您好,欢迎访问三七文档
武汉理工大学理学院数学系课程上机实验报告课程名称:数据挖掘班级信计1201日期6.9成绩评定姓名张徐军(26)李雪梅(35)张晓婷(33)实验室数学207老师签名实验名称决策树解决肝癌预测问题所用软件Python实验目的及内容目的:熟悉决策树分类的基本思想以及运用ID3算法进行实例演练。内容:根据所给的肝癌数据进行分类,得到决策树,以及验证其正确性与适用性。实验原理步骤在编码之前为了方便起见,将实验数据的十个指标的属性值分别用数字1,2,3,4表示,表示结果如下:X1no1light2mid3serious4X2no1branch2trunk3X3positive1negative2X4positive1negative2X5rightliver1leftliver2allliver3X6small1middle2big3verybig4X7dilation1infiltration2X8no1part2integrate3X9no1have2X10no1less2much3代码:#-*-coding:cp936-*-importmathimportoperator#计算香农熵,分两步,第一步计算频率,第二步根据公式计算香农熵。defcalcShannonEnt(dataSet):numEntries=len(dataSet)labelCounts={}forfeatVecindataSet:currentLabel=featVec[-1]ifcurrentLabelnotinlabelCounts.keys():labelCounts[currentLabel]=0labelCounts[currentLabel]+=1shannonEnt=0.0forkeyinlabelCounts:prob=float(labelCounts[key])/numEntriesshannonEnt-=prob*math.log(prob,2)returnshannonEntdefcreateDataSet():dataSet=[[3,2,2,2,1,2,1,2,1,2,'Y'],[3,3,1,1,1,2,2,1,2,3,'N'],[4,1,2,1,2,3,1,1,1,3,'Y'],[1,1,2,2,3,4,1,3,1,3,'Y'],[2,2,1,1,1,1,2,3,2,1,'N'],[3,3,1,2,1,2,2,2,1,1,'Y'],[2,2,1,2,1,1,2,1,2,3,'Y'],[1,3,2,1,3,3,1,2,1,2,'N'],[3,2,1,2,1,2,1,3,2,2,'N'],[1,1,2,1,1,4,1,2,1,1,'N'],[4,3,2,2,1,3,2,3,2,2,'N'],[2,3,1,2,3,1,1,1,1,2,'Y'],[1,1,2,1,1,4,2,2,1,3,'N'],[1,2,2,2,3,4,2,3,2,1,'N'],[4,2,1,1,1,3,2,2,2,2,'Y'],[3,1,2,1,1,2,1,3,2,3,'N'],[3,2,2,2,1,2,1,3,1,2,'N'],[2,3,2,1,2,1,2,1,1,1,'Y'],[1,3,2,1,1,4,2,1,1,1,'N'],[1,1,1,1,1,4,1,2,1,2,'Y']]labels=['X1','X2','X3','X4','X5','X6','X7','X8','X9','X10']returndataSet,labels#划分数据集,将满足X[aixs]==value的值都划分到一起,返回一个划分好的集合(不包括用来划分的aixs属性,因为不需要)defsplitDataSet(dataSet,axis,value):retDataSet=[]forfeatVecindataSet:iffeatVec[axis]==value:reducedFeatVec=featVec[:axis]#chopoutaxisusedforsplittingreducedFeatVec.extend(featVec[axis+1:])retDataSet.append(reducedFeatVec)returnretDataSet#选择最好的属性进行划分,思路很简单就是对每个属性都划分下,看哪个好。这里使用到了一个set来选取列表中唯一的元素。defchooseBestFeatureToSplit(dataSet):numFeatures=len(dataSet[0])-1#因为数据集的最后一项是标签baseEntropy=calcShannonEnt(dataSet)bestInfoGain=0.0;bestFeature=-1foriinrange(numFeatures):#iterateoverallthefeaturesfeatList=[example[i]forexampleindataSet]#createalistofalltheexamplesofthisfeatureuniqueVals=set(featList)#getasetofuniquevaluesnewEntropy=0.0forvalueinuniqueVals:subDataSet=splitDataSet(dataSet,i,value)prob=len(subDataSet)/float(len(dataSet))newEntropy+=prob*calcShannonEnt(subDataSet)infoGain=baseEntropy-newEntropy#calculatetheinfogain;iereductioninentropyif(infoGainbestInfoGain):#comparethistothebestgainsofarbestInfoGain=infoGain#ifbetterthancurrentbest,settobestbestFeature=ireturnbestFeature#returnsaninteger#因为我们递归构建决策树是根据属性的消耗进行计算的,所以可能会存在最后属性用完了,但是分类还是没有完,这是就会采用多数表决的方式计算节点分类。defmajorityCnt(classList):classCount={}forvoteinclassList:ifvotenotinclassCount.keys():classCount[vote]=0classCount[vote]+=1sortedClassCount=sorted(classCount.iteritems(),key=operator.itemgetter(1),reverse=True)returnsortedClassCount[0][0]#基于递归构建决策树。这里的label更多是对于分类特征的名字,为了更好理解。defcreateTree(dataSet,labels):classList=[example[-1]forexampleindataSet]ifclassList.count(classList[0])==len(classList):returnclassList[0]#stopsplittingwhenalloftheclassesareequaliflen(dataSet[0])==1:#stopsplittingwhentherearenomorefeaturesindataSetreturnmajorityCnt(classList)bestFeat=chooseBestFeatureToSplit(dataSet)bestFeatLabel=labels[bestFeat]myTree={bestFeatLabel:{}}del(labels[bestFeat])featValues=[example[bestFeat]forexampleindataSet]uniqueVals=set(featValues)forvalueinuniqueVals:subLabels=labels[:]#copyalloflabels,sotreesdon'tmessupexistinglabelsmyTree[bestFeatLabel][value]=createTree(splitDataSet(dataSet,bestFeat,value),subLabels)returnmyTreedefclassify(inputTree,featLabels,testVec):firstStr=inputTree.keys()[0]secondDict=inputTree[firstStr]featIndex=featLabels.index(firstStr)key=testVec[featIndex]valueOfFeat=secondDict[key]ifisinstance(valueOfFeat,dict):classLabel=classify(valueOfFeat,featLabels,testVec)else:classLabel=valueOfFeatreturnclassLabeldefgetResult():dataSet,labels=createDataSet()#splitDataSet(dataSet,1,1)chooseBestFeatureToSplit(dataSet)#printchooseBestFeatureToSplit(dataSet)#printcalcShannonEnt(dataSet)mtree=createTree(dataSet,labels)printmtreeprintclassify(mtree,['X1','X2','X3','X4','X5','X6','X7','X8','X9','X10'],[3,1,2,1,1,2,1,3,2,3])printclassify(mtree,['X1','X2','X3','X4','X5','X6','X7','X8','X9','X10'],[3,2,2,2,1,2,1,3,1,2])printclassify(mtree,['X1','X2','X3','X4','X5','X6','X7','X8','X9','X10'],[2,3,2,1,2,1,2,1,1,1])printclassify(mtree,['X1','X2','X3','X4','X5','X6','X7','X8','X9','X10'],[1,3,2,1,1,4,2,1,1,1])printclassify(mtree,['X1','X2','X3','X4','X5','X6','X7','X8','X9','X10'],[1,1,1,1,1,4,1,2,1,2])if__name__=='__main__':getResult()实验结果及分析第一步:首先将20个样本作为训练样本生成决策树,然后以原来的20个样本作为测试样本检验决策树的正确性及适用性,得到的结果与事实全部相符,决策树程序表示如下:{'X8':{1:{'X1':{1:'N',2:'Y',3:'N',4:'Y'}},2:{'X1':{1:{'X3':{1:'Y',2:'N'}},3:'Y',4:'Y'}},3:{'X1':{1:{'X2':{1:'Y',2:'N'}},2:'N',3:'N',4:'N'}}}}检验结果:YNYYNYYNNNNYNNYNNYNY原始结果:YNYYNYYNNNNYNNYNNYNY第二步:为了防止产生决策树过拟合的情形,保证正确性,我们再将前15个样本作为训练样本,后5个样本作为测试样本进行检验,得到的结果出现一个错误,结果如下:检验结果:NNYNN原始结果:NNYNY第三步:重复上述步骤,将前10个样本和最后5个样本作为训练样本,剩下的5个样本作为测试
本文标题:数据挖掘实验报告
链接地址:https://www.777doc.com/doc-7307136 .html