Classification trees with bivariate linear discrim

ClassicationTreeswithBivariateLinearDiscriminantNodeModelsHyunjoongKimWei-YinLohDepartmentofStatisticsDepartmentofStatisticsUniversityofTennesseeUniversityofWisconsinKnoxville,TN37996Madison,WI53706hjkim@utk.eduloh@stat.wisc.eduAbstractWeintroduceaclassicationtreealgorithmthatcansimultaneouslyreducetreecom-plexity,improveclassprediction,andenhancedatavisualization.Weaccomplishthisbyttingabivariatelineardiscriminantmodeltothedataineachnode.Standardalgorithmscanproducefairlycomplextreestructures,becausetheyemployaverysimplenodemodel,whereintheentirepartitionassociatedwithanodeisassignedtooneclass.Wereducethesizeofourtreesbylettingthediscriminantmodelsabsorbsomeofthetreecomplexity.Beingthemselvesclassiers,thediscriminantmodelscanalsohelptoimprovepredictionaccuracy.Finally,becausethediscriminantmodelsutilizeonlytwopredictorvariablesatatime,theireectsareeasilyvisualizedbymeansoftwo-dimensionalplots.Ouralgorithmdoesnotsimplytdiscriminantmodelstotheterminalnodesofaprunedtree,asthisdoesnotreducethesizeofthetree.Instead,discriminantmodelingiscarriedoutinallphasesoftreegrowthandthemisclassicationcostsofthenodemodelsareexplicitlyusedtoprunethetree.Ouralgorithmisalsodistinctfromthe\linearcombinationsplitalgorithmsthatpartitionthedataspacewitharbitrarilyorientedhyperplanes.Weuseaxis-orthogonalsplitstopreservetheinterpretabilityofthetreestructures.Anextensiveempiricalstudywithrealdatasetsshowsthatingeneralouralgorithmhasbetterpredictionpowerthanmanyothertreeornon-treealgorithms.Keywordsandphrases:Decisiontree,lineardiscriminantanalysis,tree-structuredclassier1INTRODUCTIONAmajoradvantageofclassicationtreesisthedirectandintuitivewaytheycanbeinterpreted.Consider,forexample,Figure1whichshowsatreeobtainedusingversion4oftheCARTalgorithm(Breiman,Friedman,OlshenandStone,1984;SteinbergandColla,1997).ItisbasedondatafromastudyonbreastcancerattheUniversityofWisconsin(WolbergandMangasarian,ResearchpartiallysupportedbyU.S.ArmyResearchOcegrantDAAD19-01-1-0586.1UnifCellSize2.5BareNuclei5.5416j5benign1j7malignantUnifCellShape2.5ClumpThickness5.518j1benign0j4malignantUnifCellSize4.5BareNuclei2.5MarginalAdhesion3.510j1benign0j3malignant8j48malignant5j172malignantFigure1:CARTtreeforbreastcancerdata.Atanintermediatenode,acasegoestotheleftsubnodeifitsatisestheconditionthere;otherwiseitgoestotherightsubnode.Thepairofnumbersontheleftofaterminalnodegivesthenumbersofbenignandmalignantcasesatthenode.1990).Thedataconsistofmeasurementstakenfrom699patientson9predictorvariablestakingintegervaluesbetween1and10.Theresponsevariablerecordswhetherapatient'stumorisbenignormalignant.Thetreeisstraightforwardtointerpretandshowsthatvepredictorvariablesmaybesucienttopredicttheresponse.Figures2and3showtreesconstructedfromthesamedatausingtheCRUISE(KimandLoh,2001)andQUEST(LohandShih,1997)algorithms.BotharelesscomplexthantheCARTtreeandarethuseveneasiertointerpret.Butaretheyequallygoodintermsofpredictionaccuracy?Empiricalevidence(Lim,LohandShih,2000;KimandLoh,2001)indicatesthatthesealgorithmstendtoproducetreeswithcomparableaccuracy.Ifthesethreetreesareindeedequallyaccurate,theQUESTtreemaybepreferredforitssimplicity.Theclasscompositionsareshownbesidetheterminalnodesofthetrees.Forexample,theterminalnodeontheextremeleftoftheCARTtreecontains416benignand5malignantcases.Ifauserwishestondouthowthese421casesaredistributedinthespaceofthepredictorvariables,whatisthebestwaytodothisgraphically?Onecouldmakeone-dimensionaldotplotsofthedataforeachvariable,usingadierentplotsymbolforeachclass.Thiswillrequire9dotplots.Betterstill,wecouldlookattwo-dimensionalplots.Butwhichpredictorvariabletoplotagainstwhich?Ascatterplotmatrixofallpairsofpredictorswillcontain92=81plotsperterminalnode.Theseplotscanbetiresometoexamineifthenumberofpredictorsorthenumberofterminalnodesislarge.Besides,manyoftheplotswillprobablybeuninteresting.Clearly,itisusefultohaveamethodthatcanscreenthroughalltheplotsandshowusonly2BlandChromatin3.5UnifCellShape3.5424j7benignClumpThickness6.5MarginalAdhesion5.514j3benign0j7malignant0j28malignantUnifCellShape1.57j1benign13j195malignantFigure2:CRUISEtreeforbreastcancerdata.Atanintermediatenode,acasegoestotheleftsubnodeifitsatisestheconditionthere;otherwiseitgoestotherightsubnode.Thepairofnumbersontheleftofaterminalnodegivesthenumbersofbenignandmalignantcasesatthenode.UnifCellShape2.5403j9benignBareNuclei2.5UnifCellSize3.539j2benign5j20malignant11j210malignantFigure3:QUESTtreeforbreastcancerdata.Atanintermediatenode,acasegoestotheleftsubnodeifitsatisestheconditionthere;otherwiseitgoestotherightsubnode.Thepairofnumbersontheleftofaterminalnodegivesthenumbersofbenignandmalignantcasesatthenode.3BlandChromatin3.5UnifCellShape3.5424j714j3820j196Figure4:ClassicationtreeforthebreastcancerdatausingtheproposedMmethod.Atanintermediatenode,acasegoestotheleftsubnodeifitsatisestheconditionthere;otherwiseitgoestotherightsubnode.Thepairofnumbersontheleftofaterminalnodegivesthenumbersofbenignandmalignantcases.theinterestingones.Wewillsaythataplotis\interes

Classification trees with bivariate linear discrim

免费阅读已结束，点击付费阅读剩下 ... 页

阅读已结束，您可以下载文档离线阅读

小区推广

芜湖市某污水处理厂工程施工组织设计_secret

§9有色金属和粉末冶金

抗癌药物的研究和发展简介

《权健秘方产品之专方专用系列》lican

东湖大学科技园发展对策研究

绩效考核与操作流程（一）

物流产业与政策

焊接工程师培训讲义2

表格使用说明及注意事项

相关文档

相关搜索