您好,欢迎访问三七文档
ClassicationTreeswithBivariateLinearDiscriminantNodeModelsHyunjoongKimWei-YinLohDepartmentofStatisticsDepartmentofStatisticsUniversityofTennesseeUniversityofWisconsinKnoxville,TN37996Madison,WI53706hjkim@utk.eduloh@stat.wisc.eduAbstractWeintroduceaclassicationtreealgorithmthatcansimultaneouslyreducetreecom-plexity,improveclassprediction,andenhancedatavisualization.Weaccomplishthisbyttingabivariatelineardiscriminantmodeltothedataineachnode.Standardalgorithmscanproducefairlycomplextreestructures,becausetheyemployaverysimplenodemodel,whereintheentirepartitionassociatedwithanodeisassignedtooneclass.Wereducethesizeofourtreesbylettingthediscriminantmodelsabsorbsomeofthetreecomplexity.Beingthemselvesclassiers,thediscriminantmodelscanalsohelptoimprovepredictionaccuracy.Finally,becausethediscriminantmodelsutilizeonlytwopredictorvariablesatatime,theireectsareeasilyvisualizedbymeansoftwo-dimensionalplots.Ouralgorithmdoesnotsimplytdiscriminantmodelstotheterminalnodesofaprunedtree,asthisdoesnotreducethesizeofthetree.Instead,discriminantmodelingiscarriedoutinallphasesoftreegrowthandthemisclassicationcostsofthenodemodelsareexplicitlyusedtoprunethetree.Ouralgorithmisalsodistinctfromthe\linearcombinationsplitalgorithmsthatpartitionthedataspacewitharbitrarilyorientedhyperplanes.Weuseaxis-orthogonalsplitstopreservetheinterpretabilityofthetreestructures.Anextensiveempiricalstudywithrealdatasetsshowsthatingeneralouralgorithmhasbetterpredictionpowerthanmanyothertreeornon-treealgorithms.Keywordsandphrases:Decisiontree,lineardiscriminantanalysis,tree-structuredclassier1INTRODUCTIONAmajoradvantageofclassicationtreesisthedirectandintuitivewaytheycanbeinterpreted.Consider,forexample,Figure1whichshowsatreeobtainedusingversion4oftheCARTalgorithm(Breiman,Friedman,OlshenandStone,1984;SteinbergandColla,1997).ItisbasedondatafromastudyonbreastcancerattheUniversityofWisconsin(WolbergandMangasarian,ResearchpartiallysupportedbyU.S.ArmyResearchOcegrantDAAD19-01-1-0586.1UnifCellSize2.5BareNuclei5.5416j5benign1j7malignantUnifCellShape2.5ClumpThickness5.518j1benign0j4malignantUnifCellSize4.5BareNuclei2.5MarginalAdhesion3.510j1benign0j3malignant8j48malignant5j172malignantFigure1:CARTtreeforbreastcancerdata.Atanintermediatenode,acasegoestotheleftsubnodeifitsatisestheconditionthere;otherwiseitgoestotherightsubnode.Thepairofnumbersontheleftofaterminalnodegivesthenumbersofbenignandmalignantcasesatthenode.1990).Thedataconsistofmeasurementstakenfrom699patientson9predictorvariablestakingintegervaluesbetween1and10.Theresponsevariablerecordswhetherapatient'stumorisbenignormalignant.Thetreeisstraightforwardtointerpretandshowsthatvepredictorvariablesmaybesucienttopredicttheresponse.Figures2and3showtreesconstructedfromthesamedatausingtheCRUISE(KimandLoh,2001)andQUEST(LohandShih,1997)algorithms.BotharelesscomplexthantheCARTtreeandarethuseveneasiertointerpret.Butaretheyequallygoodintermsofpredictionaccuracy?Empiricalevidence(Lim,LohandShih,2000;KimandLoh,2001)indicatesthatthesealgorithmstendtoproducetreeswithcomparableaccuracy.Ifthesethreetreesareindeedequallyaccurate,theQUESTtreemaybepreferredforitssimplicity.Theclasscompositionsareshownbesidetheterminalnodesofthetrees.Forexample,theterminalnodeontheextremeleftoftheCARTtreecontains416benignand5malignantcases.Ifauserwishestondouthowthese421casesaredistributedinthespaceofthepredictorvariables,whatisthebestwaytodothisgraphically?Onecouldmakeone-dimensionaldotplotsofthedataforeachvariable,usingadierentplotsymbolforeachclass.Thiswillrequire9dotplots.Betterstill,wecouldlookattwo-dimensionalplots.Butwhichpredictorvariabletoplotagainstwhich?Ascatterplotmatrixofallpairsofpredictorswillcontain92=81plotsperterminalnode.Theseplotscanbetiresometoexamineifthenumberofpredictorsorthenumberofterminalnodesislarge.Besides,manyoftheplotswillprobablybeuninteresting.Clearly,itisusefultohaveamethodthatcanscreenthroughalltheplotsandshowusonly2BlandChromatin3.5UnifCellShape3.5424j7benignClumpThickness6.5MarginalAdhesion5.514j3benign0j7malignant0j28malignantUnifCellShape1.57j1benign13j195malignantFigure2:CRUISEtreeforbreastcancerdata.Atanintermediatenode,acasegoestotheleftsubnodeifitsatisestheconditionthere;otherwiseitgoestotherightsubnode.Thepairofnumbersontheleftofaterminalnodegivesthenumbersofbenignandmalignantcasesatthenode.UnifCellShape2.5403j9benignBareNuclei2.5UnifCellSize3.539j2benign5j20malignant11j210malignantFigure3:QUESTtreeforbreastcancerdata.Atanintermediatenode,acasegoestotheleftsubnodeifitsatisestheconditionthere;otherwiseitgoestotherightsubnode.Thepairofnumbersontheleftofaterminalnodegivesthenumbersofbenignandmalignantcasesatthenode.3BlandChromatin3.5UnifCellShape3.5424j714j3820j196Figure4:ClassicationtreeforthebreastcancerdatausingtheproposedMmethod.Atanintermediatenode,acasegoestotheleftsubnodeifitsatisestheconditionthere;otherwiseitgoestotherightsubnode.Thepairofnumbersontheleftofaterminalnodegivesthenumbersofbenignandmalignantcases.theinterestingones.Wewillsaythataplotis\interes
本文标题:Classification trees with bivariate linear discrim
链接地址:https://www.777doc.com/doc-3327603 .html