您好,欢迎访问三七文档
当前位置:首页 > IT计算机/网络 > 数据挖掘与识别 > 机器学习十大算法:CART
Chapter10CART:ClassificationandRegressionTreesDanSteinbergContents10.1Antecedents.........................................................18010.2Overview...........................................................18110.3ARunningExample.................................................18110.4TheAlgorithmBrieflyStated........................................18310.5SplittingRules......................................................18510.6PriorProbabilitiesandClassBalancing...............................18710.7MissingValueHandling.............................................18910.8AttributeImportance................................................19010.9DynamicFeatureConstruction.......................................19110.10Cost-SensitiveLearning.............................................19210.11StoppingRules,Pruning,TreeSequences,andTreeSelection.........19310.12ProbabilityTrees....................................................19410.13TheoreticalFoundations.............................................19610.14Post-CARTRelatedResearch........................................19610.15SoftwareAvailability................................................19810.16Exercises............................................................198References..................................................................199The1984monograph,“CART:ClassificationandRegressionTrees,”coauthoredbyLeoBreiman,JeromeFriedman,RichardOlshen,andCharlesStone(BFOS),repre-sentsamajormilestoneintheevolutionofartificialintelligence,machinelearning,nonparametricstatistics,anddatamining.Theworkisimportantforthecompre-hensivenessofitsstudyofdecisiontrees,thetechnicalinnovationsitintroduces,itssophisticatedexamplesoftree-structureddataanalysis,anditsauthoritativetreatmentoflargesampletheoryfortrees.SinceitspublicationtheCARTmonographhasbeencitedsome3000timesaccordingtothescienceandsocialsciencecitationindexes;GoogleScholarreportsabout8,450citations.CARTcitationscanbefoundinalmostanydomain,withmanyappearinginfieldssuchascreditrisk,targetedmarketing,fi-nancialmarketsmodeling,electricalengineering,qualitycontrol,biology,chemistry,andclinicalmedicalresearch.CARThasalsostronglyinfluencedimagecompression179©2009byTaylor&FrancisGroup,LLC180CART:ClassificationandRegressionTreesviatree-structuredvectorquantization.ThisbriefaccountisintendedtointroduceCARTbasics,touchingonthemajorthemestreatedintheCARTmonograph,andtoencouragereaderstoreturntotherichoriginalsourcefortechnicaldetails,discus-sionsrevealingthethoughtprocessesoftheauthors,andexamplesoftheiranalyticalstyle.10.1AntecedentsCARTwasnotthefirstdecisiontreetobeintroducedtomachinelearning,althoughitisthefirsttobedescribedwithanalyticalrigorandsupportedbysophisticatedstatisticsandprobabilitytheory.CARTexplicitlytracesitsancestrytotheauto-maticinteractiondetection(AID)treeofMorganandSonquist(1963),anautomatedrecursivemethodforexploringrelationshipsindataintendedtomimictheitera-tivedrill-downstypicalofpracticingsurveydataanalysts.AIDwasintroducedasapotentiallyusefultoolwithoutanytheoreticalfoundation.This1960s-eraworkontreeswasgreetedwithprofoundskepticismamidstevidencethatAIDcouldradicallyoverfitthetrainingdataandencourageprofoundlymisleadingconclusions(Einhorn,1972;Doyle,1973),especiallyinsmallersamples.By1973well-readstatisticianswereconvincedthattreeswereadeadend;theconventionalwisdomheldthattreesweredangerousandunreliabletoolsparticularlybecauseoftheirlackofatheoreticalfoundation.Otherresearchers,however,werenotyetpreparedtoabandonthetreelineofthinking.TheworkofCoverandHart(1967)onthelargesamplepropertiesofnearestneighbor(NN)classifierswasinstrumentalinpersuadingRichardOlshenandJeromeFriedmanthattreeshadsufficienttheoreticalmerittobeworthpursu-ing.OlshenreasonedthatifNNclassifierscouldreachtheCoverandHartboundonmisclassificationerror,thenasimilarresultshouldbederivableforasuitablyconstructedtreebecausetheterminalnodesoftreescouldbeviewedasdynami-callyconstructedNNclassifiers.Thus,theCoverandHartNNresearchwastheimmediatestimulusthatpersuadedOlshentoinvestigatetheasymptoticpropertiesoftrees.Coincidentally,Friedman’salgorithmicworkonfastidentificationofnearestneighborsviatrees(Friedman,Bentley,andFinkel,1977)usedarecursivepartition-ingmechanismthatevolvedintoCART.OnepredecessorofCARTappearsinthe1975StanfordLinearAcceleratorCenter(SLAC)discussionpaper(Friedman,1975),subsequentlypublishedinashorterformbyFriedman(1977).WhileFriedmanwasworkingoutkeyelementsofCARTatSLAC,withOlshenconductingmathemat-icalresearchinthesamelab,similarindependentresearchwasunderwayinLosAngelesbyLeoBreimanandCharlesStone(BreimanandStone,1978).Thetwoseparatestrandsofresearch(FriedmanandOlshenatStanford,BreimanandStoneinLosAngeles)werebroughttogetherin1978whenthefourCARTauthorsfor-mallybegantheprocessofmergingtheirworkandpreparingtowritetheCARTmonograph.©2009byTaylor&FrancisGroup,LLC10.3ARunningExample18110.2OverviewTheCARTdecisiontreeisabinaryrecursivepartitioningprocedurecapableofpro-cessingcontinuousandnominalattributesastargetsandpredictors.Dataare
本文标题:机器学习十大算法:CART
链接地址:https://www.777doc.com/doc-6332826 .html