您好,欢迎访问三七文档
九.决策树(DecisionTree)北邮经管学院张晓航数据挖掘实验北邮经管学院张晓航2ContentsBasicConceptsCHAIDC4.5CARTTreeinSASEM数据挖掘实验北邮经管学院张晓航3FittedDecisionTreeNINQ175%2%01-245%DELINQDEBTINC454510%0,12BAD=NewCaseDEBTINC=20NINQ=2DELINQ=045%21%数据挖掘实验北邮经管学院张晓航4DivideandConquern=5,00010%BADn=4,300n=700Debt-to-IncomeRatio45yesno65.3%BAD1%BAD数据挖掘实验北邮经管学院张晓航5TheCultivationofTreesSplitSearchWhichsplitsaretobeconsidered?SplittingCriterionWhichsplitisbest?StoppingRuleWhenshouldthesplittingstop?PruningRuleShouldsomebranchesbeloppedoff?数据挖掘实验北邮经管学院张晓航6PossibleSplitstoConsider1100,000200,000300,000400,000500,0002468101214161820NominalInputOrdinalInputInputLevels数据挖掘实验北邮经管学院张晓航7SplittingCriteriaLeftRightPerfectSplitWorthlessSplitACompetingThree-WaySplit45000050027001800300200GoodBad25211188115162791223LeftCenterRight450050045005004500500GoodBadGoodBadLeftRight数据挖掘实验北邮经管学院张晓航8TheRight-SizedTreeStuntingPruning数据挖掘实验北邮经管学院张晓航9AFieldGuidetoTreeAlgorithmsCARTAIDTHAIDCHAIDID3C4.5C5.0数据挖掘实验北邮经管学院张晓航10ComparisonoftreesCARTBinarysplitsPostpruningWithin-nodesamplingGiniindexforclassificationtreeVariancereductionforregressiontreeCHAIDMulti-waysplitsPre-pruningSplitsearchalgorithmisdesignedforcategoricalinputsSplittingandstoppingcriteriaarebasedonstatisticalsignificance(chi-squaredtest)C4.5L-waysplitsforL-levelcategoricalinputsPostpruningBinarysplitsforcontinuousinputsSplittingcriteriaarebasedoninformationgain数据挖掘实验北邮经管学院张晓航11BenefitsofTreesInterpretabilitytree-structuredpresentationMixedMeasurementScalesnominal,ordinal,intervalRegressiontreesRobustnessMissingValues数据挖掘实验北邮经管学院张晓航12BenefitsofTreesAutomaticallyDetectsinteractions(AID)AccommodatesnonlinearitySelectsinputvariablesInputInputProbMultivariateStepFunction数据挖掘实验北邮经管学院张晓航13DrawbacksofTreesRoughnessLinear,MainEffects数据挖掘实验北邮经管学院张晓航14ContentsBasicConceptsCHAIDC4.5CARTTreeinSASEM数据挖掘实验北邮经管学院张晓航15Chi-square检验法数据挖掘实验北邮经管学院张晓航16Chi-square检验法数据挖掘实验北邮经管学院张晓航17CHAID算法的本质数据挖掘实验北邮经管学院张晓航18实例194.1993/35511,1XAYE3)12)(14()1)(1(..CRfd1XYABCDRowTotal1235194512122151342ColTotal3573417(93)8122193281.9)(iiiiEEO02682849.0P数据挖掘实验北邮经管学院张晓航19实例X1流失不流失汇总03037.5120112.515012012.53037.550汇总50150200X2流失不流失汇总025257575100125257575100汇总50150200X3流失不流失汇总0037.5150112.515015012.5037.550汇总50150200X4流失不流失汇总0401520456011035130105140汇总501502002=8d.f.=12=0d.f.=12=200d.f.=12=79.37d.f.=1数据挖掘实验北邮经管学院张晓航20决策树构造流程S&RS1&R-{D}Sm&R-{D}……数据挖掘实验北邮经管学院张晓航21ContentsBasicConceptsCHAIDC4.5CARTTreeinSASEM数据挖掘实验北邮经管学院张晓航22信息增益数据挖掘实验北邮经管学院张晓航23信息增益比概念T数据挖掘实验北邮经管学院张晓航24实例79.1)]9317log(9317)9334log(9334)937log(937)9335log(9335[)log()(iiippTInfo72.1)]}4213log(4213)4215log(4215)422log(422)4212log(4212[9342)]514log(514)5119log(5119)515log(515)5123log(5123[9351{),(),(iiiTDInfopTDInfo1XYDTABCDRowTotal1235194512122151342ColTotal3573417(93)07.0),()(),(TDInfoTInfoTDGain数据挖掘实验北邮经管学院张晓航25实例X1流失非流失汇总0301201501203050汇总50150200X2流失非流失汇总0252575751001100汇总50150200X3流失非流失汇总00501500150150汇总50150200X4流失非流失汇总0401020130601140汇总50150200Gain=0.027Gain=0GainGain=0.276数据挖掘实验北邮经管学院张晓航26ContentsBasicConceptsCHAIDC4.5CARTTreeinSASEM数据挖掘实验北邮经管学院张晓航27Gini杂度函数数据挖掘实验北邮经管学院张晓航28实例007.0])9317()9334()937()9335[(11)(22222iiiTleafofcasesallclassofnumberGini71159.0]2)4213()4215()422()4212(1[9342])514()5119()515()5123(1[9351),(),(2222222iiiTDpTDGiniGini1XYDTABCDRowTotal1235194512122151342ColTotal3573417(93)007.0),()(TDTGiniGiniGini数据挖掘实验北邮经管学院张晓航29实例X1流失非流失汇总0301201501203050汇总50150200X2流失非流失汇总0252575751001100汇总50150200X3流失非流失汇总00501500150150汇总50150200X4流失非流失汇总0401020130601140汇总50150200015.0Gini375.0Gini0Gini149.0Gini数据挖掘实验北邮经管学院张晓航30ContentsBasicConceptsCHAIDCARTC4.5TreeinSASEM数据挖掘实验北邮经管学院张晓航31modelassessmentcriteria-intervaltargetProfitorloss--evaluatesthetreeusingthemaximumaverageprofitorminimalaverageloss.Iftheprofitorlossmatrixthatyoudefinedinthetargetprofilecontainslessthantwodecisions,theTreenodeusestheaveragesquareerrortoevaluatethetree.ASE--evaluatesthetreethathasthesmallestaverageerror.Average,profit,orlossinthetop10,25,or50%--thismeasureevaluatesthetreebasedontheaveragepredictedvaluesforthetarget,themaximumaverageprofit,ortheminimumaveragelossforthetopn%ofthecases.Youusethismodelassessmentcriterionwhenyouroverallgoalistocreatethetreewiththebestliftvalue.Theaverageprofitorlossinthetopn%isusedifyoudefinedaprofitorlossmatrixthathastwoormoredecisions.Otherwise,theTreenodeselectsthetreebasedontheaveragepredictedvaluesforthetargetinthetopn%.数据挖掘实验北邮经管学院张晓航32modelassessmentcriteria-ordinaltargetProportionmisclassified--evaluatesthetreethathasthesmallestmisclassificationrate.Ordinal-proportioncorrect,profit,orloss--evaluatesthetreewiththebestclassificationratewhenweightedfortheordinaldistances.LetORDER(Y)denotetherankorderoftargetvalueY.Inthiscase,ORDER(Y)takesonthevalues1,2,3,...,ntargetlevels.Theclassificationrateweightedforordinaldistancesisequalto:Proportionofevent,profit,orlossintop10,25,or50%--evaluatesthetreethatresultsinthemaximumprofitorminimum
本文标题:09-决策树
链接地址:https://www.777doc.com/doc-608850 .html