您好,欢迎访问三七文档
当前位置:首页 > 商业/管理/HR > 咨询培训 > 04-Classification(分类)
Classification1•Definition:assignanobjecttooneofseveralpredefinedcategories•Given:•Asetofpredefinedclasses•Anumberofattributes•Alearningset•Goal:•PredicttheclassofunclassifieddataClassification2•Amedicalresearcherwantstoanalyzepatientsdatatodeterminewhoisatriskofheartdisease:•Categories:at-risk,not-at-risk•Dataset:(Age,heartrate,bloodpressing,smoking,heartdiseaseinfamily,class)•Acompanywouldliketoanalyzecustomerdatatopredictwhichcustomerswouldlikelyleave•AscientistwouldliketoclassifytreesbylookingattheleavesitproducesApplications32Stepprocess1.Learningstep:usesthetrainingdatatobuildaclassificationmodel2.Classificationstep:usesthemodelfromstep1topredicttheclassoftestdataandestimateaccuracyofthemodel•SupervisedlearningsincetheclassofthetrainingdataisgivenGeneralApproach45•DecisionTrees•ClassificationRules•NaïveBayes,BayesNetworks•NeuralNetworks•NearestNeighbor•EnsembleMethodsMethods6DecisionBoundariesLinearNonlinearRectilinearBorderlinebetweentwoneighboringregionsofdifferentclassesisknownasdecisionboundary7DecisionTrees8RuleBased9•Trainingerrors:numberofmisclassifiedrecordsinthetrainingset•Generalizationerrors:theexpectederrorofthemodelonpreviouslyunseenrecords•Goal:reducebothtrainingerrorsANDgeneralizationerrorsPerformanceevaluation10DecisionTreeExample11ConfusionMatrix:Accuracy:fractionofcorrectpredictionsErrorrate:fractionofwrongpredictionsPerformanceEvaluationPredictedClassClass=1Class=0ActualClassClass=1f11f10Class=0f01f00000110110011ffffffaccuracy000110111001_ffffffrateerror12•Underfitting:whenthemodelistoosimple•Overfitting:whenthemodelisbuilttotightlyfitthetrainingsetUnderfittingandOverfitting13OverfittingduetoNoiseDecisionboundaryisdistortedbynoisepoint14OverfittingduetoLackofRepresentativeSamplesLackofdatapointsinthelowerhalfofthediagrammakesitdifficulttopredictcorrectlytheclasslabelsofthatregion15OverfittingduetoLackofRepresentativeSamplesLackofdatapointsinthelowerhalfofthediagrammakesitdifficulttopredictcorrectlytheclasslabelsofthatregion16•Speed:computationalcostinvolvedingeneratingandusingthemodel•Robustness:abilitytomakecorrectpredictionusingnoisy/missingvalues•Scalability:Abilitytoconstructtheclassifiergivenlargeamountsofdata•Interpretability:levelofinsightprovidedbytheclassifierOtherPerformanceMeasures17•Giventhetrainingdata,splitintotwodisjointsets:thetrainingsetandthetestset•Limitations:•Fewerdataavailablefortraining•Themodelishighlydependentonthecompositionofthetwosets•Thetrainingsetandtestnotindependent:•AnoverrepresentedclassinonesetwillbeunderrepresentedintheothersetHoldoutMethodTrainingSetTestSetFullDataSet18•Repeattheholdoutmethodmultipletimestoimprovetheestimation•Accuracy:•Limitation:•Notusingasmuchdataaspossiblefortraining•SomerecordsusedmultipletimesfortrainingRandomSubsamplingFullDataSetkaccacciisub19•Splitsthedataintokdisjointsets•Ineachiteration,onesetisusedfortestingandK-1fortraining•Advantage:allrecordsareusedforbothtrainingandtestK-FoldCrossValidationFullDataSet20•Specialcaseofcrossvalidationwherek=N•Usesasmuchdataaspossiblefortraining•Testsetsmutuallyexclusive•ComputationallyexpensiveLeaveoneout–CrossvalidationFullDataSet21•Trainingrecordsaresampledwithreplacement•Trainingdatamaycontainduplicaterecords•Repeatprocessbtimestogeneratebbootstrapsamples•WhenNrecordsarechosenfromaNrecordset,theprobabilityofarecordbeingselectedis:(1-1/N)N~1-e-1=0.632ifNissufficientlylarge•Accuracy:Bootstrapbisibootaccbacc1)368.0632.0(1accs:accuracyoforiginalsampleastrainingseti:accuracyofbootstrapsample22•ConsidertwomodelsMAandMBMA:85%accuracyonatestsetof30recordsMB:75%accuracyonatestsetof5000records•Whichoneisbetter?•Howmuchconfidencedoestheaccuracyhave?•Canweexplainthedifferenceinaccuracyasaresultofvariationsinthetestsets?ComparingClassifiers23•Taskofpredictinglabel:abinomialexperimentwithprobabilityofsuccessp•IfthetestsetcontainsNrecords,Xisthenumberofrecordscorrectlypredicted•Xhasabinomialdistributionwithmean:Npandvariance:Np(1-p)•Example:•Obtainingheadsoncointoss:p=0.5•Obtainingheads20timeson50tosses:•AccuracyX/Nhasabinomialdistributionwithmean=pandvar=p(1-p)/N•ForlargeN,X/Nhasanormaldistributionwithmean=pandvar=p(1-p)/NConfidenceIntervalofAccuracy24vNvppvNvXP)1()(0419.0)5.01(5.02050)20(205020XPConfidenceintervalsareconstructedforgivenconfidencelevelsExample:Confidenceintervalisbetween82%and88%atconfidencelevel95%Meaning:Ifpopulationsamplemultipletimes:theresult(ofexperiment)iswiththeconfidenceinterval(82%and88%)95%ofthetimes.ConfidenceIntervalforAccuracy25•AmeasureofhowmanystandarddeviationsanelementisbeloworabovethemeanZ-Score26zWhatisthez-scorecorrespondingtoa96percentile?z=1.75ConfidenceIntervalforp:ConfidenceIntervalforAccuracy1)/)1((2/12/ZNpppaccZP1-Z/2Z1-/2)(244222/222/2/22/ZNaccNaccNZZZaccN27/21/2ConfidenceIntervalforAccuracyConsideramodelthatproducesanaccuracyof80%whenevaluatedon100testinstances:◦N=100,acc=0.8◦Let1-=0.95(95%confidence)◦Fromprobabilitytable,Z/2=1.96N501001000ConfidenceIntervaloftrueaccuracy
本文标题:04-Classification(分类)
链接地址:https://www.777doc.com/doc-5038373 .html