04-Classification(分类)

Classification1•Definition:assignanobjecttooneofseveralpredefinedcategories•Given:•Asetofpredefinedclasses•Anumberofattributes•Alearningset•Goal:•PredicttheclassofunclassifieddataClassification2•Amedicalresearcherwantstoanalyzepatientsdatatodeterminewhoisatriskofheartdisease:•Categories:at-risk,not-at-risk•Dataset:(Age,heartrate,bloodpressing,smoking,heartdiseaseinfamily,class)•Acompanywouldliketoanalyzecustomerdatatopredictwhichcustomerswouldlikelyleave•AscientistwouldliketoclassifytreesbylookingattheleavesitproducesApplications32Stepprocess1.Learningstep:usesthetrainingdatatobuildaclassificationmodel2.Classificationstep:usesthemodelfromstep1topredicttheclassoftestdataandestimateaccuracyofthemodel•SupervisedlearningsincetheclassofthetrainingdataisgivenGeneralApproach45•DecisionTrees•ClassificationRules•NaïveBayes,BayesNetworks•NeuralNetworks•NearestNeighbor•EnsembleMethodsMethods6DecisionBoundariesLinearNonlinearRectilinearBorderlinebetweentwoneighboringregionsofdifferentclassesisknownasdecisionboundary7DecisionTrees8RuleBased9•Trainingerrors:numberofmisclassifiedrecordsinthetrainingset•Generalizationerrors:theexpectederrorofthemodelonpreviouslyunseenrecords•Goal:reducebothtrainingerrorsANDgeneralizationerrorsPerformanceevaluation10DecisionTreeExample11ConfusionMatrix:Accuracy:fractionofcorrectpredictionsErrorrate:fractionofwrongpredictionsPerformanceEvaluationPredictedClassClass=1Class=0ActualClassClass=1f11f10Class=0f01f00000110110011ffffffaccuracy000110111001_ffffffrateerror12•Underfitting:whenthemodelistoosimple•Overfitting:whenthemodelisbuilttotightlyfitthetrainingsetUnderfittingandOverfitting13OverfittingduetoNoiseDecisionboundaryisdistortedbynoisepoint14OverfittingduetoLackofRepresentativeSamplesLackofdatapointsinthelowerhalfofthediagrammakesitdifficulttopredictcorrectlytheclasslabelsofthatregion15OverfittingduetoLackofRepresentativeSamplesLackofdatapointsinthelowerhalfofthediagrammakesitdifficulttopredictcorrectlytheclasslabelsofthatregion16•Speed:computationalcostinvolvedingeneratingandusingthemodel•Robustness:abilitytomakecorrectpredictionusingnoisy/missingvalues•Scalability:Abilitytoconstructtheclassifiergivenlargeamountsofdata•Interpretability:levelofinsightprovidedbytheclassifierOtherPerformanceMeasures17•Giventhetrainingdata,splitintotwodisjointsets:thetrainingsetandthetestset•Limitations:•Fewerdataavailablefortraining•Themodelishighlydependentonthecompositionofthetwosets•Thetrainingsetandtestnotindependent:•AnoverrepresentedclassinonesetwillbeunderrepresentedintheothersetHoldoutMethodTrainingSetTestSetFullDataSet18•Repeattheholdoutmethodmultipletimestoimprovetheestimation•Accuracy:•Limitation:•Notusingasmuchdataaspossiblefortraining•SomerecordsusedmultipletimesfortrainingRandomSubsamplingFullDataSetkaccacciisub19•Splitsthedataintokdisjointsets•Ineachiteration,onesetisusedfortestingandK-1fortraining•Advantage:allrecordsareusedforbothtrainingandtestK-FoldCrossValidationFullDataSet20•Specialcaseofcrossvalidationwherek=N•Usesasmuchdataaspossiblefortraining•Testsetsmutuallyexclusive•ComputationallyexpensiveLeaveoneout–CrossvalidationFullDataSet21•Trainingrecordsaresampledwithreplacement•Trainingdatamaycontainduplicaterecords•Repeatprocessbtimestogeneratebbootstrapsamples•WhenNrecordsarechosenfromaNrecordset,theprobabilityofarecordbeingselectedis:(1-1/N)N~1-e-1=0.632ifNissufficientlylarge•Accuracy:Bootstrapbisibootaccbacc1)368.0632.0(1accs:accuracyoforiginalsampleastrainingseti:accuracyofbootstrapsample22•ConsidertwomodelsMAandMBMA:85%accuracyonatestsetof30recordsMB:75%accuracyonatestsetof5000records•Whichoneisbetter?•Howmuchconfidencedoestheaccuracyhave?•Canweexplainthedifferenceinaccuracyasaresultofvariationsinthetestsets?ComparingClassifiers23•Taskofpredictinglabel:abinomialexperimentwithprobabilityofsuccessp•IfthetestsetcontainsNrecords,Xisthenumberofrecordscorrectlypredicted•Xhasabinomialdistributionwithmean:Npandvariance:Np(1-p)•Example:•Obtainingheadsoncointoss:p=0.5•Obtainingheads20timeson50tosses:•AccuracyX/Nhasabinomialdistributionwithmean=pandvar=p(1-p)/N•ForlargeN,X/Nhasanormaldistributionwithmean=pandvar=p(1-p)/NConfidenceIntervalofAccuracy24vNvppvNvXP)1()(0419.0)5.01(5.02050)20(205020XPConfidenceintervalsareconstructedforgivenconfidencelevelsExample:Confidenceintervalisbetween82%and88%atconfidencelevel95%Meaning:Ifpopulationsamplemultipletimes:theresult(ofexperiment)iswiththeconfidenceinterval(82%and88%)95%ofthetimes.ConfidenceIntervalforAccuracy25•AmeasureofhowmanystandarddeviationsanelementisbeloworabovethemeanZ-Score26zWhatisthez-scorecorrespondingtoa96percentile?z=1.75ConfidenceIntervalforp:ConfidenceIntervalforAccuracy1)/)1((2/12/ZNpppaccZP1-Z/2Z1-/2)(244222/222/2/22/ZNaccNaccNZZZaccN27/21/2ConfidenceIntervalforAccuracyConsideramodelthatproducesanaccuracyof80%whenevaluatedon100testinstances:◦N=100,acc=0.8◦Let1-=0.95(95%confidence)◦Fromprobabilitytable,Z/2=1.96N501001000ConfidenceIntervaloftrueaccuracy

04-Classification(分类)

免费阅读已结束，点击付费阅读剩下 ... 页

阅读已结束，您可以下载文档离线阅读

电子商务网页制作-项目6

新型交流调速系统实验平台的研究-中国电气传动网欢迎您光临

33号关于印发《川庆钻探工程有限公司劳动防护用品

风景园林专业初中级技术资格考试大纲

储蓄分流与保险

营销中法律风险防范指要

质量体系评定

主要耗能行业能源效率及节能潜力分析

非接触电导检测器与离子色谱联用及在环境分析中的应用

GB5796-1

相关文档

相关搜索