您好,欢迎访问三七文档
当前位置:首页 > 商业/管理/HR > 人事档案/员工关系 > web数据挖掘__3监督学习1
Chapter3:SupervisedLearningRoadMapBasicconceptsDecisiontreeinductionEvaluationofclassifiersClassificationusingassociationrulesNaïveBayesianclassificationNaïveBayesfortextclassificationSupportvectormachinesK-nearestneighborEnsemblemethods:BaggingandBoostingSummary2AnexampleapplicationAnemergencyroominahospitalmeasures17variables(e.g.,bloodpressure,age,etc)ofnewlyadmittedpatients.Adecisionisneeded:whethertoputanewpatientinanintensive-careunit.DuetothehighcostofICU,thosepatientswhomaysurvivelessthanamontharegivenhigherpriority.Problem:topredicthigh-riskpatientsanddiscriminatethemfromlow-riskpatients.3AnotherapplicationAcreditcardcompanyreceivesthousandsofapplicationsfornewcards.Eachapplicationcontainsinformationaboutanapplicant,ageMaritalstatusannualsalaryoutstandingdebtscreditratingetc.Problem:todecidewhetheranapplicationshouldapproved,ortoclassifyapplicationsintotwocategories,approvedandnotapproved.4MachinelearningandourfocusLikehumanlearningfrompastexperiences.Acomputerdoesnothave“experiences”.Acomputersystemlearnsfromdata,whichrepresentsome“pastexperiences”ofanapplicationdomain.Ourfocus:learnatargetfunctionthatcanbeusedtopredictthevaluesofadiscreteclassattribute,e.g.,approveornot-approved,andhigh-riskorlowrisk.Thetaskiscommonlycalled:Supervisedlearning,classification,orinductivelearning.5ThedataandthegoalData:Asetofdatarecords(alsocalledexamples,instancesorcases)describedbykattributes:A1,A2,…Ak.aclass:Eachexampleislabelledwithapre-definedclass.Goal:Tolearnaclassificationmodelfromthedatathatcanbeusedtopredicttheclassesofnew(future,ortest)cases/instances.6Anexample:data(loanapplication)Approvedornot7Anexample:thelearningtaskLearnaclassificationmodelfromthedataUsethemodeltoclassifyfutureloanapplicationsintoYes(approved)andNo(notapproved)Whatistheclassforfollowingcase/instance?8Supervisedvs.unsupervisedLearningSupervisedlearning:classificationisseenassupervisedlearningfromexamples.Supervision:Thedata(observations,measurements,etc.)arelabeledwithpre-definedclasses.Itislikethata“teacher”givestheclasses(supervision).Testdataareclassifiedintotheseclassestoo.Unsupervisedlearning(clustering)ClasslabelsofthedataareunknownGivenasetofdata,thetaskistoestablishtheexistenceofclassesorclustersinthedata9Supervisedlearningprocess:twostepsLearning(training):LearnamodelusingthetrainingdataTesting:Testthemodelusingunseentestdatatoassessthemodelaccuracy,casestestofnumberTotaltionsclassificacorrectofNumberAccuracy10Whatdowemeanbylearning?GivenadatasetD,ataskT,andaperformancemeasureM,acomputersystemissaidtolearnfromDtoperformthetaskTifafterlearningthesystem’sperformanceonTimprovesasmeasuredbyM.Inotherwords,thelearnedmodelhelpsthesystemtoperformTbetterascomparedtonolearning.11AnexampleData:LoanapplicationdataTask:Predictwhetheraloanshouldbeapprovedornot.Performancemeasure:accuracy.Nolearning:classifyallfutureapplications(testdata)tothemajorityclass(i.e.,Yes):Accuracy=9/15=60%.Wecandobetterthan60%withlearning.12FundamentalassumptionoflearningAssumption:Thedistributionoftrainingexamplesisidenticaltothedistributionoftestexamples(includingfutureunseenexamples).Inpractice,thisassumptionisoftenviolatedtocertaindegree.Strongviolationswillclearlyresultinpoorclassificationaccuracy.Toachievegoodaccuracyonthetestdata,trainingexamplesmustbesufficientlyrepresentativeofthetestdata.13RoadMapBasicconceptsDecisiontreeinductionEvaluationofclassifiersClassificationusingassociationrulesNaïveBayesianclassificationNaïveBayesfortextclassificationSupportvectormachinesK-nearestneighborEnsemblemethods:BaggingandBoostingSummary14IntroductionDecisiontreelearningisoneofthemostwidelyusedtechniquesforclassification.Itsclassificationaccuracyiscompetitivewithothermethods,anditisveryefficient.Theclassificationmodelisatree,calleddecisiontree.C4.5byRossQuinlanisperhapsthebestknownsystem.ItcanbedownloadedfromtheWeb.15Theloandata(reproduced)Approvedornot16AdecisiontreefromtheloandataDecisionnodesandleafnodes(classes)17UsethedecisiontreeNo18Isthedecisiontreeunique?No.Hereisasimplertree.Wewantsmallertreeandaccuratetree.Easytounderstandandperformbetter.FindingthebesttreeisNP-hard.Allcurrenttreebuildingalgorithmsareheuristicalgorithms19FromadecisiontreetoasetofrulesAdecisiontreecanbeconvertedtoasetofrulesEachpathfromtheroottoaleafisarule.20AlgorithmfordecisiontreelearningBasicalgorithm(agreedydivide-and-conqueralgorithm)Assumeattributesarecategoricalnow(continuousattributescanbehandledtoo)Treeisconstructedinatop-downrecursivemannerAtstart,allthetrainingexamplesareattherootExamplesarepartitionedrecursivelybasedonselectedattributesAttributesareselectedonthebasisofanimpurityfunction(e.g.,informationgain)ConditionsforstoppingpartitioningAllexamplesforagivennodebelongtothesameclassTherearenoremainingattributesforfurtherpartitioning–majorityclassistheleafTherearenoexamplesleft21Decisiontreelearningalgorithm22ChooseanattributetopartitiondataThekeytobuildingadecisiontree
本文标题:web数据挖掘__3监督学习1
链接地址:https://www.777doc.com/doc-2867264 .html