您好,欢迎访问三七文档
当前位置:首页 > IT计算机/网络 > 数据挖掘与识别 > 机器学习十大算法8:kNN概要
Chapter8kNN:k-NearestNeighborsMichaelSteinbachandPang-NingTanContents8.1Introduction(1518.2DescriptionoftheAlgorithm(1528.2.1High-LevelDescription(1528.2.2Issues(1538.2.3SoftwareImplementations(1558.3Examples(1558.4AdvancedTopics(1578.5Exercises(158Acknowledgments(159References(1598.1IntroductionOneofthesimplestandrathertrivialclassifiersistheRoteclassifier,whichmemorizestheentiretrainingdataandperformsclassificationonlyiftheattributesofthetestobjectexactlymatchtheattributesofoneofthetrainingobjects.Anobviousproblemwiththisapproachisthatmanytestrecordswillnotbeclassifiedbecausetheydonotexactlymatchanyofthetrainingrecords.Anotherissueariseswhentwoormoretrainingrecordshavethesameattributesbutdifferentclasslabels.Amoresophisticatedapproach,k-nearestneighbor(kNNclassification[10,11,21],findsagroupofkobjectsinthetrainingsetthatareclosesttothetestobject,andbasestheassignmentofalabelonthepredominanceofaparticularclassinthisneighborhood.Thisaddressestheissuethat,inmanydatasets,itisunlikelythatoneobjectwillexactlymatchanother,aswellasthefactthatconflictinginformationabouttheclassofanobjectmaybeprovidedbytheobjectsclosesttoit.Thereareseveralkeyelementsofthisapproach:(ithesetoflabeledobjectstobeusedforevaluatingatestobject’sclass,1(iiadistanceorsimilaritymetricthatcanbeusedtocomputeThisneednotbetheentiretrainingset.151152kNN:k-NearestNeighborstheclosenessofobjects,(iiithevalueofk,thenumberofnearestneighbors,and(ivthemethodusedtodeterminetheclassofthetargetobjectbasedontheclassesanddistancesoftheknearestneighbors.Initssimplestform,kNNcaninvolveassigninganobjecttheclassofitsnearestneighbororofthemajorityofitsnearestneighbors,butavarietyofenhancementsarepossibleandarediscussedbelow.Moregenerally,kNNisaspecialcaseofinstance-basedlearning[1].Thisincludescase-basedreasoning[3],whichdealswithsymbolicdata.ThekNNapproachisalsoanexampleofalazylearningtechnique,thatis,atechniquethatwaitsuntilthequeryarrivestogeneralizebeyondthetrainingdata.AlthoughkNNclassificationisaclassificationtechniquethatiseasytounderstandandimplement,itperformswellinmanysituations.Inparticular,awell-knownresultbyCoverandHart[6]showsthattheclassificationerror2ofthenearestneighborruleisboundedabovebytwicetheoptimalBayeserror3undercertainreasonableassump-tions.Furthermore,theerrorofthegeneralkNNmethodasymptoticallyapproachesthatoftheBayeserrorandcanbeusedtoapproximateit.Also,becauseofitssimplicity,kNNiseasytomodifyformorecomplicatedclassifi-cationproblems.Forinstance,kNNisparticularlywell-suitedformultimodalclasses4aswellasapplicationsinwhichanobjectcanhavemanyclasslabels.Toillustratethelastpoint,fortheassignmentoffunctionstogenesbasedonmicroarrayexpressionprofiles,someresearchersfoundthatkNNoutperformedasupportvectormachine(SVMapproach,whichisamuchmoresophisticatedclassificationscheme[17].TheremainderofthischapterdescribesthebasickNNalgorithm,includingvari-ousissuesthataffectbothclassificationandcomputationalperformance.PointersaregiventoimplementationsofkNN,andexamplesofusingtheWekamachinelearn-ingpackagetoperformnearestneighborclassificationarealsoprovided.Advancedtechniquesarediscussedbrieflyandthischapterconcludeswithafewexercises.8.2DescriptionoftheAlgorithm8.2.1High-LevelDescriptionAlgorithm8.1providesahigh-levelsummaryofthenearest-neighborclassificationmethod.GivenatrainingsetDandatestobjectz,whichisavectorofattributevaluesandhasanunknownclasslabel,thealgorithmcomputesthedistance(orsimilarity2Theclassificationerrorofaclassifieristhepercentageofinstancesitincorrectlyclassifies.3TheBayeserroristheclassificationerrorofaBayesclassifier,thatis,aclassifierthatknowstheunderlyingprobabilitydistributionofthedatawithrespecttoclass,andassignseachdatapointtotheclasswiththehighestprobabilitydensityforthatpoint.Formoredetail,see[9].4Withmultimodalclasses,objectsofaparticularclasslabelareconcentratedinseveraldistinctareasofthedataspace,notjustone.Instatisticalterms,theprobabilitydensityfunctionfortheclassdoesnothaveasingle“bump”likeaGaussian,butrather,hasanumberofpeaks.8.2DescriptionoftheAlgorithm153Algorithm8.1BasickNNAlgorithmInput:D,thesetoftrainingobjects,thetestobject,z,whichisavectorofattributevalues,andL,thesetofclassesusedtolabeltheobjectsOutput:cz∈L,theclassofzforeachobjecty∈Ddo|Computed(z,y,thedistancebetweenzandy;endSelectN⊆D,theset(neighborhoodofkclosesttrainingobjectsforz;cz=argmaxv∈Ly∈NI(v=class(cy;whereI(·isanindicatorfunctionthatreturnsthevalue1ifitsargumentistrueand0otherwise.betweenzandallthetrainingobjectstodetermineitsnearest-neighborlist.Itthenassignsaclasstozbytakingtheclassofthemajorityofneighboringobjects.Tiesarebrokeninanunspecifiedmanner,forexample,randomlyorbytakingthemostfrequentclassinthetrainingset.ThestoragecomplexityofthealgorithmisO(n,wherenisthenumberoftrainingobjects.ThetimecomplexityisalsoO(n,sincethedistanceneedstobecomputedbetweenthetargetandeachtrainingobject.However,thereisnotimetakenfortheconstructionoftheclassificationmodel,forexample,adecisiontreeorsepa-ratinghyperplane.Thus,kNNisdifferentfrommostotherclassificationtechniqueswhichhavemoderatelytoqu
本文标题:机器学习十大算法8:kNN概要
链接地址:https://www.777doc.com/doc-4208117 .html