您好,欢迎访问三七文档
当前位置:首页 > 商业/管理/HR > 信息化管理 > knn classification
Project1k-nearest-neighbormethodBasic1)Developak-NNclassifierwithEuclideandistanceandsimplevotingThisisdonebyfunction“knn.m”.Inputs:trainingdata,testdataandthekvalue.Output:classificationresultoftestdata.Allthedataareusedbothfortrainingandforvalidation.Theaccuracycurvesaredrawnbelow(seeninfigure1).Thisisdonebyscript“main.m”(testk-NNandplotaccuracycurves).Thevalueofkisfrom1to50.Figure1theaccuracycurveofk-NNclassifier2)Perform5-foldcrossvalidation,findoutwhichkperformsthebest(intermsofaccuracy)Thisisdonebyfunction“cross_validation.m”.Inputs:provideddataandthekvalue.Outputs:k-NNalgorithmaccuracyandthevauleofkwhichperformsthebestFunction“cross_validation.m”divideprovideddatainto5partsrandomly,performs5-foldcross-validationbyinvokingfunction“knn.m”,calculatethek-NNalgorithmaccuracyundercertainvalueofkandreturnthebestkvalueof5-foldcrossvalidation.Script“main.m”(test5-foldcrossvalidation,plotaccuracycurvesandplotbest_k)invokes“cross-validation”,plotstheaccuracycurvesof5-foldcrossvalidationandfindsoutthebestkvalue.(seeninfigure2).Becausetheprovideddataisrandomlygroupedinthe“cross_validation.m”function,sotheresultsofthescript“main.m”aredifferentformeverytime.(accuracyandthevauleofbestk)Thegreensymbol“*”representsthebestworkingpointofthe5-foldcrossvalidationandthevauleofbestkisgivenout.Figure25-foldCV,simplevotingwithoutPCA3)UsePCAtoreducethedimensionalityto6,thenperform2)again.DoesPCAimprovetheaccuracy?Thisisdonebyaddingseverallinesofcodesinscript“main.m”.(dimensionreduction:PCA,plotaccuracycurvesandplotbest_k)Accuracycurvesof5-foldcross-validationafterusingPCAtoreducethedimensionalityto6isinfigure3below.Comparedwithfigure2,PCAdidn’timprovetheperformancebuthasareductionincomputation.Figure35-foldCV,simplevotingwithPCAPlus●ExplorethedatabeforeclassificationusingsummarystatisticsorvisualizationThisisdonebyscript“EDA.m”.Scatterplot,parallelcoordinates,boxplotandempiricalCDFareusedinthisdataanalysis.(seeninfigure4)Scatterplotandparallelcoordinatesoftherowdataaregroupedbythegroupingvariablecategories.Therearethreecategories:category1,category2andcategory3.Theboxplotisforall13variablesandweonlydrawtheempiricalCDFfigureofthefirstfourvariables.Figure4.1datadisplayedwithScatterplotFigure4.2datadisplayedwithparallelcoordinatesFigure4.3theboxplotofthe13variablesFigure4.4empiricalCDFoffirstfourvariables.●Pre-processthedataThisisdonebyscript“main.m”1)Normalizationoftrainingandtestdata.AscanbeseeninFigure2andFigure3,theclassificationalgorithmshavehigheraccuracyusingthedataafternormalization.Fortherowdata,thevaluesofsomevariablearequitelarge.Thus,weneedtopre-processthedatasoastoweakentheinfluenceofthisvariables.2)denoisingAfterdenoisingandnormalization,wecangetdifferentresults.(seeninfigure5)Figure5theboxplotafterdenoisingandnormalization●Tryotherdistancemetricsordistance-basedvoting1)Otherdistancemetrics:ManhattanDistanceThisisdonebyfunction“knn.m”andscript“ceshi.m”.ManhattanDistanceisusedtodevelopak-NNclassifierwithsimplevoting.Allthedataarepre-processedbydenoisingandnormalization.Weperform5-foldcrossvalidationandplotboththeaccuracycurvesofEuclideandistanceandManhattanDistance.(seeninfigure6.1)2)Distance-basedvotingThisisdonebyfunction“knn.m”andscript“ceshi.m”.distance-basedvotingisusedtodevelopak-NNclassifierwithEuclideandistance.Allthedataarepre-processedbydenoisingandnormalization.Weperform5-foldcrossvalidationandplotboththeaccuracycurvesofsimplevotinganddistance-basedvoting.(seeninfigure6.2)Figure6theaccuracycurvesofEuclideandistanceandManhattanDistanceFigure6theaccuracycurvesofsimplevotinganddistance-basedvoting●TryotherdimensionalityreductionmethodsThisisdonebyfunction“knn.m”andscript“ceshi.m”.UsetheLLEdimensionalityreductionmethods.StepsoftheLLEalgorithm:a)Findthenearestneighborsb)UsingtheKnearestneighborstoreconstructthispointc)Findlocallylinearcombinationweights2iijjijEWXWXd)IfisnotoneoftheKnearestneighbors,0ijWe)Findthelowdimensionvectoriyf)Wheniixy,keepijW2min(Y)iijjYijYWYFigure7theaccuracycurvesofPCAandLLE●Howtosetthekvalue,ifnotusingcrossvalidation?VerifyyourideaBootstrappingisusedtosetthekvaluehere.Keypointforbootstrapping:SamplingwithreplacementSamplingwithreplacementmeansthatafterwerandomlydrawanobservationfromtheoriginalsample,weputitbackbeforedrawingthenextobservation.Asaresult,anynumbercanbedrawnonce,morethanonce,ornotatall.Ifwesampledreplacement,we’dgetthesamesetofnumberswestartedwith,thoughinadifferentorder.Stepsofexperiment:a)Samplingthewinedatawithreplacementtocreatethetrainingdatab)Theremainingsamplesareusedtocreatethetestdatac)k-NNclassifierFigure8theaccuracycurvesof5-foldcrossvalidationandbootstrapFromtheFigure8,wecanseethatboth5-foldcrossvalidationandbootstrapalgorithmperformbestwhilek=1andtheyhavealmostthesameperformanceintermsofaccuracy.Ifnotusingcrossvalidation,bootstrapisanotherwaytosetthekvalue.Reference1T.Hastie,R.Tibshirani,J.Friedman.TheElementsofStatisticalLearning2.(statistics)3.Zoubir,A.M.Bootstrap:TheoryandApplications.Proceedingofth
本文标题:knn classification
链接地址:https://www.777doc.com/doc-5126318 .html