您好,欢迎访问三七文档
AtutorialonPrincipalComponentsAnalysisLindsayISmithFebruary26,2002Chapter1IntroductionThistutorialisdesignedtogivethereaderanunderstandingofPrincipalComponentsAnalysis(PCA).PCAisausefulstatisticaltechniquethathasfoundapplicationinfieldssuchasfacerecognitionandimagecompression,andisacommontechniqueforfindingpatternsindataofhighdimension.BeforegettingtoadescriptionofPCA,thistutorialfirstintroducesmathematicalconceptsthatwillbeusedinPCA.Itcoversstandarddeviation,covariance,eigenvec-torsandeigenvalues.ThisbackgroundknowledgeismeanttomakethePCAsectionverystraightforward,butcanbeskippediftheconceptsarealreadyfamiliar.Thereareexamplesallthewaythroughthistutorialthataremeanttoillustratetheconceptsbeingdiscussed.Iffurtherinformationisrequired,themathematicstextbook“ElementaryLinearAlgebra5e”byHowardAnton,PublisherJohnWiley&SonsInc,ISBN0-471-85223-6isagoodsourceofinformationregardingthemathematicalback-ground.1Chapter2BackgroundMathematicsThissectionwillattempttogivesomeelementarybackgroundmathematicalskillsthatwillberequiredtounderstandtheprocessofPrincipalComponentsAnalysis.Thetopicsarecoveredindependentlyofeachother,andexamplesgiven.Itislessimportanttoremembertheexactmechanicsofamathematicaltechniquethanitistounderstandthereasonwhysuchatechniquemaybeused,andwhattheresultoftheoperationtellsusaboutourdata.NotallofthesetechniquesareusedinPCA,buttheonesthatarenotexplicitlyrequireddoprovidethegroundingonwhichthemostimportanttechniquesarebased.IhaveincludedasectiononStatisticswhichlooksatdistributionmeasurements,or,howthedataisspreadout.TheothersectionisonMatrixAlgebraandlooksateigenvectorsandeigenvalues,importantpropertiesofmatricesthatarefundamentaltoPCA.2.1StatisticsTheentiresubjectofstatisticsisbasedaroundtheideathatyouhavethisbigsetofdata,andyouwanttoanalysethatsetintermsoftherelationshipsbetweentheindividualpointsinthatdataset.Iamgoingtolookatafewofthemeasuresyoucandoonasetofdata,andwhattheytellyouaboutthedataitself.2.1.1StandardDeviationTounderstandstandarddeviation,weneedadataset.Statisticiansareusuallycon-cernedwithtakingasampleofapopulation.Touseelectionpollsasanexample,thepopulationisallthepeopleinthecountry,whereasasampleisasubsetofthepop-ulationthatthestatisticiansmeasure.Thegreatthingaboutstatisticsisthatbyonlymeasuring(inthiscasebydoingaphonesurveyorsimilar)asampleofthepopulation,youcanworkoutwhatismostlikelytobethemeasurementifyouusedtheentirepop-ulation.Inthisstatisticssection,Iamgoingtoassumethatourdatasetsaresamples2ofsomebiggerpopulation.Thereisareferencelaterinthissectionpointingtomoreinformationaboutsamplesandpopulations.Here’sanexampleset: Icouldsimplyusethesymbol torefertothisentiresetofnumbers.IfIwanttorefertoanindividualnumberinthisdataset,Iwillusesubscriptsonthesymbol toindicateaspecificnumber.Eg. referstothe3rdnumberin ,namelythenumber4.Notethat isthefirstnumberinthesequence,not likeyoumayseeinsometextbooks.Also,thesymbol willbeusedtorefertothenumberofelementsintheset Thereareanumberofthingsthatwecancalculateaboutadataset.Forexample,wecancalculatethemeanofthesample.Iassumethatthereaderunderstandswhatthemeanofasampleis,andwillonlygivetheformula: !$# Noticethesymbol (said“Xbar”)toindicatethemeanoftheset .Allthisformulasaysis“Addupallthenumbersandthendividebyhowmanythereare”.Unfortunately,themeandoesn’ttellusalotaboutthedataexceptforasortofmiddlepoint.Forexample,thesetwodatasetshaveexactlythesamemean(10),butareobviouslyquitedifferent: &%’ &