您好,欢迎访问三七文档
JournalofMachineLearningResearch9(2008)2579-2605Submitted5/08;Revised9/08;Published11/08VisualizingDatausingt-SNELaurensvanderMaatenLVDMAATEN@GMAIL.COMTiCCTilburgUniversityP.O.Box90153,5000LETilburg,TheNetherlandsGeoffreyHintonHINTON@CS.TORONTO.EDUDepartmentofComputerScienceUniversityofToronto6King’sCollegeRoad,M5S3G4Toronto,ON,CanadaEditor:YoshuaBengioAbstractWepresentanewtechniquecalled“t-SNE”thatvisualizeshigh-dimensionaldatabygivingeachdatapointalocationinatwoorthree-dimensionalmap.ThetechniqueisavariationofStochasticNeighborEmbedding(HintonandRoweis,2002)thatismucheasiertooptimize,andproducessignificantlybettervisualizationsbyreducingthetendencytocrowdpointstogetherinthecenterofthemap.t-SNEisbetterthanexistingtechniquesatcreatingasinglemapthatrevealsstructureatmanydifferentscales.Thisisparticularlyimportantforhigh-dimensionaldatathatlieonseveraldifferent,butrelated,low-dimensionalmanifolds,suchasimagesofobjectsfrommultipleclassesseenfrommultipleviewpoints.Forvisualizingthestructureofverylargedatasets,weshowhowt-SNEcanuserandomwalksonneighborhoodgraphstoallowtheimplicitstructureofallofthedatatoinfluencethewayinwhichasubsetofthedataisdisplayed.Weillustratetheperformanceoft-SNEonawidevarietyofdatasetsandcompareitwithmanyothernon-parametricvisualizationtechniques,includingSammonmapping,Isomap,andLocallyLinearEmbedding.Thevisualiza-tionsproducedbyt-SNEaresignificantlybetterthanthoseproducedbytheothertechniquesonalmostallofthedatasets.Keywords:visualization,dimensionalityreduction,manifoldlearning,embeddingalgorithms,multidimensionalscaling1.IntroductionVisualizationofhigh-dimensionaldataisanimportantprobleminmanydifferentdomains,anddealswithdataofwidelyvaryingdimensionality.Cellnucleithatarerelevanttobreastcancer,forexample,aredescribedbyapproximately30variables(Streetetal.,1993),whereasthepixelintensityvectorsusedtorepresentimagesortheword-countvectorsusedtorepresentdocumentstypicallyhavethousandsofdimensions.Overthelastfewdecades,avarietyoftechniquesforthevisualizationofsuchhigh-dimensionaldatahavebeenproposed,manyofwhicharereviewedbydeOliveiraandLevkowitz(2003).ImportanttechniquesincludeiconographicdisplayssuchasChernofffaces(Chernoff,1973),pixel-basedtechniques(Keim,2000),andtechniquesthatrepre-sentthedimensionsinthedataasverticesinagraph(Battistaetal.,1994).Mostofthesetechniquessimplyprovidetoolstodisplaymorethantwodatadimensions,andleavetheinterpretationofthec2008LaurensvanderMaatenandGeoffreyHinton.VANDERMAATENANDHINTONdatatothehumanobserver.Thisseverelylimitstheapplicabilityofthesetechniquestoreal-worlddatasetsthatcontainthousandsofhigh-dimensionaldatapoints.Incontrasttothevisualizationtechniquesdiscussedabove,dimensionalityreductionmethodsconvertthehigh-dimensionaldatasetX=fx1;x2;:::;xngintotwoorthree-dimensionaldataY=fy1;y2;:::;yngthatcanbedisplayedinascatterplot.Inthepaper,werefertothelow-dimensionaldatarepresentationYasamap,andtothelow-dimensionalrepresentationsyiofindividualda-tapointsasmappoints.Theaimofdimensionalityreductionistopreserveasmuchofthesig-nificantstructureofthehigh-dimensionaldataaspossibleinthelow-dimensionalmap.Varioustechniquesforthisproblemhavebeenproposedthatdifferinthetypeofstructuretheypreserve.TraditionaldimensionalityreductiontechniquessuchasPrincipalComponentsAnalysis(PCA;Hotelling,1933)andclassicalmultidimensionalscaling(MDS;Torgerson,1952)arelineartech-niquesthatfocusonkeepingthelow-dimensionalrepresentationsofdissimilardatapointsfarapart.Forhigh-dimensionaldatathatliesonornearalow-dimensional,non-linearmanifolditisusu-allymoreimportanttokeepthelow-dimensionalrepresentationsofverysimilardatapointsclosetogether,whichistypicallynotpossiblewithalinearmapping.Alargenumberofnonlineardimensionalityreductiontechniquesthataimtopreservethelocalstructureofdatahavebeenproposed,manyofwhicharereviewedbyLeeandVerleysen(2007).Inparticular,wementionthefollowingseventechniques:(1)Sammonmapping(Sammon,1969),(2)curvilinearcomponentsanalysis(CCA;DemartinesandH´erault,1997),(3)StochasticNeighborEmbedding(SNE;HintonandRoweis,2002),(4)Isomap(Tenenbaumetal.,2000),(5)MaximumVarianceUnfolding(MVU;Weinbergeretal.,2004),(6)LocallyLinearEmbedding(LLE;RoweisandSaul,2000),and(7)LaplacianEigenmaps(BelkinandNiyogi,2002).Despitethestrongper-formanceofthesetechniquesonartificialdatasets,theyareoftennotverysuccessfulatvisualizingreal,high-dimensionaldata.Inparticular,mostofthetechniquesarenotcapableofretainingboththelocalandtheglobalstructureofthedatainasinglemap.Forinstance,arecentstudyrevealsthatevenasemi-supervisedvariantofMVUisnotcapableofseparatinghandwrittendigitsintotheirnaturalclusters(Songetal.,2007).Inthispaper,wedescribeawayofconvertingahigh-dimensionaldatasetintoamatrixofpair-wisesimilaritiesandweintroduceanewtechnique,called“t-SNE”,forvisualizingtheresultingsimilaritydata.t-SNEiscapableofcapturingmuchofthelocalstructureofthehigh-dimensionaldataverywell,whilealsorevealingglobalstructuresuchasthepresenceofclustersatseveralscales.Weillustratetheperformanceoft-SNEbycomparingittothesevendimensionalit
本文标题:Visualizing High-Dimensional Data Using t-SNE vand
链接地址:https://www.777doc.com/doc-3519356 .html