Visualizing High-Dimensional Data Using t-SNE vand

JournalofMachineLearningResearch9(2008)2579-2605Submitted5/08;Revised9/08;Published11/08VisualizingDatausingt-SNELaurensvanderMaatenLVDMAATEN@GMAIL.COMTiCCTilburgUniversityP.O.Box90153,5000LETilburg,TheNetherlandsGeoffreyHintonHINTON@CS.TORONTO.EDUDepartmentofComputerScienceUniversityofToronto6King’sCollegeRoad,M5S3G4Toronto,ON,CanadaEditor:YoshuaBengioAbstractWepresentanewtechniquecalled“t-SNE”thatvisualizeshigh-dimensionaldatabygivingeachdatapointalocationinatwoorthree-dimensionalmap.ThetechniqueisavariationofStochasticNeighborEmbedding(HintonandRoweis,2002)thatismucheasiertooptimize,andproducessigniﬁcantlybettervisualizationsbyreducingthetendencytocrowdpointstogetherinthecenterofthemap.t-SNEisbetterthanexistingtechniquesatcreatingasinglemapthatrevealsstructureatmanydifferentscales.Thisisparticularlyimportantforhigh-dimensionaldatathatlieonseveraldifferent,butrelated,low-dimensionalmanifolds,suchasimagesofobjectsfrommultipleclassesseenfrommultipleviewpoints.Forvisualizingthestructureofverylargedatasets,weshowhowt-SNEcanuserandomwalksonneighborhoodgraphstoallowtheimplicitstructureofallofthedatatoinﬂuencethewayinwhichasubsetofthedataisdisplayed.Weillustratetheperformanceoft-SNEonawidevarietyofdatasetsandcompareitwithmanyothernon-parametricvisualizationtechniques,includingSammonmapping,Isomap,andLocallyLinearEmbedding.Thevisualiza-tionsproducedbyt-SNEaresigniﬁcantlybetterthanthoseproducedbytheothertechniquesonalmostallofthedatasets.Keywords:visualization,dimensionalityreduction,manifoldlearning,embeddingalgorithms,multidimensionalscaling1.IntroductionVisualizationofhigh-dimensionaldataisanimportantprobleminmanydifferentdomains,anddealswithdataofwidelyvaryingdimensionality.Cellnucleithatarerelevanttobreastcancer,forexample,aredescribedbyapproximately30variables(Streetetal.,1993),whereasthepixelintensityvectorsusedtorepresentimagesortheword-countvectorsusedtorepresentdocumentstypicallyhavethousandsofdimensions.Overthelastfewdecades,avarietyoftechniquesforthevisualizationofsuchhigh-dimensionaldatahavebeenproposed,manyofwhicharereviewedbydeOliveiraandLevkowitz(2003).ImportanttechniquesincludeiconographicdisplayssuchasChernofffaces(Chernoff,1973),pixel-basedtechniques(Keim,2000),andtechniquesthatrepre-sentthedimensionsinthedataasverticesinagraph(Battistaetal.,1994).Mostofthesetechniquessimplyprovidetoolstodisplaymorethantwodatadimensions,andleavetheinterpretationofthec2008LaurensvanderMaatenandGeoffreyHinton.VANDERMAATENANDHINTONdatatothehumanobserver.Thisseverelylimitstheapplicabilityofthesetechniquestoreal-worlddatasetsthatcontainthousandsofhigh-dimensionaldatapoints.Incontrasttothevisualizationtechniquesdiscussedabove,dimensionalityreductionmethodsconvertthehigh-dimensionaldatasetX=fx1;x2;:::;xngintotwoorthree-dimensionaldataY=fy1;y2;:::;yngthatcanbedisplayedinascatterplot.Inthepaper,werefertothelow-dimensionaldatarepresentationYasamap,andtothelow-dimensionalrepresentationsyiofindividualda-tapointsasmappoints.Theaimofdimensionalityreductionistopreserveasmuchofthesig-niﬁcantstructureofthehigh-dimensionaldataaspossibleinthelow-dimensionalmap.Varioustechniquesforthisproblemhavebeenproposedthatdifferinthetypeofstructuretheypreserve.TraditionaldimensionalityreductiontechniquessuchasPrincipalComponentsAnalysis(PCA;Hotelling,1933)andclassicalmultidimensionalscaling(MDS;Torgerson,1952)arelineartech-niquesthatfocusonkeepingthelow-dimensionalrepresentationsofdissimilardatapointsfarapart.Forhigh-dimensionaldatathatliesonornearalow-dimensional,non-linearmanifolditisusu-allymoreimportanttokeepthelow-dimensionalrepresentationsofverysimilardatapointsclosetogether,whichistypicallynotpossiblewithalinearmapping.Alargenumberofnonlineardimensionalityreductiontechniquesthataimtopreservethelocalstructureofdatahavebeenproposed,manyofwhicharereviewedbyLeeandVerleysen(2007).Inparticular,wementionthefollowingseventechniques:(1)Sammonmapping(Sammon,1969),(2)curvilinearcomponentsanalysis(CCA;DemartinesandH´erault,1997),(3)StochasticNeighborEmbedding(SNE;HintonandRoweis,2002),(4)Isomap(Tenenbaumetal.,2000),(5)MaximumVarianceUnfolding(MVU;Weinbergeretal.,2004),(6)LocallyLinearEmbedding(LLE;RoweisandSaul,2000),and(7)LaplacianEigenmaps(BelkinandNiyogi,2002).Despitethestrongper-formanceofthesetechniquesonartiﬁcialdatasets,theyareoftennotverysuccessfulatvisualizingreal,high-dimensionaldata.Inparticular,mostofthetechniquesarenotcapableofretainingboththelocalandtheglobalstructureofthedatainasinglemap.Forinstance,arecentstudyrevealsthatevenasemi-supervisedvariantofMVUisnotcapableofseparatinghandwrittendigitsintotheirnaturalclusters(Songetal.,2007).Inthispaper,wedescribeawayofconvertingahigh-dimensionaldatasetintoamatrixofpair-wisesimilaritiesandweintroduceanewtechnique,called“t-SNE”,forvisualizingtheresultingsimilaritydata.t-SNEiscapableofcapturingmuchofthelocalstructureofthehigh-dimensionaldataverywell,whilealsorevealingglobalstructuresuchasthepresenceofclustersatseveralscales.Weillustratetheperformanceoft-SNEbycomparingittothesevendimensionalit

Visualizing High-Dimensional Data Using t-SNE vand

免费阅读已结束，点击付费阅读剩下 ... 页

阅读已结束，您可以下载文档离线阅读

第五章商业银行-教育电子政务平台

中国网通集团-企业信息化整体规划建议best

交通工程设施标准包括八个方面

建筑住宅工程质量通病防治方案上报

XXXX1115建筑业现代化

房建质量安全保证体系完成版（DOC80页）

xx时代商场前期推广及开幕庆典策划方案

具染色体性分化异常者的诊断和治疗

海事行政执法政务公开指南项目内容doc-海事行政执法政务

如何提高企业的培训效果

相关文档

相关搜索