您好,欢迎访问三七文档
PictureFiltering&Locality-SensitiveHashingLixiangLiuJanuary13th,2010ContentPictureFiltering&OurApproachLocality-SensitiveHashingPictureFiltering-ANewApproachProblemdescriptionCurrentmethodsOurapproachProblemDescription•Findthoseimagessimilartospecificimagefromasetoffiles.•Acquirements:–Highaccuracyrate–SpeedCurrentMethods•Symbolassist•Featureextraction•Machinelearning•……PictureFilter-TheBruteForceApproach•LinearScan–ProgrammingisSimple!–Thenumberoffilescouldbebillions,pictureprocessingistooSlow!IdeasSummary•Distributed/ParallelComputing–Concurrentprogramming–Distributedcomputing•Algorithmoptimized–Advanceddatastructure?–Algorithmsfromrelevantresearchfield•Other–Pipelines?–Codetuning?ConcurrentProgramming•Multipleprocessors(core)computersarewidelyavailable.•ProgramminglanguagessuchasJava、Erlang、Scalaetc.aresupportconcurrentprogramming.点拆分•将图像的点平均拆分成N份•N个线程分别处理每一份的数据•线程间采用共享计数器的方法点拆分数据集合拆分•将索引数据平均拆分成N份,对每一份数据集合分别建立对应的索引(可以用多个线程并发)•将待匹配图像数据复制N份,分别在每个集合中查找•线程间不存在交互,不需要考虑共享计数器等问题,容易扩展到分布式处理模型数据集合拆分ExperimentResult--Summary•以上两种都采用静态拆分的方法(点拆分按点数评价拆分、数据集合则按文件数目平均拆分)•并发的方法的执行时间为线性的1/4•数据量大时数据集合拆分的效率略优于点拆分的办法ExperimentResult(数据说明)Case精确(模糊)1(10)2(11)3(12)4(13)5(14)6(15)7(16)8(17)9(18)说明同文件1FileSize(KB)116116116116116116131343PointNumber5125125125125125125655191ExperimentResult单位:毫秒小文件处理方面v32性能不如v8ExperimentResult单位:毫秒小文件处理方面v32性能不如v8DistributedComputing•Usingdistributedcomputingmethods–HPC(HighPerformanceComputing)•GridSim•MPICH2(aMPIimplementation)–MapReduce•Hadoop(Opensourceproject,Java)•CUDA(onNVIDIAGPUs)•……AlgorithmOptimization•算法和数据结构的优化分两个方向:•1、通过实验调节设计符合分布式并行处理的数据结构;•2、借鉴并改进信息检索领域现有成熟的算法(特别是SimilarSearch相关算法),使其符合我们的应用。Why“SimilarSearch”?Modelforpicturefiltering•RangeQueriesModelforsimilarsearch•NNS(NearestNeighborSearch)DefinitionofNN•Given:asetPofpointsinRd•NearestNeighbor:foranyqueryq,returnsapointp∈Pminimizing||p-q||•r-NearNeighbor:foranyqueryq,returnsapointp∈Ps.t.||p-q||≤r(ifitexists)ThispagecomesfromPiotrIndyk’slecture“Near-OptimalHashingAlgorithmsforApproximateNear(est)NeighborProblem”onFOCS06SolutiontoNNS•d=2–Voronoidiagram•Space:O(n)•Querytime:O(logn)•d2–K-dTree?•work“well”in“low-medium”dimensions•Near-linearquerytimeforhighdimensions–LSHSolutiontoNNS—Summary•locality-sensitivehashing(LSH),low-distortionembeddings,k-dtrees,kd-trees,metrictrees,M-trees,R*-trees,vp-trees,vantagepointtrees,vantagepointforest,multi-vantagepointtree,bisectortrees,Orchard'salgorithm,randomprojections,fixedqueriestree,Voronoitree,BBD-tree,min-wiseindependentpermutations,Burkhard-Kellertree,generalizedhyperplanetree,geometricnear-neighboraccesstree(GNAT),approximatingeliminatingsearchalgorithm(AESA),invertedindex,spatialapproximationtree(SAT)……•From:•Currently,thenewfamilyismostlyoftheoreticalinterest.Thisisbecausetheasymptoticimprovementintherunningtimeachievedviaabetterseparationofcollisionprobabilitiesmakesadifferenceonlyforarelativelylargenumberofinputpoints.Nevertheless,itisquitelikelythatonecandesignbetterpseudorandompointconfigurationswhichdonotsufferfromthisproblem.From:Near-OptimalHashingAlgorithmsforApproximateNearestNeighborinHighDimensionsCACMJan-2008Basicidea•RandomizedAlgorithm(ApproximateAlgorithm)•LSHisbasedonthesimpleideathat,iftwopointsareclosetogether,thenaftera“projection”operationthesetwopointswillremainclosetogether.•Theprobabilityofcollisionofsimilarobjectsismuchhigherthanotherobjects.TheoryBasisInorderforalocality-sensitivehash(LSH)familytobeuseful,ithastosatisfyP1P2.Limitation&Solution•Limitation:–IfthegapbetweenP1andP2istoosmall,thenthealgorithmwillbeineffective.•Solution–AmplifythegapbetweenthehighprobabilityP1andlowprobabilityP2byconcatenatingseveralfunctions.ProofAlgorithmFramework•Hashfunction:g(p)=h1(p),h2(p),…,hk(p)•Preprocessing:–Selectg1…gL–Forallp∈P,hashptobucketsg1(p)…gL(p)•Query:–Retrievethepointsfrombucketsg1(q),g2(q),…,until•EitherthepointsfromallLbucketshavebeenretrieved,or•Totalnumberofpointsretrievedexceeds2L–Answerthequerybasedontheretrievedpoints–Totaltime:O(dL)ImplementationSchemaofLSH•VectorModel–Scalarproduct,cosine•StringModel–Hammingdistance,Editdistance•Black-boxModel–triangleinequality(byOracle,Unknown!)•SetModel–Sizeofintersection•Smallgraphs–Structure/labelsmatchingHamming•Definebyhammingdistance(Euclidean1-spacedistance)L2Distance•p-StableDistribution(Euclideann-spacedistance)LSHFamilies(1)Set(JaccardLSHfamily)•ThedistanceoftwosetsAandBismeasuredbyArccos•Forvectorsp,q,considerthedistancemeasurethatistheanglebetweenthetwovectorsLSHFamilies(2)Implementations•E2LSH(ExactEuclideanLSH)–as-stabledistributedLSHfamilyimplementationbyMIT–Link:~andoni/LSH•LSHKIT0.2.1:–AC++LSHlibrarywrittenbyWeiDongfromPrincetonCASS
本文标题:LSH
链接地址:https://www.777doc.com/doc-4050342 .html