您好,欢迎访问三七文档
当前位置:首页 > 商业/管理/HR > 企业文化 > 基于语义指纹和LCS的文本去重方法-陈露
20143511SOFTWAREIT(1989-),,,:LCS(100000)-DuplicateDetectionforChineseTextsBasedonSemanticFingerprintandLCSCHENLu,WUGuo-shi,LIJing(BeijingUniversityofPostandTelecommunicationsSchoolofSoftware,Beijing100083,China)Abstract:InthetraditionalduplicateddetectionalgorithmsfortheChinesecontent,theyoftenencounteredthelowaccuracyissue.Toaddressthisissue,thispaperproposesanovelmethodbasedonsemanticfingerprintandLCS.Withthepre-processedtextsynopsis,firstgettheabstractofthearticle,andthenimplementedtf-idfalgorithmtoobtainthecon-tent’sfeaturevectorandtheabstract’sfeaturevector.Byusingthetwovectorsasinput,wecalculatedthefingerprintsofboththecontentandtheabstractwithsimhashmethod.CalculatetheHammingDistanceofthecorrespondingfingerprintofthetwotextsindividually,andputthetwodistancesintotheformularaisedinthispaper,thengetthefingerprintsimilarityofthetwotexts.ThismethodusefingerprintasthepreliminaryselectionandfurtherdeterminethesimilaritywiththeLCSalgorithm.Withtwo-levelselection,thismethodavoidthefallaciousresultsandgainabetteraccuracy.Inaddition,thispa-perevaluatedourmethodthroughcomparingresultswithotherwidespreadalgorithmsliketheLCSandsimhash.Experi-mentsshowedthismethodnotonlyadvancestheaccuracybutalsoenhancestheoperationspeedwhichhasbetterperfor-manceonthelargescaledata.Keywords:Theoreticalcomputerscience;Semanticfingerprint;Simhash;LCS;Duplicatedetection012000-2012LCS26cosoft@163.comLCS[1]SVM[2,3,4]Bordershingling[5]Charikarsimhash[6]SimhashsimhashLCSshinglingsimhash(Precision)(recallrate)11001004.16kb1LCSsiimhashshinglingsimhashLCSshinglingLCSsimhashLCS2LCSsimhashsimhashLCSsimhashsimhashsimhashLCS11Fig.1TheframeworkoftheDuplicateDetectionmethod[7]tf-idf[8]1Tab.1Comparisonofeffectsofcommonlyusedalgorithm/%/%/msLCS86734116775Simhash70435885Shingling70673849524LCS27cosoft@163.comsimhashLCS3LCSsimhashSimhashhashsimhashsimhashnf-bitkhConhSumhConhSumFFδhConβhSum(1)F2200(200)0~11~00.01simhash31234(Precision)(recallRate)22Fig.2Thechangeoftheeffectweightontheeffectoffingerprintalgorithm0.670.333FF0.67hCon0.33hSum(2)F34LCSLCS3NewArticlesimContent,simSummary101922article3articlesimContenthComsimSummaryhSum43F5FT61LCS28cosoft@163.com6NewArticlearticleLCSsimilarLCS7similarLCSLCST818NewArticlesimilarList109NewArticle103LCSFig.3DuplicateremovalstepafterjoiningLCS521731990(“TwoSimhash&LCS”)(“simhash”)(“TwoSimhash”)simhashLCS(simhash+LCS)LCSsimhash3LCS0.82NLPIR/ICTCLAS97.58%(973)90%98%31.5KB/s[9,10]LCS29cosoft@163.comCPUIntel(R)i53210.2.50GHz4GBwindows8.164bitJavaMyEclipse10.04LCS3kb1170120ms588ms2(ms)2(/ms)Tab.2comparisonofseveralalgorithm'stimecostSimhashTwoSimhashSimhash+lcsTwoSimhash&LCSLCS(2174)25152326534079312796942540(1991)18551121514554106804926109LCS425\TwoSimhash&LCSLCSsimhash30030015155%23305(Recall)(Precision)3535TwoSimhash&LCS6LCS4Fig.4comparisonofseveralalgorithm'stimecost3Tab.3Theprecisionandrecallrate/%/%LCS39281190.071.8simhash58241980.041.4Simhash+lcs42251783.359.5Twosimhash45242180.053.3twoSim+Lcs38281093.373.685Fig.5TheprecisionandrecallrateofseveralalgorithmLCS30cosoft@163.comLCSsimhashLCSLCS[1]E.Myers.AnO(ND)differencealgorithmanditsvariations.Algorithmica,1(2):251–266,1986[2]ZhangKuo,XuHui,TangJie,eta1.Keywordextractionusingsupportvectormachine[C]//Proceedingsofthe7thInternationalConferenceonWeb—AgeInformationManagementHongKong,China,2006:85–96.[3]Xiaoxiao,XuQihua.AComparativeResearchontheClassificationandRegressionBasedonSVMandBP[J].TheJournalofNewIndustrialization,2014,4(5):48–53.[4],,.[J].,2013,34(2):65–68.YuanAiling,QiWei,QianXu.TextClassificationwithaSVMbasedonManifoldRegularization[J],Software,2013,34(2):65–68.[5]A.Z.Border,S.C.Glassman,M.S.Manasse,etal.Syntacticclusteringoftheweb.ComputerNetworks,29(1157-1166):8–13,1997.[6]Luhn.H.P.TheAutomaticCreationofLiteratureAbstracts[J].IBMJoumalofResearchDevelopment,1958,2(2):159[7],,,.[J].(),2010,38(7):50–55.JiangChangjin,PengHong,ChenJianchao,etal.AutomaticTextSummarizationBasedonThematicWordWeightandSentenceFeatures[J].JournalofSouthChinaUniversityofTechnology(NaturalScienceEdition),2010,38(7):50–55.[8],.TF.IDF——[J].,2008,31(6):945–950.QianAibing,JiangLan.ImprovedTF-IDF-basedKeywordExtractionforChineseWebPage:ACaseStudyofWebNews.INFORMATIONSTUDIES:THEORY&APPLICATION,2008,31(6):945–950.[9],,.[J].,2013(9):41-47.LiGang,MaoJin,ChenJing-hao.FastDuplicateDetectionforChineseTextsBasedonSemanticFingerprint[J].Modernlibraryandinformationtechnology,2013(9):41–47.[10].[J].,2013,34(7):75–76.LiJia.ATentativeStudyonChineseSegmentationAlgorithm.Software,2013,34(7):75–76.
本文标题:基于语义指纹和LCS的文本去重方法-陈露
链接地址:https://www.777doc.com/doc-4415375 .html