您好,欢迎访问三七文档
CrossLanguageInformationRetrievalRoadMapCrossLingualIRMotivationDefinitionGeneralIssuesWithCLIRBasicApproachestoCLIRCLIRevaluationCLIRapplications2019/8/23InformationRetrievalSinglelanguage:boththeuser’squeryanddocumentstobesearchedareinsamelanguage.Crosslanguage:documentswritteninalanguagedifferentfromthelanguageoftheuser'squerydocumentsquery2019/8/242000-2010年世界各大洲网络语言使用增长率(数据更新时间:2010年6月30日)TheInternetBigPictureWorldRegionsPopulationInternetUsersPenetration(%population)Users%ofTableGrowth2000-2015Africa1,158,355,663313,257,07427.0%9.6%6,839%Asia4,032,466,8821,563,208,14338.8%47.8%1,268%Europe821,555,904604,122,38073.5%18.5%475%MiddleEast236,137,235115,823,88249.0%3.5%3,426%NorthAmerica357,172,209313,862,86387.9%9.6%191%LatinAmerica617,776,105333,115,90853.9%10.2%1,743%Oceania/Australia37,157,12027,100,33472.9%0.8%256%WorldTotal7260,621,1183,270,490,58445%100%806%WorldInternetUsersand2015PopulationStats2019/8/252019/8/26Usageofcontentlanguagesforwebsites2019/8/2720022015English72%English54.5%German7%Russian5.9%Japanese6%German5.7%Spanish3%Japanese5.0%French3%Spanish4.7%Italian2%French4.1%Dutch2%Portuguese2.6%Chinese2%Chinese2.2%Korean1%Italian2.1%Russian1%Polish1.9%Portuguese1%Turkish1.6%Source:://www.oclc.org/research/activities/wcp/stats/intnl.htmlCrossLanguageIRMotivationInformationunavailabilityinsomelanguagesLanguagebarrierDefinition:Cross-languageinformationretrieval(CLIR)isasubfieldofinformationretrievaldealingwithretrievinginformationwritteninalanguagedifferentfromthelanguageoftheuser'squery(wikipedia)Example:AusermayaskqueryinChinesebutretrieverelevantdocumentswritteninEnglish.WhydoweneedCLIRsystems?Needstechnologiesthatenableaccesstoinforegardlessofgeographic/languagebarriers.Tofind,retrieveandunderstandrelevantinformationinwhateverlanguage/form.CLIRhasbecomeoneofthekeyfactorsaffectingknowledgesharingallovertheworld.GeneralIssuesWithCLIRMultilingualtextaccess(charactersets,etc.)Differencesbetweenlanguages-stemming,compoundwords,breaksbetweenwords,etc.TermambiguitybetweenlanguagesWhattotranslate(queryvs.document)andhowMatchingstrategiesNotranslation(1)CognatematchingTranslation(2)Querytranslation(3)Documenttranslation(4)Interlingualtechniques2019/8/211Cognatematching(同源匹配)Inthecaseofthemostnaivecognatematching,untranslatabletermssuchaspropernounsortechnicalterminologyareleftunchangedthroughthestageoftranslation.Theunchangedtermcanbeexpectedtomatchsuccessfullywithacorrespondingterminanotherlanguageifthetwolanguageshaveacloselinguisticrelationship.(forexample,generationinEnglishandFrench)Whentwolanguagesareverydifferent,byexploringamethodformeasuringsimilaritybetweentransliterationanditsoriginalword,wemaymakecognatematchingfeasible(音译)..2019/8/2122019/8/213Querytranslation搜索引擎翻译系统法语查询法语文档结果中文查询选择浏览法语文档集合过程:将中文查询翻译成法语检索法语文档集合将检索结果翻译成中文2019/8/214querytranslationQuerytranslationisthemostwidelyusedmatchingstrategyforCLIRduetoitstractability.theretrievalsystemdoesnothavetochangeitsinvertedfilesofindextermsinanywayagainstqueriesinanylanguage.ItislesscomputationallycostlytoprocessthetranslationofaquerythanthatofalargesetofdocumentsChallenge:termambiguity‘queriesareoftenshortandshortqueriesprovidelittlecontextfordisambiguation’Termdisambiguationwillbediscussedlater.2019/8/215查询翻译优缺点优点简单容易操作灵活节约时间、空间,效率高缺点缺乏上下文对于短查询式,翻译歧义性大2019/8/216Documenttranslation中文查询法语文档集合搜索引擎翻译系统中文文档集合结果选择浏览过程:将整个法语文档翻译成中文文档直接用中文文档检索2019/8/217DocumenttranslationDocumenttranslationhasoppositeadvantagesanddisadvantagesfromquerytranslation.InCLIRexperiments,thisapproachisnotusuallyutilized,andquerytranslationisdominant.However,someresearchershaveusedittotranslatelargesetsofdocumentssincemorevariedcontextwithineachdocumentisavailablefortranslation,whichcanimprovetranslationquality.OardandHackett(1998)reportedthatautomaticmachinetranslationofasetofdocumentsusingacommercialMTsystemoutperformsquerytranslationinanexperimentofCLIRfromGermantoEnglish2019/8/218文档翻译优缺点优点只翻译一次文档提供的上下文比较丰富文档可以线下事先翻译好缺点翻译速度慢占用大量空间、时间,效率低依赖机器翻译系统的质量2019/8/219查询翻译vs.文档翻译取决于特定语言资源通常查询翻译使用更广两种方法都提出了“交互性”挑战Interlingualapproachanintermediatespaceofsubjectrepresentationintowhichboththequeryandthedocumentsareconvertedisusedtocomparethem.Onetypeofinterlingualapproachistousethe‘‘synsets’’providedinWordNet,whichisawellknownmachine-readablethesaurus.Forexample,Diekema,Oroumchian,Sheridan,andLiddy(1999)employedtheWordNetsynsetnumbersaslanguage-independentrepresentationsforCLIR.Sinceasynsetnumber(label)representingaconceptiscorrespondedtoasetofconcretewordsineachoflanguagessupported(e.g.,EnglishandFrench),itispossiblethataqueryterminthesourcelanguagesislinkedtowordsinthetargetlanguageviathesynsetnumber.2019/8/220TranslationtechniquesDictionary-basedmethodsParallelcorpora-basedmethodUseofWWWresources2019/8/221Dictionary-basedmethodsUsingabilingualMachineReadableDictionary(MRD).mostretrievalsystemsarestillbasedonso-called‘‘bag-of-words’’architectur
本文标题:跨语言信息检索技术
链接地址:https://www.777doc.com/doc-48657 .html