您好,欢迎访问三七文档
BilingualParallelCorporaandLanguageEngineeringHaroldSomersDepartmentofLanguageEngineering,UMIST,Manchester,Englandharold@ccl.umist.ac.uk1.IntroductionTheuseofcorporahasbecomeanimportantissueinLanguageEngineering(LE).Inthispaperwewillbeconsideringaspecifictypeofcorpus,thebilingualparallelcorpus.By“parallelcorpus”,wemeanatextwhichisavailableintwo(ormore)languages:itmaybeanoriginaltextanditstranslation,oritmaybeatextwhichhasbeenwrittenbyaconsortiumofauthorsinavarietyoflanguages,andthenpublishedinvariouslanguageversions.Acorpusofthistypeoftextissometimescalleda“comparablecorpus”,thoughthistermisalsoused(confusingly)foracorpusofsimilarbutnotnecessarilyequivalenttexts.Anothertermsometimesfoundis“bitext”,duetoBrianHarris(1988).Parallelcorporaareavaluablesourceofakindoflinguisticmetaknowledge,whichformsthebasisoftechniquessuchastokenization,POS-tagging,morphologicalandsyntacticanalysis,whichinturncanbeusedtodevelopLEapplications.Thispaperfocusesonproblems(andsolutions)relatedtotheextractionoflinguisticmeta-knowledgefromparallelcorpora.2.“First,catchyourcorpus”Thefirstrequirementforknowledgeextractionfrombilingualcorporais,ratherobviously,aparallelcorpus.Fullyannotatedalignedmultilingualparallelcorporainanumberoflanguagesarebecomingincreasinglywidelyavailablethroughvariouscoordinatedinternationalefforts.Avisittoanyofanumberofwebsitesdevotedtocorporaingeneralandbilingualcorporainparticularrevealsalonglistofsuchcollections.TheW3CwebsiteatEssexUniversity(cl)isagoodstartingpoint.Nevertheless,eventhoughthenumberofcollectionsiseverincreasing,thenumberofdifferentlanguagesfeaturedisstillrathersmall.Also,someofthecollectionsarerelativelyunfocusedintermsofsubjectmatter.Ineithercasetheremaybeaproblemofcoverageforaparticularneed.Inthiscase,youmightneedtoattempttolocateandanalyseyourcorpusfromscratch.Sowebeginbyconsideringsomewaysofautomaticallylocatingparalleltexts,andsomeissuesinvolvedinretrievingandstoringsuchdata.2.1.LocatingparallelcorporaautomaticallyAlthoughEnglishisoverwhelminglythelinguafrancaoftheWorldWideWeb,agreatnumberofwebsiteshaveparallelmaterialinseverallanguages.Theseevidentlyprovideaninstantsourceofparalleltexts,iftheycanbelocatedandsuccessfullyaligned.BilingualParallelCorporaandLanguageEngineeringInterestingworkonautomaticallyidentifyingandlocatingparallelcorporahasbeeninitiatedbyResnik(1998,1999).Theideaisfirstofalltofindlikelycandidatepairsoftextsusingsuch“tricks”assearchingforsiteswhichseemtohaveparallel“anchors”(seebelow),oftenaccompaniedbyimagesofflags,orpairsoffilenameswhichdifferonlyintheidentificationofalanguage,e.g.withalternativedirectoriesinthepaths,orsuffixessuchas.enand.fr.Thesecandidatesarethenevaluatedbycomparing,inaverysimplisticmanner,theircontent:sincetheyareusuallyHTMLdocuments,itisusuallyquiteeasytoaligntheHTMLmark-up(headingandsubheadingidentifiers,forexample),andtocomparetheamountoftextbetweeneachanchor.Inthisway,wegetaroughmapofthestructuresofthetwodocuments.Thesecanthenbecomparedusingavarietyofmoreorlesssophisticatedtechniqueswhichmayormaynotincludethekindsoflinguisticmethodsusedinthealignmentofknownparalleltexts–seenextsection.Flexibilityinmark-upconventionscanunderminethistechnique,however.Forexample,Figure1showsparallelEnglishandFrenchpages(writtenbythecurrentauthor)withminordifferencesinmark-upandcontent.Figure1.HTMLversionsofparallelwebpages.Noticedifferencesincapitalizationinthetags,orderofelementsintheBODYtag,andtextualdifferences,e.g.anadditionalLIitemintheFrenchversion.HTMLHEADTITLEATLASSymposium/TITLE/HEADBODYbgcolor=fffffftext=115511LINK=004080vLINK=0040800centerimgsrc=”...”alt=logoheight=145width=184h1ArabicTranslationandLocalisationSymposiumpSymposiumsurlaTraductionetlaLocalisationenArabebrimgsrc=arabatlas.gifalt=arabic/h1.../centerpItisoneofthefiveofficiallanguagesoftheUnitedNations,ithas260millionnativespeakers,andisusedasasecondlanguagebyafurther1.3billionpeople....centerliArabiccorpusprocessingliDevelopmentofArabicresourcesliWebtoolsforArabicHTMLHEADTITLESymposiumATLAS/TITLE/HEADBODYTEXT=#115511BGCOLOR=#FFFFFFLINK=#004080VLINK=#048000CENTERIMGSRC=”...ALT=logoHEIGHT=145WIDTH=184H1SymposiumsurlaTraductionetlaLocalisationenArabePArabicTranslationandLocalisationSymposiumBRIMGSRC=arabatlas.gif/h1.../CENTERpL'unedescinqlanguesofficiellesdel'ONUestl'Barabe/B,lalanguematernellede260millionsdelocuteurs,qu'utilisentenviron1.3milliardsdemusulmanscommedeuxièmelangue....CENTERLIlesstandardsdecodagedescaractèresarabes/LILIletraitementdescorpusenarabe/LILIledéveloppementdesressourcespourl'arabe/LILIlesoutilsInternetpourl'arabe/LIHaroldSomers2.2.StorageandencodingHavinglocatedasuitableparallelcorpus,thereremainanumberofaspectstoconsiderbeforetheprocessoflinguisticknowledgeextractioncanbegin.One,whichshouldnotbeignoredistheissueofdeterminingthelegalpositionwithrespecttothetext:eventhoughthe
本文标题:Bilingual_Parallel_and_Language_Engineering_by_Som
链接地址:https://www.777doc.com/doc-3231118 .html