您好,欢迎访问三七文档
当前位置:首页 > 商业/管理/HR > 薪酬管理 > 浙江大学肖忠华语料库session 3(外语学习)
DatacaptureandcorpusmarkupCorpusLinguisticsRichardXiaolancsxiaoz@googlemail.comOutlineofthesession•Lecture–Datacapture–Somee-textarchives–Copyrightincorpuscreation–Corpusmarkup•Lab–MLCT–WSTWebGetter–SometranscribingtoolsDatatobecollected•Likeotherdecisionsincorpuscreation(e.g.balance,representativeness,size),thekindofdatatobecollectedalsodependsonyourresearchquestions–IfyouwishtocompareBritishEnglishandAmericanEnglish,youwillneedtocollectspokenand/orwrittendataproducedbynativespeakersofthetworegionalvarietiesofEnglish–IfyouareinterestedinhowChinesespeakersacquireEnglishasasecondlanguage,youwillthenneedtocollecttheEnglishdataproducedbyChineselearnerstocreatealearnercorpus–IfyouareinterestedinhowtheEnglishlanguagehasevolvedovercenturies,youwillneedtocollectsamplesofEnglishproducedindifferenthistoricalperiodstobuildahistoricalordiachroniccorpusDatacapture•Havingdevelopedanunderstandingofthetypeofdatayouneedtocollect,andhavingmadesurethatnoready-madecorpusofsuchmaterialexists,you’llneedtocapturethedata•Datadigitalisation–Machine-readabilityisadefactofeatureofamoderncorpusDatacapture•Textmustberenderedmachine-readable–Keyboarding–OCR(OpticalCharacterRecognition)scanning–Transcribingaudio/videorecording•Existingelectronicdataispreferredoverpaper-basedmaterials–TheWebasanimportantsourceofmachine-readabledataformanylanguages–ConvertingotherfileformatsuchasHTML,Word,PDFintoplaintextformat•TheWorld-Wide-Web()isanimportantsourceofelectronictextarchivesSomeusefuldatasource•OxfordTextArchive––Oldesttextarchive-thousandsoftexts(andmanywell-knowncorpora)inmorethan25differentlanguages•ProjectGutenberg––Firstproduceroffreeelectronicbooks–2,8000e-books!•Digitalcollectionsofuniversitylibrariese.g.––•Corpus4uelectronictextarchives–=21Copyrightincorpuscreation•Acorpusconsistingentirelyofcopyright-freeoldtextsisnotusefulinstudyofcontemporarylanguage•Copyrightisamajorissueindatacollectionifyouaretopublishormakeyourcorpuspubliclyavailable•Thesamplestakenundertheconventionof‘fairdealing’incopyrightlawaresosmallastojeopardizeanyclaimofbalanceorrepresentativeness•ThereisasyetnosatisfactorysolutiontotheissueofcopyrightincorpusCopyrightincorpuscreation•Tipsforcopyrightissues–Usuallyeasiertoobtainpermissionforsamplesthanforfulltexts–Easierforsmallersamplesthanforlargerones–Ifyoushowthatyouareactingingoodfaith,andonlysmallsampleswillbeusedinnon-profit-makingresearch,copyrightholdersaretypicallypleasedtograntyoupermission–Youdon’tneedtoworryaboutcopyrightifyoubuildacorpusforyourprivateuse!Corpusmarkup•Systemofstandardcodesinsertedintoadocumentstoredinelectronicformtoprovideinformationaboutthetextitselfandgovernformatting,printingandotherprocesses–Describingthedocument(“metadata”likesource,name,author,date,etc)–Markingboundariesforparagraphs,sentences,andwords,omissionsetc–Displayingmarkup(font,fontsize,positioning)ExampleofmarkupstarttagendtagWhymarkup?•Markuprecoverscontextualinformationofsampledtextswhicharetakenoutofcontext•Markupallowsforabroaderrangeofresearchquestionstobeaddressedbyprovidingextrainformationsuchastexttypes,sociolinguisticvariables,structuralorganization•Markupallowscorpusbuilderstoinserteditorialcommentsduringthecorpusbuildingprocess•Pre-processingwrittentexts,andparticularlytranscribingspokendata,alsoinvolvesmarkup(e.g.pause,paralinguisticfeaturesetc)Markupschemes•Theextramarkupinformationmustbekeptseparatefromthetextualdatainacorpus•Markupschemes–COCOA–TEI(TextEncodingInitiative)–CES(CorpusEncodingStandard)COCOAreference•Oneoftheearliestmarkupschemes•Consistingofasetofattributenamesandvaluesenclosedinangledbrackets–e.g.AWILLIAMSHAKESPEARE•attributename=A(author)•attributevalue=WILLIAMSHAKESPEARE•Onlyencodingalimitedsetoffeaturessuchasauthors,titlesanddates•GivingwaytomoremodernschemesTEIguidelines•Sponsoredbythreemajoracademicassociationsconcernedwithhumanitiescomputing–TheAssociationforComputationalLinguistics(ACL)–TheAssociationforLiteraryandLinguisticComputing(ALLC)–TheAssociationforComputersandtheHumanities(ACH)•AimingtofacilitatedataexchangebystandardizingthemarkuporencodingofinformationstoredinelectronicformTEIguidelines•Eachindividualtextisadocumentconsistinginaheaderandabody,whichareinturncomposedofdifferentelements•TEIcorpusheaderhas4principalelements–Afiledescription(fileDesc):afullbibliographicdescription–Anencodingdescription(encodingDesc):relationshipbetweenanelectronictextanditssourceorsources–Atextprofile(profileDesc):adetaileddescriptionofnon-bibliographicaspectsofatext–Arevisionhistory(revisionDesc):arecordofchangestoafile•OnlyfileDescisrequiredtobeTEI-compliant–Theotherthreeelementsareoptional•Tagscanbenested,i.e.anelementcanappearinsideanotherelementTheBNCheaderTEIguidelines•MarkuplanguagesadoptedbytheTEI–SGML(StandardGeneralizedMarkupLanguage)–XML(eXtensibleMarkupLanguage)•CurrentversionofTEIP5guidelines•SeetheTEIof
本文标题:浙江大学肖忠华语料库session 3(外语学习)
链接地址:https://www.777doc.com/doc-3188648 .html