您好,欢迎访问三七文档
PCFGModelsofLinguisticTreeRepresentationsMarkJohnson*BrownUniversityThekindsoftreerepresentationsusedinatreebankcorpuscanhaveadramaticeffectonperfor-manceofaparserbasedonthePCFGestimatedfromthatcorpus,causingtheestimatedlikelihoodofatreetodiffersubstantiallyfromitsfrequencyinthetrainingcorpus.ThispaperpointsoutthatthePenn1Itreebankrepresentationsareofthekindpredictedtohavesuchaneffect,anddescribesasimplenoderelabelingtransformationthatimprovesatreebankPCFG-basedparser'saverageprecisionandrecallbyaround8%,orapproximatelyhalfoftheperformancedifferencebetweenasimplePCFGmodelandthebestbroad-coverageparsersavailabletoday.ThisperformancevariationcomesaboutbecauseanyPCFG,andhencethecorpusoftreesfromwhichthePCFGisinduced,embodiesindependenceassumptionsaboutthedistributionofwordsandphrases.Theparticularindependenceassumptionsimplicitinatreerepresentationcanbestudiedtheoreticallyandinvestigatedempiricallybymeansofatreetransformation/detransformationprocess.1.IntroductionProbabalisticcontext-freegrammars(PCFGs)providesimplestatisticalmodelsofnat-urallanguages.Therelativefrequencyestimatorprovidesastraightforwardwayofinducingthesegrammarsfromtreebankcorpora,andabroad-coverageparsingsystemcanbeobtainedbyusingaparsertofindamaximum-likelihoodparsetreefortheinputstringwithrespecttosuchatreebankgram_mar.PCFGparsingsystemsoftenperformaswellasothersimplebroad-coverageparsingsystemforpredictingtreestructurefrompart-of-speech(POS)tagsequences(Charniak1996).WhilePCFGmodelsdonotperformaswellasmodelsthataresensitivetoawiderrangeofdependencies(Collins1996),theirsimplicitymakesthemstraightforwardtoanalyzeboththeoreticallyandempirically.Moreover,sincemoresophisticatedsystemscanbeviewedasrefinementsofthebasicPCFGmodel(Charniak1997),itseemsreasonabletofirstattempttobetterunderstandthepropertiesofPCFGmodelsthemselves.Itiswellknownthatnaturallanguageexhibitsdependenciesthatcontext-freegrammars(CFGs)cannotdescribe(Culy1985;Shieber1985).Butthestatisticalin-dependenceassumptionsembodiedinaparticularPCFGdescriptionofaparticularnaturallanguageconstructionareingeneralmuchstrongerthantherequirementthattheconstructionbegeneratedbyaCFG.WeshowbelowthatthePCFGextensionofwhatseemstobeanadequateCFGdescriptionofPPattachmentconstructionsper-formsnobetterthanPCFGmodelsestimatedfromnon-CFGaccountsofthesameconstructions.Morespecifically,thispaperstudiestheeffectofvaryingthetreestructurerepre-sentationofPPmodificationfrombothatheoreticalandanempiricalpointofview.ItcomparesPCFGmodelsinducedfromtreebanksusingseveraldifferenttreerepre-*DepartmentofCognitiveandLinguisticSciences,Box1978,Providence,RI02912(~)1998AssociationforComputationalLinguisticsComputationalLinguisticsVolume24,Number4sentations,includingtherepresentationusedinthePennIItreebankcorpora(Marcus,Santorini,andMarcinkiewicz1993)andtheChomskyadjunctionrepresentationnowstandardlyassumedingenerativelinguistics.OneoftheweaknessesofaPCFGmodelisthatitisinsensitivetononlocalre-lationshipsbetweennodes.IftheserelationshipsaresignificantthenaPCFGwillbeapoorlanguagemodel.Indeed,thesenseinwhichthesetoftreesgeneratedbyaCFGiscontextfreeispreciselythatthelabelonanodecompletelycharacterizestherelationshipsbetweenthesubtreedominatedbythenodeandthenodesthatproperlydominatethissubtree.Roughlyspeaking,themorenodesinthetreesofthetrainingcorpus,thestrongertheindependenceassumptionsinthePCFGlanguagemodelinducedfromthosetrees.Forexample,aPCFGinducedfromacorpusofcompletelyflattrees(i.e.,consistingoftherootnodeimmediatelydominatingastringofterminals)generatespreciselythestringsoftrainingcorpuswithlikelihoodsequaltotheirrelativefrequenciesinthatcorpus.ThusthelocationandlabelingonthenonrootnonterminalnodesdeterminehowaPCFGinducedfromatreebankgeneralizesfromthattrainingdata.Generally,onemightexpectthatthefewerthenodesinthetrainingcorpustrees,theweakertheindependenceassumptionsintheinducedlanguagemodel.Forthisreason,aflattreerepresentationofPPmodificationisinvestigatedhereaswell.AsecondmethodofrelaxingtheindependenceassumptionsimplicitinaPCFGistoencodemoreinformationineachnode'slabel.Heretheintuitionisthatthelabelonanodeisacommunicationchannelthatconveysinformationbetweenthesubtreedominatedbythenodeandthepartofthetreenotdominatedbythisnode,soallotherthingsbeingequal,appendingtothenode'slabeladditionalinformationaboutthecontextinwhichthenodeappearsshouldmaketheindependenceassumptionsimplicitinthePCFGmodelweaker.Theeffectofaddingaparticularlysimplekindofcontextualinformation--thecategoryofthenode'sparent--isalsostudiedinthispaper.WhethereitherofthesetwoPCFGmodelsoutperformsaPCFGinducedfromtheoriginaltreebankisaseparatequestion.Wefaceaclassicalbiasversusvariancedilemmahere(Geman,Bienenstock,andDoursat1992):astheindependenceassump-tionsimplicitinthePCFGmodelareweakened,thenumberofparametersthatmustbeestimated(i.e.,thenumberofproductions)increases.Thuswhilemovingtoaclassofmodelswithweakerindependenceassumptionspermitsustomoreaccuratelyde-scribeawiderclassofdistributions(i.e.,itreducesthebiasimplicitintheestimator),ingeneralouresti
本文标题:PCFG models of linguistic tree representations
链接地址:https://www.777doc.com/doc-4798118 .html