您好,欢迎访问三七文档
当前位置:首页 > 行业资料 > 冶金工业 > NLP自然语言处理—N-gramlanguagemodel
1CS388:NaturalLanguageProcessing:N-GramLanguageModelsRaymondJ.MooneyUniversityofTexasatAustinLanguageModels•Formalgrammars(e.g.regular,contextfree)giveahard“binary”modelofthelegalsentencesinalanguage.•ForNLP,aprobabilisticmodelofalanguagethatgivesaprobabilitythatastringisamemberofalanguageismoreuseful.•Tospecifyacorrectprobabilitydistribution,theprobabilityofallsentencesinalanguagemustsumto1.UsesofLanguageModels•Speechrecognition–“Iateacherry”isamorelikelysentencethan“EyeeightuhJerry”•OCR&Handwritingrecognition–Moreprobablesentencesaremorelikelycorrectreadings.•Machinetranslation–Morelikelysentencesareprobablybettertranslations.•Generation–MorelikelysentencesareprobablybetterNLgenerations.•Contextsensitivespellingcorrection–“Theirareproblemswitthissentence.”CompletionPrediction•Alanguagemodelalsosupportspredictingthecompletionofasentence.–Pleaseturnoffyourcell_____–Yourprogramdoesnot______•Predictivetextinputsystemscanguesswhatyouaretypingandgivechoicesonhowtocompleteit.N-GramModels•Estimateprobabilityofeachwordgivenpriorcontext.–P(phone|Pleaseturnoffyourcell)•Numberofparametersrequiredgrowsexponentiallywiththenumberofwordsofpriorcontext.•AnN-grammodelusesonlyN1wordsofpriorcontext.–Unigram:P(phone)–Bigram:P(phone|cell)–Trigram:P(phone|yourcell)•TheMarkovassumptionisthepresumptionthatthefuturebehaviorofadynamicalsystemonlydependsonitsrecenthistory.Inparticular,inakth-orderMarkovmodel,thenextstateonlydependsonthekmostrecentstates,thereforeanN-grammodelisa(N1)-orderMarkovmodel.N-GramModelFormulas•Wordsequences•Chainruleofprobability•Bigramapproximation•N-gramapproximationnn)|()|()...|()|()()(111112131211knkknnnwwPwwPwwPwwPwPwP)|()(1111kNknkknwwPwP)|()(111knkknwwPwPEstimatingProbabilities•N-gramconditionalprobabilitiescanbeestimatedfromrawtextbasedontherelativefrequencyofwordsequences.•Tohaveaconsistentprobabilisticmodel,appendauniquestart(s)andend(/s)symboltoeverysentenceandtreattheseasadditionalwords.)()()|(111nnnnnwCwwCwwP)()()|(111111nNnnnNnnNnnwCwwCwwPBigram:N-gram:GenerativeModel&MLE•AnN-grammodelcanbeseenasaprobabilisticautomataforgeneratingsentences.•Relativefrequencyestimatescanbeproventobemaximumlikelihoodestimates(MLE)sincetheymaximizetheprobabilitythatthemodelMwillgeneratethetrainingcorpusT.InitializesentencewithN1ssymbolsUntil/sisgenerateddo:StochasticallypickthenextwordbasedontheconditionalprobabilityofeachwordgiventhepreviousN1words.))(|(argmaxˆMTPExamplefromTextbook•P(siwantenglishfood/s)=P(i|s)P(want|i)P(english|want)P(food|english)P(/s|food)=.25x.33x.0011x.5x.68=.000031•P(siwantchinesefood/s)=P(i|s)P(want|i)P(chinese|want)P(food|chinese)P(/s|food)=.25x.33x.0065x.52x.68=.00019TrainandTestCorpora•Alanguagemodelmustbetrainedonalargecorpusoftexttoestimategoodparametervalues.•Modelcanbeevaluatedbasedonitsabilitytopredictahighprobabilityforadisjoint(held-out)testcorpus(testingonthetrainingcorpuswouldgiveanoptimisticallybiasedestimate).•Ideally,thetraining(andtest)corpusshouldberepresentativeoftheactualapplicationdata.•Mayneedtoadaptageneralmodeltoasmallamountofnew(in-domain)databyaddinghighlyweightedsmallcorpustooriginaltrainingdata.UnknownWords•Howtohandlewordsinthetestcorpusthatdidnotoccurinthetrainingdata,i.e.outofvocabulary(OOV)words?•Trainamodelthatincludesanexplicitsymbolforanunknownword(UNK).–ChooseavocabularyinadvanceandreplaceotherwordsinthetrainingcorpuswithUNK.–ReplacethefirstoccurrenceofeachwordinthetrainingdatawithUNK.EvaluationofLanguageModels•Ideally,evaluateuseofmodelinendapplication(extrinsic,invivo)–Realistic–Expensive•Evaluateonabilitytomodeltestcorpus(intrinsic).–Lessrealistic–Cheaper•Verifyatleastoncethatintrinsicevaluationcorrelateswithanextrinsicone.Perplexity•Measureofhowwellamodel“fits”thetestdata.•Usestheprobabilitythatthemodelassignstothetestcorpus.•Normalizesforthenumberofwordsinthetestcorpusandtakestheinverse.NN)...(1)(21•Measurestheweightedaveragebranchingfactorinpredictingthenextword(lowerisbetter).SamplePerplexityEvaluation•Modelstrainedon38millionwordsfromtheWallStreetJournal(WSJ)usinga19,979wordvocabulary.•Evaluateonadisjointsetof1.5millionWSJwords.UnigramBigramTrigramPerplexity962170109Smoothing•Sincethereareacombinatorialnumberofpossiblewordsequences,manyrare(butnotimpossible)combinationsneveroccurintraining,soMLEincorrectlyassignszerotomanyparameters(a.k.a.sparsedata).•Ifanewcombinationoccursduringtesting,itisgivenaprobabilityofzeroandtheentiresequencegetsaprobabilityofzero(i.e.infiniteperplexity).•Inpractice,parametersaresmoothed(a.k.a.regularized)toreassignsomeprobabilitymasstounseenevents.–Addingprobabilitymasstounseeneventsrequiresremovingitfromseenones(discounting)inordertomaintainajointdistributionthatsumsto1.Laplace(Add-One)Smoothing•“Hallucinate”additionaltrainingdatainwhicheachpossibleN-gramoccursexactlyonceandadjustestimatesaccordingly.whereVisthetotalnumberofpossible(N1)-grams(i.e.thevocabularysizeforabigrammodel).VwCwwCwwPnnnnn)(1)()|(111VwCwwCwwPnNnnnNnnNnn
本文标题:NLP自然语言处理—N-gramlanguagemodel
链接地址:https://www.777doc.com/doc-2889719 .html