NLP自然语言处理—N-gramlanguagemodel

1CS388:NaturalLanguageProcessing:N-GramLanguageModelsRaymondJ.MooneyUniversityofTexasatAustinLanguageModels•Formalgrammars(e.g.regular,contextfree)giveahard“binary”modelofthelegalsentencesinalanguage.•ForNLP,aprobabilisticmodelofalanguagethatgivesaprobabilitythatastringisamemberofalanguageismoreuseful.•Tospecifyacorrectprobabilitydistribution,theprobabilityofallsentencesinalanguagemustsumto1.UsesofLanguageModels•Speechrecognition–“Iateacherry”isamorelikelysentencethan“EyeeightuhJerry”•OCR&Handwritingrecognition–Moreprobablesentencesaremorelikelycorrectreadings.•Machinetranslation–Morelikelysentencesareprobablybettertranslations.•Generation–MorelikelysentencesareprobablybetterNLgenerations.•Contextsensitivespellingcorrection–“Theirareproblemswitthissentence.”CompletionPrediction•Alanguagemodelalsosupportspredictingthecompletionofasentence.–Pleaseturnoffyourcell_____–Yourprogramdoesnot______•Predictivetextinputsystemscanguesswhatyouaretypingandgivechoicesonhowtocompleteit.N-GramModels•Estimateprobabilityofeachwordgivenpriorcontext.–P(phone|Pleaseturnoffyourcell)•Numberofparametersrequiredgrowsexponentiallywiththenumberofwordsofpriorcontext.•AnN-grammodelusesonlyN1wordsofpriorcontext.–Unigram:P(phone)–Bigram:P(phone|cell)–Trigram:P(phone|yourcell)•TheMarkovassumptionisthepresumptionthatthefuturebehaviorofadynamicalsystemonlydependsonitsrecenthistory.Inparticular,inakth-orderMarkovmodel,thenextstateonlydependsonthekmostrecentstates,thereforeanN-grammodelisa(N1)-orderMarkovmodel.N-GramModelFormulas•Wordsequences•Chainruleofprobability•Bigramapproximation•N-gramapproximationnn)|()|()...|()|()()(111112131211knkknnnwwPwwPwwPwwPwPwP)|()(1111kNknkknwwPwP)|()(111knkknwwPwPEstimatingProbabilities•N-gramconditionalprobabilitiescanbeestimatedfromrawtextbasedontherelativefrequencyofwordsequences.•Tohaveaconsistentprobabilisticmodel,appendauniquestart(s)andend(/s)symboltoeverysentenceandtreattheseasadditionalwords.)()()|(111nnnnnwCwwCwwP)()()|(111111nNnnnNnnNnnwCwwCwwPBigram:N-gram:GenerativeModel&MLE•AnN-grammodelcanbeseenasaprobabilisticautomataforgeneratingsentences.•Relativefrequencyestimatescanbeproventobemaximumlikelihoodestimates(MLE)sincetheymaximizetheprobabilitythatthemodelMwillgeneratethetrainingcorpusT.InitializesentencewithN1ssymbolsUntil/sisgenerateddo:StochasticallypickthenextwordbasedontheconditionalprobabilityofeachwordgiventhepreviousN1words.))(|(argmaxˆMTPExamplefromTextbook•P(siwantenglishfood/s)=P(i|s)P(want|i)P(english|want)P(food|english)P(/s|food)=.25x.33x.0011x.5x.68=.000031•P(siwantchinesefood/s)=P(i|s)P(want|i)P(chinese|want)P(food|chinese)P(/s|food)=.25x.33x.0065x.52x.68=.00019TrainandTestCorpora•Alanguagemodelmustbetrainedonalargecorpusoftexttoestimategoodparametervalues.•Modelcanbeevaluatedbasedonitsabilitytopredictahighprobabilityforadisjoint(held-out)testcorpus(testingonthetrainingcorpuswouldgiveanoptimisticallybiasedestimate).•Ideally,thetraining(andtest)corpusshouldberepresentativeoftheactualapplicationdata.•Mayneedtoadaptageneralmodeltoasmallamountofnew(in-domain)databyaddinghighlyweightedsmallcorpustooriginaltrainingdata.UnknownWords•Howtohandlewordsinthetestcorpusthatdidnotoccurinthetrainingdata,i.e.outofvocabulary(OOV)words?•Trainamodelthatincludesanexplicitsymbolforanunknownword(UNK).–ChooseavocabularyinadvanceandreplaceotherwordsinthetrainingcorpuswithUNK.–ReplacethefirstoccurrenceofeachwordinthetrainingdatawithUNK.EvaluationofLanguageModels•Ideally,evaluateuseofmodelinendapplication(extrinsic,invivo)–Realistic–Expensive•Evaluateonabilitytomodeltestcorpus(intrinsic).–Lessrealistic–Cheaper•Verifyatleastoncethatintrinsicevaluationcorrelateswithanextrinsicone.Perplexity•Measureofhowwellamodel“fits”thetestdata.•Usestheprobabilitythatthemodelassignstothetestcorpus.•Normalizesforthenumberofwordsinthetestcorpusandtakestheinverse.NN)...(1)(21•Measurestheweightedaveragebranchingfactorinpredictingthenextword(lowerisbetter).SamplePerplexityEvaluation•Modelstrainedon38millionwordsfromtheWallStreetJournal(WSJ)usinga19,979wordvocabulary.•Evaluateonadisjointsetof1.5millionWSJwords.UnigramBigramTrigramPerplexity962170109Smoothing•Sincethereareacombinatorialnumberofpossiblewordsequences,manyrare(butnotimpossible)combinationsneveroccurintraining,soMLEincorrectlyassignszerotomanyparameters(a.k.a.sparsedata).•Ifanewcombinationoccursduringtesting,itisgivenaprobabilityofzeroandtheentiresequencegetsaprobabilityofzero(i.e.infiniteperplexity).•Inpractice,parametersaresmoothed(a.k.a.regularized)toreassignsomeprobabilitymasstounseenevents.–Addingprobabilitymasstounseeneventsrequiresremovingitfromseenones(discounting)inordertomaintainajointdistributionthatsumsto1.Laplace(Add-One)Smoothing•“Hallucinate”additionaltrainingdatainwhicheachpossibleN-gramoccursexactlyonceandadjustestimatesaccordingly.whereVisthetotalnumberofpossible(N1)-grams(i.e.thevocabularysizeforabigrammodel).VwCwwCwwPnnnnn)(1)()|(111VwCwwCwwPnNnnnNnnNnn

NLP自然语言处理—N-gramlanguagemodel

免费阅读已结束，点击付费阅读剩下 ... 页

阅读已结束，您可以下载文档离线阅读

龙湖地产投资管理理念与实践

染料和助剂产品与纺织品染色技术

数控车床类设备

甲状腺功能紊乱的临床生物化学

实例分析--汽车内电磁环境的建模分类

劳动法律法规名词介绍

通富微电：关于公司XXXX年公开发行人民币普通股并上市之法律意见书

对外承包项目人民币借款合同

债权转让协议文本

高新区建设工程质量阶段验收和竣工验收实施办法1

相关文档

相关搜索