您好,欢迎访问三七文档
当前位置:首页 > 商业/管理/HR > 信息化管理 > n-gram和数据平滑
n-gramchbb@pku.edu.cn(LanguageModeling)sP(s)s1=s2=P(s1)P(s2)P(s)PLPPLOCR1)(=∑∈LssPP(s)P(s)=f(θ1,θ2,…,θn)θis=w1w2…wnP(s)?(Chainrule)P(“JOHNREADABOOK”)=p(“JOHN”)p(“READ|JOHN”)p(“A|JOHNREAD”)p(“BOOK|JOHNREADA”))()()()()(11213121−⋅⋅⋅=llw...w|wpww|wpw|wpwpsP∏=−=liiiw...w|wp111)(“”Johnreada______bookorvegetable?n-1“thelargegreen______.”“mountain”?“tree”?“Sueswallowedthelargegreen______.”“pill”?“broccoli”?“Sueswallowed”n-gramp(wi|w1w2…wi-1)n-1p(wi|wi-n+1…wi-1)n-gramn-gramnnn-gramunigram(n=1)p(wi)2000020000bigram(n=2)p(wi|wi-1)20000200002trigram(n=3)p(wi|wi-2wi-1)20000200003four-gram(n=4)(digramquadrigram)……n-gram:tokenizationBOSEOSIeat.ÆBOSIeat.EOSIsleep.ÆBOSIsleep.EOS(MLE)c(w1,..,wn)n-gramw1,..,wn)()()(11111n-nn-nMLE,..,wwc,..,wwc,..,w|wwp=BOSJOHNREADMOBYDICKEOSBOSMARYREADADIFFERENTBOOKEOSBOSSHEREADABOOKBYCHEREOS31==)BOS(c)JOHNBOS(c)BOS|JOHN(p11==)JOHN(c)READJOHN(c)JOHN|READ(p32==)READ(c)AREAD(c)READ|A(p21==)A(c)BOOKA(c)A|BOOK(p21==)BOOK(c)EOSBOOK(c)BOOK|EOS(pJOHNREADABOOKP(JOHNREADABOOK)=p(JOHN|BOS)p(READ|JOHN)p(A|READ)p(BOOK|A)p(EOS|BOOK)=0.06bigram(underflow)Ln(P(JOHNREADABOOK))=Ln(p(JOHN|BOS))+Ln(p(READ|JOHN))+Ln(p(A|READ))+Ln(p(BOOK|A))+Ln(p(EOS|BOOK))=Ln(1/3)+Ln(1)+Ln(2/3)+Ln(1/2)+Ln(1/2)=-2.8902EOS(DataSparsness)CHERREADABOOKc(CHERREAD)=0Æp(READ|CHER)0Æp(CHERREADABOOK)=0()MLE0n-gram,n-gram0NLPZipfZipfwfr(wr)fr=k(k)wi50wj150wiwj3TomSawyer71,370(wordtokens)8,018(wordtypes)ZipfTomSawyer3993(50%)(wordtypes)10051%(wordtokens)ZipfZipfk8000-90003r=100ZipfTomSawyer1-3Zipf0501001502002503003500500100015002000RankFreqZipf19981ZipfZipfZipf()ZipfPrincipleofLeasteffort()()ZipfBalh150trigram23%trigramMLE:discounting()Add-oneAdd-deltaWitten-BellGood-TuringChurch-GaleJelinek-MercerKatz……Add-one(Laplace’slaw)n-gramn-gram:new_count(n-gram)=old_count(n-gram)+1n-gram0Add-oneIwanttoeatChinesefoodlunch…Total(N)I810870130003437want3078606861215to301086030123256eat002019252938Chinese200001201213food1901700001506lunch4000010459…bigramIwanttoeatChinesefoodlunch…TotalI.0023(8/3437).320.0038(13/3437)0001want.00250.650.0049.0066.00491to.000920.0031.26.000920.00371eat00.00210.020.0021.0551Chinese.00940000.56.00471food.0130.01100001lunch.00870000.002201…bigram1stword2ndwordAdd-oneIwanttoeatChinesefoodlunch…Total(N+V)I891087108811411134375053want34178717972831to411186141134872eat11231203532554Chinese3111112121829food2011811113122lunch51111212075bigramIwanttoeatChinesefoodlunch…TotalI.0018(9/5053).22.0002.0028(14/5053).0002.0002.00021want.0014.00035.28.00035.0025.0032.00251to.00082.00021.0023.18.00082.00021.00271eat.00039.00039.0012.00039.0078.0012.0211Chinese.0016.00055.00055.00055.00055.066.00111food.0064.00032.0058.00032.00032.00032.000321lunch.0024.00048.00048.00048.00048.0022.000481bigramp(w1|w2)Add-oneN:n-gram(token)V:n-gram(type)VN…=…1)()(22111Add-onen-gram00n-gramn-gramNLPAdd-onen-gramn-gramn-gram()AP(ChurchandGale,1991)22,000,000bigrams(token)273,266(type)(74,674,306,756bigrams(type))74,671,100,000bigramsAdd-onebigram0.0002950.0008224.2150.0006853.2340.0005482.2430.0004111.2520.0002740.44810.0002950.0000270fadd-onefempiricalfMLEtoohightoolowFreq.fromtrainingdataFreq.fromheld-outdataAdd-onesmoothedfreq.bigram=(74,671,100,000x0.000295)/22,000,000~99.96!!!!Add-delta(Lidstone’slaw)1,1λλ=0.5Add-oneVNλλ++…=…)()(2211V.N.)()(2211++…=…(Held-outEstimation)(Held-outdata):(trainingset):(heldoutdata):n-gramw1...wn:Ctr(w1...wn)w1...wnCho(w1...wn)w1...wn:r=n-gramNr=rn-gramTr=rn-gramT=n-gram(token):∑==})(|{111)(rwwCwwnhorntrnwwCTLLL)(,1)(11ntrrrnhowwCrNTTwwPLL=×=)(1)(11ntrrrnhowwC,rNTTwwPLL=×=rn-gramNrrn-gramr=510n-gram(types)5:N5=1010n-gram20:T5=202000n-gram(token)0010101200020)5(.withrann-gramPho=×==(DeletedEstimation)…part0part1part0part1part1part0(DeletedEstimationortwo-waycrossvalidation)r...ww,cNNNTT...wwpnrrrrndel=++=)()()(11010011Good-Turing:n-gramn-gramTuringestimaten-gramrrNNrr*1)1(++=rn-gramr+1n-gramNNNrN*100=Good-Turingrn-gramprr+1n-gramNr+1=0pr=0Nr“”S(.)S(r)Nr(Good-Turingestimate)S(.)?)()1()1(*rSrSrr++=Good-Turing[1]log(r)log(Nr)rNr0NrNrtrq,Nt0,Nr0,Nq0txrrxqNx=0)(5.0qtNZrr−=[1]Church,K.andW.Gale,(1991),AcomparisonoftheenhancedGood-TuringanddeletedestimationmethodsforestimatingprobabilitiesofEnglishbigrams,ComputerSpeechandLanguage,v.5,pp.19-54.Good-Turingr*r(r1)r*rpr=r*/Nn-gramrr*rr*/r1rpMLE)log()log(rbaNr+=))log((,AaArNbr==Good-Turinglog(Zr)log(r)abb-1r*r1bbbr1r*)r11(rAr)1r(A)1r(NN)1r(r+++=++=+=Good-TuringTuringTuringTuringGood-TuringGood-TuringTuringGood-Turing1.65TuringTuring)1()1()(1212*rrrrTNNNNrVVar++++≈1)1(11≥⋅−=∑≥rpNpNNprunnormrrunnormrrn-gram()bigramjournalofPunsmoothed(of|journal)=0journalfromPunsmoothed(from|journal)=0journalneverPunsmoothed(never|journal)=0,“journalof”.“of”“from”&“never”unigramP(of)P(from)P(never)unigrammodelbigrammodelbigrammodeltrigrammodel…n-gramn-gramn-gram(SimpleLinearInterpolation)n-gramPli(wn|wn-2,wn-1)=λ1P(wn)+λ2P(wn|wn-1)
本文标题:n-gram和数据平滑
链接地址:https://www.777doc.com/doc-6708193 .html