您好,欢迎访问三七文档
当前位置:首页 > 商业/管理/HR > 市场营销 > Lecture 2---Language Modeling
Prof.HuiJiangDepartmentofComputerScienceandEngineeringYorkUniversity,Toronto,Canadahj@cse.yorku.ca •—AcousticModel(AM):givestheprobabilityofgeneratingfeatureXwhenWisuttered.–NeedamodelforeveryWtomodelallspeechsignals(features)fromWHMMisanidealmodelforspeech–Speechunitselection:whatspeechunitismodeledbyeachHMM?(phoneme,syllable,word,phrase,sentence,etc.)•Sub-wordunitismoreflexible(better)•—LanguageModel(LM):givestheprobabilityofW(word,phrase,sentence)ischosentosay.–NeedaflexiblemodeltocalculatetheprobabilityforallkindsofWMarkovChainmodel(n-gram))|(WXpΛ)(WPΓ)|()(maxarg)|()(maxarg)|(maxargˆWXpWPWXpWPXWpΛΓΓ∈Γ∈Γ∈⋅=⋅== •TrainingStage:–Acousticmodeling:howtoselectspeechunitandestimateHMMsreliablyandefficientlyfromavailabletrainingdata.–Languagemodeling:howtoestimaten-grammodelfromtexttrainingdata;handledatasparsenessproblem.•TestStage:–Search:givenHMM’sandn-grammodel,howtoefficientlysearchfortheoptimalpathfromahugegrammarnetwork.•Searchspaceisextremelylarge•Callforanefficientpruningstrategy •N-gramLanguagemodel(LM)essentiallyisaMarkovChainmodel,whichiscomposedofasetofmultinomialdistributions.•GivenW=w1,w2,…,wM,LMprobabilityPr(W)isexpressedas–whereht=wt-n+1,…,wt-1ishistoryofwt.–Inunigram,ht=null(parameters~|V|,|V|vocabularysize)–Inbigram,ht=wt-1(parameters~|V|*|V|)–Intrigram,ht=wt-2wt-1(parameters~|V|*|V|*|V|)–In4-gram,ht=wt-3wt-2wt-1(parameters~|V|*|V|*|V|*|V|)•HowtoevaluateperformanceofLM?∏===MiiiMhwp)|(),,,Pr()Pr( ! •Perplexity:themostwidelyusedperformancemeasureforLM.•GivenanLM{Pr(.)}withvocabularysize|V|,andasufficientlylongtestwordsequenceW=w1,w2,…,wM:–Calculateanegativelog-probquantityperword:–PerplexityofLMiscomputedas•Perplexity:indicatesthepredictionoftheLMisaboutasdifficultasguessingawordamongPPequallylikelywords.•Perplexity:thesmallerPPvalue,thebetterLMpredictioncapability.•Training-setperplexity:howmuchLMfitsorexplainthedata•Test-setperplexity:generalizationcapabilityoftheLMtopredictnewtextdata.)Pr(log12WMLP-=LPPP2= •Largevocabularysizeexponentialgrowthofvariousn-gramsexponentialincreasementofLMmodelparametersmuchmoretrainingdataandcomputingresources•NeedtocontrolvocabularysizeinLM.•Giventhetrainingtextdata,–limitvocabularyofLMtothemostfrequentwordsoccurringinthetrainingcorpus,e.g.,thetopNwords.–Allotherwordsaremappedasunknownword,UNK.–Thisgivesthelowestrateofout-of-vocabulary(OOV)wordsforthesamevocabularysize.•Example:EnglishnewspaperWSJ(WallStreetJournal)–Trainingcorpus:37millionwords(full3-yeararchive)–Vocabulary:20,000words–OOVrate:4%–2-gramPP:114–3-gramPP:76 # $•Collecttextcorpus:needtensofmillionsofwordsfor3-gram•Corpuspreprocessing:(verytime-consuming)–Textclean-up:removepunctuationandothersymbols–Normalization:0.1%(zero)pointonepercent6:00sixo’clock;1/2onehalf,…–SurroundingeachsentencewithTAGSsand/s–Language-specificprocessing:e.g.,forsomeorientallanguages(Chinese,Japanese,etc.)dotokenizationfindwordboundariesfromastreamofcharacters.–Output:cleantextsw1w2w3w4w5/ssw11w21w32w41w52w12w22w33w44w54w16w26w36w43w56/ssw12w23w31w42w51w11w23w34w44w5/s…. # %•LMparameterestimationfromcleantext:–Theentiretrainingtextcanbemappedintoanorderedsampleofn-gramswithoutlossofinformation:S=h1w1,h2w2,…hTwT(assumewehaveTwordsintrainingcorpus)–Grouptogetheralln-gramswiththesamehistoryh:Sh=hwx1,hwx2,…,hwxn–Shcanbeviewedasani.i.d.samplefromPr(w|h).–Wedenotephw=p(w|h)forallpossiblew’sandh’s.–SoprobabilityofShfollowsamultinomialdistribution:whereN(hw)isfrequencyofn-gramhwoccurringinSh.∏∈∝VwhwNhhwpS)()]|([)Pr( # & •MaximumLikelihood(ML)estimationofmultinomialdistributioniseasytoderive.•TheMLestimateofn-gramLMis:)()()()(.allfor1constrantssubjecttoln)(maxarg)|(ln)(maxarg)()|(hNhwNhwNhwNphpphwNhwphwNVwMLhwVwhwVwhwpVwhwphw===⋅=⋅∈∈∈∈ # & •ThenaturalconjugatepriorofmultinomialdistributionistheDirichletdistribution.•ChooseDirichletdistributionaspriors–where{K(hw)}arehyper-parameterstospecifytheprior.•Deriveposteriorp.d.f.fromBayesianlearning:•Maximizationofposteriorip.d.f.theMAPestimate•MAPestimatesofn-gramLMcanbeusedforsmoothing.∏∈∝VwhwKhwhwppp)(][})({∏∈+∝VwhwNhwKhwhhwpSpp)()(][)|}({∈++=VwMAPhwhwKhwNhwKhwNp)]()([)()()(' •MLestimationneverworksduetodatasparseness.•Example:in1.2millionwordsEnglishtext(vocabulary1000words)–20%bigramsand60%trigramsoccuronlyonce.–85%oftrigramsoccurlessthanfivetimes.–Afterobservingthewhole1.2Mwdata,theexpectedchanceofseeinganewbi-gramis22%,anewtri-gram65%.•InMLestimation:zero-frequencyzeroprobability•Datasparsenessproblemcannotbesolvedbycollectingmoredata.–Extremelyunevendistributionofn-gramsinnaturallanguage.–Afteramountofdatareachesacertainpoint,thespeedofreducingOOVrateorrateofnewn-gramsbyaddingmoredatabecomesextremelyslow.•Callforabetterestimationstrate
本文标题:Lecture 2---Language Modeling
链接地址:https://www.777doc.com/doc-3499467 .html