您好,欢迎访问三七文档
当前位置:首页 > 商业/管理/HR > 其它文档 > 隐马尔科夫过程-Profile-Hidden-Markov-Models
ProfileHiddenMarkovModels•Markovmodels•HiddenMarkovmodels•ProfilehiddenMarkovmodelsMarkovChain•Asetofstates•Thetransitionsfromonestatetoallotherstates,includingitself,aregovernedbyaprobabilitydistribution•FirstorderMarkovchain:theprobabilitiesdependsolelyonthecurrentstate•n-thorderMarkovchain:npreviousstatesAMarkovModelofDNAMutationsACGTA0.990.0020.0060.002C0.0020.990.0020.006G0.0060.0020.990.002T0.0020.0060.0020.99ACGT0.990.990.990.990.0060.0060.0020.002TwoUsesofaMarkovModel•Generatesequencesaccordingtotheprobabilities•ComputetheprobabilityofasequenceAMarkovModelGeneratingRandomDNASequencesACGTbeginendAGoodIntroductiontoHMM•Theexamplesinthefollowingslidesaretakenfrom:•AnintroductiontohiddenMarkovmodelsforbiologicalsequences•AndersKrogh•InComputationalMethodsinMolecularBiology,editedbyS.L.Salzberg,D.B.Searls,andS.Kasif,pages45-63,Elsevier,1998•[AT][CG][AC][ACGT]*A[TG][GC]AProblemwithRegularExpressionACA---ATGTCAACTATCACAC--AGCAGA---ATCACCG--ATC[AT][CG][AC][ACGT]*A[TG][GC]Doesnotdistinguishbetween:TGCT--AGGexceptionalACAC--ATCconsensusAHiddenMarkovModelinsertionnodenode1node2node3node4node5node6FirstThreeandLastThreeColumns•Column1:4A’sand1T–probabilityforAis0.8–probabilityforTis0.2ACA---ATGTCAACTATCACAC--AGCAGA---ATCACCG--ATCInsertions•Columns4,5,6aretheinsertions•Atthefourthcolumn,3outof5sequenceshaveinsertions–theprobabilityoftransitionfromthethirdnodetotheinsertionnodeis0.6•Intheinsertionnode,1A,2C’s,1G,1T–theprobabilitiesofA,C,G,Tare0.2,0.4,0.2,0.2•Transitionsoutoftheinsertionnode–3outofthe5insertionsterminatetheinsertion–theprobabilityofleavingtheinsertionnodeis0.6Pr(ACACATC)=0.8∙1∙0.8∙1∙0.8∙0.6∙0.4∙0.6∙1∙1∙0.8∙1∙0.8≈0.047Twouses:computeprobability,generatesequencesTheProbabilitiesofSequencesSequenceProb∙100ConsensusACAC--ATC4.7sequence1ACA---ATG3.3sequence2TCAACTATC0.0075sequence3ACAC--AGC1.2sequence4AGA---ATC3.3sequence5ACCG--ATC0.59exceptionalTGCT--AGG0.0023AProblemwithProbabilities•Biasedbythelengthsofthesequences•0.047forACAC--ATC•0.000075forTCAACTATC•Normalizeforthelength–letLbethelengthofthesequence–dividetheprobabilityby(0.25)L–oddsratio–takelogarithmofoddsratio:log-oddsscoreProbabilitiesandLog-oddsScoresSequenceProb∙100log-oddsConsensusACAC--ATC4.76.7sequence1ACA---ATG3.34.9sequence2TCAACTATC0.00753.0sequence3ACAC--AGC1.25.3sequence4AGA---ATC3.34.9sequence5ACCG--ATC0.594.6exceptionalTGCT--AGG0.0023-0.97log-odds(ACACATC)=1.16+0+1.16+0+1.16-0.51+0.47-0.51+1.39+0+1.16+0+1.16=6.64IfThereAreNoInsertions•Removetheinsertionnode•Theprobabilitiesofalltransitionsare1–log-oddsscoreis0•Thenthescoreofasequenceisthesumofthelog-oddsscores–ThisreducesHMMtoaposition-specificscoringmatrix(PSSM)AProblemwithProbabilitiesDerivedfromSimpleCounting•Countsof(A,C,G,T)incolumn1:(4,0,0,1)–probabilitiesof(A,C,G,T):(0.8,0,0,0.2)•AreweabsolutelysurethatnoCorGcanappearinthiscolumn?•Whatifweareworkingwithaminoacids?–SomeaminoacidsareknowntosubstituteforeachotherPriorInformation•Inanalignmentof3sequences,acolumncontainsonlyisoleucine–priorinformation:isoleucineiscommonlyfoundinburiedbeta-strandenvironments,andleucineandvalineoftensubstituteforitintheseenvironments–estimateoftheprobabilitydistributionshouldincludeleucineandvaline,andperhapsotheraminoacids,albeitwithsmallerprobabilities•Inanalignmentof100variedsequences,acolumncontainsonlyisoleucine–moreevidencefromthedatathatisoleucineisconservedatthisposition–inthiscase,priorinformationislessimportantThePseudocountMethod•Datacounts:(A,C,G,T)=(4,0,0,1)–probabilities:(4/5,0,0,1/5)•Pseudocounts(1,1,1,1)astheprior–posteriorcounts:(5,1,1,2)–posteriorprobabilities:(5/9,1/9,1/9,2/9)•Pesudocounts(4,4,4,4)–posteriorcounts:(8,4,4,5)–posteriorprobabilities:(8/21,4/21,4/21,5/21)ThePaperonDirichletMixtures•Dirichletmixtures:amethodforimproveddetectionofweakbutsignificantproteinsequencehomology•Sjolander,Karplus,Brown,Hughey,Krogh,Mian,andHaussler•CABIOS,volume12,number4,327-345,1996DirichletDensity•LetPbe(p1,p2,…,p20),aprobabilitydistributionofthe20aminoacids,suchthatpi≥0,∑pi=1•ADirichletdensityRhasparameters(α1,α2,…,α20),suchthatαi0•RisaprobabilitydistributionofP–Aprobabilitydistributionofprobabilitydistributions•R(P)=(constant)∙p1^(α1-1)∙p2^(α2-1)∙…∙p20^(α20-1)–TheconstantscalesR(P)sothatRisaprobabilitydistribution–E(pi)=αi/∑αj,theexpectation–Allαi=0.005,allE(pi)=1/20,butpreferspuredistributions–Allαi=0.5,allE(pi)=1/20,butprefersmixeddistributionCompare2DirichletDensitieswith3ParametersDirichletMixture•ADirichletmixturewithLcomponents–R=q1R1+q2R2+…+qLRL–R1,R2,…,RL:eachisaDirichletdensity,andtheyarecalledthecomponentsofthemixture–q1,q2,…,qLaremixturecoefficients,sumtoone•ThepseudocountmethodcorrespondstoaDirichletmixturewithonecomponentTheProbabilitiesbyUsingaSingleDirichletDensity•Let(N1,N2,…,N20)betheaminoacidcountsinacolumn–N=∑Ni•TheDirichletdensityis(α1,α2,…,α20)–α=∑αi•Prob(i)=(Ni+αi)/(N+α)•Thisre
本文标题:隐马尔科夫过程-Profile-Hidden-Markov-Models
链接地址:https://www.777doc.com/doc-5274728 .html