您好,欢迎访问三七文档
当前位置:首页 > 商业/管理/HR > 质量控制/管理 > ScoringMatrix[SeqA[i]][SeqB[j]]
SHRiMP:AccurateMappingofShortReadsinLetter-andColour-spacesStephenRumble,PhilLacroute,…,ArendSidow,MichaelBrudnoHowSHRiMPworks:Stage1:MapreadstotargetgenomeStage2:ComputestatisticsReadMappingThreephasesVeryfastk-merscan(indexreads,scangenome)Fast,vectorizedSmith-WatermantoconfirmSlow,completebacktrackingS-Wfortop‘n’hitsReadMapping:Phase1Createahashtableofsize4^(k-merlength)4bases–ignoreallelse(‘N’,‘X’,wobblecodes…)Thisbecomesourkmertoreadindex…AACTGTACCAGTGAGReadMapping:Phase1Createahashtableofsize4^(k-merlength)4bases–ignoreallelse(‘N’,‘X’,wobblecodes…)Thisbecomesourkmertoreadindex…AACTGTaccagtgagAACTGTReadMapping:Phase1Createahashtableofsize4^(k-merlength)4bases–ignoreallelse(‘N’,‘X’,wobblecodes…)Thisbecomesourkmertoreadindex…aACTGTAccagtgagAACTGTACTGTAReadMapping:Phase1Createanindexofsize4(k-merlength)4bases–ignoreallelse(‘N’,‘X’,wobblecodes…)Thisisourk-mertoreadindex…aaCTGTACcagtgagAACTGTACTGTACTGTACReadMapping:Phase1Createahashtableofsize4^(k-merlength)4bases–ignoreallelse(‘N’,‘X’,wobblecodes…)Thisbecomesourkmertoreadindex…accTGTACCagtgagAACTGTACTGTACTGTACTGTACCReadMapping:Phase1Createahashtableofsize4^(k-merlength)4bases–ignoreallelse(‘N’,‘X’,wobblecodes…)Thisbecomesourkmertoreadindex…AACTGTACTGTACTGTACTGTACCRead7Read32Read18Read12Read13Read12Read7Read15ReadMapping:Phase1Oncewe’veindexedallreads,justscanthegenomebyk-merGenomeReadsReadMapping:Phase1Rememberthek-merhitswithinagiveninterval(window)Whensufficienthits,lookmoreclosely“Lookmoreclosely”meanscalculateafastSmith-WatermanscoreTechnicalitiesWedon’talwaysusefullk-mers(q-grams).Weactuallysupport‘spacedseeds’,butthealgorithmdoesn’tchangemuch.Foreachspacedseed,‘compressout’thek-meranduseitasthehashindexReadMapping:Phase2Smith-WatermanisveryexpensiveNxMmatrixisn’ttoobigforshortreadsandwindows,but…WecallthevectorizedcodemillionsoftimesWedon’twantabottleneck–aimfornomorethan50%ofthetotalruntimeWeonlywantonescoreasquicklyaspossibleReadMapping:Phase2CellbeingcomputedPreviouslycomputedcellsACTAGACTTGTCCAGTMi,jmaxMi1,j1S(Ai1,Bj1)Mi1,jgapMi,j1gapReadMapping:Phase2Eachforward-facingdiagonalinS-Wmatrixdependson:Smallconstant#ofpreviousdiagonalsSmallconstant#ofscalarsWecancomputeentirediagonalsinparallelOurspeed-upisproportionaltothediagonalsizeReadMapping:Phase2+---+CurrentPreviousPenultimateACTAGACTTGTCCAGTACTAGACTTGTGACCT+---+ReadMapping:Phase2MostcommodityprocessorshavevectorinstructionsRemembertheMMXbrouhaha?SIMD–SingleInstruction,MultipleData41287291536212310+=ReadMapping:Phase2+---+CurrentPreviousPenultimateACTAGACTTGTCCAGTACTAGACTTGTGACCT+---+ReadMapping:Phase2MatchscorestypicallyuseascoringmatrixScoringMatrix[SeqA[i]][SeqB[j]]Butthisdoesn’tscale:IndividualcellscoresbecomeabottleneckCanprecomputea‘queryprofile’(expensive),or…Ifweonlycareaboutstrictmatch/mismatchwecanuselogicalbit-wiseoperationsSIMDinstructionsworkhere(fullyparallel)ReadMapping:Phase2Results:OurvectorizedS-Wisasfast,orfasterthanotherverycomplicatedSIMDimplementations500million+matrixcells/secondonCore2machinesEvenwithsmallseeds,S-WaccountsforatmosthalfofthetotalruntimeReadMapping:Phase3Recap:K-merscanselectsareasofreasonablesimilarityVectorizedS-W(dis)confirmssimilarityBest‘n’hitsperreadaregivenafullalignmentwithbacktraceReadMapping:Phase3Letter-spacealignmentsaresimple:K-merscan,VectorizedS-W,FullS-Winletters,giveuserprettyoutputWhataboutABSOLiDcolour-space?BiologistswanttoseeA,C,G,T,not0,1,2,3…DealingwithstrangeSOLiDproperties…Oursolution:K-merscan,VectorizedS-Wincolour-spaceFullS-Winletter-space,butwecan’tjustconvertABDi-baseReadsWethinkintermsofnucleotides:A,C,G,andT’s.AB’sNGSmachineoutputs4coloursOnecolourperpairofbases:TTGAGCGTTCT0122331020123T1032G2301C3210ATGCAABDi-baseReadsAGCT00001122330123T1032G2301C3210ATGCASOLiDTranslationsGiventhefollowingread,thereare4translations(weneedaninitialbase):012233102AACTCGCAAGCCAGATACCTGGTCTATGGATTGAGCGTTCSOLiDTranslationsReadsbeginwithaknownprimer(‘T’)012233102AACTCGCAAGCCAGATACCTGGTCTATGGATTGAGCGTTCSOLiDTranslationsWhathappensifareaderroroccurs?Therighttranslationwas:TTGAGCGTTC010233102AACCTATGGACCAAGCGTTCGGTTCGCAAGTTGGATACCTColour-spaceSmith-WatermanTherearefouruniquetranslationsforeveryreadAnerrorwillcauseustochangeframes(differenttranslation)WhynotdoaS-Wacrossallfourletter-spacetranslationswithsomeerrorpenalty?Colour-spaceSmith-WatermanThinkof4S-WmatricesstackedaboveoneanotherIfwehave1readerror,butotherwiseperfectmatch,we’lluse2matricesGenomeReadFrame1Frame2Frame3Frame4Colour-spaceSmith-WatermanEndresult:G:1123724TA-ACCACGGTCACACTTGCATCAC1123701|||||||||||||||X|||||||T:TACACCACGGTCAGACTtGCATCACR:0T031110113012122121131321124Shouldbe‘0’StatisticsAfterreadsaremapped,mullovertheresultsForeachread:P(hitbypurechance–notavalidhit)P(hitgeneratedbygenome–validhit)P(hitisbestofallforparticularread)ResultsSpeedSimplek-merscanisveryfastImportantwhenseedsarebigger(lessS-W)VectorizedS-WisfastImportantwhenseedsaresmaller(moreS-W)Generallywell-
本文标题:ScoringMatrix[SeqA[i]][SeqB[j]]
链接地址:https://www.777doc.com/doc-4062083 .html