您好,欢迎访问三七文档
当前位置:首页 > IT计算机/网络 > 图形图像 > 基于de-Bruijn图的DNA+contig生成算法
硕士学位论文基于deBruijn图的DNAcontig生成算法DNACONTIGGENERATIONALGORITHMBASEDONDEBRUIJNGRAPH王旭哈尔滨工业大学2011年6月中图分类号:TP319学校代码:10213UDC:621.3密级:公开工学硕士学位论文基于deBruijn图的DNAcontig生成算法硕士研究生:王旭导师:陈彬副教授申请学位:工学硕士学科:计算机科学与技术所在单位:计算机科学与技术学院答辩日期:2011年6月授予学位单位:哈尔滨工业大学ClassifiedIndex:TP319U.D.C:621.3DissertationfortheMasterDegreeinEngineeringDNACONTIGGENERATIONALGORITHMBASEDONDEBRUIJNGRAPHCandidate:WangXuSupervisor:AssociateProf.ChenBinAcademicDegreeAppliedfor:MasterofEngineeringSpecialty:ComputerScienceandTechnologyAffiliation:SchoolofComputerScienceandTechnologyDateofDefence:June,2011Degree-Conferring-Institution:HarbinInstituteofTechnology哈尔滨工业大学工学硕士学位论文-I-摘要为了探索生命的本质,人们迫切希望快速地获得新物种个体DNA分子的全部碱基序列。现在,新一代测序技术不断发展,但测序过程中DNA分子已被打碎成碱基片段,于是从头测序被提出来了。随着第二代、第三代测序技术的产生,人们能在较短时间内获得大量的测序数据。测序技术以高通量、低成本、高精度为发展方向,现在积累的测序数据越来越多。如何快速、准确地处理海量测序数据已成为DNA测序发展的瓶颈。测序之前,DNA分子要经过复制、打碎、过滤等过程,然后通过测序仪,把DNA片段读出,读出的DNA片段称为读取。要获得整个DNA分子的碱基序列,首先要根据读取构造重叠群。重叠群是由读取相互重叠而成的DNA片段,并且重叠群上的每个碱基都被一条读取所覆盖。在本文中,提出了一种新的重叠群生成算法,叫SRGA。该算法基于deBruijn图,将从头测序问题转化成一个四叉树的搜索问题,并采用启发式搜索策略,能够快速地处理海量测序数据,而且能得到质量较高的重叠群。本文详细叙述了算法的原理以及实现过程。为了存储大量的读取,在本文中使用了一种新的deBruijn图结构。为了引入启发式规则,决策表结构是必要的。决策表中保存了正在参与拼接的读取,后继k-mer就是由这些读取决定的。当选定后继k-mer时,决策表需要更新。算法中通过不断选取后继k-mer,来扩展重叠群。故后继k-mer的选取是一个非常重要的过程,只有选择了正确的后继k-mer才能得到质量较高的重叠群。本文提出了读取锁定策略,即通过设置决策表中的锁定位,将一些快要成功拼接的读取锁定,令后继k-mer由这些读取产生。这样可以保证重叠群上每一个碱基都被一个成功拼接的读取所覆盖。最后,本文将SRGA与EULER算法进行了比较。发现EULER产生的重叠群较短,数量较多,单条重叠群与参考基因组匹配较好,但总体覆盖度较低。而SRGA能产生较长的重叠群,且使用了更少的时间与内存。虽然单条重叠群与参考基因组匹配稍差,但总体覆盖度较高。关键词:从头测序;启发式搜索;deBruijn图;决策表哈尔滨工业大学工学硕士学位论文-II-AbstractInordertoexploretheessenceoflife,peopleareeagertogetallthebasesonDNAmoleculesofnewspeciesquickly.Now,anewgenerationofsequencingtechnologyisdevelopingonandon,butintheprocessofsequencing,DNAhasbeenbrokenintofragments.Sodenovosequencingisproposed.Asthesecondgeneration,thirdgenerationsequencingtechnologygenerated,peoplecangetalotofsequencedatainarelativelyshortperiodoftime.Thesequencingtechnologyisdevelopinginthedirectionofhigh-throughput,lowcostandhigh-precision,whichleadstomoreandmoresequencingdataareaccumulated.HowtoquicklyandaccuratelyhandlethemassivesequencingdatahasbecomethebottleneckofthedevelopmentofDNAsequencing.Beforesequencing,DNAmustbecopied,smashedandfiltered,andthenthesequencerreadDNAfragmentsout.TheseDNAfragmentsarecalledreads.First,wemustgeneratecontigsfromreads.AcontigissuchaDNAfragmentthatisovoerlapedbyreads,andeachbaseiscoveredbyaread.Inthispaper,weproposeanewdenovosequencingalgorithm,calledSRGA.ThisalgorithmisbasedondeBruijngraph,andittransformsthedenovosequencingproblemintoaquadtreesearchproblem.Inthisalgorithmheuristicsearchstrategyisused,soitisabletodealwithmasssequencingdataquicklyandthecontigsgeneratedfromthealgorithmhavehighquality.Inthispaper,theprincipleandtherealizationofthealgorithmisdescribeddetailedly.Inordertosaveagreatdealofreads,anewdeBruijngraphisused.Decisiontableisessentialforheuristicrules.Assemblingreadsaresavedinthedecisiontable,andthesereadsdecidewhichk-meristhesubsequentk-mer.Thedecisiontableneedstobeupdatewhenasubsequentk-merischosen.Wechoosesubsequentk-merrepeatedlytoextendcontig.Sochoosingthesubsequentk-merisanimportantprocess.Ifandonlyiftherightsubsequentk-merischosencanwegetthecontigwithhighquality.Herelockedreadstrategyisproposed.Somereadswhichareaboutsuccesscanbelockedthroughsetthelockedbitinthedecisiontableandthenthelockedreadswilldecidewhichk-meristhesubsequentk-mer.Thisensuresthateachbaseonthecontigcanbecoveredbyasuccessfulassemblingread.Atlast,wecompareSRGAwithEULER.EULERproducesshortetcontigs,andeachcontigcanmatchreferencegenomewell,butoverallcoverageislow.SRGAproduceslongercontigs,anduseslesstimeandmemory.Althougheachcontigcannotmatchreferencegenomewell,yetoverallcoverageishigh.Keywords:denovosequencing,heuristicsearch,deBruijngraph,decisiontable哈尔滨工业大学工学硕士学位论文-III-目录摘要...............................................................................................................................IAbstract.............................................................................................................................II第1章绪论....................................................................................................................11.1课题背景...............................................................................................................11.2研究目的和意义...................................................................................................21.3国内外研究现状和分析.......................................................................................41.4基于deBruijn图的算法概述..............................................................................61.4.1deBruijn图简介...............................................................................................61.4.2基于deBruijn图算法的一般步骤.................................................................71.5本文主要研究内容...............................................................................................71.5.1基因组中存在大量重复片段..........................................................................81.5
本文标题:基于de-Bruijn图的DNA+contig生成算法
链接地址:https://www.777doc.com/doc-7358924 .html