您好,欢迎访问三七文档
当前位置:首页 > 商业/管理/HR > 项目/工程管理 > GATK使用说明及流程
GATK使用说明及流程1.要分析的序列名称中,一般不要有空格2.准备Reference文件(需为fasta格式)及比对1)Index处理:生成一个ref.fasta.fai的文件2)生成.dict文件:samtoolsfaidxreference.fastajava-jarpicard-tools/CreateSequenceDictionary.jarR=reference.fastaO=reference.dictbwaindex-abwtswref.fastabwamem-t16-Mref.fastaread.fqmates.fqsample.sam转换结果文件到bam格式java-jarpicardtools/SamFormatConvertI=xx.samo=xx.bamorsamtoolsview-bSxx.sam-oxx.bam3.准备样本的BAM文件1)Sortthealignedreadsbycoordinateorder2)Markduplicates3)Addreadgroupinformation(同时具有,sam2bam转换、sort功能,可合并setp1)4)IndextheBAMfilejava-jarpicardtools/SortSam.jarINPUT=unsorted_reads.bamOUTPUT=sorted_reads.bamSORT_ORDER=coordinate(input可以输入sam文件,output输出bam,省去上述的格式转换)java-jarpicardtools/MarkDuplicates.jarINPUT=sorted_reads.bamOUTPUT=dedup_reads.bamMETRICS_FILE=sample01.dedup.metricsMAX_FILE_HANDLES=1000注意:MAX_FILE_HANDLES=Integer,参数由“ulimit-n”获得极限值。Duringthesequencingprocess,thesameDNAmoleculescanbesequencedseveraltimes.Theresultingduplicatereadsarenotinformativeandshouldnotbecountedasadditionalevidencefororagainstaputativevariant.Theduplicatemarkingprocess(sometimescalleddeduppinginbioinformaticsslang)identifiesthesereadsassuchsothattheGATKtoolsknowtoignorethem.java-jarpicardtools/AddOrReplaceReadGroups.jarI=dedup_reads.bamO=addrg_reads.bamID=group1LB=lib1PL=illuminaPU=unit1SM=sample1ID=StringReadGroupIDDefaultvalue:1.Thisoptioncanbesetto'null'toclearthedefaultvalue.LB=StringReadGroupLibraryRequired.PL=StringReadGroupplatform(e.g.illumina,solid)Required.PU=StringReadGroupplatformunit(eg.runbarcode)Required.SM=StringReadGroupsamplenameRequired.java-jarpicardtools/BuildBamIndexI=addrg_reads.bamorsamtoolsindexaddrg_reads.bam例如1)bwa比对bwaindex-abwtswref.fastabwamem-t16-Mref.fastaread.fqmates.fqsample.sam2)转换sam到bamsamtoolsview-bSsample01.sam-osample01.bamjava-jarpicardtools/SamFormatConvertI=xx.samo=xx.bam3)排序java-jarpicardtools/SortSam.jarI=sample.bamO=sample.sorted.bamsort_order=coordinate4)去重复java-jarpicardtools/MarkDuplicates.jarINPUT=sorted_reads.bamOUTPUT=dedup_reads.bamMETRICS_FILE=sample01.dedup.metricsMAX_FILE_HANDLES=10005)分组java-jarpicardtools/AddOrReplaceReadGroups.jarI=sample.sorted.bamO=group.bamID=group1LB=lib1PL=illuminaPU=unit1SM=sample16)index样品java-jar~/my_bin/picardtools1.94/BuildBamIndex.jarI=group.bam4.使用参数--------------------------------------------------------------------------------TheGenomeAnalysisToolkit(GATK)v2.6-4-g3e5ff60,Compiled2013/06/2414:48:56Copyright(c)2010TheBroadInstituteForsupportanddocumentationgoto(requiredString)Typeofanalysistorun.-I,--input_fileinputfile(s),SAMorBAM-rbs,--read_buffer_sizeNumberofreadsperSAMfiletobufferinmemory--BQSR/-BQSR(File)Theinputcovariatestablefilewhichenableson-the-flybasequalityscorerecalibration(intendedforusewithBaseRecalibratorandPrintReads).Enableson-the-flyrecalibrateofbasequalities.ThecovariatestablesareproducedbytheBaseQualityScoreRecalibratortool.Pleasebeawarethatoneshouldonlyrunrecalibrationwiththecovariatesfilecreatedonthesameinputbam(s).-K,--gatk_keyGATKKeyfile.Requiredifrunningwith-etNO_ET.Pleasesee-home-and-how-does-it-affect-me#latestfordetails.--intervals/-L(List[IntervalBinding[Feature]])Oneormoregenomicintervalsoverwhichtooperate.Canbeexplicitlyspecifiedonthecommandlineorinafile(includingarodfile).UsingthisoptiononecaninstructtheGATKenginetotraverseoveronlypartofthegenome.Thisargumentcanbespecifiedmultipletimes.Onemayusesamtools-styleintervalseitherexplicitly(e.g.-Lchr1or-Lchr1:100-200)orlistedinafile(e.g.-LmyFile.intervals).Additionally,onemayspecifyarodfiletotraverseoverthepositionsforwhichthereisarecordinthefile(e.g.-Lfile.vcf).TospecifythecompletelyunmappedreadsintheBAMfile(i.e.thosewithoutareferencecontig)use-Lunmapped.-XL,--excludeIntervalsOneormoregenomicintervalstoexcludefromprocessing.Canbeexplicitlyspecifiedonthecommandlineorinafile(includingarodfile)--reference_sequence/-R(File)Referencesequencefile.--num_threads/-nt(Integerwithdefaultvalue1)Howmanydatathreadsshouldbeallocatedtorunningthisanalysis..Howmanydatathreadsshouldbeallocatedtothisanalysis?DatathreadscontainsNcputhreadsperdatathread,andactascompletelydataparallelprocessing,increasingthememoryusageofGATKbyMdatathreads.Datathreadsgenerallyscaleextremelyeffectively,upto24cores......5.分析流程1)MappingandDuplicateMarking2)LocalRealignment3)BaseQualityRecalibration该步骤的运行,需要使用已知的snp/indel信息做参考。若没有已知信息,可以先用GATK和samtools初步获得,取其一致snp/indel信息,作为参考。具体可参考他人博客::variantscallingbyGATKjava-jarGenomeAnalysisTK.jar\-Rref.fasta\-TUnifiedGenotyper\-Isample01.realn.bam\-osample01.gatk.raw.vcf\-stand_call_conf30.0\-stand_emit_conf0\-glmBOTH\-rfBadCigar这边有一个-rf参数,是用来过滤掉不符合要求的reads,这边是把包含错误的Cigar字符串的reads给排除掉,关于Cigar字符串可以参考关于sam文件的说明(TheSAMFormatSpecification),sam文件的第六行就是这边的Cigar字符串,-rf的其他参
本文标题:GATK使用说明及流程
链接地址:https://www.777doc.com/doc-636470 .html