您好,欢迎访问三七文档
当前位置:首页 > 商业/管理/HR > 其它文档 > lecture-9(宾夕法尼亚大学二代测序数据分析教程)
2013%&%BMMB%597D:%Analyzing%Next%Generaon%Sequencing%Data%%%Week%5,%Lecture%9%István'Albert''Bioinformacs%Consulng%Center%%Penn%State%NGS%sequencing%read%formats%Reads:%%short%sequences%produced%by%the%instrument%%Illumina%!%FastQ%format%(.fastq%or%.fq)%Solid%!%colorspace%fasta%(.xsq%or%.csfasta%+%.qual)%454%%%!%standard%flowgram%format%(.sff)%The%structure%of%the%FASTQ%file%Four'lines'per'FASTQ'record''1. @%indicates%the%sequence%%the%sequence%content%of%the%read%2. %+%oponally%repeat%the%sequence%id%(oZen%leZ%empty)%3. quality%string%Paper:'The%Sanger%FASTQ%file%format%for%sequences%with%quality%scores,%%and%the%Solexa/Illumina%FASTQ%variants%&%%Nucl.&Acids&Res.&(2010)&38&(6):&176771771.&%Encodings%An%encoding%is%a%transformaon%from%one%representaon%to%another%%%• The%informaon%is%not%changed%• The%opmizaon%method%changes%i.e:%pig%lan%is%a%type%of%encoding%Ordinal%(numerical)%%value%of%a%character%(ord)%Character%value%of%%and%integer%(chr)%Encoding%One%character%!%one%byte%space%ABCa%=%%4%bytes%long%65%66%67%97%=%11%bytes%long%%Good:'three%characters%are%turned%into%one,%saves%space%Bad:'not%readable,%hinders%understanding,%may%have%different%decoding%opons%Quality%Scores%• A%quality%score%is%a%number%that%usually%has%limits,%a%low%(say%0)%to%a%high%(say%40)%• A%quality%score%represents%an%error%probability.%• It%characterizes%a%single%step%of%the%process%and%the%NOT%the%enre%experimental%procedure%• Quality%scores%are%used%to%represent%base%calling%accuracy,%alignment%accuracy%%and%other%probabilies%Remapping%an%encoding%• Only%some%types%of%characters%can%be%printed.%• So%the%encoding%must%start%at%a%character%that%can%be%printed,%but%we%also%want%that%value%to%be%the%low%end%of%the%scale%=%0%%• Say%character%“A”%has%a%code%of%65.%If%we%choose%“A”%as%the%minimum%of%our%scale%then%%%%%%%%%%%%%%%PHRED%Quality%Scores%For%a%quality%score%Q%the%error%probability%is%%P'='10'–Q/10'%Examples:%%Q'='10%!%P%=%10%–1%=%1/10%=%0.1%=%10%%Q'='40%!%P%=%10%–4%=%1/10000%=%%0.0001%=%0.01%%%There%are%mulple%encodings%• Illumina%used%to%switch%around%the%encoding%every%once%in%a%while.%%• Finally%they%sejled%on%the%Sanger%for%encoding/Phred%quality%representaon.%• There%are%datasets/tools%out%there%with%different%encodings!%Sanger%Encoding%(+33)%• Quality%Value%range%between%0%and%93%%• Start%the%scale%at%character%33%• End%the%scale%at%character%33%+%93%=%126%(currently%most%instruments%only%produce%qualies%in%the%range%is%0%to%40)%Illumina%1.3%encoding%(+64)%%(obsolete%but%sll%oZen%observed%in%the%wild)%• Quality%range%between%0%to%62%• Start%scale%at%character%64%%• End%scale%at%character%64%+%62%=%126%Understanding%encodings%FASTQ%format%The%first%column%indicates%the%record%type%De&facto%standard%for%processing%sequencing%reads.%%Download%the%lecture&6.zip%data%The%structure%of%the%FASTQ%file%Four'lines'per'FASTQ'record''1. @%indicates%the%sequence%%the%sequence%content%of%the%read%2. %+%oponally%repeat%the%sequence%id%(oZen%leZ%empty)%3. quality%string%Paper:'The%Sanger%FASTQ%file%format%for%sequences%with%quality%scores,%%and%the%Solexa/Illumina%FASTQ%variants%&%%Nucl.&Acids&Res.&(2010)&38&(6):&176771771.&%Other%formats%• Some%instruments%generate%files%in%different%formats.%Occasionally%two%files:%1. %A%sequence%file%in%FASTA%format%2. A%FASTA%like%quality%file%that%lists%numerical%qualies%Convert%them%to%FASTQ%FASTA%and%Quality%Files%First%step%is%to%convert%to%FASTQ%format.%Homework%9%• What%characters%in%the%Sanger%encoding%represent%base%calling%error%probabilies%of:%%– 100%%%– 0.01%%%%– 0.001%%%• Create%a%Sanger%encoded%FASTQ%file%that%contains%the%sequence%ATGC%and%has%the%qualies%of%32,%51,%38%and%34'
本文标题:lecture-9(宾夕法尼亚大学二代测序数据分析教程)
链接地址:https://www.777doc.com/doc-4171639 .html