2013-BI543-NextGenSeqInf-v02x - BC Bioinformatics

Download Report

Transcript 2013-BI543-NextGenSeqInf-v02x - BC Bioinformatics

High throughput sequencing:
informatics & software aspects
Gabor T. Marth
Boston College Biology Department
BI543 Fall 2013
January 29, 2013
Traditional DNA sequencing
Genetics of living organisms
Chromosomes
DNA
Radioactive label gel sequencing
Four-color capillary sequencing
~1 Mb
~100 Mb
>100 Mb
ABI 3700 four-color sequence trace
~3,000 Mb
Individual human resequencing
Next-generation DNA sequencing
New sequencing technologies…
… vast throughput, many applications
Illumina, SOLiD
1 Tb
bases per machine run
100 Gb
10 Gb
454
1 Gb
100 Mb
10 Mb
ABI / capillary
1 Mb
10 bp
100 bp
read length
1,000 bp
Sequencing chemistries
DNA base extension
DNA ligation
Church, 2005
Template clonal amplification
Church, 2005
Massively parallel sequencing
Church, 2005
Chemistry of paired-end sequencing
Double strand DNA is
folded into a bridge
shape then separated
into single strands. The
end of each strand is
then sequenced.
(Figure courtesy of
Illumina)
Paired-end reads
• fragment amplification:
fragment length 100 - 600 bp
• fragment length limited by
amplification efficiency
• circularization: 500bp - 10kb (sweet spot ~3kb)
• fragment length limited by library complexity
Korbel et al. Science 2007
Features of NGS data
Short sequence reads
100-200bp
25-35bp (micro-reads)
Huge amount of sequence per run
Up to gigabases per run
Huge number of reads per run
Up to 100’s of millions
Higher error as compared with Sanger sequencing
Error profile different to Sanger
Application areas of next-gen
sequencing
Application areas
• Genome resequencing
• variant discovery
• somatic mutation detection
• mutational profiling
• De novo assembly
• Identification of protein-bound DNA
• chromatin structure
• methylation
• transcription binding sites
Mikkelsen et al. Nature 2007
• RNA-Seq
• expression
• transcript discovery
Cloonan et al. Nature Methods, 2008
SNP and short-INDEL discovery
Structural variation detection
• structural variations (deletions, insertions, inversions and translocations) from
paired-end read map locations
• copy number (for amplifications, deletions) from depth of read coverage
Identification of protein-bound DNA
genome sequence
aligned reads
Chromatin structure (CHIP-SEQ)
(Mikkelsen et al. Nature 2007)
Transcription binding sites. (Robertson et al. Nature Methods, 2007)
Novel transcript discovery (genes)
Mortazavi et al. Nature Methods
• novel exons
• novel transcripts containing known exons
Novel transcript discovery (miRNAs)
Ruby et al. Cell, 2006
Expression profiling
gene
gene
aligned reads
aligned reads
Jones-Rhoads et al. PLoS Genetics, 2007
• tag counting (e.g. SAGE, CAGE)
• shotgun transcript sequencing
De novo genome sequencing
Lander et al. Nature 2001
short reads
read pairs
longer reads
assembled sequence contigs
The informatics of sequencing
Re-sequencing informatics pipeline
REF
IND
(ii) read mapping
(iv) SV calling
IND
GigaBayes
(iii) SNP and short INDEL calling
(i) base calling
(v) data viewing, hypothesis
generation
The variation discovery toolbox
• base callers
• read mappers
• SNP callers
• SV callers
• assembly viewers
Raw data processing / base calling
• These steps are usually handled well by the
machine manufacturers’ software
Trace extraction
Base calling
• What most analysts want to see is base calls
and well-calibrated base quality values
Sequence traces are machine-specific
Base calling is increasingly left to machine manufacturers
Read mapping…
Is like a jigsaw puzzle…
…where they give you the
cover on the box
Some pieces are easier to place than others…
pieces that look like each other…
…pieces with
unique features
Repeats  multiple mapping problem
Lander et al. 2001
Paired-end (PE) reads
fragment length: 100 –
600bp
fragment length: 1 –
10kb
Korbel et al. Science 2007
PE reads are now the standard for whole-genome short-read
sequencing
Mapping quality values
0.8
0.19
0.01
SNP calling
SNP calling: what goes into it?
Base qualities
sequencing error
true polymorphism
Base coverage
Prior expectation
Bayesian SNP calling
A
A
A
A
A
Bayesian
posterior
probability
P( SNP ) 
Base call +
Base quality

all var iable
C
C
C
C
C
polymorphic
permutation
G
G
G
G
G
T
T
T
T
T
monomorphic
permutation
Expected polymorphism rate
P( S N | RN )
P( S1 | R1 )
 ... 
 PPr ior ( S1 ,..., S N )
PPr ior ( S1 )
PPr ior ( S N )
P( SiN | R1 )
P( Si1 | R1 )
S
...

...

 PPr ior ( Si1 ,..., SiN )


P
(
S
)
P
(
S
)
S i1 [ A ,C ,G ,T ] S iN [ A ,C ,G ,T ] Pr ior
i1
Pr ior
iN
Base composition Depth of coverage
The PolyBayes software
http://bioinformatics.bc.edu/~marth/PolyBayes
• First statistically rigorous SNP discovery
tool
• Correctly analyzes alternative cDNA
splice forms
Marth et al., Nature Genetics, 1999
SNP calling (continued)
P(B1=aacc|G1=aa)
P(B1=aacc|G1=cc)
P(B1=aacc|G1=ac)
-----a---------a---------a---------a---------c-----
P(Bi=aaaac|Gi=aa)
P(Bi=aaaac|Gi=cc)
P(Bi=aaaac|Gi=ac)
-----c---------c---------c---------c-----
P(Bn=cccc|Gn=aa)
P(Bn=cccc|Gn=cc)
P(Bn=cccc|Gn=ac)
“genotype
likelihoods”
P(SNP)
P(G1=aa|B1=aacc; Bi=aaaac; Bn= cccc)
P(G1=cc|B1=aacc; Bi=aaaac; Bn= cccc)
P(G1=ac|B1=aacc; Bi=aaaac; Bn= cccc)
Prior(G1,..,Gi,.., Gn)
-----a---------a---------c---------c-----
P(Gi=aa|B1=aacc; Bi=aaaac; Bn= cccc)
P(Gi=cc|B1=aacc; Bi=aaaac; Bn= cccc)
P(Gi=ac|B1=aacc; Bi=aaaac; Bn= cccc)
P(Gn=aa|B1=aacc; Bi=aaaac; Bn= cccc)
P(Gn=cc|B1=aacc; Bi=aaaac; Bn= cccc)
P(Gn=ac|B1=aacc; Bi=aaaac; Bn= cccc)
“genotype
probabilities”
Insertion/deletion (INDEL) variants
These variants have been on the “radar screen” for decades
Accurate automated detection is difficult
Different mutation mechanisms
Often appear in repetitive sequence and therefore difficult to align
Often multi-allelic
Deleted allele has no base quality values
Alignment methods became more refined
Original
alignment
After left realignment
After haplotype-aware
realignment
Medium length INDELs still a problem
Guillermo Angel
Structural variation detection
Feuk et al. Nature Reviews Genetics, 2006
Structural variant detection (cont’d)
Detection Approaches
Sample
Reference
Read Depth:
good for big CNVs
• Paired-end:
all types of SV
Lmap
• Split-Reads
good break-point
resolution
read
• deNovo Assembly
~ the future
contig
SV slides courtesy of Chip Stewart, Boston College
Relative numbers of events
SV detection – resolution
Expected CNVs
Karyotype
Micro-array
Sequencing
CNV event length [bp]
Standard data formats
Reads: FASTQ
Alignments: SAM/BAM
Variants: VCF
Tools for analyzing & manipulating 1000G
data
Alignments: SAM/BAM
• samtools: http://samtools.sourceforge.net/
• BamTools: http://sourceforge.net/projects/bamtools/
• GATK:
http://www.broadinstitute.org/gsa/wiki/index.php/The_Genome_Analysis_T
oolkit
Variants: VCF
• VCFTools: http://vcftools.sourceforge.net/
• VcfCTools: https://github.com/AlistairNWard/vcfCTools