Transcript Indel
From calling bases to calling variants:
Experiences with Illumina data
Gerton Lunter
Wellcome Trust Centre for Human Genetics
This talk
Refresher: Illumina sequencing
QC
What can go wrong
Useful QC statistics
Read mapping
Comparison of popular read mappers
Stampy
Indel and SNP calling
(Some results: 1000 Genomes indel calls)
7x Illumina GA-II
2x Roche 454
WTCHG production (Gb)
700
600
500
400
300
200
100
0
1x Illumina HiSeq 2000
1. Refresher: Illumina sequencing
Illumina sequencing
Illumina sequencing
8 lanes…
x 120 tiles x108 bp x 2 reads…
= about 48 Gb raw bp
2. QC
Quality issues
Bases are identified by their fluorescent tag
Single base per cycle: reversible terminator chemistry
Not perfect: fraction lags, fraction runs ahead: “Phasing”
Limits read length
Optimizing yield: cluster density
Overlapping emission spectra
Higher densities mean more errors
Above an optimum, yield decreases
Partly signal processing issue: software improvements
Low amounts of initial DNA
Linker-linker hybrids; duplicated reads
Overlapping fluorescence spectra
C/A
and G/T overlap
(Most common mutations are transitions, A-G and C-T)
Rougemont et al, 2008
Refresher: Phred scores
Phred score = 10 log10( probability of error )
10:
10% error probability
20: 1% error probability
30: 0.1% error probability (one in 1,000)
3 = 50%,
7 = 20%
13 = 5%,
17 = 2%
23 = 0.5%, 27 = 0.2%
33 = 0.05%, 37 = 0.02%
Phasing
August 2009
June 2010
Cluster density & other improvements
August 2009:
June 2010:
Library complexity, duplicate reads
Some sequences are read several times:
Problem for variant calling
Criterion: both ends of a PE read map to matching location
Can occur by chance, but low probability, except for very high coverage
Post processing: duplicate removal
Any PCR error will be seen twice: evidence for variant
Rate of duplicates is rarely >5%
Low amount of initial material, many PCR copies
Optical duplicates; secondary cluster seeding
Standard processing step (e.g. Samtools, Picard)
Useful statistic:
Duplicate fraction is approximately additive across lanes (same library)
2x duplication fraction ≈ fraction of the library that was sequenced
Library complexity, duplicate reads
Fraction α of all molecules is sequenced
Number of times a PCR copy is sequenced: Poisson(α)
n=
0
1
2
3
…
Poisson
e-α
α e-α
α2 e-α/2
α3 e-α/6
…
0
1
2
…
Duplicates: 0
Expected fraction of duplicates: e-α-1+α
As a fraction of all reads sequenced:
(e-α-1+α)/α
=
½α+…
Sequencing QC
QC statistics
QC statistics - coverage
QC statistics – quality scores
GATK recalibration tool
3. Read mapping
Read mapping
First processing step after sequencing:
Read mapping (most times)
Assembly (no reference sequence; specialized analyses)
Quality of mapping determines downstream results
Accessible genome
Biases
(ref vs. variant)
Sensitivity (divergent reference; snps, indels, SV)
Specificity (calibration of mapping quality)
Read mapper comparison
Read mappers:
Maq
BWA
Eland
Novoalign
Stampy
Criteria:
Sensitivity
(overall; divergent reference; variants)
Specificity (mapping quality calibration)
Speed
Sensitivity
Sensitivity - indels
Sensitivity – Divergent reference
Specificity – ROC curves
ROC - indels
Proportion mapped to within 10kb of mate
Performance on real data
Efficiency
Stampy – first part of algorithm
read
15 bp
subsequence
4 bytes x
229 entry (2 Gb)
hash table
candidate
positions
Remove
rev-comp
symmetry
29 bit word
open addressing,
cache-friendly
Second part: Fast candidate alignment
Single-instructionmultiple-data (SIMD),
parallel execution
Affine gap penalties.
Linear-time and constant
memory algorithm:
DP table in registers.
Maximum indel size 15 bp.
Third part: Modeling mapping failures
Pseudo-bayesian posterior
(using candidates, rather than all mapping positions)
Failure to find the correct candidate
(2 or more mismatches in every 15bp subsequence)
Sequence not in reference
(is sequence match better than expected best random match?)
4. SNP and indel calling
SNP calling
General idea:
Works quite well! Some caveats:
Include mapping quality:
P(read|g) = P(read | wrong map) P(wrong map) +
P(read | g, correct map) P(correct map)
Mapping errors are dependent: don’t include mapQ<10
Base errors are not uniform (A/C/G/T): assume worst case (all identical)
Assumes no anomalies (seg dups; alignments; indel/SV; …)
Hard problem: be conservative
Expected SNP rate (human): 10-3/nt. FPR of 10-5 required for 1% FDR
Filtering is required to achieve good FDR –
or all data features must be adequately modeled
Indel calling
General idea:
Differences with SNP calling:
Pseudo-Bayes:
cannot consider all possible variants/genotypes
Generate large set of candidates
Filter using goodness-of-fit test
Illumina
reads do not have an explicit indel error model
Indel error model
Indel error rate (per nt)
1.00E-02
1.00E-03
1.00E-04
1.00E-05
1
2
3
4
5
6
7
8
9
10
11
12
Homopolymer run length
13
14
15
Wrap up
GA-II produces large amounts of good data
Artefacts do occur, keep a look at QC statistics
Choice of mapper influences yield and quality
Variant calling:
Bayesian approaches work well
Some assumptions (independence) not met, hard to model
Filtering remains necessary