Transcript Indel

From calling bases to calling variants:
Experiences with Illumina data
Gerton Lunter
Wellcome Trust Centre for Human Genetics
This talk

Refresher: Illumina sequencing

QC



What can go wrong
Useful QC statistics
Read mapping


Comparison of popular read mappers
Stampy

Indel and SNP calling

(Some results: 1000 Genomes indel calls)
7x Illumina GA-II
2x Roche 454
WTCHG production (Gb)
700
600
500
400
300
200
100
0
1x Illumina HiSeq 2000
1. Refresher: Illumina sequencing
Illumina sequencing
Illumina sequencing

8 lanes…
x 120 tiles x108 bp x 2 reads…
= about 48 Gb raw bp
2. QC
Quality issues

Bases are identified by their fluorescent tag


Single base per cycle: reversible terminator chemistry



Not perfect: fraction lags, fraction runs ahead: “Phasing”
Limits read length
Optimizing yield: cluster density




Overlapping emission spectra
Higher densities mean more errors
Above an optimum, yield decreases
Partly signal processing issue: software improvements
Low amounts of initial DNA

Linker-linker hybrids; duplicated reads
Overlapping fluorescence spectra
 C/A
and G/T overlap
 (Most common mutations are transitions, A-G and C-T)
Rougemont et al, 2008
Refresher: Phred scores

Phred score = 10 log10( probability of error )
 10:
10% error probability
 20: 1% error probability
 30: 0.1% error probability (one in 1,000)
3 = 50%,
7 = 20%
 13 = 5%,
17 = 2%
 23 = 0.5%, 27 = 0.2%
 33 = 0.05%, 37 = 0.02%

Phasing
August 2009
June 2010
Cluster density & other improvements
August 2009:
June 2010:
Library complexity, duplicate reads

Some sequences are read several times:



Problem for variant calling



Criterion: both ends of a PE read map to matching location
Can occur by chance, but low probability, except for very high coverage
Post processing: duplicate removal


Any PCR error will be seen twice: evidence for variant
Rate of duplicates is rarely >5%


Low amount of initial material, many PCR copies
Optical duplicates; secondary cluster seeding
Standard processing step (e.g. Samtools, Picard)
Useful statistic:


Duplicate fraction is approximately additive across lanes (same library)
2x duplication fraction ≈ fraction of the library that was sequenced
Library complexity, duplicate reads
Fraction α of all molecules is sequenced
Number of times a PCR copy is sequenced: Poisson(α)
n=
0
1
2
3
…
Poisson
e-α
α e-α
α2 e-α/2
α3 e-α/6
…
0
1
2
…
Duplicates: 0
Expected fraction of duplicates: e-α-1+α
As a fraction of all reads sequenced:
(e-α-1+α)/α
=
½α+…
Sequencing QC
QC statistics
QC statistics - coverage
QC statistics – quality scores
GATK recalibration tool
3. Read mapping
Read mapping

First processing step after sequencing:
Read mapping (most times)
 Assembly (no reference sequence; specialized analyses)


Quality of mapping determines downstream results
Accessible genome
 Biases
(ref vs. variant)
 Sensitivity (divergent reference; snps, indels, SV)
 Specificity (calibration of mapping quality)

Read mapper comparison

Read mappers:
 Maq
 BWA
 Eland
 Novoalign
 Stampy

Criteria:
 Sensitivity
(overall; divergent reference; variants)
 Specificity (mapping quality calibration)
 Speed
Sensitivity
Sensitivity - indels
Sensitivity – Divergent reference
Specificity – ROC curves
ROC - indels
Proportion mapped to within 10kb of mate
Performance on real data
Efficiency
Stampy – first part of algorithm
read
15 bp
subsequence
4 bytes x
229 entry (2 Gb)
hash table
candidate
positions
Remove
rev-comp
symmetry
29 bit word
open addressing,
cache-friendly
Second part: Fast candidate alignment
Single-instructionmultiple-data (SIMD),
parallel execution
Affine gap penalties.
Linear-time and constant
memory algorithm:
DP table in registers.
Maximum indel size 15 bp.
Third part: Modeling mapping failures

Pseudo-bayesian posterior
(using candidates, rather than all mapping positions)

Failure to find the correct candidate
(2 or more mismatches in every 15bp subsequence)

Sequence not in reference
(is sequence match better than expected best random match?)
4. SNP and indel calling
SNP calling

General idea:

Works quite well! Some caveats:
Include mapping quality:
P(read|g) = P(read | wrong map) P(wrong map) +
P(read | g, correct map) P(correct map)
 Mapping errors are dependent: don’t include mapQ<10
 Base errors are not uniform (A/C/G/T): assume worst case (all identical)
 Assumes no anomalies (seg dups; alignments; indel/SV; …)


Hard problem: be conservative
Expected SNP rate (human): 10-3/nt. FPR of 10-5 required for 1% FDR
 Filtering is required to achieve good FDR –
or all data features must be adequately modeled

Indel calling

General idea:

Differences with SNP calling:
 Pseudo-Bayes:
cannot consider all possible variants/genotypes
Generate large set of candidates
Filter using goodness-of-fit test
 Illumina
reads do not have an explicit indel error model
Indel error model
Indel error rate (per nt)
1.00E-02
1.00E-03
1.00E-04
1.00E-05
1
2
3
4
5
6
7
8
9
10
11
12
Homopolymer run length
13
14
15
Wrap up

GA-II produces large amounts of good data

Artefacts do occur, keep a look at QC statistics

Choice of mapper influences yield and quality

Variant calling:
Bayesian approaches work well
 Some assumptions (independence) not met, hard to model
 Filtering remains necessary
