Information Theory of DNA Sequencing

Transcript Information Theory of DNA Sequencing

Information Theory of DNA Sequencing
David Tse
Dept. of EECS
U.C. Berkeley
Guy Bresler
LIDS Student Conference
MIT
Abolfazl Motahari
Feb. 2, 2012
Research supported by NSF Center for Science of Information.
DNA sequencing
DNA: the blueprint of life
Problem: to obtain the sequence of nucleotides.
…ACGTGACTGAGGACCGTG
CGACTGAGACTGACTGGGT
CTAGCTAGACTACGTTTTA
TATATATATACGTCGTCGT
ACTGATGACTAGATTACAG
ACTGATTTAGATACCTGAC
TGATTTTAAAAAAATATT…
courtesy: Batzoglou
Impetus: Human Genome Project
1990: Start
3 billion basepairs
2001: Draft
2003: Finished
courtesy: Batzoglou
Sequencing Gets Cheaper and Faster
Cost of one human genome
• HGP:
$ 3 billion
• 2004:
$30,000,000
• 2008:
$100,000
• 2010:
$10,000
• 2011:
$4,000
• 2012-13:
$1,000
• ???:
$300
courtesy: Batzoglou
But many genomes to sequence
100 million species
(e.g. phylogeny)
7 billion individuals
(SNP, personal genomics)
1013 cells in a human
(e.g. somatic mutations
such as HIV, cancer)
courtesy: Batzoglou
Whole Genome Shotgun Sequencing
Reads are assembled to reconstruct the original DNA sequence.
Sequencing Technologies
• HGP era: single technology
(Sanger)
• Current: multiple “next generation”
technologies (eg. Illumina, SoLiD,
Pac Bio, Ion Torrent, etc.)
• All provide massively parallel
sequencing.
• Each technology has different
read lengths, noise profiles, etc
Assembly Algorithms
• Many proposed algorithms.
• Different algorithms tailored to different technologies.
• Each algorithm deals with the full complexity of the
problem while trying to scale well with the massive
amount of data.
• Lots of heuristics used in the design.
A Basic Question
• What is the minimum number of reads needed to
reconstruct with a given reliability?
• A benchmark for comparing different algorithms.
• An algorithm-independent basis for comparing
different technologies and designing new ones.
Coverage Analysis
• Pioneered by Lander-Waterman
• What is the minimum number of reads to ensure there is
no gap between the reads with a desired prob.?
• Only provides a lower bound.
• Can one get a tight lower bound?
Communication and Sequencing:
An Analogy
Communication:
Sequencing:
Communication: Fundamental Limits
Given statistical models for source and channel:
Cchannel = channel capacity
H source = source entropy rate
Shannon 48
Asymptotically reliable communication at rate R source symbols
per channel output symbol if and only if:
R<
C ch a n n el
H so u r ce
DNA Sequencing: Fundamental Limits?
S1 ; S2 ; : : : ; SG
R1; R2; : : : ; RN
• Define: sequencing rate R = G/N basepairs per read
• Question: can one define a sequencing capacity C
such that:
asymptotically reliable reconstruction is possible if
and only if R < C?
A Simple Model
• DNA sequence: i.i.d. with distribution p.
• Starting positions of reads are i.i.d. uniform on the
DNA sequence.
• Read process is noiseless.
Will extend to more complex source model and noisy
read process later.
The read channel
read
channel
AGCTTATAGGTCCGCATTACC
• Capacity depends on
– read length: L
L ") C "
– DNA length: G
G ") C #
• Normalized read length:
L¹ :=
L
log G
• Eg. L = 100, G = 3 £ 109 : L¹ = 4:6
AGGTCC
Result: Sequencing Capacity
C = L¹
X4
H 2 (p) = ¡ log
p2i
i= 1
Renyi entropy of order 2
C= 0
Coverage Constraint
G
T
L
N reads
Starting positions of reads ~ Poisson(1/R)
R=
G
N
¡
E[# of gaps] = N ¢P[T > L] = N e
,
E [# of gaps] ! 0
R < L¹
L
R
No-Duplication Constraint
L
L
L
L
The two possibilities have the same set of length L subsequences.
Ã
E [# of duplicat ed pairs] ¼ N 2 ¢
X4
!
p2i
i= 1
2 ¡ L H 2 (p )
= N e
,
E [# of duplicat ed pairs] ! 0
2
¹L >
H 2 (p)
L
Achievability
no-duplication constraint
coverage constraint
achievable?
Greedy Algorithm
Input: the set of N reads of length L
1. Set the initial set of contigs as the reads.
2. Find two contigs with largest overlap and merge
them into a new contig.
3. Repeat step 2 until only one contig remains or no
more merging can be done.
Algorithm progresses in stages:
at stage ` = L ¡ 1; L ¡ 2; : : : ; 0
merge reads at overlap `
Greedy algorithm: the beginning
gap
Most reads have large overlap with neighbors
Expected # of errors in stage L-1:
probability two disjoint
reads are equal
Very small since no-duplication constraint is satisfied.
Greedy algorithm: stage
Expected # of errors at stage
probability two disjoint
reads appear to
overlap
¡
¢
¡ L =R 2
This may get larger, but no larger than N e
Very small since coverage constraint is satisfied.
when ` = 0
Summary: Two Regimes
coverage-limited
regime
duplication-limited
regime
Relation to Earlier Works
• Coverage constraint: Lander-Waterman 88
• No-duplication constraint: Arratia et al 96
• Arratia et al focused on a model where all length L
subsequences are given (seq. by hybridization)
• Our result: the two constraints together are
necessary and sufficient for shotgun sequencing.
Rest of Talk
• Impact of read noise.
• Impact of repeats in DNA sequence
Read Noise
ACGTCCTATGCGTATGCGTAATGCCACATATTGCTATGCGTAATGCGT
TATA CTTA
Model:
discrete memoryless channel defined by transition
probabilities
Modified Greedy algorithm
X
Y
Y
Do we merge the two reads at overlap
?
We observe two strings: X and Y.
Are they noisy versions of the same DNA subsequence? (merge)
Or from two different locations? (do not merge)
This is a hypothesis testing problem!
Impact on Sequencing Rate
H0: noisy versions of the same DNA subsequence (merge)
H1: from disjoint DNA subsequences (do not merge)
• Hypothesis test:
MAP rule: declare H0 if
X
Y
Y
Two types of error:
• false positive (same as before)
• missed detection (new type of error)
Impact on Sequencing Rate
no-duplication constraint
coverage constraint
obtained by optimizing MAP threshold
More Complex DNA Statistics
• i.i.d. is not a very good model for the DNA sequence.
• More generally, we may want to model it as a
correlated random process.
• For short-scale correlation, H2(p) can be replaced by
the Renyi entropy rate of the process.
1
H 2 = lim ¡ log P(x ` = y ` )
`! 1
`
• But for higher mammals, DNA contains long repeats,
repeat length comparable or longer than reads.
• This is handled by paired-end reads in practice.
A Simple Model for Repeats
K
Model: M repeats of length K placed uniformly into DNA sequence
If repeat length K>> read length L, how to reconstruct
sequence?
Use paired-end reads:
J
These reads can bridge the repeats
reads come in pairs
with known separation
Impact on Sequencing Rate
K= repeat length
J = paired-end
separation
constant
indep of K
If J > 2d + K
then capacity
is the same as
without repeats
no-duplication constraint
coverage
of
repeats
constraint
constraint
Conclusion
• DNA sequencing is an important problem.
• Many new technologies and new applications.
• An analogy between sequencing and communication
is drawn.
• A notion of sequencing capacity is formulated.
• A principled design framework?

Information Theory of DNA Sequencing

Transcript Information Theory of DNA Sequencing

Directory