DNA Sequencing

Download Report

Transcript DNA Sequencing

DNA Sequencing
CS273a 2016
DNA sequencing
How we obtain the sequence of nucleotides of a species
…ACGTGACTGAGGACCGTG
CGACTGAGACTGACTGGGT
CTAGCTAGACTACGTTTTA
TATATATATACGTCGTCGT
ACTGATGACTAGATTACAG
ACTGATTTAGATACCTGAC
TGATTTTAAAAAAATATT…
CS273a 2016
Human Genome Sequencing –
A Crazy Idea
1953
Franklin, Watson, Crick
1977
Sanger & colleagues
1985:
•
Robert Sinsheimer, UCSC

•
Rene Dulbecco, SALK

•
Chancellor, UC Santa Cruz, Salk
Salk Institute
Charles DeLisi, DOE

DOE
Source:
Encyclopedia of Philosophy
CS273a Stanford
2016
“DOE’s program
for unemployed
bombmakers”
The Human Genome Project
1986: Discussions
1990: Launch
1996: Map first, sequence later
1997: Weber & Myers
1998: Celera
2000: Bill Clinton:
2001: Draft
3 billion basepairs
$3 billion
CS273a 2016
2003: “Finished”
“most important
scientific discovery
in the 20th century”
Whole Genome Shotgun Sequencing
Genome Research, 1997
Let’s sequence
the human
genome with the
shotgun strategy
That is
impossible, and a
bad idea anyway
CS273a 2016
Gene Myers
Phil Green
Human Genome Race 1999-2000
• 1998: Celera, Inc
• 1999: Celera drosophila genome
• 2000:
Venter, Myers and colleagues
Lander, Collins and colleagues
 5 years ahead of time, a computational achievement
 Huge boost to sequencing technology development
CS273a 2016
Sequencing genomes in order to align them
•
•
•
•
•
•
Human, 2001
Mouse, 2002
Rat, 2003
Chicken, 2004
Dog, Chimpanzee, 2005
Many mammals & other vertebrates, 2006-today
•
Genscan (1997)

•
MUMmer, Glass (1999-2000)

•
Pairwise genome alignment
Rosetta/Glass, Twinscan (2000-1)

•
HMM for gene prediction
Human/Mouse gene prediction
MLAGAN, MAVID, BLASTZ (2003-4)

Multiple genome alignment
NHGRI FY 2005 Budget: $492M. “One of the most powerful approaches for unlocking the secrets
of the human genome is comparative genomics. [[…]] The current NHGRI-supported, large-scale
sequencing centers [[… are...]] expected to yield the equivalent of about 20 additional draft
vertebrate genomes in just the next three years.
CS273a 2016
Sequencing Growth
Cost of one human genome
• 2004:
$30,000,000
• 2008:
$100,000
• 2010:
$10,000
• 2015:
“$1,000”
• ???:
$300
CS273a 2016
SOLID
A Data-Acquisition Technology Explosion
Enabled by Inexpensive Sequencing
•
•
RNAseq
•
DNase-seq
•
ATAC-seq
•
Hi-C
CS273a 2016
ChIP-seq for

Transcription factor binding

Nucleosome positioning

Histone Modifications
•
Bisulfite treatment for
methylation
•
MeDIP for methylation
Major Data Acquisition Efforts
Human Microbiome
Roadmap Epigenomics
1000 Genomes
CS273a 2016
Sequencing Growth
Cost of one human genome
• 2004:
$30,000,000
• 2008:
$100,000
• 2010:
$10,000
• 2015:
“$1,000”
• ???:
$300
How much would you
pay for a smartphone?
CS273a 2016
Stephens ZD et al. Plos Biology 2015
How soon will we all be sequenced?
Applications
Cost
Time
2016?
2022?
• Learning from the data
• Roadblocks
CS273a 2016
Which representative of the species?
Which human?
Answer one:
Answer two: it doesn’t matter
Polymorphism rate: number of letter changes between two different
members of a species
Humans: ~1/1,000
Other organisms have much higher polymorphism rates
 Population size!
CS273a 2016
Why humans are so similar
Out of Africa
N
Heterozygosity: H
H = 4Nu/(1 + 4Nu)
u ~ 10-8, N ~ 104
 H ~ 410-4
CS273a 2016
A small population that interbred
reduced the genetic variation
Out of Africa ~ 40,000 years ago
Ancient sequencing technology – Sanger
Vectors
DNA
Shake
DNA fragments
Vector
Circular genome
(bacterium, plasmid)
CS273a 2016
+
=
Known
location
(restriction
site)
Ancient sequencing technology – Sanger
Gel Electrophoresis
1.
Start at primer (restriction
site)
2.
Grow DNA chain
3.
Include dideoxynucleoside
(modified a, c, g, t)
4.
Stops reaction at all
possible points
5.
Separate products with
length, using gel
electrophoresis
CS273a 2016
Illumina cluster concept
CS273a 2016
Slide Credit: Arend Sidow
Cluster generation (‘bridge amplification’)
CS273a 2016
Slide Credit: Arend Sidow
Clonally Amplified Molecules on Flow Cell
1µM
CS273a 2016
Slide Credit: Arend Sidow
Sequencing by Synthesis, One Base at a Time
3’-
…-5’
G T A T T T T C G G C A C A G
A
G
A
C
T
C
Cycle 1:
T
G
T
Add sequencing reagents
First base incorporated
Remove unincorporated bases
Detect signal
Cycle 2-n:
CS273a 2016
Add sequencing reagents and repeat
Slide Credit: Arend Sidow
Method to sequence longer regions
genomic segment
cut many times at
random (Shotgun)
Get one or two reads from
each segment
~900 bp
CS273a 2016
~900 bp
Two main assembly problems
• De Novo Assembly
• Resequencing
CS273a 2016
Reconstructing the Sequence
(De Novo Assembly)
reads
Cover region with high redundancy
Overlap & extend reads to reconstruct the original genomic region
CS273a 2016
Definition of Coverage
C
Length of genomic segment:
Number of reads:
Length of each read:
G
N
L
Definition:
C=NL/G
Coverage
How much coverage is enough?
Lander-Waterman model: Prob[ not covered bp ] = e-C
Assuming uniform distribution of reads, C=10 results in 1
gapped region /1,000,000 nucleotides
CS273a 2016
Repeats
Bacterial genomes:
Mammals:
5%
50%
Repeat types:
•
Low-Complexity DNA (e.g. ATATATATACATA…)
•
Microsatellite repeats
•
Transposons
 SINE
(a1…ak)N where k ~ 3-6
(e.g. CAGCAGTAGCAGCACCAG)
(Short Interspersed Nuclear Elements)
e.g., ALU: ~300-long, 106 copies
 LINE
 LTR retroposons
(Long Interspersed Nuclear Elements)
~4000-long, 200,000 copies
(Long Terminal Repeats (~700 bp) at each end)
cousins of HIV
•
Gene Families
genes duplicate & then diverge (paralogs)
•
Recent duplications
~100,000-long, very similar copies
CS273a 2016
Sequencing and Fragment Assembly
AGTAGCACAGA
CTACGACGAGA
CGATCGTGCGA
GCGACGGCGTA
GTGTGCTGTAC
TGTCGTGTGTG
TGTACTCTCCT
3x109 nucleotides
50% of human DNA is composed of repeats
CS273a 2016
Error!
Glued together two distant regions
What can we do about repeats?
Two main approaches:
• Cluster the reads
•
Link the reads
CS273a 2016
What can we do about repeats?
Two main approaches:
• Cluster the reads
•
Link the reads
CS273a 2016
What can we do about repeats?
Two main approaches:
• Cluster the reads
•
Link the reads
CS273a 2016
Sequencing and Fragment Assembly
AGTAGCACAGA
CTACGACGAGA
CGATCGTGCGA
GCGACGGCGTA
GTGTGCTGTAC
TGTCGTGTGTG
TGTACTCTCCT
3x109 nucleotides
A
R
B
ARB, CRD
or
C
CS273a 2016
R
D
ARD, CRB ?
Sequencing and Fragment Assembly
AGTAGCACAGA
CTACGACGAGA
CGATCGTGCGA
GCGACGGCGTA
GTGTGCTGTAC
TGTCGTGTGTG
TGTACTCTCCT
3x109 nucleotides
CS273a 2016
Long Reads
The Holy Grail
CS273a 2016
Long Reads - PacBio
Chemistry
RS II: P4-C2
RS II: P5-C3
Optimized For
higher quality
longer reads
Run time
180 min
180 min
Total output
~275 Mb
~375 Mb
Output/day
~2.2 Gb
~3 Gb
Mean read length
~5.5 kb
~8.5 kb
Top 5% of reads
>11 kb
>18 kb
Single pass accuracy
~86%
~83%
>99.999%
>99.98%
~50k
~50k
Instrument price
~$700k
~$700k
Run price
~$400
~$400
Consensus (50X)
accuracy
# of reads
CS273a 2016
Long Reads – Oxford Nanopore
Read length: 50,000+?
Cost ?
CS273a 2016
10x System
Massively Parallel Partitioning
10X Instrument & Reagents
Read Clouds (“linked reads”)
Hap1
Hap2
CS273a 2016
X 3,000,000
Phased 60Kb deletion
Read Clouds
Partition 1
xC
R
Synthetic Long Reads (SLR):
CR >= 50x
Read Clouds:
CR < 1x
Partition n
CFCR
Coverage = Xx
CS273a 2016
C
XF
Fragment Assembly
(in whole-genome shotgun sequencing)
CS273a 2016
Fragment Assembly
Given N reads…
Where N ~ 30
million…
We need to use a
linear-time
algorithm
CS273a 2016
Steps to Assemble a Genome
Some Terminology
1. Find
overlapping
reads
read
a 500-900
long word
that comes
out of sequencer
mate pair a pair of reads from two ends
2. Merge
some
pairs of reads into
of the
same“good”
insert fragment
longer contigs
contig
a contiguous sequence formed
by several overlapping reads
with
no gaps
3. Link
contigs
to form supercontigs
supercontig an ordered and oriented set
(scaffold)
of contigs, usually by mate
pairs
4. Derive consensus sequence
consensus sequence derived from the
sequene
multiple alignment of reads
in a contig
CS273a 2016
..ACGATTACAATAGGTT..
1. Find Overlapping Reads
aaactgcagtacggatct
aaactgcag
aactgcagt
…
gtacggatct
tacggatct
gggcccaaactgcagtac
gggcccaaa
ggcccaaac
…
actgcagta
ctgcagtac
gtacggatctactacaca
gtacggatc
tacggatct
…
ctactacac
tactacaca
CS273a 2016
(read, pos., word, orient.)
(word, read, orient., pos.)
aaactgcag
aactgcagt
actgcagta
…
gtacggatc
tacggatct
gggcccaaa
ggcccaaac
gcccaaact
…
actgcagta
ctgcagtac
gtacggatc
tacggatct
acggatcta
…
ctactacac
tactacaca
aaactgcag
aactgcagt
acggatcta
actgcagta
actgcagta
cccaaactg
cggatctac
ctactacac
ctgcagtac
ctgcagtac
gcccaaact
ggcccaaac
gggcccaaa
gtacggatc
gtacggatc
tacggatct
tacggatct
tactacaca
1. Find Overlapping Reads
• Find pairs of reads sharing a k-mer, k ~ 24
• Extend to full alignment – throw away if not >98% similar
TACA TAGATTACACAGATTAC T GA
|| ||||||||||||||||| | ||
TAGT TAGATTACACAGATTAC TAGA
• Caveat: repeats
 A k-mer that occurs N times, causes O(N2) read/read comparisons
 ALU k-mers could cause up to 1,000,0002 comparisons
• Solution:
 Discard all k-mers that occur “too often”
• Set cutoff to balance sensitivity/speed tradeoff, according to genome at
hand and computing resources available
CS273a 2016
1. Find Overlapping Reads
Create local multiple alignments from the
overlapping reads
TAGATTACACAGATTACTGA
TAGATTACACAGATTACTGA
TAG TTACACAGATTATTGA
TAGATTACACAGATTACTGA
TAGATTACACAGATTACTGA
TAGATTACACAGATTACTGA
TAG TTACACAGATTATTGA
TAGATTACACAGATTACTGA
CS273a 2016
1. Find Overlapping Reads
• Correct errors using multiple alignment
TAGATTACACAGATTACTGA
TAGATTACACAGATTACTGA
TAGATTACACAGATTATTGA
TAGATTACACAGATTACTGA
TAG-TTACACAGATTACTGA
insert A
replace T with C
TAGATTACACAGATTACTGA
TAGATTACACAGATTACTGA
TAG-TTACACAGATTATTGA
TAGATTACACAGATTACTGA
TAG-TTACACAGATTATTGA
correlated errors—
probably caused by repeats
 disentangle overlaps
TAGATTACACAGATTACTGA
TAGATTACACAGATTACTGA
TAGATTACACAGATTACTGA
In practice, error correction removes
up to 98% of the errors
CS273a 2016
TAG-TTACACAGATTATTGA
TAG-TTACACAGATTATTGA
2. Merge Reads into Contigs
• Overlap graph:
 Nodes: reads r1…..rn
 Edges: overlaps (ri, rj, shift, orientation, score)
Reads that come
from two regions of
the genome (blue
and red) that contain
the same repeat
Note:
of course, we don’t
know the “color” of
these nodes
CS273a 2016
2. Merge Reads into Contigs
repeat region
Unique Contig
Overcollapsed Contig
We want to merge reads up to potential repeat boundaries
CS273a 2016
2. Merge Reads into Contigs
• Remove transitively inferable overlaps
 If read r overlaps to the right reads r1, r2,
and r1 overlaps r2, then (r, r2) can be
inferred by (r, r1) and (r1, r2)
CS273a 2016
r
r1
r2
r3
2. Merge Reads into Contigs
CS273a 2016
Repeats, errors, and contig lengths
• Repeats shorter than read length are easily resolved
 Read that spans across a repeat disambiguates order of flanking regions
• Repeats with more base pair diffs than sequencing error rate are OK
 We throw overlaps between two reads in different copies of the repeat
• To make the genome appear less repetitive, try to:
 Increase read length
 Decrease sequencing error rate
Role of error correction:
Discards up to 98% of single-letter sequencing errors
decreases error rate
 decreases effective repeat content
 increases contig length
CS273a 2016
3. Link Contigs into Supercontigs
Normal density
Too dense
 Overcollapsed
Inconsistent links
 Overcollapsed?
CS273a 2016
3. Link Contigs into Supercontigs
Find all links between unique contigs
Connect contigs incrementally, if  2 forward-reverse links
supercontig
(aka scaffold)
CS273a 2016
3. Link Contigs into Supercontigs
Fill gaps in supercontigs with paths of repeat contigs
Complex algorithmic step
•
•
CS273a 2016
Exponential number of paths
Forward-reverse links
4. Derive Consensus Sequence
TAGATTACACAGATTACTGA TTGATGGCGTAA CTA
TAGATTACACAGATTACTGACTTGATGGCGTAAACTA
TAG TTACACAGATTATTGACTTCATGGCGTAA CTA
TAGATTACACAGATTACTGACTTGATGGCGTAA CTA
TAGATTACACAGATTACTGACTTGATGGGGTAA CTA
TAGATTACACAGATTACTGACTTGATGGCGTAA CTA
Derive multiple alignment from pairwise read alignments
Derive each consensus base by weighted voting
(Alternative: take maximum-quality letter)
CS273a 2016
De Brujin Graph formulation
• Given sequence x1…xN, k-mer length k,
Graph of 4k vertices,
Edges between words with (k-1)-long overlap
Reads
de Bruijn Graph
AAGA
ACTT
ACTC
ACTG
AGAG
CCGA
CGAC
CTCC
CTGG
CTTT
…
CCG
CS273a 2016
TCC
CGA
AAG
AGA
Potential Genomes
AAGACTCCGACTGGGACTTT
CTC
GAC
ACT
GGA
CTT
AAGACTGGGACTCCGACTTT
TTT
CTG
GGG
TGG
Slide by Michael Schatz
Node Types
Isolated nodes (10%)
Tips (46%)
Bubbles/Non-branch (9%)
Dead Ends (.2%)
Half Branch (25%)
Full Branch (10%)
(Chaisson, 2009)
CS273a 2016
Slide by Michael Schatz
Error Correction
 Errors at end of read
B’
• Trim off ‘dead-end’ tips
B
A
A
B
 Errors in middle of read
• Pop Bubbles
B’
C
A
A
C
B*
B
 Chimeric Edges
A
Slide by Michael Schatz
A
B
C
D
x
• Clip short, low coverage nodes
CS273a 2016
B
C
D
De Brujin Graph formulation
CS273a 2016
Quality of assemblies—mouse
Terminology: N50 contig length
If we sort contigs from largest to smallest, and start
Covering the genome in that order, N50 is the length
Of the contig that just covers the 50th percentile.
7.7X
sequence
coverage
CS273a 2016
Panda Genome
CS273a 2016
Hominid lineage
CS273a 2016
Orangutan genome
CS273a 2016
Assemblathon
CS273a 2016
Assemblathon
CS273a 2016
History of WGA
1997
• 1982: -virus, 48,502 bp
• 1995: h-influenzae,
Let’s sequence
the human
1 genome
Mbp with the
shotgun strategy
• 2000: fly, 100 Mbp
• 2001 – present
That
 human (3Gbp), mouse (2.5Gbp),
ratis*, chicken, dog, chimpanzee,
several fungal genomes impossible, and a
bad idea anyway
Gene Myers
CS273a 2016
Phil Green