Biostat Jhsph Edu Hji Courses Genomics Sequencing Ppt

Download Report

Transcript Biostat Jhsph Edu Hji Courses Genomics Sequencing Ppt

Special Topics in Genomics
Next-generation Sequencing
Work flow of conventional versus secondgeneration sequencing
(a) With high-throughput shotgun Sanger sequencing,
genomic DNA is fragmented, then cloned to a plasmid
vector and used to transform E. coli. For each sequencing
reaction, a single bacterial colony is picked and plasmid
DNA isolated. Each cycle sequencing reaction takes place
within a microliter-scale volume, generating a ladder of
ddNTP-terminated, dye-labeled products, which are
subjected to high-resolution electrophoretic separation
within one of 96 or 384 capillaries in one run of a
sequencing instrument. As fluorescently labeled fragments
of discrete sizes pass a detector, the four-channel
emission spectrum is used to generate a sequencing
trace.
(b) In shotgun sequencing with cyclic-array methods,
common adaptors are ligated to fragmented genomic
DNA, which is then subjected to one of several protocols
that results in an array of millions of spatially immobilized
PCR colonies or 'polonies'15. Each polony consists of
many copies of a single shotgun library fragment. As all
polonies are tethered to a planar array, a single microliterscale reagent volume (e.g., for primer hybridization and
then for enzymatic extension reactions) can be applied to
manipulate all array features in parallel. Similarly, imagingbased detection of fluorescent labels incorporated with
each extension can be used to acquire sequencing data
on all features in parallel. Successive iterations of
enzymatic interrogation and imaging are used to build up a
contiguous sequencing read for each array feature.
Jay Shendure & Hanlee Ji, Nature Biotechnology 26, 1135
- 1145 (2008)
Available next-generation sequencing
platforms
•
•
•
•
•
•
Illumina/Solexa
ABI SOLiD
Roche 454
Polonator
HeliScope
…
Example: Illumina/Solexa
1. Prepare genomic DNA
2. Attach DNA to surface
3. Bridge amplification
4. Fragement become
double stranded
5. Denature the double
stranded molecules
6. Complete amplification
Illumina/Solexa
7. Determine first base
8. Image first base
9. Determine second base
10. Image second base
11. Sequence reads over
multiple cycles
12. Align data.
>50 milliion clusters/flow
cell, each 1000 copies of
the same template, 1
billion bases per run, 1%
of the cost of capillarybased method.
(From:
http://www.illumina.com/
downloads/SS_DNAseque
ncing.pdf)
Clonal amplification of sequencing features in the
second-generation sequencing
(a) The 454, the Polonator and SOLiD platforms rely on emulsion PCR20 to amplify clonal sequencing features. In brief, an in
vitro–constructed adaptor-flanked shotgun library (shown as gold and turquoise adaptors flanking unique inserts) is PCR amplified
(that is, multi-template PCR, not multiplex PCR, as only a single primer pair is used, corresponding to the gold and turquoise
adaptors) in the context of a water-in-oil emulsion. One of the PCR primers is tethered to the surface (5'-attached) of micron-scale
beads that are also included in the reaction. A low template concentration results in most bead-containing compartments having
either zero or one template molecule present. In productive emulsion compartments (where both a bead and template molecule is
present), PCR amplicons are captured to the surface of the bead. After breaking the emulsion, beads bearing amplification
products can be selectively enriched. Each clonally amplified bead will bear on its surface PCR products corresponding to
amplification of a single molecule from the template library. (b) The Solexa technology relies on bridge PCR21, 22 (aka 'cluster
PCR') to amplify clonal sequencing features. In brief, an in vitro–constructed adaptor-flanked shotgun library is PCR amplified, but
both primers densely coat the surface of a solid substrate, attached at their 5' ends by a flexible linker. As a consequence,
amplification products originating from any given member of the template library remain locally tethered near the point of origin. At
the conclusion of the PCR, each clonal cluster contains 1,000 copies of a single member of the template library. Accurate
measurement of the concentration of the template library is critical to maximize the cluster density while simultaneously avoiding
overcrowding.
Jay Shendure & Hanlee Ji, Nature Biotechnology 26, 1135 - 1145 (2008)
Strategies for cyclic array sequencing
(a)
With the 454 platform, clonally amplified 28-m beads generated by emulsion PCR serve
as sequencing features and are randomly deposited to a microfabricated array of
picoliter-scale wells. With pyrosequencing, each cycle consists of the introduction of a
single nucleotide species, followed by addition of substrate (luciferin, adenosine 5'phosphosulphate) to drive light production at wells where polymerase-driven
incorporation of that nucleotide took place. This is followed by an apyrase wash to
remove unincorporated nucleotide.
(b) With the Solexa technology, a dense array of clonally amplified sequencing features is
generated directly on a surface by bridge PCR. Each sequencing cycle includes the
simultaneous addition of a mixture of four modified deoxynucleotide species, each
bearing one of four fluorescent labels and a reversibly terminating moiety at the 3'
hydroxyl position. A modified DNA polymerase drives synchronous extension of primed
sequencing features. This is followed by imaging in four channels and then cleavage of
both the fluorescent labels and the terminating moiety.
(c) With the SOLiD and the Polonator platforms, clonally amplified 1-m beads are used to
generate a disordered, dense array of sequencing features. Sequencing is performed
with a ligase, rather than a polymerase. With SOLiD, each sequencing cycle introduces
a partially degenerate population of fluorescently labeled octamers. The population is
structured such that the label correlates with the identity of the central 2 bp in the
octamer (the correlation with 2 bp, rather than 1 bp, is the basis of two-base encoding).
After ligation and imaging in four channels, the labeled portion of the octamer (that is,
'zzz') is cleaved via a modified linkage between bases 5 and 6, leaving a free end for
another cycle of ligation. Several such cycles will iteratively interrogate an evenly
spaced, discontiguous set of bases. The system is then reset (by denaturation of the
extended primer), and the process is repeated with a different offset (e.g., a primer set
back from the original position by one or several bases) such that a different set of
discontiguous bases is interrogated on the next round of serial ligations.
(d) With the HeliScope platform, single nucleic acid molecules are sequenced directly, that is,
there is no clonal amplification step required. Poly-A–tailed template molecules are
captured by hybridization to surface-tethered poly-T oligomers to yield a disordered
array of primed single-molecule sequencing templates. Templates are labeled with Cy3,
such that imaging can identify the subset of array coordinates where a sequencing read
is expected. Each cycle consists of the polymerase-driven incorporation of a single
species of fluorescently labeled nucleotide at a subset of templates, followed by
fluorescence imaging of the full array and chemical cleavage of the label.
Jay Shendure & Hanlee Ji, Nature Biotechnology 26, 1135 - 1145 (2008)
Conventional sequencing
• Can sequence up to 1,000 bp, and per-base
'raw' accuracies as high as 99.999%. In the
context of high-throughput shotgun genomic
sequencing, Sanger sequencing costs on the
order of $0.50 per kilobase.
Jay Shendure & Hanlee Ji, Nature Biotechnology 26, 1135 - 1145 (2008)
Second-generation DNA sequencing
technologies
Jay Shendure & Hanlee Ji, Nature Biotechnology 26, 1135 - 1145 (2008)
Applications of next-generation sequencing
Jay Shendure & Hanlee Ji, Nature Biotechnology 26, 1135 - 1145 (2008)
Base calling
Schematic representation of main Illumina noise factors.
(a–d) A DNA cluster comprises identical DNA templates (colored boxes) that are attached to the flow cell.
Nascent strands (black boxes) and DNA polymerase (black ovals) are depicted.
(a) In the ideal situation, after several cycles the signal (green arrows) is strong, coherent and corresponds to
the interrogated position.
(b) Phasing noise introduces lagging (blue arrows) and leading (red arrow) nascent strands, which transmit a
mixture of signals.
(c) Fading is attributed to loss of material that reduces the signal intensity (c).
(d) Changes in the fluorophore cross-talk cause misinterpretation of the received signal (teal arrows; d). For
simplicity, the noise factors are presented separately from each other.
Erlich et al. Nature Methods 5: 679-682 (2008)
Base calling: Alta-Cyclic
The training process (green arrows) starts with creation of the training set, beginning with sequences generated by the standard
Illumina pipeline, by linking intensity reads and a corresponding genome sequence (the 'correct' sequence). Then, two grid
searches are used to optimize the parameters to call the bases. After optimization, a final SVM array is created, each of which
corresponds to a cycle. In the base-calling stage (blue arrows), the intensity files of the desired library undergo deconvolution to
correct for phasing noise using the optimized values and are sent for classification with the SVM array. The output is processed,
and sequences and quality scores are reported.
Erlich et al. Nature Methods 5: 679-682 (2008)
Alta-Cyclic performance
(a) Analysis of the HepG2 RNA library using Alta-Cyclic. The absolute number of additional fully correct reads (in addition to
those generated by the Illumina base caller) is indicated by the red line; the fold change of the improvement is indicated by the
blue bars. (b) A comparison of fully correct reads for the Tetrahymena micronuclear library by the Illumina base caller and AltaCyclic. (c) The average error rate in calls of the artificial SNP locations in the phi X library as a function of the cycle in which
they were called. The dashed line represents 1% error rate (Q20). The plot on the right shows the last 18 cycles in a different
scale. (d) A comparison of fully correct reads for the phi X library with 1% artificial SNPs. (e) Phi X sequences generated by
Alta-Cyclic or Illumina were exhaustively aligned to the reference genome (allowing up to 53 mismatches out of 78). The
distribution of alignment scores is shown beginning with an identical number of raw reads for input into each base caller.
Erlich et al. Nature Methods 5: 679-682 (2008)
ChIP-Seq
Hongkai Ji et al. Nature Biotechnology 26: 1293-1300. 2008
ChIP-Seq Analysis
Alignment
Peak Detection
Annotation
Visualization
Sequence
Analysis
Motif Analysis
Alignment
•
•
•
•
•
ELAND
Bowtie
SOAP
SeqMap
…
Peak detection
•
•
•
•
•
•
•
•
FindPeaks
CHiPSeq
BS-Seq
SISSRs
QuEST
MACS
CisGenome
…
Two common designs
• One sample experiment
contains only a ChIP’d sample
• Two sample experiment
contains a ChIP’d sample and a negative
control sample
One sample analysis
A simple way is the sliding window method
Poisson background model is commonly used to estimate error rate
ki ~ Poisson(λ0)
ki
Or people use Monte Carlo simulations
Both are based on the assumption that read sampling rate is a constant
across the genome.
Hongkai Ji et al. Nature Biotechnology 26: 1293-1300. 2008
The constant rate assumption does not hold!
Negative binomial model fits the data better!
ki | λi ~ Poisson(λi)
λi ~ Gamma(α, β)
Marginally,
ki ~ NegBinom(α, β)
Hongkai Ji et al. Nature Biotechnology 26: 1293-1300. 2008
FDR estimation based on Poisson and negative
binomial model
Hongkai Ji et al. Nature Biotechnology 26: 1293-1300. 2008
Read direction provides extra information
Hongkai Ji et al. Nature Biotechnology 26: 1293-1300. 2008
CisGenome procedure
Alignment
Exploration
FDR
computation
Negative binomial model
Peak Detection
Post
Processing
Use read direction to refine
peak boundary and filter
low quality peaks
Hongkai Ji et al. Nature Biotechnology 26: 1293-1300. 2008
Two sample analysis
Reason: read sample rates at the same genomic locus are correlated across different
samples.
Hongkai Ji et al. Nature Biotechnology 26: 1293-1300. 2008
CisGenome two sample analysis
Alignment
k1i
Exploration
k2i
ni =k1i + k2i
k1i | ni ~ Binom(ni , p0)
FDR
computation
Peak Detection
Post
Processing
Hongkai Ji et al. Nature Biotechnology 26: 1293-1300. 2008
A comparative study of ChIP-chip and ChIP-seq
• NRSF ChIP-chip
2 ChIP + 2 Mock IP in Jurkat cells, profiled using Affymetrix
Human Tiling 2.0R arrays.
• NRSF ChIP-seq
ChIP + Negative Control in Jurkat cells, sequenced with the
next generation sequencer made by Illumina/Solexa.
Hongkai Ji et al. Nature Biotechnology 26: 1293-1300. 2008
Intersection
Before post-processing
After post-processing
Hongkai Ji et al. Nature Biotechnology 26: 1293-1300. 2008
Signal correlation
Hongkai Ji et al. Nature Biotechnology 26: 1293-1300. 2008
Visual comparison
Hongkai Ji et al. Nature Biotechnology 26: 1293-1300. 2008
Comparison of peak detection results
Hongkai Ji et al. Nature Biotechnology 26: 1293-1300. 2008
Are array specific peaks noise or signal?
Hongkai Ji et al. Nature Biotechnology 26: 1293-1300. 2008
Effects of read number in ChIP-seq
Hongkai Ji et al. Nature Biotechnology 26: 1293-1300. 2008
RNA-Seq
Mortazavi et al. Nature Methods,5:621-628, 2008
(a) After two rounds of poly(A) selection, RNA
is fragmented to an average length of 200
nt by magnesium-catalyzed hydrolysis and
then converted into cDNA by random
priming. The cDNA is then converted into a
molecular library for Illumina/Solexa 1G
sequencing, and the resulting 25-bp reads
are mapped onto the genome. Normalized
transcript prevalence is calculated with an
algorithm from the ERANGE package.
(b) Primary data from mouse muscle RNAs
that map uniquely in the genome to a 1-kb
region of the Myf6 locus, including reads
that span introns. The RNA-Seq graph
above the gene model summarizes the
quantity of reads, so that each point
represents the number of reads covering
each nucleotide, per million mapped reads
(normalized scale of 0–5.5 reads).
(c) Detection and quantification of differential
expression. Mouse poly(A)-selected RNAs
from brain, liver and skeletal muscle for a
20-kb region of chromosome 10 containing
Myf6 and its paralog Myf5, which are
muscle specific. In muscle, Myf6 is highly
expressed in mature muscle, whereas Myf5
is expressed at very low levels from a small
number of cells. The specificity of RNA-Seq
is high: Myf6 expression is known to be
highly muscle specific, and only 4 reads out
of 71 million total liver and brain mapped
reads were assigned to the Myf6 gene
model.
Reproducibility, linearity and sensitivity
Mortazavi et al. Nature Methods,5:621-628, 2008
(a) Comparison of two brain technical replicate
RNA-Seq determinations for all mouse gene
models (from the UCSC genome database),
measured in reads per kilobase of exon per
million mapped sequence reads (RPKM), which
is a normalized measure of exonic read density;
R 2 = 0.96. (b) Distribution of uniquely mappable
reads onto gene parts in the liver sample.
Although 93% of the reads fall onto exons or the
RNAFAR-enriched regions (see Fig. 3 and text),
another 4% of the reads falls onto introns and
3% in intergenic regions. (c) Six in vitro–
synthesized reference transcripts of lengths 0.3–
10 kb were added to the liver RNA sample (1.2
104 to 1.2 109 transcripts per sample; R 2 >
0.99). (d) Robustness of RPKM measurement
as a function of RPKM expression level and
depth of sequencing. Subsets of the entire liver
dataset (with 41 million mapped unique + splice
+ multireads) were used to calculate the
expression level of genes in four different
expression classes to their final expression
level. Although the measured expression level of
the 211 most highly expressed genes (black and
cyan) was effectively unchanged after 8 million
mappable reads, the measured expression
levels of the other two classes (purple and red)
converged more slowly. The fraction of genes for
which the measured expression level was within
5% of the final value is reported. 3 RPKM
corresponds to approximately one transcript per
cell in liver. The corresponding number of
spliced reads in each subset is shown on the top
x axis.
ERANGE
(a) The main steps in the computational pipeline
are outlined at left, with different aspects of read
assignment and weighting diagrammed at right and
the corresponding number of gene model reads
treated in muscle shown in parentheses. In each
step, the sequence read or reads being assigned
by the algorithm are shown as a black rectangle,
and their assignment to one or more gene models
is indicated in color. Sequence reads falling
outside known or predicted regions are shown in
gray. RNAFAR regions (clusters of reads that do
not belong to any gene model in our reference set)
are shown as dotted lines. They can either be
assigned to neighboring gene models, if they are
within a specified threshold radius (purple), or
assigned their own predicted transcript model
(green). Multireads (shown as parallelograms) are
assigned fractionally to their different possible
locations based on the expression levels of their
respective gene models as described in the text.
(b) Comparison of mouse liver expanded RPKM
values to publicly available Affymetrix microarray
intensities from GEO (GSE6850) for genes called
as present by Rosetta Resolver. Expanded RPKMs
include unique reads, spliced reads and RNAFAR
candidate exon aggregation, but not multireads.
Genes with >30% contribution of multireads to their
final RPKM (Supplementary Fig. 4) are marked in
red. (c) Comparison of Affymetrix intensity values
with final RPKMs, which includes multireads. Note
that the multiread-affected genes that are below
the regression line in b straddle the regression line
in c.
Mortazavi et al. Nature Methods,5:621-628, 2008
Summary
The next-generation sequencing is
• Rapidly evolving
• Democratizing the extent to which individual
investigators can pursue projects at a scale previously
accessible only to major genome centers
• New statistical tools need to be developed