Reference - Human Microbiome Journal Club
Download
Report
Transcript Reference - Human Microbiome Journal Club
Next-Generation Sequencing of Microbial
Genomes and Metagenomes
Christine King
Farncombe Metagenomics Facility
Human Microbiome Journal Club
July 13, 2012
Overview
Next-generation sequencing
Applications
Instruments
Library
prep and sequencing chemistry
Sequence quality
Project overview
Microbial
genomes
Microbial communities
DNA Sequencing
1st generation
Sanger chain termination
Capillary electrophoresis
2nd generation (NGS)
High throughput,
“massively parallel”
Shorter reads
Sequencing-by-synthesis
3rd generation
Single molecule
Nanopores
Applications
DNA sequencing
De novo genomes
Resequencing
Metagenome
Shotgun (e.g. mutant strains)
Amplicon (e.g. HLA, cancer)
Sequence capture (e.g.
exome)
Amplicon (e.g. 16S, COI, viral)
Shotgun
ChIP
RNA sequencing
Gene expression
Gene annotation, splice
variants
Metatranscriptome
Instruments
Instruments
Instrument
# of
reads
Read
length
(bp)
Total
outpu
t (Gb)
Cost
per
base
Run
Time
GS FLX
1M
450
0.5
$$$$
++
GS FLX+
1M
650
0.6
$$$$
++
GS Jr
100K
450
0.05
$$$$
++
GAIIx
640M
2x 150
90
$$
+++
HiSeq 2000
6B
2x 100
600
$
+++
MiSeq
12M
2x 150
2
$$
++
PacBio RS
>10K
>1000
0.01
$$$$
+
Single-molecule seq, fluorophore
SOLiD 5500xl
1.4B
75 + 35
155
$
+++
emPCR, probe ligation, fluorophore
Ion PGM - 316
1M
>100
0.1
$$$
+
Technology
emPCR, SBS, light detection
Bridge PCR, SBS, fluororphore
emPCR, SBS, pH change
Ion PGM - 318
6M
>100
1
$$
+
Which instrument(s) to use?
Read length vs number of reads
Cost per base, per sample, per project (multiplexing?)
Accuracy
Run time, wait time
Application
Length # Reads
Accuracy Instruments
Considerations
De novo (small)
+++
++
++
MiSeq, 454, Ion
Mix lengths
De novo (large)
+++
+++
++
HiSeq, 454, SOLiD
Mix lengths, MP
Re-seq (small)
++
++
++
MiSeq, Ion
Multiplex?
Re-seq (large)
++
+++
++
HiSeq, SOLiD
Enrichment?
RNA-seq (count) +
+++
+
Illumina, SOLiD, Ion
Ref? Size? Rare?
Amplicons
+++
+
+++
454, MiSeq
Size? Multiplex?
Metagenomics
++
+++
+++
Illumina, 454,
SOLiD
Length vs depth
Library Preparation
Goal: fragments of DNA, each end flanked by adaptor
sequences
Adaptors contain amplification- and sequencing primer binding
sites; platform- and chemistry-specific
Optional: sample-specific barcodes/indexes/MIDs/tags allow
multiplexing during sequencing
Library QC: quantity, size
Library Preparation
Library types:
Shotgun (DNA)
May begin with ChIP
May follow with sequence capture
Mate pair (DNA)
Amplicon (DNA)
Total RNA
May enrich for mRNA (poly-A enrichment, rRNA depletion)
Convert to cDNA (then similar to DNA protocols)
Small RNA
RNA ligations, convert to cDNA after
Library Preparation: Shotgun
Fragmentation
Sonication
Nebulization
Enzymatic
End repair
3’ overhangs digested
5’ overhangs filled
5’ phosphate added
Library Preparation: Shotgun
Adapter ligation
Library amplification
T-overhangs
Forked structure controls
orientation
Few cycles
Enrich for correctly-adapted
fragments
Required to complete adapter
structure in some protocols
Size selection
Gel excision, AMPure beads
Limit insert size as needed,
remove artifacts
Library Preparation: Amplicon
Amplify region of
interest using PCR
Primers contain
adapter sequences
Library Preparation: Mate Pair
Begin with large
fragments (e.g. 3kb, 20kb)
Circularize and fragment
again
Illumina: direct ligation
454: Cre/Lox
recombination
Enrich for fragments
containing the junction
Proceed with shotgun
library prep
Library Preparation: Mate Pair
Why? Paired sequences
are a known distance
apart; improves genome
assembly
Note: 454 calls these
“paired end libraries”, not
to be confused with
Illumina’s “paired end
sequencing”!
Sequencing: Illumina
Cluster generation
Library fragments hybridize to
oligos on the flow cell
New strand synthesized,
original denatured, removed
Free end binds to adjacent
oligos (bridge formation)
Complimentary strand
synthesized, denatured (both
tethered to flow cell)
Repeat to form clonal cluster
Cleave one oligo, denature to
leave ssDNA clusters
~800K clusters/mm^2
Sequencing: Illumina
Variety of workflows:
Single-
or paired end reads
0, 1, or 2 index reads
Sequencing: Illumina
At each cycle, all 4 fluorescently-labeled nucleotides
pass over the flow cell
Each cluster incorporates one nt (terminator) per cycle
Fluor is imaged, then cleaved
De-block and repeat
Sequencing: Illumina
Other terminology:
cBot – accessory instrument that performs cluster generation
Lanes – divisions (8) of HiSeq and GAIIx flow cells
PhiX – bacteriophage with small, balanced genome; PhiX library
spiked in with samples for QC
Phasing/pre-phasing – nt incorporation falls behind or jumps
ahead on a portion of strands in the cluster and contributes to
noise
Chastity filter – measures signal purity (after intensity
corrections); if the background signal is high, cluster will be
discarded
BaseSpace – cloud computing site for processing MiSeq data
File format: fastq
Sequencing: 454
emPCR: clonal
amplification of beadbound library in
microdroplets
Library input amounts
critical!
One
molecule per
bead
Titration procedure
Sequencing: 454
Library capture: beads
coated with
complimentary oligo
Amplification: droplet
contains PCR reagents
and the other oligo
Post-PCR: millions of
identical fragments
attached to the bead
Sequencing: 454
Bead Recovery:
physical and chemical
disruption
Enrichment: capture
successfully amplified
beads using
biotinylated primers +
magnetic, streptavidin
beads
Sequencing: 454
Deposit bead layers
onto PicoTiterPlate:
Enzyme
beads
Enriched DNA beads
More enzyme beads
PPiase beads
Sequencing: 454
Sequencing: 454
Pyrosequencing
4 nucleotides flow
separately
If nt
incorporation…PPi...light
APS + PPi (sulfurylase)
ATP
Luciferin + ATP (luciferase)
light + oxyluciferin
Amount of light
proportional to #nt
incorporated
Rinse and repeat with next
nt
Sequencing: 454
Camera captures light
emitted from every well
during every nucleotide flow
Sequencing: 454
Flowgram: representation of a sequence, based on the
pattern of light emitted from a single well
Sequencing: 454
Other terminology:
Lib-L/Lib-A: adapter variants, “ligated” or “annealed”
Titanium chemistry: ~450 bp reads on all instruments
XL+ chemistry: ~700 bp reads on the FLX+ instrument
Flow: one of the four nucleotides flows over the PTP
Cycle: a set of four flows, in order
Valley flow: if number of bases incorporated in a given
read during that flow is uncertain, e.g. 1.5 units of light
(background signal, homopolymers)
File format: sff (standard flowgram format)
Sequencing: Ion Torrent
Procedures and chemistry
similar to 454
Instead of PPi, measure H+
release (pH change) via
semiconductor chip
No expensive camera or
laser required, no modified
nucleotides
Sequence Quality
Phred (Q)
Score
Probability
of Error (P)
Base Call
Accuracy
10
1 in 10
90%
20
1 in 100
99%
30
1 in 1K
99.9%
40
1 in 10K
99.99%
50
1 in 100K
99.999%
Error probabilities
determined using
training sets, platformspecific biases
Expressed as a quality
value (QV or Q score)
per base
Similar to PHRED scores:
Q = -10 log10P
P = 10 -Q/10
Project 1: Microbial Genome
Considerations:
Reference genome?
How much coverage do I
want?
How big is the genome
How much data do I
need?
bp needed = genome size X
coverage
Which
instrument/chemistry
configuration to use?
Coverage
Depth (number of times
a particular base is
“covered” by a read
(e.g. 25X)
Breadth (% of genome
with at least 1X
coverage)
Project 1: Microbial Genome
Sample preparation
Isolate high quality (not degraded)
and high purity (no RNA) gDNA
Verify on a gel
Quantify using dsDNA-specific dye
Library preparation
Can do this yourself if you like
~ $200 per sample for Nextera
Cheaper protocols
Cheaper in bulk
Barcode compatibility
Project 1: Microbial Genome
Library QC
Insert
size confirmed on BioAnalyzer (within range, no
artifacts)
Pool barcoded libraries (normalize based on
PicoGreen quantification)
Absolute quantification of library pools using qPCR
Project 1: Microbial Genome
MiSeq sequencing
Dilute
and denature library pool (optimal concentration
requires titration...)
Spike in PhiX library as needed (e.g. 1%)
Prepare and load reagents, flow cell
Basic filtering and de-multiplexing performed
automatically
Download fastq files from BaseSpace
Project 1: Microbial Genome
Data processing
Additional
filtering
Trim the ends
Remove PCR duplicates
Assembly: overlapping
reads are assembled
to eachother based on
sequence similarity =
contigs
Project 1: Microbial Genome
What’s next?
Polish
the genome
(hybrid assemblies,
mate pair libraries)
Annotate (ORFs, RNAseq)
Compare
Project 2: Microbial Community
Shotgun metagenomics
Unbiased survey of
community content
Random library
fragments may provide
very little taxonomic
resolution (e.g.
conserved, unknown)
Identify genes, classify
by function
Targeted metagenomics
Limited survey of
community content
Targeted loci provide
excellent taxonomic
resolution, but may
exclude certain taxa
Identify OTUs, classify
by taxonomy
Project 2: Microbial Community
16S rRNA
Multi-copy gene (1.5 kb)
Conserved and
hypervariable regions
Extensive databases
from known species
Project 2: Microbial Community
Considerations:
Biases
in sampling
methods, culturing,
DNA isolation,
PCR...replicate
Available SOPs
How many reads per
sample?
Read length matters!
Sample preparation:
Isolate
DNA
PCR amplify, purify
High-fidelity
polymerase
Barcoded primers
No primer dimers!
Normalize
PCR
products and pool
Project 2: Microbial Community
454 Sequencing
emPCR
titrations with
different library input
Bulk emPCR
Sequence
Basic filtering
Collect sff files
Data processing
De-multiplexing
Additional
filtering
Trim the barcodes,
primers
Check for chimeras
Project 2: Microbial Community
Clustering
Sequences
grouped by
similarity = OTUs
Project 2: Microbial Community
Taxonomic identification
OTUs are classifed by
comparing to known 16S
sequences
Level of classification
(e.g. family vs genus)?
Diversity
Within sample
Between samples