Slide 1

Transcript Slide 1

Introduction to high throughput sequencing
Lecture 1
Introduction to high
throughput sequencing
Michael Brudno
Adapted from
presentations by
Francis Ouelette, OICR,
Michael Stromberg, BC
and Asim Siddiqui, ABI
DNA sequencing
How we obtain the sequence of nucleotides of a species
…ACGTGACTGAGGACCGTG
CGACTGAGACTGACTGGGT
CTAGCTAGACTACGTTTTA
TATATATATACGTCGTCGT
ACTGATGACTAGATTACAG
ACTGATTTAGATACCTGAC
TGATTTTAAAAAAATATT…
Introduction to high throughput sequencing
DNA Sequencing
Goal:
Find the complete sequence of A, C, G, T’s in DNA
Challenge:
There is no machine that takes long DNA as an input, and gives the
complete sequence as output
Can only sequence ~500 letters at a time
Introduction to high throughput sequencing
Generations of Sequences
•
•
•
•
•
Sanger-style: Classic
454 “First Next-gen”
Illumina + ABI SOLiD “Next-gen”
Helicos “2.5 Gen”
PacBio “Next-next-gen”, 3rd gen
Introduction to high throughput sequencing
Why are we sequencing?
• Before Next-generation:
– DNA, RNA, (proteins), (populations), sampling, averages,
consensus
• Problems: sampling, averages, consensus.
• After Next-generation:
– Genome sequence and structure
– Less cloning/PCR
– Single molecules (for some)
Introduction to high throughput sequencing
Sanger (old-gen) Sequencing
Now-Gen Sequencing
Whole Genome
Human (early drafts), model
organisms, bacteria, viruses
and mitochondria
(chloroplast), low coverage
New human (!), individual genome,
1,000 normal, 25,000 cancer matched
control pairs, rare-samples
RNA
cDNA clones, ESTs, Full
Length Insert cDNAs, other
RNAs
RNA-Seq: Digitization of transcriptome,
alternative splicing events, miRNA
Communities
Environmental sampling, 16S
RNA populations, ocean
sampling,
Human microbiome, deep
environmental sequencing, Bar-Seq
Other
Epigenome, rearrangements,
ChIP-Seq
Introduction to high throughput sequencing
Differences between the various platforms:
•
•
•
•
•
•
Nanotechnology used.
Resolution of the image analysis.
Chemistry and enzymology.
Signal to noise detection in the software
Software/images/file size/pipeline
Cost $$$
Introduction to high throughput sequencing
Adapted from Richard Wilson, School of Medicine, Washington University, “Sequencing the Cancer Genome” http://tinyurl.com/5f3alk
Next Generation DNA Sequencing Technologies
Human Genome
Req’d Coverage
6GB == 6000 MB
6
12
30
3730
454
Illumina
bp/read
600
400
2X75
reads/run
96
500,000
100,000.000
bp/run
57,600
0.5 GB
15 GB
# runs req’d
625,000
144
12
runs/day
2
1
0.1
Machine days/human 312,500
genome
(856 years)
144
120
Cost/run
$48
$6,800
$9,300
Total cost
$15,000,000
$979,200
$111,600
Adapted from Richard Wilson, School of Medicine, Washington University, “Sequencing the Cancer Genome” http://tinyurl.com/5f3alk
Solexa-based Whole Genome Sequencing
Illumina (Solexa)
Introduction to high throughput sequencing
Illumina (Solexa)
Introduction to high throughput sequencing
Illumina (Solexa)
Introduction to high throughput sequencing
From Debbie Nickerson, Department of Genome Sciences, University of Washington, http://tinyurl.com/6zbzh4
Introduction to high throughput sequencing
What is a base quality?
Base Quality
Perror(obs. base)
3
50.12%
5
31.62%
10
10.00%
15
3.16%
20
1.00%
25
0.32%
30
0.10%
35
0.03%
40
0.01%
Introduction to high throughput sequencing
From John McPherson, OICR
Next-gen sequencers
100 Gb
AB/SOLiDv3, Illumina/GAII
short-read sequencers
(10+Gb in 50-100 bp reads,
>100M reads, 4-8 days)
bases per machine run
10 Gb
454 GS FLX pyrosequencer
1 Gb
(100-500 Mb in 100-400 bp reads,
0.5-1M reads, 5-10 hours)
100 Mb
ABI capillary sequencer
(0.04-0.08 Mb in 450-800 bp reads,
96 reads, 1-3 hours)
10 Mb
1 Mb
10 bp
100 bp
read length
Introduction to high throughput sequencing
1,000 bp
DNA sequencing – vectors
DNA
Shake
DNA fragments
Vector
Circular genome
(bacterium, plasmid)
+
Introduction to high throughput sequencing
=
Known
location
(restriction
site)
Method to sequence longer regions
genomic segment
cut many times at
random (Shotgun)
Get two reads from
each segment
~500 bp
~500 bp
Introduction to high throughput sequencing
Reconstructing the Sequence
(Fragment Assembly)
reads
Cover region with ~7-fold redundancy (7X)
Overlap reads and extend to reconstruct the original genomic
region
Introduction to high throughput sequencing
Definition of Coverage
C
Length of genomic segment:
Number of reads:
n
Length of each read:
l
Definition: Coverage
L
C=nl/L
How much coverage is enough?
Lander-Waterman model:
Assuming uniform distribution of reads, C=10 results in 1 gapped
region /1,000,000 nucleotides
Introduction to high throughput sequencing
Challenges with Fragment Assembly
• Sequencing errors
~1-2% of bases are wrong
• Repeats
false overlap due to repeat
• Computation: ~ O( N2 ) where N = # reads
Introduction to high throughput sequencing
Adapted from Eric Green, NIH; Adapted from Messing & Llaca, PNAS (1998)
History of DNA Sequencing
1870
Miescher: Discovers DNA
Avery: Proposes DNA as ‘Genetic Material’
Efficiency
(bp/person/year)
1940
1
1953
Holley: Sequences Yeast tRNAAla
15
1965
Wu: Sequences  Cohesive End DNA
Watson & Crick: Double Helix Structure of DNA
150
1970
Sanger: Dideoxy Chain Termination
Gilbert: Chemical Degradation
1,500
1977
15,000
25,000
1980
50,000
1986
200,000
1990
50,000,000
100,000,000,000
Messing: M13 Cloning
Hood et al.: Partial Automation
• Cycle Sequencing
• Improved Sequencing Enzymes
• Improved Fluorescent Detection Schemes
2002
2009
• Next Generation Sequencing
•Improved enzymes and chemistry
•New image processing
Which representative of the species?
Which human?
Answer one:
Answer two: it doesn’t matter
Polymorphism rate: number of letter changes between two different
members of a species
Humans: ~1/1,000 – 1/10,000
Other organisms have much higher polymorphism rates
Introduction to high throughput sequencing
Why humans are so similar
Out of Africa
A small population that interbred reduced
the genetic variation
Out of Africa ~ 40,000 years ago
Migration of human variation
http://info.med.yale.edu/genetics/kkidd/point.html
Migration of human variation
http://info.med.yale.edu/genetics/kkidd/point.html
Introduction to high throughput sequencing
Migration of human variation
http://info.med.yale.edu/genetics/kkidd/point.html
Introduction to high throughput sequencing
Genetic Variations: Why?
Phenotypic
differences
Inherited diseases
Ancestral history
Introduction to high throughput sequencing
Genetic Variations: SNPs & INDELs
Introduction to high throughput sequencing
Structural Variations
Paul Medvedev
review in prep
July 2009
Introduction to high throughput sequencing
SNP Discovery: Goal
sequencing errors
Introduction to high throughput sequencing
SNP
SNP Discovery: Base Qualities
High quality
Genetic Variation Discovery
Low quality
bioinformatics.
SNPs & Bayesian Statistics
# of individuals
base quality
allele call in read


k
k
  Pr  Bi | Ti  Pr Ti | Gi   Pr  G1 , G2 , , Gn 

T k
i 1 


, Gn | B 
 n 

k
k
l
l
l
l
  Pr  Bi | Ti  Pr Ti | Gi  Pr G1 , G2 , , Gn
l 
k
G 

 i 1  T
n
Pr  G1, G2 ,

Introduction to high throughput sequencing

 




SNP Discovery
haploid
strain 1
AACGTTAGCATA
AACGTTAGCATA
AACGTTAGCATA
strain 2
AACGTTCGCATA
AACGTTCGCATA
strain 3
AACGTTAGCATA
AACGTTAGCATA
AACGTTAGCATA
Genetic Variation Discovery
diploid
individual 1
AACGTTAGCATA
AACGTTAGCATA
AACGTTCGCATA
AACGTTCGCATA
individual 2
AACGTTCGCATA
AACGTTCGCATA
AACGTTCGCATA
AACGTTCGCATA
individual 3
AACGTTAGCATA
AACGTTAGCATA
bioinformatics.
Genotyping & Consensus Generation
haploid
strain 1
[A]
AACGTTAGCATA
AACGTTAGCATA
AACGTTAGCATA
strain 2
[C]
AACGTTCGCATA
AACGTTCGCATA
strain 3
[A]
AACGTTAGCATA
AACGTTAGCATA
AACGTTAGCATA
Genetic Variation Discovery
diploid
individual 1
[A/C]
AACGTTAGCATA
AACGTTAGCATA
AACGTTCGCATA
AACGTTCGCATA
individual 2
[C/C]
AACGTTCGCATA
AACGTTCGCATA
AACGTTCGCATA
AACGTTCGCATA
individual 3
[A/A]
AACGTTAGCATA
AACGTTAGCATA
bioinformatics.
Visualization: Consed
Genetic Variation Discovery
bioinformatics.
1000 Genomes Project
Introduction to high throughput sequencing
1000G: Goals
• Discover genetic variations
– 1 % minor allele frequencies across genome
– 0.1 – 0.5 % MAF across gene regions
• Variant alleles
– Estimate frequencies
– Identify haplotype background
– Characterize linkage disequilibrium
Introduction to high throughput sequencing
1000G: Pilot Projects
Pilot 1
Pilot 2
Pilot 3
Low coverage
180 samples
70 samples @ 4X
110 samples @ 2X
Deep trios (CEU & YRI)
6 samples
Exon capture
607 samples
2.2 Mbp of targets
8800 targets
10 – 20x coverage
2.7 Tbp total
202 Gbp 454
1.8 Tbp Illumina
640 Gbp AB SOLiD
1.1 Tbp total
87 Gbp 454
773 Gbp Illumina
270 Gbp AB SOLiD
Introduction to high throughput sequencing
Questions about the genome
• Obtaining a genome sequence is a one step
towards understanding biological processes
• Questions that follow from the genome are:
– What is transcribed?
– Where do proteins bind?
– What is methylated?
• In other words, how does it work?
Introduction to high throughput sequencing
Central dogma
ZOOM
IN
tRNA
transcription
DNA
rRNA
snRNA
translation
mRNA
POLYPEPTIDE
Transcription
• The DNA is contained in the nucleus of the
cell.
• A stretch of it unwinds there, and its message
(or sequence) is copied onto a molecule of
mRNA.
• The mRNA then exits from the cell nucleus.
Introduction to high throughput sequencing
DNA
RNA
A
G
A
A=T
G=C
G
C
C
G
G
A
A
C
TU
C
T
U
G
G
More complexity
• The RNA message is sometimes “edited”.
• Exons are nucleotide segments whose
codons will be expressed.
• Introns are intervening segments (genetic
gibberish) that are snipped out.
• Exons are spliced together to form mRNA.
Introduction to high throughput sequencing
Splicing
frgjjthissentencehjfmkcontainsjunkelm
thissentencecontainsjunk
Introduction to high throughput sequencing
Key player: RNA polymerase
• It is the enzyme that brings about
transcription by going down the line, pairing
mRNA nucleotides with their DNA
counterparts.
Introduction to high throughput sequencing
Promoters
• Promoters are sequences in the DNA just upstream of
transcripts that define the sites of initiation.
Promoter
5’
3’
• The role of the promoter is to attract RNA polymerase to
the correct start site so transcription can be initiated.
Introduction to high throughput sequencing
Promoters
• Promoters are sequences in the DNA just upstream of
transcripts that define the sites of initiation.
Promoter
5’
3’
• The role of the promoter is to attract RNA polymerase
to the correct start site so transcription can be
initiated.
Introduction to high throughput sequencing
Transcription – key steps
DNA
• Initiation
• Elongation
• Termination
DNA
+
RNA
Introduction to high throughput sequencing
Genes can be switched on/off
• In an adult multicellular organism, there is a
wide variety of cell types seen in the adult. eg,
muscle, nerve and blood cells.
• The different cell types contain the same DNA
though.
• This differentiation arises because different
cell types express different genes.
• Promoters are one type of gene regulators
Introduction to high throughput sequencing
Transcription (recap)
• The DNA is contained in the nucleus of the
cell.
• A stretch of it unwinds there, and its message
(or sequence) is copied onto a molecule of
mRNA.
• The mRNA then exits from the cell nucleus.
• Its destination is a molecular workbench in
the cytoplasm, a structure called a ribosome.
Introduction to high throughput sequencing
The Transcriptome
• The transcriptome is the entire set of RNA
transcripts in the cell, tissue or organ.
• The transcriptome is cell type specific and
time dependant i.e. It is a function of cell state
• The transcriptome can help us understand
how cells differentiate and respond to
changes in their environment.
Introduction to high throughput sequencing
Transcriptome complexity
• Transcripts may be:
– Modified
– Spliced
– Edited
– Degraded
• Transcriptome is substantially more complex
than the genome and is time variant.
Introduction to high throughput sequencing
ESTs
• ESTs were the first genome wide scan for
transcriptional elements
• Different library types:
– Proportional
– Normalized
– Subtractive
• Can be sequenced from the 5’ or 3’ end
Introduction to high throughput sequencing
“Hello Mr Chips”
• Microarray chips introduced in 90’s
• Parallel way to measure many genes
– Probes placed on slides
– RNA -> cDNA, labelled with fluorescent dye and hybridized.
– Fluorescence measured
•
•
•
•
•
Chips have been highly successful
Simplified analysis
Useful when there is no genome sequence
Linear signal across 500 fold variation
Standardization has aided use in medical diagnostics
– E.g. Mammaprint
Introduction to high throughput sequencing
Microarray expression profiling
by 2-color assay (“cDNA arrays”)
Array:
PCR products
6250 yeast ORFs
hybridized cDNAs:
green = control
red = experiment
*Schena et al., 1995
Chips: pros and cons
• Advantages
– Do not require a genome sequence
– Highly characterised, with many s/w packages available
– One Affymetrix chip FDA approved
• Disadvantages
– Measurements limited to what’s on the array
– Hard to distinguish isoforms when used for expression
– Can’t detect balanced translocations or inversions when used
for resequencing
Introduction to high throughput sequencing
mRNA-seq
• Basic work flow
– Align reads (sometimes to transcriptome first and
then the genome)
– Tally transcript counts
– Align tags to spliced transcripts
– Add to transcript counts
Introduction to high throughput sequencing
Cloonan et al. 2008
• Used SOLiD to generate 10Gb of data from
mouse embryonic stem cells and embryonic
bodies
• Used a library of exon junctions to map across
known splice events
Introduction to high throughput sequencing
Distribution of tags
Introduction to high throughput sequencing
Tag locations
Introduction to high throughput sequencing
General issues
• Coverage across the transcript may not be
random
• Some reads map to multiple locations
• Some reads don’t map at all
• Reads mapping outside of known exons may
represent
– New gene models
– New genes
Introduction to high throughput sequencing
Size of the transcriptome
• Carter et al (2005)
– Using arrays estimated 520,000 to 850,000
transcripts per cell.
– Use upper limit and estimate average transcript
size of 2kb
– Transcriptome ~2GB
• Transcriptome cost ~ genome cost
Introduction to high throughput sequencing
The Boundome
• DNA binding proteins control genome
function
• Histones impact chromatin structure
• Activators and repressors impact gene
expression
• The location of these proteins helps us
understand how the genome works
Introduction to high throughput sequencing
ChIP
Introduction to high throughput sequencing
Chip-Seq
• Instead of probing against a chip, measure
directly
• Basic work flow
– Align reads to the genome
– Identify clusters and peaks
– Determine bound sites
Introduction to high throughput sequencing
Robertson et al. 2007
• Used Illumina technology to find STAT1 binding sites
• Comparisons with two ChIP-PCR data sets suggested that
ChIP-seq sensitivity was between 70% and 92% and
specificity was at least 95%.
Introduction to high throughput sequencing
Tag statistics
Introduction to high throughput sequencing
Typical Profile
Introduction to high throughput sequencing
Mikkelsen et al., 2007
• Performed a comparison with ChIP-chip
methods ~98% concordance
Introduction to high throughput sequencing
Comparison with ChIP-seq
Introduction to high throughput sequencing
The Methylome
• In methylated DNA, cytosines are methylated.
• This leads to silencing of genes in the region
e.g. X inactivation
• It is yet another form of transcriptional control
and together with histone modifications a key
component of epigenetics
Introduction to high throughput sequencing
Bi-sulphite sequencing
• Converts un-methylated cytosines to uracil
(which becomes thymine when converted to
cDNA)
• Experimental procedure is difficult
• Sequence alignment is tricky, but the basic
concepts hold
Introduction to high throughput sequencing
Taylor et al, 2007
• Targeted sequencing reduced alignment difficulties
• Used dynamic programming to identify alignments of
sequences against an in silico bisulphate converted
sequence of the target amplicon regions
Introduction to high throughput sequencing
Metagenomics
• Craig Venter’s sequencing of the sea one of
the earliest and most well known examples
– Used Sanger sequencing
• Many recent studies including
– Angly et al – studied ocean virome
– Cox-Foster et al – studied colony collapse disorder
• All use 454 for its longer read length and
target amplification of 16S or 18S ribsomal
subunits
Introduction to high throughput sequencing

Slide 1

Transcript Slide 1

Directory