Transcript mSA 41

Towards your own genome
Sequencing strategy
Genome size and genome complexity?!
related organism, PFGE, flow cytometry
Designing your Sequencing Run
https://genohub.com/next-generation-sequencing-guide/
Noncoding DNA in genomes
Repetitive DNA in the human genome
Sequencing strategy
Template and Library prep:
Fragment (SE),Paired-end (PE)or Mate pair (MP)
BAC clones, fosmids....
Sequencing Platform
Genome sequencing: Comparison of NGS methods
Single-molecule realtime sequencing
(Pacific Bio)
Ion semiconductor
(Ion Torrent
sequencing)
Pyrosequencing
(454)
Sequencing by
synthesis
(Illumina)
Sequencing by ligation
(SOLiD sequencing)
Chain termination
(Sanger
sequencing)
2900 bp average[38]
200 bp
700 bp
50 to 250 bp
50+35 or 50+50 bp
400 to 900 bp
87% - 99%
98%
99.9%
98%
99.9%
99.9%
Reads per run
35-75 thousand
up to 5 million
1 million
1.2 to 1.4 billion
N/A
Time per run
30 minutes to 2 hours
2 hours
24 hours
up to 3 billion
1 to 10 days,
depending upon
sequencer
1 to 2 weeks
20 minutes to 3
hours
$2
$1
$10
$0.05 to $0.15
$0.13
$2400
Low cost per base.
Long individual
reads. Useful for
many
applications.
Short reads. Slower
than other methods.
More expensive
and impractical
for larger
sequencing
projects.
Method
Read length
Accuracy
Cost per 1 mil.
bases
Advantages
Disadvantages
Longest read length.
Fast. Detects 4mC,
5mC, 6mA.[41]
Less expensive
equipment. Fast.
Potential for high
sequence yield,
Long read size. depending upon
Fast.
sequencer model
and desired
application.
Runs are
Low yield at high
expensive.
accuracy. Equipment Homopolymer errors.
Homopolymer
can be very expensive
errors.
Instrument
Application: de novo assemblies
BACs, plastids, & microbial genomes
Transcriptome
Plant & animal genome
454 – GS Jr.
B – good but expensive
D – cost prohibitive
454 – FLX+
A – good, need to multiplex to be economical
C – need multiple runs, expensive
B – good but expensive, libraries
usually normalized, not best for
short RNAs
MiSeq – v2
A – good, need to multiplex for best economics
C – OK as part of a mixed platform
strategy, prohibitive to use alone
A/B –expensive for rare transcripts B – expensive relative to HiSeq, but
(compared to HiSeq), but reads are additional read length can be
longer for better assembly
valuable
A – good, assembly more
A – primary data type in many
HiSeq 2000/2500, B/C – more data than needed unless highly indexed;
challenging than 454 but much more current projects; requires mate-pair
standard run
assembly more challenging than 454 or MiSeq
data available for analyses
libraries
HiSeq 2500, rapid B – more data than needed unless highly indexed;
run (projected) assembly more challenging than 454
A – good, assembly more
A – will probably be more expensive
challenging than 454 but much more than HiSeq2000, but increased read
data available for analyses
length may be worth it
C – OK, but reads are shorter &
more expensive than Illumina
B/A – good, less data than MiSeq,
Ion Torrent – 318 B/A – good, less data than MiSeq
reads similar to 454 titanium but
less expensive
B – more data than needed unless indexed; assembly B/A – assembly currently more
Ion Torrent Proton I
more challenging than 454 or Illumina
challenging than Illumina or 454
D – cost prohibitive, reads shorter
than alternatives
C – high cost relative to Proton or
Illumina, more economical than 454
for mixed platform strategy
B – expensive relative to HiSeq or
Proton II/III
Ion Torrent Proton B/C – more data than needed unless highly indexed; B/A – assembly currently more
II (projected) assembly more challenging than 454 or Illumina
challenging than Illumina or 454
A/B – should be similar to HiSeq
Ion Torrent Proton
C – more data than needed unless highly indexed
III (forecast)
C – more data than needed unless highly indexed;
SOLiD – 5500
assembly more challenging than 454 or Illumina
A – cost per MB could make it the
best
C/D – short reads make assembly
challenging or impossible
Ion Torrent – 314
PacBio – RS
B/C – OK, lowest experimental cost but reads are
shorter & more expensive than Illumina
B/A – need assembly pipelines
C/D – short reads make assembly
challenging or impossible
B – good for hybrid assemblies; not economical for B/D – good for hybrid assemblies;
solo assemblies – requires high coverage due to high too expensive for solo use; short
error rates
RNA is challenging
B/D – good for hybrid assemblies &
scaffolding (mixed platform
strategy); cost prohibitive for solo
use
Platform – instrument
Application: resequencing
Targeted loci
Transcript counting
454 – GS Jr.
B/C – good but expensive, need to limit
loci
D – cost prohibitive
454 – FLX+
B – good but expensive, should limit loci
D – cost prohibitive
MiSeq
A/B – good, fewer and higher cost reads
than HiSeq
B – more expensive than HiSeq or
SOLiD or ProtonII+
HiSeq 2000/2500 – standard A – primary data type in many current
run
projects; best for many loci
HiSeq 2500 – rapid run
(projected)
A – faster path to leading data type
Genome resequencing
D – cost prohibitive for large
genomes
D – cost prohibitive for large
genomes
B/C – expensive for large genomes
A – primary data type in many
current projects
A/B – likely to be slightly more
expensive than with standard flow
cell
A – primary data type in many
current projects
D – cost prohibitive
A – faster path to leading data type
Ion Torrent – 314
C – OK but expensive, need to limit loci
D – cost prohibitive
Ion Torrent – 318
B – good, slightly less data per run than
MiSeq
B/C – more expensive than HiSeq or
SOLiD; new informatics pipelines
C – expensive for large genomes
needed; new error profile
Ion Torrent Proton I
B – more expensive than Illumina or
A/B – similar to MiSeq, but different error SOLiD; new informatics pipelines
B – expensive relative to HiSeq or
profile will inhibit switching
needed (different error profile than Proton II+
Illumina)
Ion Torrent Proton II
(projected)
A/B – similar to HiSeq, but different error A/B – new informatics pipelines
profile will inhibit switching
needed
Ion Torrent Proton III
(forecast)
A/B – costs projected to be better than
A/B – new informatics pipelines
HiSeq; error profile different than Illumina needed
SOLiD – 5500xl
PacBio – RS
B – harder to assemble than Illumina
A/B – used much less than HiSeq
C/D – expensive but can sequence difficult
D – cost prohibitive
regions
A – supposed to set new pricing
standard, could become leading
shorter-read platform
A – supposed to set new pricing
standard, could become leading
shorter-read platform
A/B – used much less than HiSeq
C/D – cost prohibitive except for
strutural variants
Bacterial genomes
Noncoding DNA in genomes
Bacterial genomes
Bacterial genomes
Bacterial genomes
Complex Bacterial Genomes
Fosmid and plasmid library; Sanger
Simplified Bacterial Genomes
454 (average read length 225bp)
Illumina (33bp)
MDA for 16h on one lysed cell
3kb Sanger libraries plus 454
15 gaps (chimeric clones) Sanger finishing
Polishing by Illumina reads
37 regions Sanger polishing
Bacterial genomes
Eukaryotic Genomes
Eukaryotic Genomes: Fish genomes
Template: A female fish was chosen
because of its XX sex chromosome
constitution
Roche 454 Titanium (3 and 20kb libraries)
Illumina PE insert size 200bp and 75 bp reads
physical map: fingerprints with ABI3730 from
the WLC-1247 BAC library (insert size of 160 kb;
10× genome coverage with a total of 43,192
clones available)
Bird genomes
Mammalian genomes
HiSeq2000
DNA isolated from blood
Extremelly large genomes
The largest genome assembled to date
loblolly pine (Pinus taeda)
DNA template:
a single megagametophyte, the haploid tissue of a single pine seed – quantity
long-fragment mate pair libraries from the parental diploid DNA
Novel fosmid DiTag libraries
N50 scaffold size of 66.9 kbp
Raw Data Trimming and Filtering
Quality score
Raw Data Trimming and Filtering
Raw Data Trimming and Filtering
Assembly
N50
N75
Contigs
Scaffolds
Assembly: K-mer
A common sequence shared by pairs of reads
Assembly: K-mer
Assembly
Assembly – algorithms
Repeats!
OLC Overlap/Layout/Consensus
Overlap: Overlap discovery all-against-all, seed & extend heuristic algorithm;
K-mers as alignment seeds-sensitivity
Layout: Construction and manipulation of an overlap graph leads to an
approximate read layout
Consensus: Multiple sequence alignment (MSA) determines the precise layout
and then the consensus sequence.
Loading base calls-computer memory
Assembly vs Repetitive DNA
Assembly vs Repetitive DNA
Assembly vs Repetitive DNA and Coverage
Why is coverage important?
resolution
repeat discovery, copy number estimation
binning of metagenomic data
Assembly vs GC content
both GC-rich fragments and AT-rich fragments are underrepresented in the Illumina sequencing results
Why is GC important?
affecting coverage
HGT discovery
binning of metagenomic data
Assembly vs GC content
Less even coverage with Illumina
Assembling algorithms and Scaffolders
Velvet and Velvet Optimizer
Newbler
Celera
MaSuRCA
http://en.wikipedia.org/wiki/Sequence_assembly
Assembling algorithms and Scaffolders
Annotation
Ready for Annotation?
Checking gene coverage:
UCOs - Ultra Conserved Orthologs (Kozik et al., 2007)
CEGMA - Core Eukaryotic Genes Mapping Approach (Parra et al., 2007)
SICO - genes Single Copy genes Proteobacteria (Lerat et al., 2003)
Percent gaps:
library insert size vs. 50 “N”s
Median gene length roughly
proportional to genome size
Ready for Annotation?
UCOs
Sanger
Illumina
454
Annotation of Prokaryotic Genomes
Gene prediction:
GLIMMER
Prodigal Prokaryotic Dynamic Programming Genefinding Algorithm
Automated pipelines and annotation softwares:
RAST
BASys
SOP
PROKKA
IMG ER
Annotation of Prokaryotic Genomes
Repeated errors
Inconsistent gene names
Additional data and postgenomic experiments
Annotation of Eukaryotic Genomes
Standard draft assembly
High quality draft assembly
Two phases
1. computation phase
repeat masking (homopolymers, transposable elements)
evidence alignment (proteins, ESTs, RNA-seq data aligned)
ab initio gene prediction vs Evidence driven gene prediction
2. annotation phase
finding a consensus
Annotation of Eukaryotic Genomes
Gene prediction and gene annotation are not synonyms!
Predictors do not report untranslated regions (UTRs) or splice variants
Annotation of Eukaryotic Genomes