Transcript Document

From First Assembly
Towards a New
Cyberpharmaceutical
Computing Paradigm
Sorin Istrail
Senior Director, Informatics Research
Stan Ulam’s Vision
“Don’t ask what Mathematics
can do for Biology, ask what
Biology can do for Mathematics.”
“The more I see the less I know for sure.”
John Lennon
The O.J. Simpson Problem
No matter how good the data look,
it is full of errors!
David Botstein, Albuquerque, 1994
Science and Technology








Genomics
Comparative Genomics
Proteomics
Pharmacogenomics
Structural Genomics
Drug and Vaccine Design
DNA Expression Chips
Animal Models
GENOMICS
There is Nothing More Important than
the Assembly!
Gene Myers and Granger Sutton
and the Assembly Team
Celera Human Assembly:
Largest Non-Defence Computation
20,000 CPU hours -- done 5 times now
160 Processors --
Compaq Architecture
Assembly Progression
(Macro View)
Generally 3-10 pairs link each consecutive contig
Completed
Microbial
Genomes
Escherichia
coli
Helicobacter
pylori
Bacillus
subtilis
Chlamydia
trachomatis
Archaeoglobus
fulgidus
Mycoplasma
pneumoniae
1997
Mycoplasma
genitalium
Methanococcus
jannaschii
(Deinococcus
radiodurans)
Aq Aquifex
aeolicus
Haemophilus
influenzae
1996
(Mycobacterium
tuberculosis
CSU#93)
Methanobacterium
thermoautotropicum
Saccharomyces
cerevisiae
1998
Borrelia
burgdorferi
Synechocystis
sp.
Pyrococcus
horikoshii
Mycobacterium
tuberculosis H37Rv
Treponema
pallidum
1999
(Thermotoga
maritima)
(Rickettsia
prowazekii)
GENOMES SEQUENCED AT TIGR and CELERA
Pathogens
*Haemophilus influenzae Rd
*Mycoplasma genitalium
Environment
Plants
*Helicobacter pylori
*Methanococcus jannaschii
*Arabidopsis thaliana
*Borrelia burgdorferi
*Archaeoglobus fulgidus
*Treponema pallidum
*Thermotoga maritima
*Plasmodium flaciparum
*Neisseria meningitidis
*Chlamydia trachomatis
Insects
**Drosophila melanogaster
*Deinococcus radiodurans
*Chlorobium tepidum
*Chlamydia pneumoniae
*Caulobacter crescentus
*Vibro cholerae
*Shewanella putrafaciens
*Streptococcus pneumoniae
*Desulfovibrio vulgaris
*Mycobacterium tuberculosis
*Pseudomonas putida
Mammals
**Human
**Mouse
*Porphyromonas gingivalis
*Trypanosoma brucei
*Staphylococcus aureus
*Enterococcus faecalis
* The Institute for Genomic Research
*Porphyromonas gingivalis
** Celera Genomics
*Chlamydia psittaci
Genesis of Celera August 1998
 New 3700 automated DNA
Sequencer changed the sequencing
possibility
 Combined with TIGR Whole
Genome Sequencing Strategy

And 64bit computing
Celera’s Sequencing / SNP Discovery Center
Celera Supercomputing Facility

Celera’s system is one of the most
powerful civilian super-computing
facilities in the world

Currently over 1.5 teraflop of
computing power in a virtual compute
farm of Compaq processors with 100
terabytes storage

Next phase a 100 teraflop computer
Obstacles to Genome Sequencing
• Sequencing reactions produce short reads (~550bp).
Sequence read
~550 bases
Human Genome
~3 billion bases
GCATTA...GACCGT
• The human genome is repeat-rich.
CGGATAGACATAAC
CGGATAGACATAAC
CGGATAGACATAAC
CAGCAGCAGCAGCA
CAGCAGCAGCAGCA
CAGCAGCAGCAGCA
Many short reads look identical to each other.
Comparison of Sequencing Strategies
Lab-Intense
(SLOW)
1.
Mapping and Walking
2.
Mapping and Clone by
Clone Shotgun
3. Whole Genome Shotgun
with Mate Pairs
Compute-Intense (FAST)
Clone by Clone Shotgun sequencing
• Mapping and Shotgun
1) Replicate mapped spans of DNA.
Chromosome
Mapped span
(BAC) 35,000
2) Shear the replicates randomly and
sequence the pieces.
3) Assemble reads by overlap
matching. Infer the original sequence
by consensus.
cgattc
cgattc
cgattc
cgattc
cgattc
cgattc
cgattc
cgattc
cgattc
Computed
overlaps
Computed
sequence
cgattcggattctcgattctacgaa
Mate-Pair Shotgun DNA Sequencing
DNA target sample
SHEAR & SIZE
End Reads / Mate Pairs
590bp
10,000bp
CLONE
& END SEQUENCE
e.g., 10Kbp
± 8% std.dev.
Whole Genome Sequencing Approaches

Whole Genome Shotgun Sequencing:
~ 27 million reads
for Human.
10Kbp
2Kbp
50Kbp
–
Collect 10-15x BAC inserts and end sequence:
–
Early simulations showed that if repeats were considered black boxes, one
could still cover 99.7% of the genome unambiguously.
BAC 5’
~ 300K pairs for Human.
BAC 3’
Building Scaffolds
Mated reads
Confirmed if at least 2 join the same
unitigs and one of them is a U-unitig.
1 in 1015 that a confirmed
pair is in error.
1. 2k-10k Scaffolds: Compute all “unitigs” in graph of
U-unitigs connected by confirmed mate links.
2. BAC Scaffolds: Compute all “unitigs” in graph of
2/10K scaffolds connected by confirmed BAC links.
Scaffold
Sequence or Repeat Gaps
(with estimated distances)
The Gene Counting Problem
The number probably will be never known exactly.
Current estimates: 30,000-40,000
Other estimates: 120,000
• sequence analysis
Gene discovery: • motif recognition
• matches to mRNA
• computational predictions
• mouse data matches
• experimental validation
Gene Counting
Random ESTs from tissues
CpG islands (55% of known genes are in CpG islands)
Complexity of EST data sets -- sampling biased on
tissue and depth of collection
Underrepresented in data bases: Low abundance genes,
in inaccessible tissues or developmental stages
Overrepresented: EST data sets are composed of incomplete
sequences of mRNA, and non-overlapping pieces of same mRNA
Drosophila
Nucleic Acid
Binding
Functional Assignment8%
using
Hypothetical
11%
Enzyme
Gene Ontology
18%
Signal
Transduction
4%
Transporter
4%
13,601 Genes
Structural Protein
2%
Unknown
48%
Ligand Binding or
Carrier
2%
Motor Protein
1%
Nucleic Acid Binding
Transporter
Cell Adhesion
Unknown
Enzyme
Structural Protein
Chaperone
Hypothetical
Chaperone
1%
Cell Adhesion
1%
Signal Transduction
Ligand Binding or Carrier
Motor Protein
Gene Number in the Human Genome
Number of genes
10 K
Known genes
20 K
30 K
Otto
4
3
2
40 K
1
50K
Confidence
Haemophilus vs. Drosophila

 Genome Size (Mbp)
 Sequences
 Months in sequencing
 Sequencing Staff
 Assembly Group Staff
Hflu
1.8
Drosophila
120
X
67
26,000
3,100,000
116
4
4
1
24
50
1
10
2.1
10
Human Genome Sequence from 5 Humans (3
females-2 males) completed
=Human sequencing started 9/8/99
=Over 39X coverage of the genome in paired plasmid reads
=First Assembly announced June 26 2.9 billion bp
=Published in Science, February 16, 2001
Validation Against STS-map


Scaffolds were aligned against
the BDGP STS-content map
All scaffolds with spanning 2 or
more STSs were checked for
order discrepancies.
16 STS sites out of 2175 (.73%)
were out of order, well within
the estimated error rate of the
STS map. 10 have been
determined to be incorrect.
4
3L
BDGP STS Order

3R
2L
2R
X
Celera Scaffold and STS Order
Components vs. GeneMap ‘99
Order & Orientation is Essential to Finding Genes
Exon 1
Exon 2
Exon 3
Exon 4
But if contigs are not correctly put together:
2
1
4
3 reversed
Exons are shuffled and unoriented, significantly impacting the ability of
gene finding programs to make a correct prediction.
Users consistently report finding genes that they can’t find elsewhere.
Contactin-associated protein gene (CNTNAP2)
Comparison of genomic DNA sequences retrieved from the public working draft
and the Celera database
Working draft
Celera
Genomics 73, 108-112, 2001. http://www.idealibrary.com
Scaffold Sizes
Scaffold Length
scaffold length
(Mbp)
25
20
Mouse WGA
Human WGA
Human CSA
15
10
5
0
of genome
percent of%
genome
coverage
Scaffold Sizes
Scaffold Length
(Mbp)
80
60
Mouse WGA
Human WGA
Human CSA
Drosophila WGA
Celera-only WGA
40
20
0
% of genome
Human and Mouse Assemblies
All
Span (Mbp) Scaffolds
Gap (Mbp)
Gaps 30K % 100K %
WGA
2,847
119,000
261
221,000
90.4
88.6
CSA
2,905
53,000
252
170,000
94.6
92.9
Mouse WGA
2,446
19,778
212
265,000
96.8
95.5
C-only WGA
2,781
6,500
134 *182,000
99.0
98.7
30K
Span (Mbp) Scaffolds
Gap (Mbp)
Gaps
WGA
2,574
2,507
240
99,000
CSA
2,748
2,845
224
112,000
Mouse WGA
2,367
1,779
193
242,000
C-only WGA
2,754
537
118
174,000
The Human Genome is NOT
THE Book of Life
The Blueprint of Humanity
The Language of God
The Parts List of Humanity
ADRB2
Molecular Function of Predicted proteins
BLAST, FASTA, and SIM4
Sorin Istrail
Celera Genomics
BLAST (Basic Local Alignment Search Tool)
 A suite of sequence comparison algorithms optimized for
speed used to search sequence databases for optimal local
alignments to a protein or nucleotide query
Program
Query
Database
blastp
protein
protein
blastn
DNA
DNA
blastx
DNA (translated in 6 frames)
protein
tblastn
protein
DNA (translated in 6 frames)
tblastx
DNA (translated in 6 frames)
DNA (translated in 6 frames)
Altschul, Gish, Miller, Myers, Lipman “Basic Local Alignment Search Tool”, J.Mol.Biol. 215(3):403-10 (1990)
Altschul, Madden, Schaffer, Zhang, Zhang, Miller, Lipman “Gapped BLAST and PSI-BLAST: a new generation
of protein database search programs”, NAR 25(17):3389-402 (1997) (and references therein)
The BLAST algorithm

Detect all word hits (exact, or nearly identical matches) of a given
length between the two sequences



Extend the word hits in both directions to high-scoring gap-free
segment pairs (HSPs)



retain only HSPs that score above a threshold
start from the center of the HSP (original BLAST, 1990), or from the center
of a pair of HSPs located close to each other on the same diagonal (gapped
BLAST, 1997)
Extend the HSPs in both directions allowing for gaps


k=10 for nucleotide sequences (exact word matches)
k=3 for protein sequences (nearly identical word matches)
use dynamic programming, and stop when the alignment score falls more
than a threshold X below the best score yet seen
Report all statistically significant local alignments


E-value (starting with BLAST 2.0) is used to measure the statistical significance
E-value = the number of alignments with score equal to or higher than s one
would expect to find by chance when searching the database
FASTA

A program for rapid alignment of pairs of protein and DNA sequences,
building a local alignment from matching sequence patterns, or words

Algorithm for comparing a query to a database of sequences
For each database sequence:

Identify the 10 diagonal regions having the largest number of perfect word
matches of a given length


word size: k=1,2 for protein, and k=6-10 for nucleotide searches
Re-score these regions using a given scoring matrix (e.g., PAM250), and trim
them to form (gap-free) maximal scoring initial regions
 Join (non-overlapping) initial regions from adjacent diagonals to generate
longer regions, allowing for gaps
 Re-score these based on the initial regions’ scores, assessing a penalty for
each joining
Align the query sequence to each of the sequences in the search set having the
highest overall scores
Pearson and Lipman, “Improved tools for biological sequence comparison”, Proc. Natl. Acad. Sci.
USA 85; 2444-2448 (1988).
Sim4

Aligns an expressed DNA (EST, cDNA, mRNA) sequence with a
genomic sequence for that gene, allowing for introns and sequencing
errors
cDNA
genomic
sequence
Intron
Intron
5’
Exon 1
GT
AG
GT
Exon 2
AG
Exon 3
3’
Florea, Hartzell, Zhang, Rubin, Miller, “A computer program for aligning expressed DNA and genomic
sequences”, Genome Res 8(9):967:74 (1998)
Stages and algorithmic techniques

Detect basic homology blocks

Determine gap-free matches (HSPs) using a ‘blast’-like homology search






Extend the similarity blocks using fast greedy sequence comparison algorithms
Detect new exon cores with the ‘blast’-like homology search tuned for higher
sensitivity
Refine the introns


Connect the HSPs to form larger blocks (‘exon cores’) using sparse dynamic
programming
Extend or trim the exon cores to eliminate gaps or overlaps in the cDNA
sequence


Detect all exact word matches of length k (e.g., k=12)
Extend the word hits in both directions, by substitutions, to gap-free high-scoring
segment pairs (HSPs)
Retain only HSPs scoring above a threshold
Predict the locations of splice junctions using a combined measure of the
accuracy of alignment and the intensity of splice signals at the ends of each
intron
Generate the spliced alignment


Align the sequences within individual exons using greedy alignment algorithms
Connect the chain of exon alignments by gaps (introns)