No Slide Title

Download Report

Transcript No Slide Title

Comparative genomics for
biological discovery
Lior Pachter
Dept. Mathematics, U.C. Berkeley
[email protected]
February 3, 2004
Comparative Genomics
From: Hardison RC (2003) Comparative Genomics.
PLoS Biol 1(2): e58.
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
February 2001
December 2002
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
Rat 2004
Picture credit: G.Bourque, P. Pevzner, G. Tesler and the
Rat Genome Sequencing Consortium
State of the Genomes (Jan 2004)
QuickTi me™ and a
TIFF ( Uncompressed) decompressor
are needed to see thi s pi ctur e.
v3
QuickTi me™ and a
TIFF ( Uncompressed) decompressor
are needed to see thi s pi ctur e.
v6
0.36 0.35
Gb Gb
QuickTi me™ and a
QuickTi me™ and a
TIFF ( Uncompressed) decompressor
TIFF ( Uncompressed) decompressor
are needed to see thi s pi ctur e. are needed to see thi s pi ctur e.
QuickTi me™ and a
TIFF ( Uncompressed) decompressor
are needed to see thi s pi ctur e.
QuickTi me™ and a
TIFF ( Uncompressed) decompressor
QuickTi me™ and a are needed to see thi s pi ctur e.
TIFF ( Uncompressed) decompressor
are needed to see thi s pi ctur e.
QuickTi me™ and a
TIFF ( Uncompressed) decompressor
are needed to see thi s pi ctur e.
v2
v3
v34 v3.1 v0.1 v1
1.7
Gb
2.5
Gb
2.9
Gb
Aligned (multiple)
2.8
Gb
Working on it
QuickTi me™ and a
TIFF ( Uncompressed) decompressor
are needed to see thi s pi ctur e.
QuickTi me™ and a
QuickTi me™ and a
TIFF ( Uncompressed) decompressorTIFF ( Uncompressed)
decompressor
are needed to see thi s pi ctur e.
are needed to see thi s pi ctur e.
v0e
2.4* 2.9* 1.2
Gb
Gb
Gb
As soon as released
---- ----
3*
Gb
1.7
Gb
http://www.ncbi.nlm.nih.gov/Genbank/genbankstats.html
Outline
VISTA/AVID tools for comparative genomics
Related biological stories
Human/Mouse/Rat
Phylogenetic Shadowing
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
http://www-gsd.lbl.gov/vista
Processed ~ 11000 queries on-line, distributed > 560 copies of the
program in 34 countries
VISTA/AVID package
• AVID: Program for global alignment of DNA
fragments of any length
`
N. Bray and L. Pachter, MAVID: Constrained Ancestral Alignment of Multiple
Sequences, Genome Research, in press.
N. Bray, I. Dubchak, L. Pachter, AVID: A Global Alignment Program , Genome Research,
13 (2003) p 97 - 102.
• VISTA: Visualization of alignment and various
sequence features for any number of species
C. Mayor, M. Brudno, J.R. Schwartz, A. Poliakov, E. M. Rubin, K. A. Frazer, L. Pachter and I.
Dubchak, VISTA: Visualizing global DNA sequence alignments of arbitrary length,
Bioinformatics, 16 (2000), p 1046-1047.
Aligning large genomic regions
•
•
•
•
•
Long sequences lead to memory problems
Speed becomes an issue
Long alignments are very sensitive to parameters
Draft sequences present a nontrivial problem
Accuracy is difficult to measure and to achieve
References for other existing programs:
Glass:
Domino Tiling, Gene Recognition, and Mice.
Pachter, L. Ph.D. Thesis, MIT (1999)
Human and Mouse Gene Structure: Comparative Analysis and Application to Exon Prediction.
Batzoglou, S., Pachter, L., Mesirov, J., Berger, B., Lander, E. Genome Research (2000).
MUMmer
Delcher, A.L., Kasif S., Fleischmann, R.D., Peterson J., White, O. and Salzberg, S.L.
Alignment of whole genomes. Nucleic Acids Research (1999)
PipMaker
PipMaker: A Web Server for Aligning Two Genomic DNA Sequences.
Scott Schwartz, Zheng Zhang, Kelly A. Frazer, Arian Smit, Cathy Riemer, John Bouck, Richard Gibbs,
Ross Hardison, and Webb Miller. Genome Research (2000)
DIALIGN
Multiple DNA and protein sequence alignment based on segment-to-segment comparison
B. Morgenstern, A. Dress and T. Werner, Proc. Natl. Acad. Sci. USA 93 (1996)
Variations on Sequence
Alignment
Find the best OVERALL alignment.
Global alignment
Find ALL regions of similarity.
Local alignment
Find the BEST region of similarity.
Optimal local alignment
AVID- the alignment engine behind VISTA
 Very fast global alignment of megabases of sequence.
 Provides details about ordered and oriented contigs, and
accurate placement in the finished sequence.
 Full integration with repeat masking.
•
•
•
•
•
ORDER and ORIENT
FIND
all common k-long words (k-mers)
ALIGN
k-mers scoring by local homology
FIX
k-mers with good local homology
RECURSE with smaller k (shorter words)
Visualization
tggtaacattcaaattatg-----ttctcaaagtgagcatgaca-acttttttccatgg
|| | |||| | | ||
|| | | |
|||||| | ||
|
| ||
tgatgacatctatttgctgtttcctttttagaaactgcatgagagcctggctagtaggg
Window of length L is centered at a particular nucleotide in
the base sequence
Percent of identical nucleotides in L positions of the alignment
is calculated and plotted
Move to the next nucleotide
Finding conserved regions with
percentage and length cutoffs
Conserved segments with percent identity X and
length Y - regions in which every contiguous
subsegment of length Y was at least X% identical
to its paired sequence. These segments are
merged to define the conserved regions.
Output:
11054 - 11156 = 103bp at 77.670%
13241 - 13453 = 213bp at 87.793%
14698 - 14822 = 125bp at 84.800%
NONCODING
EXON
EXON
VISTA Plot
Conserved NonCoding Sequences
KIF Gene
100%
75
0kb
Human Sequence (horizontal axis)
50
10kb
% Identity
QuickTi me™ and a
TIFF ( Uncompressed) decompressor
are needed to see thi s pi ctur e.
Multi-Species Comparative Analysis (mVISTA)
Apolipoprotein AI gene
100%
human/
macaque
75%
50/100%
human/
pig
75%
human/
rabbit
75%
50/100%
50/100%
human/
mouse
75%
50/100%
human/
rat
75%
50/100%
human/
chicken
75%
50%
Liver enhancer
Some results obtained with VISTA
J Mol Cell Cardiol 34, 1345-1356 (2002)
Myocardin: A Component of a Molecular Switch for Smooth Muscle
Differentiation. J. Chen, C. M. Kitchen, J. W. Streb and J. M. Miano
University of Oxford
VSTA used to solve the gene structures of rat and human myocardin.
Blood, 100, 3450-3456 (2002)
Deletion of the mouse a -globin regulatory element (HS 26) has an
unexpectedly mild phenotype
E. Anguita, J. A. Sharpe, J. A. Sloane-Stanley, C. Tufarelli, D. R. Higgs, and W. G. Wood
University of Oxford.
Genome Research 11, 78 (2001)
Human and Mouse - Synuclein Genes: Comparative Genomic Sequence Analysis
and Identification of a Novel Gene Regulatory Element
J. W. Touchman, et al.
NIH Intramural Sequencing Center, National Institutes of Health
Synuclein gene involved in Alzheimer’s disease
EMBO reports 4:143 (2003)
The kangaroo genome. Leaps and bounds in comparative genomics
M. J. Wakefield and J. A. Marshall Graves
Research School of Biological Sciences, The Australian National University,
Canberra, ACT 0200, Australia
‘The kangaroo genome is a rich and unique resource for comparative genomics,
a treasure trove of comparative genomics data’.
Phylogenetic footprinting of 3’ untranslated region of the SLC16A2 gene
VISTA flavors
• VISTA – comparing DNA of multiple
organisms
• for 3 species - analyzing cutoffs to define
actively conserved non-coding sequences
• cVISTA - comparing two closely related
species
• rVISTA – regulatory VISTA
Identifying non-coding sequences (CNSs)
involved in transcriptional regulation
rVISTA - prediction of transcription
factor binding sites
• Simultaneous searches of the major transcription
factor binding site database (Transfac) and the
use of global sequence alignment to sieve through
the data
• Combination
of
database
searches
with
comparative sequence analysis reduces
the
number of predicted transcription factor binding
sites by several orders of magnitude
Regulatory VISTA (rVISTA)
1. Identify potential transcription factor binding sites for
each sequence using library of matrices (TRANSFAC)
2. Identify aligned sites using AVID
3. Identify conserved sites using dynamic shifting window
Percentage of conserved sites of the total 3-5%
Ikaros-2
Human
Mouse
Dog
Rat
Cow
Rabbit
Ikaros-2
NFAT
Ikaros-2
TGATTTCTCGGCAGCAAGGGAGGGCCCCATGACAAAGCCATTTGAAATCCCAGAAGCAATTTTCTACTTACGACCTCACTTTCTGTTGCTGTCTCTCCCTTCCCCTCTG
TGATTTCTCGGCAGCCAGGGAGGGCCCCATGACGAAGCCACTCGAAATCCCAGAAGCAATTTTCTACTTACGACCTCACTTTCTGTTGCTCTCTCTTCCTCCCCCTCCA
TGATTTCTCGGCAGCAAGGGAGGGCCCCATGACGAAGCCATTTGAAATCCCAGAAGCGATTTTCTACCTACGACCTCACTTTCTGTTGCGCTCACTCCCTTCCCCTGCA
TGATTTCTCGGCAGCCAGGGAGGGCCCCATGACGAAGCCACTCGAAATCCCAGAAGCAATTTTCTACTTACGACCTCACTTTCTGTTGTTCTCTCTTCCTCCCCCTCCA
TGATTTCTCGGCAGCCAGGGAGGGCCCCATGACGAAGCCATTTGAAATCCCAGAAGCAATTTTCTACTTACGACCTCACTTTCTGTTGCGTTCTCTCCCTTCCCCTCCT
TGATTTCTCGGCAGCCAGGGAGGGCCCCACGAC-AAGCCATTCAAAATCCCAGAAGTGATTTTCTACTTACGACCTCACTTTCTGTTG----CTCTCTCCTTCCCTCCA
20 bp dynamic
shifting window
>80% ID
~1 Meg region, 5q31
Coding Noncoding
Human interval Transfac predictions for GATA sites
839 20654
Aligned with the same predicted site in the mouse seq.
Alligned sites conserved at 80% / 24 bp dynamic window
450
Random DNA sequence of the same length
303
2618
731
29280
2 Exp. Verified GATA-3 Sites
IL 5
GATA-3 (28)
GATA-3 Conserved (4)
A
Ik-2-All
Ik-2-Aligned
Ik-2-conserved
100%
75%
50%
B
AP-1-conserved
NFAT-conserved
GATA-3-conserved
100%
75%
50%
C
AP-1-All
NFAT-All
AP-1-Aligned
NFAT-Aligned
AP-1-Conserved
NFAT-Conserved
100%
75%
50%
Main features of AVID
• Alignments up to several megabases
• Works with finished and draft sequences
• Fast
• Accurate for close and distant organisms
Main features of VISTA
• Clear , configurable output
• Ability to visualize several global
alignments on the same scale
• Available source code and WEB site
Large scale VISTA/AVID applications:
Cardiovascular comparative genomics database
http://pga.lbl.gov
Berkeley Genome Pipeline – comparing the human and
mouse genome
http://pipeline.lbl.gov/
Multiple whole genome comparisons using MAVID
http://bio.math.berkeley.edu/genome/
Automatic computational system for
comparative analysis of pairs of genomes
http://pipeline.lbl.gov
Alignments (all pair-wise combinations):
Human Genome:
(Golden Path Assembly)
Mouse assemblies: Arachne, Phusion (2001) MGSC v3 (2002)
Rat assemblies:
November 2002, February 2003
---------------------------------------------------------D. Melanogaster vs D. Pseudoobscura February 2003
Main modules of the system
Mapping and alignment of mouse contigs
against the human genome
Visualization
Analysis of conservation
Tandem Local/Global Alignment Approach
•Finding a likely mapping for a contig
•Multi-step verification of potential regions by global alignment
Specificity test
The ratio of the number of bp on each human chromosome covered by
alignments of the reversed mouse genome and the number of base pairs
covered by the actual mouse genome.
Apolipoprotein(a) region. The expressed gene is confined to
A subset of primates. Our method is the only one to predict
that apoa(a) has NO homology in the mouse.
VistaBrowser
Input your own sequence to align against the Reference
Genomes: Human, Mouse, Rat, D.Melanogaster
GenomeVISTA
Opposum BAC versus Human Genome
Examples of Results
• Understanding the structure of conservation
• Identification of putative functional sites
• Discovery of new genes
• Detection of contamination and misassemblies
Two assemblies are better than one
Identification of a New Apo Gene on Human 11q23
Gene Name
Highly Conserved Region
Zoom In
ApoA4
ApoC3 ApoA1
Identification of a New Apo Gene on Human 11q23
New Gene (ApoA5)
Pennacchio LA et al.
Science. 2001, 294:169-73.
Finding regulatory regions
Muscle Specific Regulatory Region: human
beta enolase intronic enhancer
Comparative analysis of genomic intervals
containing important cardiovascular genes
http://pga.lbl.gov
http://pga.lbl.gov/cvcgd.html
Example of CVCGD entry
Short annotation of the region
Detailed annotation in AceDB
format
VISTA plot of the region
multiVISTA plot of the region
Alignment
Conserved regions
Comparing the human, mouse and rat
• Design a computational scheme for multiple genome
mapping (Construction of Homology Maps)
• Move from pair-wise to multiple DNA alignment (MAVID)
• Novel visualization and browsing techniques (KBROWSER)
MAVID architecture overview
ML ancestor
AVID
Nicolas Bray
http://baboon.math.berkeley.edu/mavid/
QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture.
Human-Mouse-Rat
Human: April 03
Mouse: Feb. 03
Rat: June 03
Homology map (Colin Dewey)
~500 HMR blocks
MAVID
Computer cluster
Conservation
Annotation
…..
Result:
3-way alignment of human-mouse-rat
Foundation for further analysis
Can be browsed at
http://hanuman.math.berkeley.edu/kbrowser/
QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture.
QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture.
QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture.
Human
th
tm
Mouse
tr
Rat
Identification of Rodent Hotspots
Human
Human
Mouse
Rat
Mouse
Rat
QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture.
http://bio.math.berkeley.edu/slam/
SLAM components
• Splice site detector
– VLMM
• Intron and intergenic regions
– 2nd order Markov chain
– independent geometric lengths
• Coding sequence
– PHMM on protein level
– generalized length distribution
• Conserved non-coding sequence
– PHMM on DNA level
SLAM input and output
• Input:
– Pair of syntenic sequences (FASTA).
• Output:
– CDS and CNS predictions in both sequences.
– Protein predictions.
– Protein and CNS alignment.
Input:
Output:
QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture.
QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture.
Summary statistics
# of SLAM human/mouse genes
# of SLAM human/rat genes
# of SLAM genes identical in human, mouse, and rat
# of SLAM human/mouse/rat genes overlapping
human RefSeq
% of SLAM human/mouse/rat genes with correct
structure (out of genes overlapping human RefSeq)
# of novel (not overlapping with human Ensembl,
RefSeq, or Known genes) SLAM human/mouse/rat
genes
# of SLAM human/mouse/rat genes tested
29370
25427
3698
2478
36%
924
48 ortholog pairs (48
human, 48 rat)
% of SLAM human/mouse/rat genes verified 73% (29 pairs verified in
both human and rat, 6
verified only in rat)
Comparative Genomics
From: Hardison RC (2003) Comparative Genomics.
PLoS Biol 1(2): e58.
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Example: LXR-a exon 3
100%
75%
50%
Human: chromosome 11
13 other primate sequences (~2kb each)
• Begin with a multi-FASTA file
• No phylogenetic tree
• No alignment
• No annotation
Nicolas Bray
http://baboon.math.berkeley.edu/mavid/
Non-conserved likelihood calculation
Conserved likelihood calculation
Example: LXR-a exon 3
0.4
log(lik[fast]/[slow])
-0.1
-0.6
-1.1
-1.6
-2.1
0
100%
75%
50%
500
1000
sequence (bp)
1500
Which primates should we
sequence?
0.25
0
Primates
Rodents
Lemurs
Lorises
Prosimians
Tarsioids
Cebuella
Callithrix
Callimico
Saguinis
Leontopithecus
Samiri
Cebus
Aotus
Callicebus
Pithecia
Chiropotes
Cacajao
Alouatta
Lagothrix
Brachyteles
Ateles
Allenopithecus
Miopithecus
Erythrocebus
Chlorocebus
Cercopithecus
Macaca
Mandrillus
Cercocebus
Lophocebus
Papio
Theropithecus
Procolobus
Piliocolobus
Colobus
Semnopithecus
Kasi
Trachypithecus
Presbytis
Nasalis
Simias
Pygathrix
Rhinopithecus
Hylobates
Pongo
Gorilla
Pan
Homo
80
60
40
million years
20
0
New-world
monkeys
Old-world
monkeys
Hominoids
k-MST problem
Given a phylogenetic tree on n leaves, and an
integer k<n, find the subtree of maximum weight
spanning k leaves.
The clamped k-MST problem is to find the subtree
of maximum weight spanning k leaves where
one of the leaves is human.
Rodents
Lemurs
Lorises
Prosimians
Tarsioids
Cebuella
Callithrix
Callimico
Saguinis
Leontopithecus
Samiri
Cebus
Aotus
Callicebus
Pithecia
Chiropotes
Cacajao
Alouatta
Lagothrix
Brachyteles
Ateles
Allenopithecus
Miopithecus
Erythrocebus
Chlorocebus
Cercopithecus
Macaca
Mandrillus
Cercocebus
Lophocebus
Papio
Theropithecus
Procolobus
Piliocolobus
Colobus
Semnopithecus
Kasi
Trachypithecus
Presbytis
Nasalis
Simias
Pygathrix
Rhinopithecus
Hylobates
Pongo
Gorilla
Pan
Homo
80
60
40
million years
20
0
New-world
monkeys
Old-world
monkeys
Hominoids
Phylogenetic shadowing of the apo(a) promoter
4.5
log(lik[fast]/lik[slow])
3.5
2.5
1.5
TATA HNF-1a EXON
0.5
-0.5
250
500
750
1000
1250
1500
sequence position (bp)
conserved
non-conserved
1750
2000
2250
Gel-shift assay to assess DNA-protein interactions
conserved elements
DNA-protein
complex
unbound DNA
nuclear extract
non-conserved elements
Gel-shift assay to assess DNA-protein interactions
conserved elements
DNA-protein
complex
unbound DNA
nuclear extract
non-conserved elements
Gel-shift assay to assess DNA-protein interactions
conserved elements
DNA-protein
complex
unbound DNA
nuclear extract
non-conserved elements
Gel-shift analysis of conserved elements in the apo(a) promoter
Conserved elements
7
8
9
10-1 10-2
1
2
3
4
5
6
7
N7
6
N6
5
N5
4
C8
3
C7
2
C6
1
Non-conserved elements
%oligonucleotide shifted
35
30
25
20
15
10
5
0
N4
N3
N2
N1
C10.2
C10.1
C9
C5
C4
C3
C2
C1
oligonucleotide
Summary and Conclusions - Phylogenetic Shadowing
• Alignment problem is tractable
• Trees can be constructed accurately
• Total tree weight is sufficient for distinguishing
conserved from non-conserved regions
• Likelihood calculations are reliable because alignment
are good
• Can decide a-priori which organisms should be sequenced
• Annotation of primate-specific elements is possible
• Annotation of coding exons is accurate
• Annotation of regulatory elements is possible
• Sequencing is easier because comparative mapping and
assembly techniques can be applied
Web sites
• MAVID alignment program
http://bio.math.berkeley.edu/mavid/
• SLAM comparative gene prediction program
http://bio.math.berkeley.edu/slam/mouse/
• VISTA
http://www-gsd.lbl.gov/vista/
• KBROWSER
http://hanuman.math.berkeley.edu/kbrowser/
• SHADOWER
http://bonaire.lbl.gov/shadower/
Credits
(M)AVID
Nicolas Bray
VISTA Projects and PGA
Michael Brudno
Gaby Loots
Eddy Rubin
Olivier Couronne
Chris Mayor
Inna Dubchak
Ivan Ovcharenko
Homology Mapping
Colin Dewey
Evolutionary Hotspots
Von Bing Yap
KBROWSER
Kushal Chakrabarti
Phylogenetic Shadowing
Dario Boffelli
Jon McAuliffe
Gene Finding
Marina Alexandersson
Colin Dewey
Keith Lewis
Ivan Ovcharenko
Michael Jordan
Eddy Rubin
Simon Cawley
Richard Gibbs
Sourav Chatterji
Jia Qian Wu
Kelly Frazer
Alexander Poliakov