Comparative genomics and proteomics in Ensembl

Download Report

Transcript Comparative genomics and proteomics in Ensembl

Comparative genomics
and proteomics in
Ensembl
Sep 2006
Overview
• Rationale
• Species available
• Comparative proteomics
– Orthologue and paralogue prediction
– Protein clustering into families
• Comparative genomics
– Genome-wide DNA alignments
– Synteny block characterisation
• Future and perspectives
2 of 56
Compara
The Compara database is one single
multispecies database
• Gene orthology/paralogy prediction
• Protein clustering
• Whole genome alignments
• Synteny regions
3 of 56
The era of sequencing genomes
23
Red : whole genome assembly available
Green : whole genome assembly due within the next year in Ensembl
91
41
92
* 19 species currently in Ensembl
+ 10 Pre! Ensembl
105
Eutheria
45
74
83
65
170
310
Mammalia
Amniota
Tetrapoda 360
450
550
20
Metatheria
Aves
Vertebrata
Amphibia
197
140
Teleostei
70
Chordata
990
?
25
?
Urochordata
1500?
70?
250
340
Arthropoda
H. sapiens (human) * +
P. troglodytes (chimpanzee) *
M. mulatta (rhesus macaque) *
M. musculus (house mouse) *
R. norvegicus (Norway rat) *
C. familiaris (dog) *
F. catus (cat)
E. caballus (horse)
S. scrofa (pig)
B. taurus (cow) *
O. aries (sheep)
L. africana (elephant) +
M. domestica (opossum) *
G. gallus (chicken) *
X. tropicalis (western clawed frog) *
X. laevis (African clawed frog)
D. rerio (zebrafish) *
O. latipes (Japanese medaka)
G. aculeatus (Stickleback) +
T. nigroviridis (spotted green pufferfish) *
T. rubripes (torafugu) *
C. savignyi (sea squirt) +
C. intestinalis (transparent sea squirt) *
A. aegypti (yellow fever mosquito) +
A. gambiae (African malaria mosquito) *
D. melanogaster (fruitfly) *
A. mellifera (honey bee) *
Fungi
Nematoda
C. elegans (nematode) *
S. cerevisiae (baker’s yeast) *
Million years
1000
500
400
300
200
100
4 of 56
Comparing different species
• From the Ensembl perspective joins species
through
– orthologous/paralogous genes links
– chromosome synteny links
– protein family links
• From a broader perspective
–
–
–
–
–
–
Where are syntenic regions located?
How many genes are conserved?
Where are orthologous/paralogous genes?
Is gene order conserved?
Where are potential regulatory regions?
What is missing in one species, present only in another?
5 of 56
Orthologue and Paralogue
Prediction
• Evolutionary studies
• Identify potential species-specific
proteins/genes
• Identify orthologues of (human)
genes in model organisms
6 of 56
Gene Evolution
Orthologues and Paralogues
Reconstruct the Molecular Evolutionary history from
the evidence visible within the known extant genes
• Divergence
• Speciation / Duplication
• Change within allelic population
• Point Mutations / Selection / Drift
• Exon/domain shuffling
• Transposition / Translocation
• Retroposition (reverse transcription)
• Horizontal gene transfer?
7 of 56
Homologue Relationships
• Orthologues : any gene pairwise
relation where the ancestor node is
a speciation event
• Paralogues : any gene pairwise
relation where the ancestor node is
a duplication event
8 of 56
Homologue Relationships
A
time
Duplication
Inparalogues
A1
A2
Speciation
Inparalogues
M1
H2
Duplication
H1
Orthologues
Outparalogues
M2
M 2’
Inparalogues
Orthologous genes have originated from a single ancestor (often have equivalent functions).
Paralogous are genes related via duplication:
•Inparalogues (ortholog_one2one, ortholog_one2many, etc.) duplication follows speciation and
•Between_species_paralog (outparalogues). Duplication precedes speciation
9 of 56
Orthology Prediction Algorithm
• Find orthologous genes by comparing the protein sets of two species (only the
longest peptide considered).
• blastp+sw all versus all (on a paired species basis)
• Build a graph of gene relations based on BRH (best reciprocal hit) and BSR
(BLAST score ratio)
• Extract connected components (single linkage clusters ), each cluster
representing a gene family
Human
Mouse
Human
Mouse
Human
Mouse
Human
Human
10 of 56
GeneTree prediction:
MUSCLE/PHYML
• Multiple alignment of clusters with MUSCLE (based on
BRH and BSR).
•Unrooted gene tree built using PHYML (Guidon &
Gascuel, 2003)
•Tree reconciliation (gene tree with species tree) to call
duplication event on internal nood and root the tree
using RAP (Dufayard et al. 2005)
• Infer pairwise relations of orthology and paralogy types
(from each tree)
11 of 56
Molecular Phylogenetics
• Protein sequences in different species, both:
• Provide information about the history of
•
evolution
Reconstruct evolution
• We are after an alignment that equally reflects
all species:
• Modeling the branching processes by
comparing gene and species trees (tree
reconciliation)
12 of 56
Phylogenies
Revealing the evolutionary history that has led to
the organisms at the current stage.
- Leaves are real genomes
- Internal nodes are ancestors
Duplication node
Speciation node or leaf
13 of 56
Orthologue and Paralogue types
•
•
•
•
ortholog_one2one
ortholog_one2many
ortholog_many2many
apparent_ortholog_one2one
• within_species_paralog
• between_species_paralog
14 of 56
…in Ensembl…
15 of 56
Orthologue and Paralogue types
16 of 56
GeneView
17 of 56
GeneView
18 of 56
GeneTreeView
Links to ATV
and JalView
GeneTree
MUSCLE
protein alignment
19 of 56
GeneTreeView
Duplication
node (red)
Speciation
node (blue)
20 of 56
ATV
21 of 56
Protein clustering into families
• Cluster proteins from different
organisms that may share the
same function
• Obtain some kind of description
for ‘novel’ genes/proteins
• Locate family members over the
whole genome
• Identify possible orthologues and
paralogues in other species
22 of 56
Protein Dataset
• Nearly a million proteins clustered:
– All Ensembl proteins from all species in Ensembl
• 513,256 predicted proteins
– All metazoan (animal) proteins in UniProt
• 55,892 UniProt/Swiss-Prot
• 469,725 UniProt/TrEMBL
• Blastp all versus all, then clustering with MCL
23 of 56
Clustering Strategy
• BLASTP all-versus-all comparison
• Markov clustering
• For each cluster:
– Calculation of multiple sequence
alignments with ClustalW
– Assignment of a consensus
description
24 of 56
Markov Clustering (MCL)
•
•
MCL for Markov CLustering algorithm, based on flow
simulation in graphs (http://micans.org/mcl/)
Keeps into the same graph/cluster only very well interconnected nodes (proteins) in the same graph (cluster)
MCL
•
•
Allows rapid and accurate detection of protein families
on large-scale.
Automatic description and clustalw multiple alignment
applied on each cluster
25 of 56
ProtView
Link to
FamilyView
26 of 56
FamilyView
JalView multiple
alignments
Ensembl family
members within
human
Ensembl family
members in
other species
27 of 56
For each cluster
• We store
– Description and score
– Multiple alignment
• Future extensions
– Improving descriptions
– Multiple alignment assessment
– Build phylogeny on each cluster
• Using the multiple alignment
• Using dS values (mainly inside mammals)
• Extend paralogous prediction
28 of 56
Aligning complete genomes
29 of 56
Whole Genome Alignments
• Understand what evolution has done on
the species compared, after speciation
– What is missing in one species, present only
in another?
– Differences between closely related species
may help understanding speciation
• Define syntenic regions, those long
regions of DNA sequences were order
and orientation is highly conserved
• Conserved non-coding regions
– Guides to putative regulatory regions
30 of 56
Evolution at the DNA level
Deletion
Mutation
…ACTGACATGTACCA…
Sequence edits
…AC----CATGCACCA…
Rearrangements
Inversion
Translocation
Duplication
31 of 56
Basic Idea
• Functional sequences evolve more
slowly than non-functional sequences
• Comparing genomic sequences from
species at different evolutionary
distances allows us to identify:
– Coding genes
– Non-coding genes
– Non-coding regulatory sequences
32 of 56
Aligning large genomic sequences
• Independent from protein/gene predictions
• Should find all highly similar regions between two
sequences
• Should allow for segments without similarity,
rearrangements etc.
– Computes run only by few dedicated groups
• Issues
–
–
–
–
–
Heavy process
Scalability, as more and more genomes are sequenced
Time constraint
Computes run only by few dedicated groups
As the «true» alignment is not known, then difficult to
measure the alignment accuracy and apply the right
method
33 of 56
Using a local aligner
• Local alignment
– Find all highly similar regions over 2
sequences
• Find the orthologous as well as all the
paralogous sequences
– Separated by segments without alignment
– Can handle rearranged sequences
– Need post- filtering to limit too much
overlapping alignments
34 of 56
AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA
AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA
Local v Global Alignment
AGTGACCTGGGAAGACCCTGAACCCTGGGTCACAAAACTTAATC
AGTGACCTGGGAAGACCCTGAACCCTGGGTCACAAAACTTAATC
Local
Global
Advantages
Compares large genomic regions
(requires syntenic maps)
Can detect, rearrangements like
translocations, inversions and
duplications (!)
Detects insertions and deletions
Disadvantages
Fails to identify insertions or
deletions
Fails to detect rearrangements
(inversions)
35 of 56
Glocal Alignment Problem
GTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGAG
Find least cost transformation of one sequence
into another using new operations
•Sequence edits (indels,
mutations)
•Inversions
•Translocations
•Duplications
•A combination of these
AGTGACCTGGGAAGACCCTGAACCCTGGGTCACAAAACT
36 of 56
Glocal aligner (Brudno et al., 2003)
BLASTZ-net, tBLAT and MLAGAN
• BLASTZ-net (comparison on nucleotide
level) is used for species that are
evolutionary close, e.g. human - mouse
• Translated BLAT (comparison on amino
acid level) is used for evolutionary
more distant species, e.g. human zebrafish
• MLAGAN global alignment used for
multispecies alignments
37 of 56
all versus all approach using
BLASTZ (collaboration with UCSC)
• Can handle large sequences
• Used 2-weighted spaced seeding strategy
• Dynamic masking
• Makes distinction between repeat and
non-repeat sequences (soft masking)
• Try aligning inside repeats
• One iterative step with lower threshold
to expand alignments
38 of 56
Blastz strategy
• 10Mb Human fragments (3000)
• 30Mb Mouse fragments (100)
• Lineage-specific repeats removed
• 48 hours on 1024 CPUs
• Generates 9Gb of output
• When filtered for Best hit on Human,
reduced to 2.5Gb
•10Mb Human fragments (3000)
• 30Mb Mouse fragments (100)
39 of 56
Blastz human genome coverage
• 40% of the human genome is covered by an
alignment of mouse sequences
By rescoring the alignment over a “tight” matrix
that is very stringent and look for high conservation
(>70% identity), the coverage goes down to 6%
40 of 56
DNA/DNA matches web display
ContigView human
EPO
Conserved
sequences
41 of 56
DotterView
Human
sequence
Mouse
sequence
42 of 56
Multiple alignments
• Currently 3 sets:
– MLAGAN-primates:
– MLAGAN-amniote vertebrates:
– MLAGAN-eutherian mammals:
43 of 56
Strategy
•
•
•
•
Use all coding exons
Get sets of best reciprocal hits
Create orthology maps
Build multiple global alignments
44 of 56
MultiContig
View
45 of 56
Multiple alignments
ContigView human
EPO
46 of 56
AlignSpliceView
Export
alignments
Human
Dog
Rat
Mouse
Alignment on
basepair level
47 of 56
MultiContigView vs. AlignSliceView
48 of 56
AlignView
49 of 56
GeneSeqalignView
50 of 56
GeneSeqalignView
51 of 56
Syntenic Regions
• Genome alignments are refined into
larger syntenic regions
• Alignments are clustered together
when the relative distance between
them is less than 100 kb and order
and orientation are consistent
• Any clusters less than 100 kb are
discarded
52 of 56
SyntenyView
Human
chromosome
Orthologues
Mouse
chromosomes
Mouse
chromosomes
53 of 56
CytoView
Syntenic
blocks
54 of 56
Outlook
• OrthoView
• Displaying alignments both from
whole genome alignments and on
orthologues
• Consider all isoforms for each gene
•Calculate dN/dS
55 of 56
Acknowledgements
•
•
•
•
•
•
Abel Ureta-Vidal
Benoît Ballester
Kathryn Beal
Stephen Fitzgerald
Javier Herrero
Albert Vilella
Ensembl team
Sep 2006
56 of 56
Basic idea
Ancestor sequence
Speciation event
mutations
selection
alignment
Mutation
Regulatory region
Exon
57 of 56
Global v Local Alignments
Local
Global
duplication
1
2
inversion
1
2
(-)
Advantages
Disadvantages
Local
Compares large genomic
Fails to identify insertions or
regions (uses syntenic maps) deletions
Can detect, rearrangements
like translocations,
inversions and duplications
(!)
Global
Detects insertions and
deletions
Fails to detect rearrangements
(inversions)
Glocal aligner (Brudno et al., 2003) pairwise only
58 of 56
Inparalogues vs Outparalogues
59 of 56
Adapted from Sonnhammer & Koonin (2002) TIG 18, 12: 620
Problems: weak orthologies
60 of 56
Problems: missalignments
61 of 56
Possible solutions
• Weak orthologies:
• Poor alignments:
– report to author
– edit alignments, detect wrong
edges, redefine blocks
– use another aligner
62 of 56
From Edgar, R. C. (2004) NAR 32:1792-1797
63 of 56