Amsterdam 2004 - Theoretical Biology & Bioinformatics
Download
Report
Transcript Amsterdam 2004 - Theoretical Biology & Bioinformatics
Bioinformatics and Evolutionary
Genomics
Gene Trees, Gene Duplications (I), and
Orthology
Gene Trees, Gene Duplications and
Orthology
Phylogenetic gene trees: how to make them
• Homology: are two pieces of sequence related;
Trees: when did they diverge (how are they related)
• Start from a multiple sequence alignment
• All multiple sequence programs alignments make a
global alignment, thus feed it regions that you know
are homologous → Domains !
• MUSCLE / clustal / t_coffee
• Visual inspection of alignments (gaps,
fragments/complete sequences, weird things e.g. A)
Put homologs in the alignment
• Even if they are not homologous MUSCLE will align
them (muscle/clustalw implicitly “assumes” that the
sequences you feed it are homologous)
• And in a phylogeny program, non-homologous
sequences will be clustered
Visual inspection of alignments: ?!
An additive tree which is wrongly reconstructed by
UPGMA
B
A
A
B
C
D
5
6
2
1
D
3
1
C
A
1
B
6
4
1
3
C
3
D
3
C
3
D
4
5
B
A
A B
x 12
12 x
9 9
9 7
C
9
9
x
6
D
9
7
6
x
A B CD
A
x 12 9
B 12 x 8
CD 9 8 x
A
A
x
BCD 10
BCD
10
x
Neighbour-Joining (Saitou and Nei, 1987)
• Global measure. keeps total branch length minimal
• At each step, join two nodes such that distances are
minimal (criterion of minimal evolution)
• Leads to unrooted tree
Neighbour-Joining
At each step all possible “neighbour joinings” are checked and the one
corresponding to the minimal total tree length (calculated by adding all
branch lengths) is taken.
Neighbour-Joining
A B
x 12
12 x
9 9
9 7
C
9
9
x
6
r= net
divergence
D
9
7
6
x
r Mab = dab – (ra+rb)/(N-2)
A
30 Mab = 12 – (30+28)/(4-2)) = -17
B
28
A
B
C D
C
24
A x -17 -18 -17
AC → U
D
22
B
x -17 -18
dau = dac/2 + (ra-rc)/(2(N-2))
C
x -17
= 9/2 + (30-24)/(2*2) = 6
D
x
dcu = dac - dau = 9 – 6 = 3
dbu = (dab + dbc – dac ) / 2 = (12 +
9–9)/2=6
ddu = (dad + dcd – dac ) / 2 = (9+ 6
– 9) / 2 = 3
A
B
6
6
U
C
3
3
D
U
B
D
U
x
6
3
B
6
x
7
D r
3 9
7 13
x 10
U
B
D
U
B
D
x -16 -16
x -16
x
e.g. UB →V
Dvu = dub / 2 + (ru – rb )/ (2(N-2))
= 6/2 + (9-13)/(2*1) = 3 – 2 = 1
Dvb = dub – duv = 6 – 1 = 5
A
Ddv = (dud +dbd –dub)/2 = (3+76
6)/2 = 2
B
5
U
3
C
1
V2
D
Unequal rates between species
are a very real phenomenon
Character based: parsimony and maximum likelihood
• Two way classification in phylogeny distance based
vs character based
• character state method. Searches “directly” (i.e.
without defining distances) for a tree that fits best to
the data (the alignment)
Maximum likelihood
• Search the tree with the highest maximum likelihood
• one searches for the maximum likelihood (ML) value
for the character state configurations among the
sequences under study for each possible tree and
chooses the one with the largest ML value as the
preferred tree.
Maximum likelihood
• have to specify a model of sequence evolution
• likelihood for all sites is the product of the likelihoods for
individual sites assuming all the nucleotide sites evolve
independently.
• maximum likelihood method computes the probabilities for all
possible combinations of ancestral states!
• ML methods evaluate phylogenetic hypotheses n terms of the
probability that a proposed model of the evolutionary process
and the proposed unrooted tree (hypothesis) would give rise to
the observed data (the alignment). The tree found to have the
highest (log)ML value is considered to be the preferred tree.
Interpreting trees
(recurring theme)
Interpreting the tree
• Taxonomic findings
• Paraphyly
• Monophyly
Interpreting the tree
• Outgroup. place root
between distant
homologouss sequence and
rest group (b)
• Midpoint. place root at
midpoint of longest path
(sum of branches between
any two leafs) NB njplot
• Gene duplication. Place
root between paralogous
gene copies (b)
• NB all affected by rates !
b
Simple example (kinase)
Two genes per species: how to
differentiate between one ancient
or two recent duplications?
• Two genes in Human chromosomes ( Human A &
Human B) & two genes in mouse chromosomes
(Mouse A & Mouse B)
Duplications, Speciations
1
2
3
?
Interpreting the tree: duplications vs speciations, going
pseudo 3D
Speciation
Interpreting the tree: gene trees vs species trees
Interpreting the tree
Example: vertebrate
duplications
• Tetraploidy?
Interpreting the tree: Horizontal Gene Transfer ( HGT )
Bacteria
Eukarya
Archaea
Jargon for interpretation: Orthology (and paralogy) as a
specification of homology when discussing two species
human1
mouse1
human2
Fitch 1970
Two genes in two species are
orthologous if they derive from one
gene in their last common ancestor
“the corresponding gene”
Genes can diverge by
Speciation, or
Duplication
“Gene duplication by cell division”
implied to have the same function
Orthology ~ annotating internal nodes
as duplications or speciations
Because of the
definition, how does
that translate to a
tree
With or without
species phylogeny?
Terminology: inparalogs, outparalogs, coorthologs
Inparalogs
Co-orthologs
Outparalogs
Importance of orthology for comparative genomics: more
resolution
Ec Hi Bs Af
Af Ec Bs Mg
Gene family present in
Ec Hi Bs Mg Af
Orthologs 1 present in
Ec Hi Bs Af
Orthologs 2 present in
Ec Bs Mg Af
Phenotype ~ gene correlation
Func prediction if Hi is only biochem characterized enzyme
Func prediction by co-oc
Evolution of gene content: loss vs dupl
Heurisitcs for orthology definition
• Needed because
– Speed (MSA plus reliable tree building is slow)
– Difficulty in deciding of which things you should
make a tree in the first place (PFAM?)
– Difficulty in operationalizing nuanced tree
orthology into group orthology
• Historically bidirectional blast hits BBH
BBH
Ec1 Hi Bs1 Af
Af Ec2Bs2 Mg
Ec1Bs1
Ec1Bs2
Ec2 Bs1
Ec2 Bs2
Extracting tree-like
information from pairwise
similarities
50%
35%
33%
48%
BBH issues 1: unequal rates
prpC N. meningitidis
1:1 orthologs
prpC E. coli
prpC. P. aeruginosa
.
VCh1337
V cholerae
mmgD B. subtilis
mmgD B. halodurans
citZ B. subtilis
Outparalogs
citZ B. halodurans
.
VCh2092
V. cholerae
gltA P. multocida
gltA E. coli
gltA P. aeruginosa
gltA N. meningitidis
Duplication Speciation
BBH issues 2: ignores inparalogs
Ec1 Hi Bs1 Af
Ec1 Hi 70%
Ec2 Hi 38%
Af Ec2Bs2 Bs3
Ec2 Bs2 48%
Ec2 Bs3 51%
(Bs2 Bs3 70%)
Prevalence? Depends
on e.g. evo distance,
group vs pairwise
orthology
At least 16%
prokaryotes
INPARANOID
BBH issues 3: differential gene loss
Ec1 Hi Bs1 Af
Af Ec2Bs2 Mg
Mg Hi 35%
Other Large Scale orthology schemes: Inparanoid
Eric Sonnhammer
Orthologous groups
• Solution to the non-transitivity of the concept of
orthology sensu stricto is: “Group orthology”
• Conceptually: all proteins that are directly descended
from one protein in the last common ancestor are
considered orthologous to each other
• Operationally: Combine all connected “best triangular
hits” into Clusters of Orthologous Groups (COGs,
Tatusov et al, 1997). WWW.NCBI.NLM.GOV (Watch
out for fusion/fission though !!!)
Large Scale orthology schemes: COG
•
1. Perform the all-against-all protein sequence
comparison.
•
2. Detect and collapse obvious paralogs, that is,
proteins from the same genome that are more similar
to each other than to any proteins from other species.
•
3. Detect triangles of mutually consistent, genomespecific best hits (BeTs), taking into account the
paralogous groups detected at step 2.
•
4. Merge triangles with a common side to form COGs.
•
5. A case-by-case analysis of each COG. This analysis serves to eliminate false-positives and to
identify groups that contain multidomain proteins by examining the pictorial representation of the
BLAST search outputs. The sequences of detected multidomain proteins are split into singledomain segments and steps 1–4 are repeated with these sequences, which results in the
assignment of individual domains to COGs in accordance with their distinct evolutionary affinities.
•
6. Examination of large COGs that include multiple members from all or several of the genomes
using phylogenetic trees, cluster analysis and visual inspection of alignments; as a result, some of
these groups are split into two or more smaller ones that are included in the final set of COGs.
Large Scale orthology schemes: COG
• 5. A case-by-case analysis of each COG. This analysis serves
to eliminate false-positives and to identify groups that contain
multidomain proteins by examining the pictorial representation
of the BLAST search outputs. The sequences of detected
multidomain proteins are split into single-domain segments and
steps 1–4 are repeated with these sequences, which results in
the assignment of individual domains to COGs in accordance
with their distinct evolutionary affinities.
• 6. Examination of large COGs that include multiple members
from all or several of the genomes using phylogenetic trees,
cluster analysis and visual inspection of alignments; as a result,
some of these groups are split into two or more smaller ones
that are included in the final set of COGs.
Other Large Scale orthology schemes: Ortho MCL
The too ambitious comparative genomics dilemma:
duplication/speciation vs domains
Domain composition, accretion
Single
Gene
fusion
Domains
Gene
Domain cassettes
structural
elements?
present
TIME
Very
distant
past
Gene Trivial orthologs~orthologs homologs Distant homologs
Sequence divergence
i.e. genome comparison between close species:
no domain considerations, sub-sub-ortholog. Between distant
Homologs, loads of domain considerations
Implication of coupling between duplication & domain
accretion for evolution and function prediction
• for some genes life is easy 1:1:1 orthologs, no fusion
/ domains, couple of losses. But a minority of families
but a large proportion of proteins is a formidable
challenge, domains permutations and duplications
make life complicated
Orthology &
function
prediction
Blast with a
newly sequenced
globin from frog
What kind of
globin is it?
Globins
Blast query
Orthologous & function prediction
vs
homologous that are not orthologous & function
• Orthologs tend to have the exact same molecular
function, mere HTANO’s not
• and operate in the same “pathway”.
• Orthologs mostly have the same domain
composition;
… but inparalogs: fate after duplication:
neofunctionalization or subfunctionalization
• Even evolutionary true orthologs can have “different
functions”
• Both co-orthologs have taken over some aspect of
the ancestral function and have lost other aspects
• Acquiring of new function or loss-of-function: one of
co-orthologs does something different now.
Does retaining the ancestral “role” correlate with speed of
sequence evolution: yes but a substantial minority is inconsistent
386
220
rfbB / rffG
RfbB and RffG catalyze the same reaction, but are involved in
two different biological processes. rfb gene cluster:
biosynthesis of O-specific polysaccharides (inner membrane).
rff gene cluster: complex biosynthesis of enterobacteria
common antigen (outer membrane).
Why do observe inconsistencies?
Consistent
Frequency (# cases)
70
Inconsistent
60
50
40
30
20
10
0
0
5 1 0 1 5 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100
Sequence identity between inparalogs (%)
Not because of chance due to lack of divergence time
Why do observe inconsistencies?
Similar sequence divergence of
inparalogs relative to their singleortholog, molecular function similar?
Any inconsistencies are then a
chance outcome: both duplicates
have diverged, but at (roughly) the
same evolutionary speed (most amino
acids substitutions are only been
subject to purifying selection and not
to adaptive selection)
• In certain orthology scheme gene order is given
prevalence above most similarity
• Gene at conserved position is considered the
“original” and the other duplicate the “copy”