Amsterdam 2004
Download
Report
Transcript Amsterdam 2004
Bioinformatics and Evolutionary Genomics
Request
• We have a small group
• and also heterogeneous with respect to previous
knowledge
• PLEASE: interrupt / ask questions when I am going
to fast, when I use jargon, when I make
jumps/conclusions that to me seem obvious 100%
logical, but to your are erratic; please point out my
implicit assumptions regarding what everybody
knows
Lectures and computer exercises
• Homology, trees,
• Genomic context , genome evolution, pathway
evolution
• HTP data
• Eukaryotic Genome Evolution, tree of life.
• Exercises … basic abilities, plus impression of what
is possible / how type of research is done (albeit on
a larger scale)
Literature Discussion
• Each (set of) articles will be introduced (=presentation) by a 1 / 2
persons, presentation should last approximately half an hour,
followed by a discussion
• What to discuss
– What are the articles actually saying? What have authors
done? (so that everybody knows)
– What does this mean in a larger context? (e.g. a discussion
of the discussion)
Homology and Domains
Gene / protein sequence evolution:
what is homology
• Definition homology (biology)
• structures are said to be homologous if they are
alike because of shared ancestry.
• Classic: arms ~ bird wings ~ bat wings,
• Genes/proteins/stretches of dna: sequence similarity
because derived from the same ancestral sequence
• Instead of analogous: with sequences we have
convergence, but thought to be limited to specific
cases (e.g. coiled-coil, regulatory motifs); but with
function we have analogy e.g. analogous enzymes
Why are we interested in homology
• Function prediction → Homologous proteins tend to
have similar functions
• Evolutionary dynamics → Tracing the evolution of
genes (duplication, gene trees, origin of new gene
families)
How do we detect homology
•
Similarity of:
•
3D structure → most conserved aspect, yet not all structures are
available. Structures are compared and classified by “eye” and
software packages (Dali). (NB classical homology); criterion shared
“idiosyncratic” features that are not strictly necessary for function +
sequence features
•
Sequence → less conserved, many sequences are however available.
Homology determination is mainly based on models of sequence
evolution and the likelihood that when you compare a sequence to a
database you will find a sequence of at least that similarity.
•
NB Manually curated databases of 3D structure similarity are used as a
benchmark for detection of homology by sequence similarity (SCOP,
Blundels Bus).
Gene / protein evolution: beyond blast, “distant homology”
•
•
•
•
Not obvious by blast
Substantial divergence, due to time and/or speed
Use “profile” (HMMer or PSI-BLAST),
In general work better because
ECGHR ECGHR
ECNHN ECNHN
C
G
TCQQL SIGNL
Gene / protein evolution: beyond blast, “distant homology”
• PSI-BLAST a multiple sequence alignment is
generated on the fly to detect which
residues/positions characterize the family.
• OR use CDD, PFAM or SMART
– Experts have collected representative and
divergent members of a gene family and use
HMMer or RPS-BLAST to see if your query
sequence belongs to this gene family (i.e. is
homologous to the members)
– clearer/cleaner than psi-blast or blast.
How to detect very distant homology / superfamilies
• When two protein families
are homologous but the
homology is not obvious they
are part of the same so
called superfamily
How to detect:
•
•
•
•
In depth PSI-BLAST
Reciprocal
Use of right seed
“hopping” (homology is by
definition transitive)
Gene / protein evolution: Distant homology
• alignment-vs-alignment, Profile-vs-profile, HMM vs
HMM comparison (whereas HHMer, PSI-BLAST
compare a profile to a single sequence)
• Unfortunately statistic are still poor
• “works” because
ACRNG
ACGNR
C
TCQQL
TFQQI
ACRNG
ACGNR
C
TCQQL
TCILL
Gene / protein evolution: Distant homology
• 3D structure comparison/alignment plus visual
inspection of multiple sequence alignment by Alexey
Murzin
• The results of this are stored in the SCOP database
• (Blundel’s bus)
Structural alignment
Secondary structure
elements
• Alpha-helices
• Beta strands (beta
sheets)
• Loops
Fold vs superfamily?
An example of distant homology
• E.g. superfamily P-loop containing nucleoside
triphosphate hydrolase
• In humans: AAA 130, ABC_tran 182, SMC_N 29
• Zot; UPF0079; TraG; SMC_N; SKI; Sigma54_activat;
Rep_fac_C; Rad17; NACHT; Mg_chelatase; MCM;
KTI12; IstB; GSPII_E; DUF853; DNA_pol3_delta;
Bac_DnaA; APS_kinase; ABC_tran; AAA_PrkA;
AAA_5; AAA_3; AAA_2; AAA;
Apart from sequence and structural features conservation
of basic molecular function
Distant Homology:
Applications to function prediction
• Bacterial protein of unknown function (DUF853)
• Member of the P-loop containing nucleoside
triphosphate hydrolase superfamily
• Thus thought to be an ATPase
Relevance of
homology for function
prediction: “Similar
function“ What is
function ?
• Various levels of
description:
• Sequence similarity,
Homology has the largest
relevance for Molecular
Function. This is aspect of
protein function that is best
conserved, protein
sequence, structure can
often be interpreted in
terms of function.
Using distant homology for function prediction: example
from (just) before PSI-BLAST & HMMer
Secreted Fringe-like Signaling Molecules May Be
Glycosyltransferases.
Cell. 1997 Jan 10;88(1):9-11.
Y. Yuan, J. Schultz, M. Mlodzik, P. Bork
Distant Homology: Application to evolution
• Invention vs (duplication and) divergence
• First determine homology before putting sequences
in multiple sequence alignment & tree building
software
• Two (or more) Proteins families that are present in all
three kingdoms of life and which can be determined
to be homologous to each other: Information from
before the Last Universal Common Ancestor,
information about very early evolution
b
Protein domains: structural definition: separate in
structure
• a structural
domain
("domain") is an
element of
overall structure
that is selfstabilizing and
often folds
independently of
the rest of the
protein chain
Protein domains: sequence/evolutionary definition:
Separate in “evolution”
• Homologous parts of proteins that occur with different
“partners”
• Mobile
• Modules
• Almost always same as structural definition
Implications of domains for homology:
•
The shared ancestry is not a property of the whole
gene but only of part of the gene.
•
When studying the evolution of gene families,
consider fusions / domain combinations (also when
making trees etc.)
Domain repeats. Homology?
• Blast homology vs
the “real”
homology unit
• Q8TKV1
(Methanosarcina
acetivorans)
• ?
Q8TKV1
Ramifications for function prediction & understanding of
cellular processes: “one domain one (molecular) function”
(in contrast to one gene one function)
• This bit does this and that bit does that
• E.g.
– multidomain enzymes
– Transcriptional regulators
Example multidomain enzyme: TrpG E.coli
Ramifications for function prediction when doing
blast: mind the domains
1
A
2
B
B
Protein B is wrongly annotated as having the function of
domain 1, based on homology with the multidomain
protein A, but not with domain 1
(multi-domain architecture problem for annotating proteins via blast)
Ramifications for function prediction when doing
blast: mind the domains
1
A
2
B
B
Protein B is incompletely annotated as having the
function of domain 2, based on homology with the
single domain protein A, the second domain is missed
in the annotation
Ramifications for function prediction
when doing blast do psi-blast, cdd / pfam instead.
• Rather than discover the domain structure by blast
yourself, use e.g. SMART / PFAM / CDD to do it for
you
• NB CDD
Domains and distant homologies
•
Promiscuous domains (i.e. that are present in many proteins), are often
quite diverged and thus need sensitive homology detection tools in
order to be recognized..
•
Moreover it is often only the most general functional property of the
domain that is conserved over such long evolutionary distances
•
Over long evolutionary distances genes are often only homologous in
the sense that they share a domain, rather than being full length
homologous
•
We THUS use PFAM/SMART etc. for
1. The domains
2. And to improve upon BLAST / be cleaner than PSI-BLAST
3. And because most of the sequences are covered by these
database. No need to reinvent the wheel. The ones that are not,
are often “non globular”, recent inventions, or very fast evolving
Disclaimer: non-globular regions
• Low complexity
• Unstructured, Elongated (as opposed to globular)
• Many polar/charged residues; few hydrophobic
residues
• parts of proteins that do not posses a clear 3D
structure
• Convergence
• Do not obey PAM or BLOSUM
Disclaimer: Coiled coil
• All alpha: thought to arise independently
(convergence)
• Hypothesis: reservoir for “new” folds: all alpha folds
(Koonin EV)
• E.g. ras / rho / rab / ran / -GAPs
Disclaimer: Other protein motifs
•
•
•
•
Signal peptides
Lipid anchoring
Convergence yet still important to predict
Trans-membrane?
Interesting result on protein evolution regarding domains
and duplications: neutral?
Black observed
Blue: model of recombination
& duplication separate
Red: also duplication of
combinations
b