Structural phylogenomic inference of protein function
Download
Report
Transcript Structural phylogenomic inference of protein function
Structural Phylogenomic Inference of Protein
Function
Kimmen Sjölander
University of California Berkeley
[email protected]
Extend function prediction through
inclusion of structure prediction and
analysis
Predict active site &
subfamily specificity
positions
Anti-fungal
defensin
(Radish)
Drosomycin
(Drosophila)
Scorpion toxin
VirB4
Annotation transfer by homology
• Status quo approach to protein function prediction
– Given a gene (or protein) of unknown function
• Run BLAST to find homologs
• Identify the top BLAST hit(s)
• If the score is significant, transfer the annotation
– If resources permit, predict domains using PFAM or CDD
• Problems:
– Approach fails completely for ~30% of genes
– Of those with annotations, only 3% have any supporting
experimental evidence
• 97% have had functions predicted by homology alone*
– High error rate
* Based on analysis of >300K proteins in the UniProt database
Tomato Cf-2 Bioinformatics Analysis
Domain fusion and fission events complicate function
prediction by homology, particularly for particularly
common domains (e.g., LRR regions).
Domain structure analysis (e.g., PFAM) is often critical.
Tomato Cf-2 (GI:1587673)
Dixon, Jones, Keddie, Thomas, Harrison and Jones JDG
Cell (1996)
BLAST against Arabidopsis
Top BLAST hit in Arabidopsis is an RLK!
Panther
PFAM results
Errors due to domain shuffling
(sic)
Error presumably due to non-orthology of
database hits used for annotation
Phylogenetic
analysis
suggests it’s
more likely
a Biogenic
Amine GPCR
Human neutral sphingomyelinase
or bacterial
isochorismate
synthase?
Database annotation errors
Main sources of annotation errors:
1. Domain shuffling
2. Gene duplication (failure to discriminate
between orthologs and paralogs)
3. Existing database annotation errors
Errors in gene structure
Contamination
Other…
Propagation of existing
database annotation errors
Galperin and Koonin, “Sources of systematic error in functional annotation of genomes: domain rearrangement,
non-orthologous gene displacement and operon disruption.”In Silico Biol. 1998
Phylogenomic inference
Eisen “Phylogenomics: Improving Functional Predictions for Uncharacterized Genes by Evolutionary Analysis,”
Genome Research 1998
Sjölander, “Phylogenomic inference of protein molecular function: advances and challenges," Bioinformatics
2004
Piet Hein, Grooks
QuickT ime ™an d a
TIFF ( Uncomp res sed) deco mpre ssor
ar e need ed to see this pictur e.
There is nothing more difficult to take in hand,
more perilous to conduct, or more uncertain in its success,
than to take the lead in the introduction of a new order of things.
Because the innovator has for enemies
all those who have done well under the old conditions,
and lukewarm defenders in those who may do well under the new.
This coolness arises partly from the incredulity of men,
who do not readily believe in new things
until they have had a long experience of them.
Construction of genome-scale
phylogenomic libraries
Cluster genome into
global homology groups
Include homologs
from other species
Construct multiple sequence alignment
Construct phylogenetic trees.
Overlay with annotation data.
Identify subfamilies.
Retrieve key literature
Predict cellular
localization.
Predict protein structure
Predict key residues
Deposit book in library
Construct HMMs for the
family and for individual
subfamilies.
Berkeley Universal Proteome
Phylogenomic Explorer
9,707 protein family “books” and 708K HMMs and expanding daily
http://phylogenomics.berkeley.edu/UniversalProteome
Protein fold prediction
12% identity
VirB4
TrwB structure (1E9RA)
Active site
Example Book: Voltage-gated K+ channels
SCI-PHY
subfamilies
supported by
ML tree, and
also consistent
with subtype
and
phylogenetic
distribution
(only one branch of
ML tree displayed)
GO annotations for Shal subfamily
Database queries
Look up protein family “books” based on the annotations associated with any sequence.
Queries can be based on GO biological process, PFAM domains, UniProt accession numbers,
etc.
Key algorithms in PhyloFacts
library construction
What clustering methods are appropriate for
inference of protein function?
What alignment methods are accurate?
How to mask?
What tree methods to use?
How to root a tree?
Can we define functional
subfamilies automatically?
Fraction
superposable
positions drops
with evolutionary
divergence
%ID #pair %Superpos
>70
107
90.6
50-70
63
87.2
40-50
46
83.4
30-40
65
85.4
25-30
41
82.1
20-25
53
77.9
15-20
84
73
10-15
151
64.4
5-10
204
50.4
0-5
122
39.5
Pairwise alignment
MSA-pw
BLAST ClustalW Tcoffee ClustalW MAFF
0.954
0.955
0.955
0.955
0.9
0.862
0.903
0.894
0.901
0.9
0.824
0.872
0.855
0.856
0.8
0.811
0.874
0.867
0.87
0.8
0.779
0.782
0.788
0.795
0.8
0.612
0.599
0.627
0.633
0.6
0.381
0.451
0.457
0.49
0.4
0.16
0.186
0.234
0.302
0.
-0.007
-0.014
0
-0.047
0.0
-0.033
-0.049 -0.051
-0.034
-0.0
FlowerPower
Clustering global (or glocal) homologs
Minimize profile drift
Improved alignment accuracy
Nandini Krishnamurthy, Ph.D.
Step 1: Construct SearchDB
Q=query
Construct SearchDB
using PSI-BLAST against
target database
Q
Step 2: Select and align core set.
Q
Inclusion criteria:
E-value 1e-10
Bi-directional coverage
MUSCLE multiple alignment (Edgar, 2003)
Step 3: Run SCI-PHY to identify subfamilies
and build subfamily HMMs (SHMMs)
Q
BETE subfamily identification: Sjölander 1998
SHMM construction: Brown et al, 2004
Step 4: SHMMs compete for sequences
from SearchDB. Sequences meeting criteria
are aligned to their closest SHMM.
Q
Step 5: Run SCI-PHY on extended
alignment to identify new subfamilies and
construct SHMMs.
Q
Iterate until convergence
Q
Comparing FlowerPower,
BLAST,
PSI-BLAST and UCSC T2K
Test: Clustering global homologs
Agreement at domain structure
determined by PFAM. SCOP used
to cluster PFAM domains into
structural equivalence classes.
Subfamily Classification In PHYlogenomics
(SCI-PHY)
Seq1
Seq2
Seq3
Seq4
Seq5
LERY-K
LDRFPR
IERYGK
MDRF-K
VERYGK
Nandini Krishnamurthy, Ph.D.
Duncan Brown
Multiple sequence alignment
5
3
1
4
2
Phylogenetic tree &
subfamily
decomposition
Agglomerative clustering
Input: MSA
Initialize: construct profile1 for each row in MSA
While (#clusters > 1) {
Join closest2 pair of clusters
Re-estimate profile1
Compute encoding cost3 for this stage
} /* cut tree using minimum encoding cost */
1.
2.
Use Dirichlet mixture densities
Distance function: relative entropy
Sjolander, K. "Phylogenetic inference in protein superfamilies: Analysis of SH2
domains" Proceedings of Conference Intelligent Systems for Molecular Biology
Detection of critical
positions
Subfamilies identified using
minimum encoding cost principles
• Each stage of the algorithm defines a different set of
alignments, one for each cluster (“subfamily”).
• Find the point during the clustering where the
encoding cost of the alignments is minimal. This
defines the subfamily decomposition.
Cost
N
# classes
1
N= number of sequences. S= number of subfamilies;
n c,1…n c,s are the amino acids aligned by subfamilies 1 through s at column c.
represents the Dirichlet mixture prior.
SCI-PHY analysis of selected GPCRs
Venter et al, The sequence of the human genome (2001) Science.
Sjolander, “"Phylogenomic inference of protein molecular function: advances and
challenges," (2004) Bioinformatics
Key residue prediction using subfamily
and family-wide conservation analysis
Y221
W222
D558
R627
D628
Elizabeth Hua-Mei Kellogg
Ryan Ritterson
Nandini Krishnamurthy
H745
Y743 A744
G629
Parker JS, Roe
SM, Barford D. ,
EMBO J., 2004
D
RD
E
YAH
Tanaka Hall, T.
Structure 2005
Rivas et al, 2005
Function Prediction Using HMMs
3.5.2.2
7TM GPCR
Dihydropyrimidinase
3.5.4.1
ABC Transporter
Cytosine deaminase
3.5.2.3
Amidohydrolase
Dihydroorotase
3.5.1.5
Urease
ATPase
Family
Subfamily
Error
Subfamily HMM construction
1.
At completely conserved positions, and subfamily
gapped positions: Use match state distributions
estimated for general (family) HMM.
2.
At other positions:
1.
Estimate Dirichlet mixture density posterior for
each subfamily at each position separately.
2.
Use Dirichlet density posteriors to weight
contributions from other subfamilies.
3.
Compute amino acid distribution using weighted
counts and standard Dirichlet procedure.
12
345
67
Brown et al,“Subfamily HMMs in functional genomics” (2005) Pacific Symposium on Biocomputing
Subfamily HMMs increase the separation
between true and false positives
•
•
•
515 unique SCOP folds
PFAM full MSAs
Scored against Astral PDB90
1.5% error rate in subfamily classification using
top-scoring SHMM
SATCHMO: Simultaneous Alignment and
Tree Construction using
Hidden Markov mOdels
Xia Jiang
Nandini Krishnamurthy
Duncan Brown
Michael Tung
Jake Gunn-Glanville
Bob Edgar
Edgar, R., and Sjölander, K., "SATCHMO: Sequence Alignment and Tree Construction using
Hidden Markov models," Bioinformatics. 2003 Jul 22;19(11):1404-11
SATCHMO motivation
• Structural divergence within a superfamily means that…
– Multiple sequence alignment (MSA) is hard
– Alignable positions varies according to degree of divergence
• Current MSA methods not designed to handle this
variability
– Assume globally alignable, all columns (e.g. ClustalW)…
• Over-aligns, i.e. aligns regions that are not superposable
– …or identify and align only highly conserved positions (e.g., SAM
software with HMM “surgery”)
• Challenge
– Different degrees of alignability in different sequence pairs,
different regions
– Masking protocols are lossy: loop regions may be variable across
the family but may be critical for function!
SATCHMO algorithm
• Input: unaligned sequences
• Initialize: a profile HMM is constructed for each
sequence.
• While (#clusters > 1) {
– Use profile-profile scoring to select clusters to join
– Align clusters to each other, keeping columns fixed
– Analyze joint MSA to predict which positions appear to be
structurally similar; these are retained, the remainder are masked.
– Construct a profile HMM for the new masked MSA
}
• Output: Tree and MSA
Alignment of proteins with
different overall folds
Assessing sequence alignment
with respect to structural alignment
Xia Jiang
Duncan Brown Nandini Krishnamurthy
Alignment accuracy as a function of % ID
(including homologs, full-length sequences)
1
0.9
Average CS score
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
10-15%
15-20%
20-25%
25-30%
30-35%
Percent ID
CLUSTALW
MUSCLE
MAFFT
SATCHMO
35-40%
Future work: Interactive specificity position
identification
Catalytic residues
• Enable users to select subtrees for analysis
• Identify positions conserved within each subtree,
but which differentiate the two**
• Plot over MSA and on structure (if available)
Donald and Shakhnovich, NAR 2005
colored red
Major challenge: Phylogenetic uncertainty
Given: A (gene tree of unknown function),
gene trees B and C (characterized function)
Predict function for A.
A
A
B
C
C B
B
C A
Problem: use three phylogenetic tree methods, get 3 or
more trees! Change the MSA, you also change the tree…
Need: Better simulation studies, benchmark datasets
http://phylogenomics.berkeley.edu
Berkeley Phylogenomics Group
PI: Kimmen Sjölander
Nandini Krishnamurthy, Ph.D.
Duncan Brown
Sriram Sankararaman
Xia Jiang
Jake Gunn-Glanville
Lead programmer and web administrator:
Dan Kirshner
This work is supported in part by
a Presidential Early Career Award for Scientists and Engineers from the NSF,
and by an R01 from the NHGRI (NIH).