Graphical Models - College of Computing & Informatics
Download
Report
Transcript Graphical Models - College of Computing & Informatics
Statistical Machine Learning and
Computational Biology
Michael I. Jordan
University of California, Berkeley
November 5, 2007
Statistical Modeling in Biology
• Motivated by underlying stochastic phenomena
thermodynamics
recombination
mutation
environment
• Motivated by our ignorance
evolution of molecular function
protein folding
molecular concentrations
incomplete fossil record
• Motivated by the need to fuse disparate sources of data
Outline
• Graphical models
phylogenomics
• Nonparametric Bayesian models
protein backbone modeling
multi-population haplotypes
• Sparse regression
protein folding
Part 1: Graphical Models
Probabilistic Graphical Models
• Given a graph G = (V,E), where each node vV is associated
with a random variable Xv
X3
X2
p(x3| x2)
p(x2| x1)
X1
X6
p(x1)
p(x6| x2, x5)
p(x4| x1)
X4
p(x5| x4)
X5
• The joint distribution on (X1, X2,…, XN) factors according to
the “parent-of” relation defined by the edges E :
p(x1, x2, x3, x4, x5, x6) = p(x1) p(x2| x1)p(x3| x2) p(x4| x1)p(x5| x4)p(x6| x2, x5)
5
Inference
• Conditioning
• Marginalization
• Posterior probabilities
6
Inference Algorithms
• Exact algorithms
sum-product
junction tree
• Sampling algorithms
Metropolis-Hastings
Gibbs sampling
• Variational algorithms
mean-field
Bethe, Kikuchi
convex relaxations
Hidden Markov Models
• Widely used in computational biology to parse strings
of various kinds (nucleotides, markers, amino acids)
• Sum-product algorithm yields
8
Hidden Markov Model Variations
9
Phylogenies
• The shaded nodes represent the observed nucleotides at a given
site for a set of organisms
• Site independence model (note the plate)
• The unshaded nodes represent putative ancestral nucleotides
• Computing the likelihood involves summing over the unshaded
nodes
10
Hidden Markov Phylogeny
• This yields a gene finder that exploits evolutionary constraints
Evolutionary rate is state-dependent
(edges from state to nodes in phylogeny are omitted for simplicity)
• Based on sequence data from 12-15 primate species, we obtain a
nucleotide sensitivity of 100%, with a specificity of 89%
GENSCAN yields a sensitivity of 45%, with a specificity of 34%
11
Annotating new genomes
Species: Aspergillus nidulans (Fungal organism)
>Q8X1T6 (hypothetical protein)
MCPPNTPYQSQWHAFLHSLPKCEHHVHLEGCLEPPL
IFSMARKNNVSLPSPSSNPAYTSV
ETLSKRYGHFSSLDDFLSFYFIGMTVLKTQSDFAEL
AWTYFKRAHAEGVHHTEVFFDPQV
HMERGLEYRVIVDGYVDGCKRAEKELGISTRLIMCF
LKHLPLESAQRLYDTALNEGDLGL
DGRNPVIHGLGASSSEVGPPKDLFRPIYLGAKEKSI
NLTAHAGEEGDASYIAAALDMGAT
RIDHGIRLGEDPELMERVAREEVLLTVCPVSNLQLK
CVKSVAEVPIRKFLDAGVRFSINSDDPAYFGAYILE
CYCAVQEAFNLSVADWRLIAENGVKGSWIGEERKNE
LLWRIDECVKRF
What molecular function does protein Q8X1T6 have?
Images courtesy of Broad Institute, MIT
Annotation Transfer
BLAST Search: Q8X1T6 (Aspergillus nidulans)
•
Species Name
Molecular Function
Score
E-value
•
Schizosaccharomyces pomb adenosine deaminase
390
e-107
•
Gibberella zeae
hypothetical protein FG01567.1
345
7e-94
•
Saccharomyces cerevisiae
adenine deaminase
308
1e-82
•
Wolinella succinogenes
putative adenosine deaminase
268
1e-70
•
Rhodospirillum rubrum
adenosine deaminase
266
6e-70
•
Azotobacter vinelandii
adenosine deaminase
260
4e-68
•
Streptomyces coelicolor
probable adenosine deaminase
254
2e-68
•
Caulobacter crescentus CB1 adenosine deaminase
253
5e-66
•
Streptomyces avermitilis
putative adenosine deaminase
251
2e-65
•
Ralstonia solanacearum
adenosine deaminase
251
2e-65
•
environmental sequence
unknown
246
5e-64
•
Pseudomonas aeruginosa
probable adenosine deaminase
245
1e-63
•
Pseudomonas aeruginosa
adenosine deaminase
245
1e-63
•
environmental sequence
unknown
244
3e-63
•
Pseudomonas fluorescens
adenosine deaminase
243
7e-63
•
Pseudomonas putida KT2440 adenosine deaminase
243
7e-63
Methodology to System: SIFTER
Set of homologous proteins (Pfam)
Species Name
Schizosaccharomyces pombe
Gibberella zeae
Saccharomyces cerevisiae
Wolinella succinogenes
Rhodospirillum rubrum
Azotobacter vinelandii
Streptomyces coelicolor
Caulobacter crescentus
Streptomyces avermitilis
Ralstonia solanacearum
environmental sequence
Pseudomonas aeruginosa
Pseudomonas aeruginosa
environmental sequence
Pseudomonas fluorescens
Pseudomonas putida KT
Molecular Function
Score
adenosine deaminase
hypothetical protein FG01567.1
adenine deaminase
putative adenosine deaminase
adenosine deaminase
adenosine deaminase
probable adenosine deaminase
adenosine deaminase
putative adenosine deaminase
adenosine deaminase
unknown
probable adenosine deaminase
adenosine deaminase
unknown
adenosine deaminase
adenosine deaminase
390
345
308
268
266
260
254
253
251
251
246
245
245
244
243
243
E-value
e-107
7e-94
1e-82
1e-70
6e-70
4e-68
2e-68
5e-66
2e-65
2e-65
5e-64
1e-63
1e-63
3e-63
7e-63
7e-63
Gene Tree
MP
Species Tree
adenine
Gene Ontology
SIFTER
adenine
adenosine
adenosine
adenosine
Functional diversity problem
1887 Pfam-A families with more than two experimentally
characterized functions
2-5 different
functions
>51 different
functions
21-50 different
functions
11-20 different
functions
6-10 different
functions
Available methods for comparison
• Sequence similarity methods
• BLAST [Altschul 1990]: sequence similarity search,
transfer annotation from sequence with most significant
similarity
• Runs against largest curated protein database in world
• GOtcha [Martin 2004]: BLAST search on seven genomes
with GO functional annotations
• GOtcha runs use all available annotations
• GOtcha-exp runs use only available experimental annotations
• Sequence similarity plus bootstrap orthology
• Orthostrapper [Storm 2002]: transfer annotation when
query protein is in statistically supported orthologous
cluster with annotated protein
AMP/adenosine deaminase
•
•
•
•
•
•
•
251 member proteins in Pfam v. 18.0
13 proteins with experimental evidence
GOA
20 proteins with experimental annotations
from manual literature search
129 proteins with electronic annotations
from GOA
Molecular function: remove amine group
from base of substrate
Alignment from Pfam family seed
alignment
Phylogeny built with PAUP* parsimony,
BLOSUM50 matrix
Mouse adenosine deaminase, courtesy PDB
AMP/adenosine deaminase
SIFTER Errors
Leave-one-out cross-validation:
93.9% accuracy (31 of 33)
BLAST: 66.7% accuracy (22 of 33)
AMP/adenosine deaminase
Note: x-axis is on log scale
Multifunction families: can choose numerical cutoff for
posterior probability prediction using this type of plot
Sulfotransferases: ROC curve
• SIFTER (no truncation): 70.0% accuracy (21 of 30)
• BLAST: 50.0% accuracy (15 of 30)
Note: x-axis is on log scale
Nudix Protein Family
• 3703 proteins in the family
• 97 proteins with molecular
functions characterized
• 66 different candidate
molecular functions
Nudix: SIFTER vs BLAST
•SIFTER truncation level 1: 47.4% accuracy (46 of 97)
•BLAST: 34.0% accuracy (33 of 97); 23.3% of terms at
all in search results
Trade specificity for accuracy
• Leave-one-out cross-validation, truncation at 1: 47.4% accuracy
15 candidate functions
66 candidate functions
Leave-one-out cross-validation, truncation at 1,2: 78.4% accuracy
Work with Jason Stajich;
Hemiascomycota
Euascomycota
Fungal genomes
Archeascomycota
Basidiomycota
Zygomycota
Images courtesy of Broad Institute
Fungal Genomes Methods
• Gene finding in all 46 genomes
• hmmsearch for all 427,324 genes
• Aligned hits with hmmalign to 2,883 Pfam v. 20 families
• Built trees using PAUP* maximum parsimony for 2,883 Pfam
v. 20 families; reconciled with Forester
• BLASTed each protein against Swiss-Prot/TrEMBL for exact
match; used ID to search for GOA annotations
• Ran SIFTER with (a) experimental annotations only and (b)
experimental and electronic annotations
SIFTER Predictions by Species
Part 2: Nonparametric Bayesian Models
Clustering
• There are many, many methodologies for clustering
• Heuristic methods
hierarchical clustering
• M estimation
K means
spectral clustering
• Model-based methods
finite mixture models
Dirichlet process mixture models
Nonparametric Bayesian Clustering
• Dirichlet process mixture models are a nonparametric
Bayesian approach to clustering
• They have the major advantage that we don’t have to
assume that we know the number of clusters a priori
Chinese Restaurant Process (CRP)
• Customers sit down in a Chinese restaurant with an
infinite number of tables
first customer sits at the first table
th subsequent customer sits at a table drawn from
the following distribution:
where
is the number of occupants of table
The CRP and Mixture Models
• The customers around a table form a cluster
associate a mixture component with each table
the first customer at a table chooses from the prior
e.g., for Gaussian mixtures, choose
1
2
3
4
• It turns out that the (marginal) distribution that this
induces on the theta’s is exchangeable
Example: Mixture of Gaussians
Dirichlet Process
• Exchangeability implies an underlying stochastic process;
that process is known as a Dirichlet process
0
1
Dirichlet Process Mixture Models
• Given observations
model each
with a latent factor
, we
:
• We put a Dirichlet process prior on
:
34
Connection to the Chinese Restaurant
• The marginal distribution on the theta’s obtained by
marginalizing out the Dirichlet process is the Chinese
restaurant process
• Let’s now consider how to build on these ideas and
solve the multiple clustering problem
Multiple Clustering Problems
• In many statistical estimation problems, we have not
one data analysis problem, but rather we have groups
of related problems
• Naive approaches either treat the problems separately,
lump them together, or merge in some adhoc way; in
statistics we have a better sense of how to proceed:
shrinkage
empirical Bayes
hierarchical Bayes
• Does this multiple group problem arise in clustering?
I’ll argue “yes!”
• If so, how do we “shrink” in clustering?
Multiple Data Analysis Problems
• Consider a set of data which is subdivided into groups, and where
each group is characterized by a Gaussian distribution with
unknown mean:
• Maximum likelihood estimates of
are obtained independently
• This often isn’t what we want (on theoretical and practical grounds)
Hierarchical Bayesian Models
•
Multiple Gaussian distributions linked by a shared hyperparameter
•
Yields shrinkage estimators for the
38
Protein Backbone Modeling
• An important contribution to the energy of a protein
structure is the set of angles linking neighboring amino
acids
• For each amino acid, it turns out that two angles
suffice; traditionally called and
• A plot of and angles across some ensemble of
amino acids is called a Ramachandran plot
A Ramachandran Plot
• This can be (usefully) approached as a mixture modeling
problem
• Doing so is much better than the “state-of-the-art,” in
which the plot is binned into a three-by-three grid
Ramachandran Plots
• But that plot is an overlay of 400 different plots, one for each
combination of 20 amino acids on the left and 20 amino acids on
the right
• Shouldn’t we be treating this as a multiple clustering problem?
Haplotype Modeling
• A haplotype is the pattern of alleles along a single
chromosome
• Data comes in the form of genotypes, which lose the
information as to which allele is associated to which
member of a pair of homologous chromosomes:
• Need to restore haplotypes from genotypes
• A genotype is well modeled as a mixture model, where
a mixture component is a pair of haplotypes (the real
difficulty is that we don’t know how many mixture
components there are)
Multiple Population Haplotype Modeling
• When we have multiple populations (e.g., ethnic
groups) we have multiple mixture models
• How should we analyze these data (which are now
available, e.g., from the HapMap project)?
• Analyze them separately? Lump them together?
Scenes, Objects, Parts and Features
Shared Parts
Hidden Markov Models
• An HMM is a discrete state space model
• The discrete state can be viewed as a cluster indicator
• We thus have a set of clustering problems, one for each
value of the previous state (i.e., for each row of the
transition matrix)
Solving the Multiple Clustering Problem
• It’s natural to take a hierarchical Bayesian approach
• It’s natural to take a nonparametric Bayesian in which
the number of clusters is not known a priori
• How do we do this?
Hierarchical Bayesian Models
•
Multiple Gaussian distributions linked by a shared hyperparameter
•
Yields shrinkage estimators for the
48
Hierarchical DP Mixture Model?
• Let us try to model each group of data with
a Dirichlet process mixture model
let the groups share an underlying
hyperparameter
• But each group is generated independently
different groups cannot share the same
components if
is continuous.
spikes do not
match up
49
Hierarchical Dirichlet Process Mixtures
The Chinese Restaurant Franchise
HDP Model of Ramachandran Plots
•
We would like to solve 400 different related clustering problems, one for
each combination of 20 amino acids on the left and 20 amino acids on the
right
HDP Model of Ramachandran Plots
53
Some HDP Success Stories
•
•
•
•
•
•
•
New backbone model for Rosetta
New method for multi-population haplotype phasing
Solution to problem of choosing number of states in HMMs
State-of-the-art method for statistical parsing
Competitive method for image denoising
Competitive method for scene categorization
State-of-the-art method for object recognition
Part 3: Sparse Regression
Rosetta Ab Initio Search
•
•
•
Very successful method from David Baker’s lab (UW)
Consistent top performer at CASP
Already used in real-world problems (e.g. HIV vaccine design)
Monte Carlo procedure
Treat energy (actually
) as a probability density, sample
from it
Primary move set: fragment insertion
Fragments come from library of solved
structures with similar residue subsequences
Energetically plausible local solutions
Like a coordinate descent move—jump to new
local minimum in a few coordinates
Rosetta Ab Initio Search
•
•
•
•
•
Start from fully extended chain.
Repeat:
•
Propose fragment insertion (or other) move
•
Do local gradient descent to evaluate proposal
•
Accept or reject by a Metropolis criterion
Throw away all but lowest energy sample from previous round
Switch to high resolution (full-atom) energy function, perform
further search (relaxation)
Return single lowest energy conformation seen in sampling.
Run many, many times.
Our idea: resampling
•
•
•
•
No more blind repetition—learn where to search from previous
searches
An initial round of sampling gives us lots of information about the
search space
Areas of conformation space that always have very poor energy
Structural elements that are predicted very consistently
How can we use this information to guide further sampling (without
getting too greedy?)
General approach:
• Define a reduced search space
• Learn a smoothed energy function from the decoys
• Minimize smoothed energy to find new decoys
• Repeat
Step 2: Response surface minimization
•
•
•
•
•
•
Restrict attention to local minima (decoys)
Fit a smoothed response surface to the decoys
Minimize response surface to find new candidate
Full-atom relax candidate
Add candidate to decoy pool
Re-fit surface, repeat
Features
E
•
Torsion angle features
B
e.g., from the HDP model of the
Ramachandran plot
G
A
E
B
(-180, -180)
•
Secondary structure features
Sheet
Loop
Helix
(180, 180)
More Features
• Sidechain rotamer features
• Register shift features
• Burial features
Buried
Exposed
Sparse Regression Models
•
Lasso regression: penalize large weights
•
L1 regularization leads to sparse solutions
LARS (Efron et al. 2004) : find estimates
for all C
simultaneously, as efficiently as least-squares
•
Results: 1ogw
Results: 1n0u
Results: 1di2
Collaborators
•
•
•
•
•
•
•
•
•
David Baker
Ben Blum
Steven Brenner
Roland Dunbrack
Barbara Engelhardt
Guillaume Obozinski
Yee Whye Teh
Daniel Ting
Eric Xing
Finis
• For more information (papers, slides, tutorials, software:
http://www.cs.berkeley.edu/~jordan