Transcript nogold
Evidence Integration
in Bioinformatics
Phil Long
Columbia University
A little molecular biology
•
DNA has nucleotides (A, C, T and G) arranged
linearly along chromosomes
•
Regions of DNA, called genes, encode proteins
•
Proteins biochemical workhorses
•
Proteins made up of amino acids
•
also strung together linearly
•
fold up to form 3D structure
Problems
•
What do proteins do?
•
Which pairs of proteins work together?
Evidence of related function
•
•
Similar protein sequences
•
BLAST
•
Smith-Waterman
•
FASTA
Evidence of interaction (see previous
slide)
Evidence of
protein-protein interaction
•
•
Indirect
•
co-expression
•
common essentiality
•
similar functional annotations
•
…
Direct
•
All yield errors
•
Complementary
•
How best to combine?
Goals:
•
yeast two-hybrid
•
Automatic
•
in-vivo pulldown
•
High Throughput
•
…
Combining using
machine learning
•
•
Supervised analysis (useful gold standard
designations available)
•
Bayes Nets
•
Support Vector Machines (kernel fusion)
•
Boosting (RankBoost)
Unsupervised analysis
Overfitting and inductive bias
•
•
•
•
Overfitting: capturing insignificant details of the data
at the expense of useful trends
Inductive bias: a priori preference some explanations
of the data over others (e.g. “simple” explanations
better)
Stronger inductive bias better protects against overfitting
Theme: exploit special structure of evidence integration
problems to fashion appropriate inductive biases
Supervised Learning with Bayes Nets
(Pearl, 1988)
(Kasif, Koller, Friedman, Segal, Jordan,…)
Bayesian Networks
(a.k.a. Bayes Nets)
•
•
To define joint probability distribution of
multiple variables
Useful for describing sparse direct
dependence among variables:
•
•
graph describes which direct dependencies
parameters describe what kind of direct
dependencies
(Friedman and Goldszmidt)
Bayes Net - Example
Burgler
Earthquake
Alarm Parameters:
Alarm
Neighbor Call Params:
A
Pr(N|A)
0
0.2
1
0.9
Neighbor Call
E
B
Pr(A|B,E)
0
0
0.1
0
1
0.81
1
0
0.9
1
1
0.91
(Jansen, et al, 2003)
Naïve Bayes
Interact
Co-expression
Essentiality
Function
…
To use:
Learn joint distribution
Score: Pr(Interact | Co-expression, Essentiality, Function…)
~(Jansen, et al, 2003)
Hierarchical Naïve Bayes
Interact
Co-expression
Essentiality
Function
Y2H
•
When subset of variables related
•
Use Naïve Bayes for related subset
•
•
•
Form new variable with resulting
scoring function
Add to group, do new Naïve Bayes
Possible interpretation: hidden Y2H
variable
Interacts
Uetz Y2H
Ito Y2H
Supervised learning with SVMs
(kernel fusion)
(Boser, Guyan and Vapnik, 1992) (Haussler, Jaakkola, Vert, Leslie, Weston,…)
Support Vector Machines
for Classification
•
Uses similarity function K (called “kernel function”)
•
Special examples (“support vectors”) chosen, given weights
•
New items assigned class predictions using special examples and weights: principles
•
•
each example “votes” for its class
•
more similar, bigger vote
•
bigger weight, bigger vote.
Specifically, if
•
classes are {-1,1}, and
•
examples are (x1,y1),…, (xm,ym), then
•
output classifier takes form h(x) = sign(α1y1K(x,x1) +… + αmymK(x,xm)).
Support Vector Machine Training
•
•
Weights, support vectors chosen to optimize
scoring function
•
trades complexity of classifier with fit to data
•
strong theoretical foundation
•
globally optimal solution found fast (polynomial time)
Successful in many domains, including
bioinformatics
(Lanckriet, et al, 2004)
Kernel fusion
•
Suppose have kernels K1, K2,…, Kn.
•
Idea:
•
use μ1 K1 + μ2 K2 + … + μn Kn
•
learn μi’s together with αi’s
•
•
can be done in polynomial time (semidefinite
programming)
Implicit view: similarity according to Ki’s provide
independent sources of evidence of correct class
Evaluation - ROC Curve
F
sorted by score
T
T
F
T
F
T
predictions
false positives
true positives
Evaluation - ROC curve
1
Area under
the ROC curve
True
positives
False positives
1
(Lanckriet, et al, 2004)
Results – membrane
protein prediction
(Results essentially unaffected by adding random kernels.)
Supervised learning with boosting
(RankBoost)
(Freund, et al 1999)
RankBoost
•
Learns scoring function for ranking
•
Input:
•
•
•
list of “order preferences” (u1 should be ranked above
v1, u2 should be ranked above v2, …)
list of “base ranking functions”
Output: single ranking function built from base
rankers
(Freund, et al 1999)
RankBoost behavior
•
Add base rankers one at a time in rounds:
•
•
•
•
assign weights to ranking preferences -- higher
priority to pairs often out of order in earlier rankings
choose base ranker to minimize weighted average
“loss” (how wrong ranking was about pair)
assign weight to base ranker based on how well it
did.
Output: weighted sum of base ranker scoring
functions.
RankBoost
•
Theoretically well-founded
•
Works well on
•
•
combining search engine results
•
movie preference prediction.
Bioinformatics?
(Segal, et al 2003)
Unsupervised Learning with Bayes Nets
Regulation of Expression
•
•
•
•
Gene expression regulated
Major mechanism: expression of gene G
regulated by proteins binding DNA near G
Proteins that regulate expression by binding
DNA called transcription factors
Sites where a transcription factor binds
approximately conform to a motif
Transcriptional Modules
•
Groups of coordinately regulated genes often work together
•
Goal:
•
•
Find such transcriptional modules
•
Identify associated motifs
Two common approaches
•
•
•
Use results of microarrays (measure expression) and DNA sequence near
regulated genes
Approach 1:
•
Cluster expression behaviors
•
Look for short DNA sequences over-represented in clusters
Approach 2:
•
Learn functions from nearby DNA to expression behaviors
•
Look for important “features”
(Segal, et al 2003)
Bayes Net
for Transcription Modules
DNA Sequence
Functions defining
conditional distributions for
Ri’s and M restricted
S1
Additional conditional
independence implicit Motifs
S3
S2
S5
S4
During training,
alternately Module (values are 1,2,3…)
Clusters of
experiments
influence
Expression
experiments
motif definition
R1
R2
R3
Motif definitions
influence cluster
determination
M
E1
E2
E3
E4
(Long, et al 2005)
Unsupervised Evidence Integration
Problem
•
•
Gold standard designations often not
available
Where few available, enriched for
conclusions based on popular approaches
Examples
•
•
•
Protein-protein interactions in higher organisms
Pairing equivalent genes (“orthologs”) between newly
sequenced genomes and human:
•
direct sequence comparison
•
phylogenetic analysis
•
local conservation of gene order
Meta-analysis of multiple screens for differential
expression using microarrays
•
for each gene, each study gives a score
•
how to combine?
More generally
•
Have (very large) list of candidate
propositions
•
For each, several scores
•
Which propositions most likely to be true?
Notation
•
•
Y indicates whether
proposition true or not
X1,...,Xn are the scores.
Input
Output
X1
X2
X3
X4
X5
2
3
0
0
1
0.3
0
1
0
?
1
0.01
8
4
5
4
10
0.98
Isn’t this just clustering?
Specific opportunities/challenges:
•
Each Xi positively associated with Y; bigger Xi
→ more likely that Y=1
•
Variables X1,...,Xn often complementary
•
Minority class minute
•
Classes overlap significantly
Related Theoretical Work [MV03] –
Problem
•
Training example generation:
•
All variables {0,1}-valued
•
Y chosen randomly, fixed
•
X1,...,Xn chosen independently with Pr(Xi = Y) = pi, where pi is
•
•
•
unknown,
•
same when Y is 0 or 1 (crucial for analysis)
only X1,...,Xn given to training algorithm
Goal:
•
•
given m training examples generated as below
output accurate classifier h that, given scores, outputs prediction
of corresponding Y
Related Theoretical Work [MV03] –
Results
•
•
If n ≥ 3, can approach Bayes error (best possible for
source) as m gets large
Idea:
•
variable “good” if often agrees with others: can determine how
good variables are by looking at how often pairs agree
•
can estimate how often pairs agree from training data
•
can plug in to get estimates of how good variables are
•
can use those to approximate optimal classifier:
•
use everybody, but
•
listen to the good ones more.
In our problem(s)...
•
•
•
•
Some of X1,...,Xn continuous-valued, others
discrete-valued
Reasonable to assume X1,...,Xn
conditionally independent given Y
Reasonable to assume Pr(Y = 1 | Xi = x)
increasing in x, for all I
Pr(Y = 1) typically very small
Conditional independence
•
•
Captures non-redundance in variables (“different views”)
Consequence: associations among variables due to common
association with class designation
•
Promoted by preprocessing (merge redundant variables)
•
Likely some conditional dependencies remain
•
Steps to make algorithm robust against departures from assumption
•
•
•
“Shrink” estimates (skeptical of indications of very strong associations)
Evaluate variable using many pairs of other variables, take median of
results
If main source of dependence is targeted class, there is hope…
Our Approach
one proposition
1
0
0
U1 X
1
U2 X
2
another proposition
U3 X
3
U4 X
4
U5 X
5
Notes
•
Estimate Pr(Ui = 1 | Y = 1) by
•
•
•
using agreement rates of Ui with all other
pairs of variables (new analysis needed), and
taking median of results.
Use Bayes’ Rule to approximate Pr(Y = 1 |
Xi’s) ; use this for score.
Evaluation:
Yeast protein-protein data
•
Use data from Jansen, et al study
•
Mix of experimental and indirect evidence
•
Preprocessing
•
summed two yeast two-hybrid variables
•
summed two in-vivo pulldown variables
•
inverted and summed functional annotation variables.
•
Gold standard designations as in Jansen, et al
•
20+ million examples, 8000+ interactions, 5 variables
Evaluation: other algorithms
•
•
k-means
spectral algorithm: project onto principal
component (direction of maximum
variance)
Evaluation
•
•
Run algorithms without using gold
standard, generate scores
Evaluate using gold standard with ROC
score
Results: Protein-protein data
Algorithm
ROC score
Peer
0.961
Spectral
0.859
k-means
0.865
Results: Protein-protein data
Algorithm
Area Over ROC Curve
Peer
0.039
Spectral
0.141
k-means
0.135
AOC = 1 – AUC
= Pr(random 0 ranked above random 1)
Evaluation: Artificial data
•
•
Inspired by Altschul’s chacterization of BLAST score
distribution
Five variables, Xi’s conditionally independent given Y,
Pr(Y = 1) = 0.01.
•
All variables uniform on [0,1] when Y=1
•
Each variable exponential when Y=0
•
•
Means of exponentials chosen randomly when source is
defined
Results averaged over 100 random sources, 100000
examples for each source
Results: Artificial source
Algorithm
ROC score
Peer
0.997
Spectral
0.974
k-means
0.861
Results: Artificial source
Algorithm
Area Over ROC Curve
Peer
0.003
Spectral
0.026
k-means
0.139
AOC = 1 – AUC
= Pr(random 0 ranked above random 1)
Paper and Software
•
•
Includes
•
paper,
•
source,
•
data,
•
scripts to reproduce all results.
Location:
http://www.cs.columbia.edu/~plong/peer