and view offline

Download Report

Transcript and view offline

Sometimes you can trust a rat
The sbv IMPROVER species translation challenge
Michael Biehl
University of Groningen
Johann Bernoulli Institute
www.cs.rug.nl/biehl
[email protected]
Gyan Bhanot
Rutgers Univ.
Sahand Hormoz Adel Dayarian
KITP, UC Santa Barbara
sbv IMPROVER species translation challenge
IBM Research, Yorktown Heights
Philip Morris International Research and Development
www.sbvimprover.com
systems
biology
verification
combined with
industrial
methodology
for
process
verification
in research
Winning the rat race
2
protein phosphorylation
reversible protein phosphorylation
addition or removal of a phosphate group
alters shape and function of proteins
Winning the rat race
3
protein phosphorylation
reversible protein phosphorylation
addition or removal of a phosphate group
alters shape and function of proteins
chemical stimuli
gene expression
Winning the rat race
4
complex network (incomplete snapshot)
chemical stimuli
phosphorylation
status
( measured)
gene expression
(Δ measured)
www.sbvimprover.com
Winning the rat race
5
challenge data
•
•
•
•
normal bronchial epithelial cells, derived from human and rat
52 different chemical stimuli (26 (A) + 26 (B)), additional controls
phosphorylation status after 5 minutes and 25 minutes
gene expression after 6 hours
• rather low noise levels
• subtract control, median of replicates
challenge organizers: activation
abs(P) > 3 @5min. or @25min.
• ~ 10% positive examples
A
Winning the rat race
B
A
B
• noisy data (microarray)
• correct for saturation effects
N= 20110 (human)
N= 13841 (rat)
6
challenge set-up and goals
www.sbvimprover.com
2
1 intra-species prediction of phosphorylation
from gene expression
2 predict the response in human using
data available for rat cells
1
3 predict gene expression response
across species
3
Winning the rat race
7
sub-challenge 1
intra-species phosphorylation prediction
combination of two approaches:
• voter method
gene selection based on mutual information
• machine learning analysis
Principal Components representation +
Linear Discriminant Analysis
• weighted combination
based on Leave-One-Out cross validation
Winning the rat race
8
voter method
binarize data by thresholding
gene expression: G=1 if p < 0.01
phosphorylation : P=1 if abs(P) > 3
(p-value for differential expression)
(@5min. or @25 min.)
for all pairs of genes and proteins:
calculate separate and joint entropies
using frequencies over stimuli
mutual information
assumption: high I indicates that a gene is predictive for the
corresponding protein status
Winning the rat race
9
voter method
example:
SYNPR level predictive of AKT1 activation
green = significant phosphorylation
red = significant gene expression
SYNPR under-expressed
 AKT1 phosphorylated
for each protein:
- determine a set of most predictive genes (varying number ~ 30-70)
- vote according to the presence of significant gene expressions
relative frequency of positive votes determines certainty score in [0,1]
Leave-One-Out (L-1-O) validation:
consider mutual information only over 25 stimuli, predict the 26th
performance estimate with respect to predicting novel data
Winning the rat race
10
voter method prediction
1 2
• voting schemes obtained
from examples in A,
applied to the 26 new
stimuli of data set B
…. proteins…….
• certainties in [0,1]
on average over the
26 L-1-O runs
16
27
... stimuli …
52
416 predictions w.r.t. data set B
Winning the rat race
11
machine learning approach
low-dimensional representation of gene expression data
• omit all genes with zero variation or only insignificant (p>0.05)
expression values over all 26 training stimuli (13841 -> 6033 genes)
• Principal Component Analysis (PCA)
(pcascat, www.mloss.org
c/o MarcStrickert)
- error free representation of all data possible by max. 52 PCs
- here: use k ≤ 22 leading PCs only (remove small variations due to noise)
• Linear Discriminant Analysis (LDA)
(Matlab, Statistics: classify)
- identifies discriminative directions in k-dim. space
based on within-class and between-class variation
- probabilistic output provided, interpreted as certainty score
- if all training examples negative, score 0 is assigned
Winning the rat race
12
machine learning approach
• Leave-One-Out procedure with varying number k of PC projections
for each of the 16 target proteins
for k=1:22
- repeat 26 times: LDA based on 25 stimuli, predict the 26th
yields probabilistic prediction 0 ≤ c(k) ≤ 1
(crisp threshold 0.5)
- determine the number of false positives (fp), true positives (tp),
false negatives (fn), true negatives (tn)
- compute Mathews Correlation Coefficient (0 ≤ mcc ≤ 1)
Winning the rat race
13
machine learning approach
• perform protein-specific
weighted average to obtain certainties:
•
prediction: apply to test set (B)
proteins
proteins
27
Winning the rat race
(binarized)
... stimuli …
52
27
... stimuli …
52
14
machine learning approach
• for fair comparison with voter method:
Nested Leave-One-Out procedure
for each protein, repeat 26 times:
L-1-O using 24 out of 25 stimuli, varying k
mcc-weighted prediction for the 26th stimulus
• averaged certainties as weighted means
(unweighted mean if both mcc=0)
Winning the rat race
15
combined prediction
Winning the rat race
16
combined prediction
1 2
…. proteins…….
16
27
Winning the rat race
... stimuli …
52
17
Scores and ranks of 21 participating teams
1
1
1
Team_75
Team_49
Team_50
Team_93
Team_111
Team_61
Team_89
Team_112
Team_116
Team_64
Team_90
Team_100
Team_78
Team_72
Team_105
Team_82
Team_106
Team_71
Team_52
Team_84
Team_99
AUPR
0.38
0.42
0.38
0.37
0.35
0.35
0.31
0.29
0.27
0.23
0.24
0.23
0.28
0.15
0.19
0.14
0.13
0.14
0.13
0.10
0.07
Pearson
0.72
0.71
0.72
0.70
0.64
0.68
0.65
0.63
0.62
0.59
0.59
0.60
0.56
0.55
0.56
0.55
0.53
0.49
0.49
0.48
0.43
statistically significant (FDR < .05)
12
Winning the rat race
BAC
0.72
0.69
0.68
0.61
0.67
0.60
0.65
0.66
0.59
0.58
0.56
0.56
0.55
0.58
0.53
0.55
0.55
0.45
0.46
0.49
0.50
AUPR
Pearson
BAC
70
60
Sum of ranks
Team
50
3 teams are separated
from the rest
40
30
20
10
0
Teams
Better rank to the left
AUPR: Area Under Precision Recall
Pearson: Pearson correlation between predictions and binarized
Gold Standard
BAC:
Balanced Accuracy
© 2013 sbv IMPROVER, PMI and IBM
18
Scores and ranks of 21 participating teams
1
1
1
Team_75
Team_49
Team_50
Team_93
Team_111
Team_61
Team_89
Team_112
Team_116
Team_64
Team_90
Team_100
Team_78
Team_72
Team_105
Team_82
Team_106
Team_71
Team_52
Team_84
Team_99
AUPR
0.38
0.42
0.38
0.37
0.35
0.35
0.31
0.29
0.27
0.23
0.24
0.23
0.28
0.15
0.19
0.14
0.13
0.14
0.13
0.10
0.07
Pearson
0.72
0.71
0.72
0.70
0.64
0.68
0.65
0.63
0.62
0.59
0.59
0.60
0.56
0.55
0.56
0.55
0.53
0.49
0.49
0.48
0.43
statistically significant (FDR < .05)
12
Winning the rat race
BAC
0.72
0.69
0.68
0.61
0.67
0.60
0.65
0.66
0.59
0.58
0.56
0.56
0.55
0.58
0.53
0.55
0.55
0.45
0.46
0.49
0.50
LDA
70
60
Sum of ranks
Team
50
0.34
0.71
0.67
AUPR
Pearson
voting
0.40
0.67
3 teams are separated
from the rest
2
BAC
0.65
2
 combination improved the performance!
40
30
20
10
0
Teams
Better rank to the left
AUPR: Area Under Precision Recall
Pearson: Pearson correlation between predictions and binarized
Gold Standard
BAC:
Balanced Accuracy
© 2013 sbv IMPROVER, PMI and IBM
19
sub-challenge 2
inter-species phosphorylation prediction
Winning the rat race
20
sub-challenge 2 set-up
www.sbvimprover.com
Winning the rat race
21
sub-challenge 2 set-up
www.sbvimprover.com
restrict ourselves to the use
of phosphorylation data only
reasoning:
immediate response to stimuli should
be comparable between species
Winning the rat race
22
data
stimuli
1 2 3
…
25 26
27 28 29
…
51 52
1 2 3 …
1 2 3 …
16
proteins
human data set A
humP
human data set B
| humP | > 3 ?
known
prediction
rat data set A
ratP
rat data set B
ratP
16
Winning the rat race
23
naïve prediction
assume similar activation in both species: “human ≈ rat”
prediction score, corresponding to threshold 3 for activation
- precise (monotonic!) form is irrelevant for ROC, PR etc.
- threshold 0.5 for crisp classification
- here: scaling factor yields values well-spread in [0,1]
Winning the rat race
24
naïve prediction
sensitivity
ROC
with respect to the full panel
(416 predictions) of
| humP | > 3
AUC ≈ 0.83
1-specificity
Winning the rat race
25
naïve prediction
1 2
color-coded certainty
for | humP |>3
in data set B
…. proteins…….
16
27
Winning the rat race
... stimuli …
52
26
machine learning approach
stimuli
1 2 3
…
25 26
1 2 3 …
human data set A
|humP | > 3 ?
1 2 3 …
16
proteins
training
rat data set A
ratP
27 28 29
…
51 52
human data set B
| humP | > 3 ?
16 separate
binary
classification
problems
prediction
rat data set B
ratP
16-dim.
vectors
16
Winning the rat race
27
LVQ prediction
here:
16-dim. data
LVQ1, one prototype per class
Nearest prototype classification:
Winning the rat race
28
LVQ prediction
prediction score / certainty for activation
-
precise (monotonic!) form is irrelevant for ROC, PR etc.
-
crisp classification for threshold 0.5
-
here: scaling factor yields range of values similar to naïve prediction
validation:
26 Leave-One-Out training processes:
split data set A in 25 training / 1 test sample
(if training set is all negative: accept naïve prediction)
prediction:
Winning the rat race
ensemble average of certainties over the 26 LVQ systems
29
LVQ prediction
sensitivity
ROC
with respect to the full panel
(416 predictions) of
| humP | > 3
AUC ≈ 0.88
obtained in the Leave-One-Out
validation scheme
1-specificity
Winning the rat race
30
naïve prediction
sensitivity
ROC
with respect to the full panel
(416 predictions) of
| humP | > 3
AUC ≈ 0.83
1-specificity
Winning the rat race
31
combined prediction
1 2
1 2
…. proteins…
…. proteins…
16
16
27 ... stimuli …
combined prediction:
Winning the rat race
52
27 ... stimuli …
52
weighted average according to
protein-specific performance (AUROC)
32
combined prediction
1 2
color-coded certainty
for |humP|>3
in data set B
…. proteins…….
16
27
Winning the rat race
... stimuli …
52
33
Winning the rat race
34
naïve (rat)
0.45
0.74
0.79
1
LVQ
0.37
0.69
0.76
3
 naïve scheme: best indiviudal prediction
• L-1-O not confirmed in the test set
 combination improves performance!
 confirmed in “wisdom of the crowd”
analysis
Winning the rat race
35
Classifier Methods for SC2
Team
Classifier
Feature Selection
Rank
Team_50
Learning Vector
Quantization LVQ1 +
naïve approach
NA
1
13489 inputs, 1000
hidden sigmoid units,
32 outputs
2
Rank proteins by
moderated t-test pvalues, threshold;
cross-validate
3
4
Team_111
Neural networks
LDA
Team_49
40
Team_61
Linear Fit
PCA
Team_52
Least absolute
regression model LBE
NA
Team_93
Random forest
Team_89
SVM w radial basis
kernel and RF
Winning the rat race
5
Predict activation
matrix of 7 proteins,
use it for remaining 9
6
Biogrid, STRING
7
© 2013 sbv IMPROVER, PMI and IBM
36
sub-challenge 3
inter-species
pathway perturbation
prediction
Winning the rat race
37
additional data / domain knowledge
1) mapping of rat genes to human orthologs
HGNC Comparison of Ortholog Predictions, HCOP
www.genenames.org/cgi-bin/hcop.pl
2) annotation of gene sets representing known pathways and function
246 gene sets from the C2CP collection (Broad Institute)
www.broadinstitute.org/gsea/msigdb/genesets.jsp?collection=CP
3) gene set enrichment analysis
www.broadinstitute.org/gsea/index.jsp
NES: normalized enrichment scores, representing expression
FDR: false discovery rate, i.e. statistical significance
threshold: FDR <0.25
Winning the rat race
38
rat vs. human
gene sets
FDR < 0.25
in stimuli (set A)
frequent observation:
negative correlations between significant
rat and human gene sets
biology? data (pre-)processing?
Winning the rat race
39
machine learning approach
training
246 classification problems
training data: 26 stimuli in rat data set A
246-dim. vectors of rat NES
targets: binarized human FDR
(<0.25?)
• PCA: dimension and noise reduction
rat gene set data A and B represented by k (≤52) projections
• LDA: linear classifier using k projections as features (probabilistic output)
• Leave-One-Out validation: determine optimal k from data set A
• use k=8 to make predictions for
data set B (averaged over 26 L-1-O runs)
Winning the rat race
40
human gene set prediction
gene sets
final prediction, certanties
27
Winning the rat race
... stimuli …
52
41
Team scores and ranks
Team
AUPR
Pearson
BAC
rank
Team 50
0.19
0.59
0.54
1
Team 133
0.12
0.54
0.54
2
Team 49
0.12
0.53
0.53
3
Team 52
0.10
0.52
0.54
4
Team 131
0.11
0.50
0.52
5
Team 105
0.11
0.52
0.51
6
Team 111
0.06
0.41
0.43
7
25
FDR ≤ significant
0.01
Aggregation of results: The Wisdom of Crowds
AUPR
15
BAC
40
All Teams
35
10
5
BAC
Pearson
30
AUPR
25
20
15
10
0
Team_50 Team_133
Team_49
5
0
20
Winning the rat race
Pearson
45
Sum of ranks
Sum of ranks
20
31
Team_50
Team_52 Team_131 Team_105 Team_111
Better rank to the left“Teams"
top_2
Best Individual
Team
top_4
top_3
top_5
Team_133 Team_49
top_6
Team_52
all_teams Team_131 Team_105 Team_111
© 2013 sbv IMPROVER, PMI and IBM
42
© 2013 sbv IMPROVER, PMI and IBM
summary
 sc-1: intra-species prediction of phosphorylation
gene expression is predictive for phosphorylation status
 sc-2: inter-species prediction of phosphorylation
rat phosphorylation is predictive for human cell response
 sc-3: inter-species prediction of gene sets
weakly predictive, presence of negative correlations
between rat and human genes and gene sets
Winning the rat race
43
outlook
• more sophisticated learning schemes / classifiers
e.g. feature weighting schemes, Matrix Relevance LVQ
• ‘joint’ predictions of protein or gene set tableaus
e.g. predict 1 protein from 16 + 15 values in set A
two-step procedure for set B
• include gene expression in sub-challenge 2
• investigate difficult to predict proteins / gene sets
• infer and enhance network models from experimental data
on-going, new challenge (runs until February 2014)
Network Verification Challenge (NVC)
www.sbvimprover.com
Winning the rat race
44
take home messages
• team work works (and skype is great)
• in case of doubt: PCA
• the smaller the data set, the simpler the method
• committees can be useful!
• if you have won the rat race, you might be a rat
Winning the rat race
45