Disease Stratification - Baliga Lab at Institute for Systems Biology
Download
Report
Transcript Disease Stratification - Baliga Lab at Institute for Systems Biology
Systems Approaches to
Disease Stratification
Nathan Price
Introduction to Systems Biology Short Course
August 20, 2012
Goals and Motivation
Currently most diagnoses based on
symptoms and visual features (pathology,
histology)
However, many diseases appear
deceptively similar, but are, in fact, distinct
entities from the molecular perspective
Drive towards personalized medicine
Outline
Molecular signature classifiers: main issues
Signal to noise
Small sample size issues
Error estimation techniques
Phenotypes and sample heterogeneity
Example study
Advanced topics
Network-based classification
Importance of broad disease context
Molecular signature classifiers
Overall strategy
Molecular signatures for diagnosis
The goals of molecular classification of tumors:
Identify subpopulations of cancer
Inform choice of therapy
Generally, a set of microarray experiments is used with
~100 patient samples
~ 104 transcripts (genes)
This very small number of samples relative to the number of
transcripts is a key issue
Feature selection & model selection
Small sample size issues dominate
Error estimation techniques
Also, the microarray platform used can have a significant effect on
results
Randomness
Expression values have
randomness arising from
both biological and
experimental variability.
Design, performance
evaluation, and
application of classifiers
must take this
randomness into
account.
Three critical issues arise…
Given a set of variables, how does one design a
classifier from the sample data that provides good
classification over the general population?
How does one estimate the error of a designed
classifier when data is limited?
Given a large set of potential variables, such as
the large number of expression levels provided by
each microarray, how does one select a set of
variables as the input vector to the classifier?
Small sample issues
Our task is to predict future events
Thus, we must avoid overfitting
It is easy (if the model is complicated enough) to fit data
we have
Simplicity of model vital when data is sparse and
possible relationships are large
This is exactly the case in virtually all microarray studies,
including ours
In the clinic
At the end, want a test that can easily be implemented
and actually benefit patients
Error estimation and variable selection
An error estimator may be unbiased but have a
large variance, and therefore often be low.
This can produce a large number of gene sets and
classifiers with low error estimates.
For a small sample, one can end up with
thousands of gene sets for which the error
estimate from the sample data is near zero!
Overfitting
Complex decision boundary may be unsupported
by the data relative to the feature-label distribution.
Relative to the sample data, a classifier may have
small error; but relative to the feature-label
distribution, the error may be severe!
Classification rule should not cut up the space in a
manner too complex for the amount of sample
data available.
Overfitting: example of KNN rule
N = 30
test sample; k = 3
N = 90
Example: How to identify appropriate
models
(regression… but the issues are the same)
y f x n
noise
learn f from data
Linear…
Quadratic…
Piecewise linear interpolation…
Which one is best?
Cross-validation
Cross-validation
Cross-validation
Simple: just choose the classifier with the best
cross-validation error
But… (there is always a but)
we are training on even less data, so the classifier
design is worse
if sample size is small, test set is small and error
estimator has high variance
so we may be fooling ourselves into thinking we have a
good classifier…
LOOCV (leave-one-out cross
validation)
mean square error: 2.12
mean square error: 0.96
best
mean square error: 3.33
Estimating Error on Future Cases
Use cross validation
to estimate accuracy
on future cases
Feature selection and
model selection must
be within loop to
avoid overly optimistic
estimates
Data Set
Resampling: Shuffled repeatedly
into training and test sets.
Training
Set
NO information passage
Methodology
Best case: have an
independent test set
Resampling
techniques
Test
Set
Average
performance
on test set
provides
estimate for
behavior on
future cases
Can be MUCH
different than
behavior on
training set
Classification methods
k-nearest neighbor
Support vector machine (SVM)
Linear, quadratic
Perceptrons, neural networks
Decision trees
k-Top Scoring Pairs
Many others
Molecular signature classifiers
Example Study
Diagnosing similar cancers with
different treatments
Challenge in medicine: diagnosis, treatment, prevention
of disease suffer from lack of knowledge
Gastrointestinal Stromal Tumor (GIST)
and Leiomyosarcoma (LMS)
morphologically similar, hard to distinguish using current methods
different treatments, correct diagnosis is critical
studying genome-wide patterns of expression aids clinical diagnosis
?
GIST Patient
LMS Patient
Goal: Identify molecular signature that will accurately differentiate these
two cancers
Relative Expression Reversal Classifiers
Find a classification rule as follows:
IF gene A > gene B THEN class1, ELSE class2
Classifier is chosen finding the most accurate and
robust rule of this type from all possible pairs in
the dataset
If needed, a set of classifiers of the above form
can be used, with final classification resulting from
a majority vote (k-TSP)
•
•
Geman, D., et al. Stat. Appl. Geneti. Mol. Biol., 3, Article 19, 2004
Tan et al., Bioinformatics, 21:3896-904, 2005
Rationale for k-TSP
Based on concept of relative expression reversals
Advantages
Does not require data normalization
Does not require population-wide cutoffs or weighting functions
Has reported accuracies in literature comparable to SVMs, PAM,
other state-of-the art classification methods
Results in classifiers that are easy to implement
Designed to avoid overfitting
n = number of genes, m = number of samples
For the example I will show, this equation yields:
10^9 << 10^20
n
2m
2
Diagnostic Marker Pair
5
10
OBSCN expression
Classified as GIST
4
10
3
10
2
10
ClinicopathologicalDiagnosis
X – GIST
O - LMS
1
10
1
10
2
10
Classified as LMS
3
10
4
10
5
10
C9orf65 expression
Accuracy on data = 99% Predicted accuracy on future data (LOOCV) = 98%
•
Price, N.D. et al, PNAS 104:3414-9 (2007)
RT-PCR Classification Results
OBSCN-c9orf65
OBSCN
14
86
33
26
61
62
2
83
84
20
22
85
58
82
Sample 79
19
71
79
80
41
0
-20
40
52
7
77
76
c9orf65
81
87
78
70
74
-15
13
29
-10
73
72
75
-5
49
15
GIST
c9orf65
5
69
LMS
difference of Ct average
10
10
37
sample
OBSCN
OBSCN
c9orf65
Sample 62
100% Accuracy
19 independent samples
20 samples from microarray study
including previously indeterminate case
• Price,
Price, N.D.
N.D. et
et al,
al, PNAS
PNAS 104:3414-9
104:3414-9 (2007)
(2007)
Comparative biomarker accuracies
6
c-kit expression
C-kit gene expression
10
5
10
4
10
3
10
2
10
GIST – X
LMS – O
1
10
-3
10
-2
10
-1
10
0
10
1
10
OBSCN
expression
/ C9orf65classifier
expression
2-gene relative
expression
• Price,
Price, N.D.
N.D. et
et al,
al, PNAS
PNAS 104:3414-9
104:3414-9 (2007)
(2007)
2
10
Kit Protein Staining of GIST-LMS
Blue arrows - GIST Red arrows - LMS
Top Row – GIST
Positive Staining
Bottom Row –
GIST negative
staining
•
Accuracy
as a classifier ~ 87%.
Price, N.D. et al, PNAS 104:3414-9
(2007)
A few general lessons
Choosing markers based on relative expression
reversals of gene pairs has proven to be very
robust with high predictive accuracy in sets we
have tested so far
Simple and independent of normalization
Easy to implement clinical test ultimately
All that’s needed is RT-PCR on two genes
Advantages of this approach may be even more
applicable to proteins in the blood
Each decision rule requiring the measurement of the
relative concentration of 2 proteins
Network-based classification
Network-based classification
Can modify feature
selection methods based
on networks
Can improve performance
(not always)
Generally improves
biological insight by
integrating heterogeneous
data
Shown to improve
prediction of breast cancer
metastasis (complex
phenotype)
•
Chuang, Lee, Liu, Lee, Ideker, Molecular Systems Biology 3:40
Rationale: Differential Rank Analysis (DIRAC)
Cancer is a multi-genic disease
Analyze high-throughput data to
identify aspects of the genome-scale
network that are most affected
Initial version uses a priori defined
gene sets
BioCarta, KEGG, GO, etc.
5
10
Classified as GIST
OBSCN expression
Networks or pathways inform best
targets for therapies
4
10
3
10
2
10
ClinicopathologicalDiagnosis
X – GIST
1 O - LMS
10
1
10
2
10
Classified as LMS
3
10
4
10
5
10
C9orf65 expression
Accuracy on data = 99% Predicted accuracy on future data (LOOCV) = 98%
Differential rank conservation (DIRAC)
for studying
Expression rank conservation for
pathways within a phenotype
Pathways that discriminate well between
phenotypes
Eddy, J.A. et al, PLoS Computational Biology (2010)
Price, N.D. et al, PNAS, 2007
Differential Rank Conservation
…across pathways in a phenotype
Highest
conservation
tightly regulated pathway
g3
g3
g3
g3
g2
g2
g1
g2
g1
g1
g2
g1
g4
g4
g4
g4
weakly regulated pathway
Lowest
conservation
…across phenotypes
for a pathway
shuffled pathway ranking
between phenotypes
GIST
LMS
g7
g6
g5
g7
g3
g4
g8
g8
g7
g6
g2
g1
g6
g7
g6
g8
g1
g3
g5
g5
g8
g5
g4
g2
Visualizing global network rank conservation
Network name.
6
4
7
11
8
12
17
8
11
8
34
14
16
5
10
15
5
21
6
8
1.000
0.981
0.955
0.955
0.948
0.947
0.946
0.946
0.945
0.847
0.845
0.844
0.840
0.839
0.833
0.829
0.806
0.805
0.763
0.728
…
µR
…
…
GS
FOSB
AKAP13
AGPCR
RNA
CACAM
NDKDYNAMIN
ETC
SET
ALTERNATIVE
ALK
LAIR
PITX2
METHIONINE
IL5
STEM
ION
CYTOKINE
IL18
LEPTIN
Num. genes
Visualizing global network rank conservation
Network name.
6
4
7
11
8
12
17
8
11
8
34
14
16
5
10
15
5
21
6
8
1.000
0.981
0.955
0.955
0.948
0.947
0.946
0.946
0.945
0.847
0.845
0.844
0.840
0.839
0.833
0.829
0.806
0.805
0.763
0.728
…
µR
…
…
GS
FOSB
AKAP13
AGPCR
RNA
CACAM
NDKDYNAMIN
ETC
SET
ALTERNATIVE
ALK
LAIR
PITX2
METHIONINE
IL5
STEM
ION
CYTOKINE
IL18
LEPTIN
Num. genes
Average rank
conservation
across all 248
networks: 0.903
Global regulation of networks across phenotypes
Highest rank
conservation
Eddy et al, PLoS Computational Biology, (2010)
Lowest rank
conservation
Global regulation of networks across phenotypes
Highest rank
conservation
Tighter network regulation:
normal prostate
Looser network regulation:
primary prostate cancer
Loosest network regulation:
metastatic prostate cancer
Eddy et al, PLoS Computational Biology, (2010)
Lowest rank
conservation
Differential Rank Conservation
…across pathways in a phenotype
Highest
conservation
tightly regulated pathway
g3
g3
g3
g3
g2
g2
g1
g2
g1
g1
g2
g1
g4
g4
g4
g4
weakly regulated pathway
Lowest
conservation
…across phenotypes
for a pathway
shuffled pathway ranking
between phenotypes
GIST
LMS
g7
g6
g5
g7
g3
g4
g8
g8
g7
g6
g2
g1
g6
g7
g6
g8
g1
g3
g5
g5
g8
g5
g4
g2
Differential rank conservation of the MAPK network
DIRAC classification is comparable to other methods
Cross validation accuracies in prostate cancer
Differential Rank Conservation
(DIRAC): Key Features
Independent of data normalization
Independent of genes/proteins outside of network
Can show massive/complete perturbations
Unlike Fischer’s exact test (e.g. GO enrichment)
Measures the “shuffling” of the network in terms of the hierarchy
of expression of he components
Distinct from enrichment or GSEA
Provides a distinct mathematically classifier to yield
measurement of predictive accuracy on test data
Stronger than p-value for determining signal
Code for the method can be found at our website:
http://price.systemsbiology.net
•
Eddy et al, PLoS Computational Biology, (2010)
Global Analysis of Human
Disease
Importance of broad context to disease diagnosis
The
envisioned
future of
blood
diagnostics
Next generation molecular disease-screening
Why global disease analyses are essential
Organ-specificity: separating signal from noise
Hierarchy of classification
Context-independent classifiers
Based on organ-specific markers
Context-dependent classifiers
Based on excellent markers once organ-specificity defined
Provide context for how disease classifiers should
be defined
Provide broad perspective into how separable
diseases are and if disease diagnosis categories
seem appropriate
GLOBAL ANALYSIS OF DISEASEPERTURBED TRANSCRIPTOMES IN
THE HUMAN BRAIN
Example case study
4/8/2015
49
Multidimensional scaling plot of brain disease data
AI
ALZ
GBM
MDL
MNG
NB
OLG
PRK
normal
Identification of Structured Signatures And Classifiers (ISSAC)
• At each class in the decision
tree, a test sample is either
allowed to pass down the tree
for further classification or
rejected (i.e. 'does not belong to
this class') and thus unable to
pass
Accuracy on randomly split test sets
classification accuracy (%)
100
100
99.0
97.6
100
98.2
90
84.2
94.7
81.8
80
60
40
20
0
AI
Average
ALZ
GBM
MDL
MNG
NB
OLG
PRK
accuracy of all class samples: 93.9 %
normal
/control
The challenge of ‘Lab Effects’
Sample heterogeneity issues in
personalized medicine
53
0%
Normal (GSE7307)
Normal (GSE3526)
PA (GSE12907)
PA (GSE5675)
OLG (GSE4290)
OLG (GSE4412)
MNG (GSE16581)
MNG (GSE9438)
MNG (GSE4780)
MDL (GSE12992)
MDL (GSE10327)
GBM (GSE4290)
GBM (GSE9171)
GBM (GSE8692)
GBM (GSE4271)
GBM (GSE4412)
EPN (GSE21687)
EPN (GSE16155)
accuracy
Independent hold-out trials for 18 GSE datasets
100%
80%
60%
40%
20%
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
E_12667
E_19188
E_3141
E_10245
E_14814
E_18842
E_15
E_10445
E_10072
E_2109
E_17475
E_7670
E_231
E_10799
E_10245
E_19188
E_3141
E_18842
E_2109
E_6253
E_14814
E_14814
E_2109
E_10445
E_19188
E_18965
E_7368
E_4302
E_8581
E_1650
E_8545
E_5058
E_12345
E_4302
E_47
E_994
E_5058
E_18842
E_1643
E_8545
E_19188
E_15
E_18965
E_1650
E_10799
E_7670
E_10072
E_231
E_8581
Class Sensitivity
E_14814
E_10445
E_7670
E_10799
E_19188
E_15
E_3141
E_17475
E_10245
E_12667
E_10072
E_231
E_2109
E_18842
E_2109
E_19188
E_10245
E_14814
E_3141
E_18842
E_6253
E_14814
E_2109
E_10445
E_19188
E_7368
E_18965
E_4302
E_8581
E_1650
E_8545
E_5058
E_994
E_12345
E_47
E_5058
E_4302
E_18965
E_8545
E_19188
E_1643
E_231
E_10072
E_7670
E_10799
E_18842
E_1650
E_15
E_8581
Class Sensitivity
Leave-batch-out validation shows impact of other
batch effects
ISSAC
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
ADC
SCC
LCLC
AST
COPD
NORM
Study Batch Excluded
ooSVM (50 features)
ADC
SCC
LCLC
AST
COPD
NORM
Study Batch Excluded
Take home messages
There is tremendous promise in highthroughput approaches to identify biomarkers
Significant challenges remain to their broad
success
Integrative systems approaches are essential
that link together data very broadly
If training set is representative of population,
there are robust signals in the data and
excellent accuracy is possible
Forward designs and partnering closely with
clinical partners is essential, as is
standardization of data collection and analysis
Summary
Molecular signature classifiers provide a promising avenue
for disease stratification
Machine-learning approaches are key
Goal is optimal prediction of future data
Must avoid overfitting
Model complexity
Feature selection & model selection
Technical challenges
Measurement platforms
Network-based classification
Global disease context is key
Lab and batch effects critical to overcome
Sampling of heterogeneity for some disease now sufficient to
achieve stability in classification accuracies
Acknowledgments
Nathan D. Price Research Laboratory
Institute for Systems Biology, Seattle, WA | University of Illinois, Urbana-Champaign, IL
Price Lab Members
Collaborators
Seth Ament, PhD
Daniel Baker
Matthew Benedict
Julie Bletz, PhD
Victor Cassen
Sriram Chandrasekaran
Nicholas Chia, PhD (now
Ast. Prof. at Mayo Clinic)
John Earls
James Eddy
Cory Funk, PhD
Pan Jun Kim, PhD (now
Ast. Prof. at POSTECH)
Alexey Kolodkin, PhD
Charu Gupta Kumar, PhD
Ramkumar Hariharan, PhD
Ben Heavner, PhD
Piyush Labhsetwar
Andrew Magis
Caroline Milne
Shuyi Ma
Beth Papanek
Matthew Richards
Areejit Samal, PhD
Vineet Sangar, PhD
Bozenza Sawicka
Evangelos Simeonidis
Jaeyun Sung
Chunjing Wang
Don Geman (Johns Hopkins)
Wei Zhang (MD Anderson)
Funding
• NIH / National Cancer Institute - Howard
Temin Pathway to Independence Award
• NSF CAREER
• Department of Energy
• Energy Biosciences Institute (BP)
• Department of Defense (TATRC)
• Luxembourg-ISB
Systems Medicine Program
• Roy J. Carver Charitable Trust
Young Investigator Award
• Camille Dreyfus
Teacher-Scholar Award