Disease Stratification - Baliga Lab at Institute for Systems Biology

Download Report

Transcript Disease Stratification - Baliga Lab at Institute for Systems Biology

Systems Approaches to
Disease Stratification
Nathan Price
Introduction to Systems Biology Short Course
August 20, 2012
Goals and Motivation
 Currently most diagnoses based on
symptoms and visual features (pathology,
histology)
 However, many diseases appear
deceptively similar, but are, in fact, distinct
entities from the molecular perspective
 Drive towards personalized medicine
Outline
 Molecular signature classifiers: main issues





Signal to noise
Small sample size issues
Error estimation techniques
Phenotypes and sample heterogeneity
Example study
 Advanced topics
 Network-based classification
 Importance of broad disease context
Molecular signature classifiers
Overall strategy
Molecular signatures for diagnosis
 The goals of molecular classification of tumors:
 Identify subpopulations of cancer
 Inform choice of therapy
 Generally, a set of microarray experiments is used with
 ~100 patient samples
 ~ 104 transcripts (genes)
 This very small number of samples relative to the number of
transcripts is a key issue
 Feature selection & model selection
 Small sample size issues dominate
 Error estimation techniques
 Also, the microarray platform used can have a significant effect on
results
Randomness
 Expression values have
randomness arising from
both biological and
experimental variability.
 Design, performance
evaluation, and
application of classifiers
must take this
randomness into
account.
Three critical issues arise…
 Given a set of variables, how does one design a
classifier from the sample data that provides good
classification over the general population?
 How does one estimate the error of a designed
classifier when data is limited?
 Given a large set of potential variables, such as
the large number of expression levels provided by
each microarray, how does one select a set of
variables as the input vector to the classifier?
Small sample issues
Our task is to predict future events
 Thus, we must avoid overfitting
 It is easy (if the model is complicated enough) to fit data
we have
 Simplicity of model vital when data is sparse and
possible relationships are large
 This is exactly the case in virtually all microarray studies,
including ours
In the clinic
 At the end, want a test that can easily be implemented
and actually benefit patients
Error estimation and variable selection
 An error estimator may be unbiased but have a
large variance, and therefore often be low.
 This can produce a large number of gene sets and
classifiers with low error estimates.
 For a small sample, one can end up with
thousands of gene sets for which the error
estimate from the sample data is near zero!
Overfitting
 Complex decision boundary may be unsupported
by the data relative to the feature-label distribution.
 Relative to the sample data, a classifier may have
small error; but relative to the feature-label
distribution, the error may be severe!
 Classification rule should not cut up the space in a
manner too complex for the amount of sample
data available.
Overfitting: example of KNN rule
N = 30
test sample; k = 3
N = 90
Example: How to identify appropriate
models
(regression… but the issues are the same)
y  f x   n
noise
learn f from data
Linear…
Quadratic…
Piecewise linear interpolation…
Which one is best?
Cross-validation
Cross-validation
Cross-validation
 Simple: just choose the classifier with the best
cross-validation error
 But… (there is always a but)
 we are training on even less data, so the classifier
design is worse
 if sample size is small, test set is small and error
estimator has high variance
 so we may be fooling ourselves into thinking we have a
good classifier…
LOOCV (leave-one-out cross
validation)
mean square error: 2.12
mean square error: 0.96
best
mean square error: 3.33
Estimating Error on Future Cases
 Use cross validation
to estimate accuracy
on future cases
 Feature selection and
model selection must
be within loop to
avoid overly optimistic
estimates
Data Set
Resampling: Shuffled repeatedly
into training and test sets.
Training
Set
NO information passage
 Methodology
 Best case: have an
independent test set
 Resampling
techniques
Test
Set
Average
performance
on test set
provides
estimate for
behavior on
future cases
Can be MUCH
different than
behavior on
training set
Classification methods







k-nearest neighbor
Support vector machine (SVM)
Linear, quadratic
Perceptrons, neural networks
Decision trees
k-Top Scoring Pairs
Many others
Molecular signature classifiers
Example Study
Diagnosing similar cancers with
different treatments
 Challenge in medicine: diagnosis, treatment, prevention
of disease suffer from lack of knowledge
 Gastrointestinal Stromal Tumor (GIST)
and Leiomyosarcoma (LMS)
 morphologically similar, hard to distinguish using current methods
 different treatments, correct diagnosis is critical
 studying genome-wide patterns of expression aids clinical diagnosis
?
GIST Patient
LMS Patient
 Goal: Identify molecular signature that will accurately differentiate these
two cancers
Relative Expression Reversal Classifiers
 Find a classification rule as follows:
 IF gene A > gene B THEN class1, ELSE class2
 Classifier is chosen finding the most accurate and
robust rule of this type from all possible pairs in
the dataset
 If needed, a set of classifiers of the above form
can be used, with final classification resulting from
a majority vote (k-TSP)
•
•
Geman, D., et al. Stat. Appl. Geneti. Mol. Biol., 3, Article 19, 2004
Tan et al., Bioinformatics, 21:3896-904, 2005
Rationale for k-TSP
 Based on concept of relative expression reversals
 Advantages
 Does not require data normalization
 Does not require population-wide cutoffs or weighting functions
 Has reported accuracies in literature comparable to SVMs, PAM,
other state-of-the art classification methods
 Results in classifiers that are easy to implement
 Designed to avoid overfitting
 n = number of genes, m = number of samples
 For the example I will show, this equation yields:
 10^9 << 10^20
n
   2m
2
Diagnostic Marker Pair
5
10
OBSCN expression
Classified as GIST
4
10
3
10
2
10
ClinicopathologicalDiagnosis
X – GIST
O - LMS
1
10
1
10
2
10
Classified as LMS
3
10
4
10
5
10
C9orf65 expression
Accuracy on data = 99% Predicted accuracy on future data (LOOCV) = 98%
•
Price, N.D. et al, PNAS 104:3414-9 (2007)
RT-PCR Classification Results
OBSCN-c9orf65
OBSCN
14
86
33
26
61
62
2
83
84
20
22
85
58
82
Sample 79
19
71
79
80
41
0
-20
40
52
7
77
76
c9orf65
81
87
78
70
74
-15
13
29
-10
73
72
75
-5
49
15
GIST
c9orf65
5
69
LMS
difference of Ct average
10
10
37
sample
OBSCN
OBSCN
c9orf65
Sample 62
 100% Accuracy
 19 independent samples
 20 samples from microarray study
 including previously indeterminate case
• Price,
Price, N.D.
N.D. et
et al,
al, PNAS
PNAS 104:3414-9
104:3414-9 (2007)
(2007)
Comparative biomarker accuracies
6
c-kit expression
C-kit gene expression
10
5
10
4
10
3
10
2
10
GIST – X
LMS – O
1
10
-3
10
-2
10
-1
10
0
10
1
10
OBSCN
expression
/ C9orf65classifier
expression
2-gene relative
expression
• Price,
Price, N.D.
N.D. et
et al,
al, PNAS
PNAS 104:3414-9
104:3414-9 (2007)
(2007)
2
10
Kit Protein Staining of GIST-LMS
Blue arrows - GIST Red arrows - LMS
 Top Row – GIST
Positive Staining
 Bottom Row –
GIST negative
staining
•
Accuracy
as a classifier ~ 87%.
Price, N.D. et al, PNAS 104:3414-9
(2007)
A few general lessons
 Choosing markers based on relative expression
reversals of gene pairs has proven to be very
robust with high predictive accuracy in sets we
have tested so far
 Simple and independent of normalization
 Easy to implement clinical test ultimately
 All that’s needed is RT-PCR on two genes
 Advantages of this approach may be even more
applicable to proteins in the blood
 Each decision rule requiring the measurement of the
relative concentration of 2 proteins
Network-based classification
Network-based classification
 Can modify feature
selection methods based
on networks
 Can improve performance
(not always)
 Generally improves
biological insight by
integrating heterogeneous
data
 Shown to improve
prediction of breast cancer
metastasis (complex
phenotype)
•
Chuang, Lee, Liu, Lee, Ideker, Molecular Systems Biology 3:40
Rationale: Differential Rank Analysis (DIRAC)
 Cancer is a multi-genic disease
 Analyze high-throughput data to
identify aspects of the genome-scale
network that are most affected
 Initial version uses a priori defined
gene sets
 BioCarta, KEGG, GO, etc.
5
10
Classified as GIST
OBSCN expression
 Networks or pathways inform best
targets for therapies
4
10
3
10
2
10
ClinicopathologicalDiagnosis
X – GIST
1 O - LMS
10
1
10
2
10
Classified as LMS
3
10
4
10
5
10
C9orf65 expression
Accuracy on data = 99% Predicted accuracy on future data (LOOCV) = 98%
 Differential rank conservation (DIRAC)
for studying
 Expression rank conservation for
pathways within a phenotype
 Pathways that discriminate well between
phenotypes
Eddy, J.A. et al, PLoS Computational Biology (2010)
Price, N.D. et al, PNAS, 2007
Differential Rank Conservation
…across pathways in a phenotype
Highest
conservation
tightly regulated pathway
g3
g3
g3
g3
g2
g2
g1
g2
g1
g1
g2
g1
g4
g4
g4
g4
weakly regulated pathway
Lowest
conservation
…across phenotypes
for a pathway
shuffled pathway ranking
between phenotypes
GIST
LMS
g7
g6
g5
g7
g3
g4
g8
g8
g7
g6
g2
g1
g6
g7
g6
g8
g1
g3
g5
g5
g8
g5
g4
g2
Visualizing global network rank conservation
Network name.
6
4
7
11
8
12
17
8
11
8
34
14
16
5
10
15
5
21
6
8
1.000
0.981
0.955
0.955
0.948
0.947
0.946
0.946
0.945
0.847
0.845
0.844
0.840
0.839
0.833
0.829
0.806
0.805
0.763
0.728
…
µR
…
…
GS
FOSB
AKAP13
AGPCR
RNA
CACAM
NDKDYNAMIN
ETC
SET
ALTERNATIVE
ALK
LAIR
PITX2
METHIONINE
IL5
STEM
ION
CYTOKINE
IL18
LEPTIN
Num. genes
Visualizing global network rank conservation
Network name.
6
4
7
11
8
12
17
8
11
8
34
14
16
5
10
15
5
21
6
8
1.000
0.981
0.955
0.955
0.948
0.947
0.946
0.946
0.945
0.847
0.845
0.844
0.840
0.839
0.833
0.829
0.806
0.805
0.763
0.728
…
µR
…
…
GS
FOSB
AKAP13
AGPCR
RNA
CACAM
NDKDYNAMIN
ETC
SET
ALTERNATIVE
ALK
LAIR
PITX2
METHIONINE
IL5
STEM
ION
CYTOKINE
IL18
LEPTIN
Num. genes
Average rank
conservation
across all 248
networks: 0.903
Global regulation of networks across phenotypes
Highest rank
conservation
Eddy et al, PLoS Computational Biology, (2010)
Lowest rank
conservation
Global regulation of networks across phenotypes
Highest rank
conservation
Tighter network regulation:
normal prostate
Looser network regulation:
primary prostate cancer
Loosest network regulation:
metastatic prostate cancer
Eddy et al, PLoS Computational Biology, (2010)
Lowest rank
conservation
Differential Rank Conservation
…across pathways in a phenotype
Highest
conservation
tightly regulated pathway
g3
g3
g3
g3
g2
g2
g1
g2
g1
g1
g2
g1
g4
g4
g4
g4
weakly regulated pathway
Lowest
conservation
…across phenotypes
for a pathway
shuffled pathway ranking
between phenotypes
GIST
LMS
g7
g6
g5
g7
g3
g4
g8
g8
g7
g6
g2
g1
g6
g7
g6
g8
g1
g3
g5
g5
g8
g5
g4
g2
Differential rank conservation of the MAPK network
DIRAC classification is comparable to other methods
Cross validation accuracies in prostate cancer
Differential Rank Conservation
(DIRAC): Key Features
 Independent of data normalization
 Independent of genes/proteins outside of network
 Can show massive/complete perturbations
 Unlike Fischer’s exact test (e.g. GO enrichment)
 Measures the “shuffling” of the network in terms of the hierarchy
of expression of he components
 Distinct from enrichment or GSEA
 Provides a distinct mathematically classifier to yield
measurement of predictive accuracy on test data
 Stronger than p-value for determining signal
 Code for the method can be found at our website:
http://price.systemsbiology.net
•
Eddy et al, PLoS Computational Biology, (2010)
Global Analysis of Human
Disease
Importance of broad context to disease diagnosis
The
envisioned
future of
blood
diagnostics
Next generation molecular disease-screening
Why global disease analyses are essential
 Organ-specificity: separating signal from noise
 Hierarchy of classification
 Context-independent classifiers
 Based on organ-specific markers
 Context-dependent classifiers
 Based on excellent markers once organ-specificity defined
 Provide context for how disease classifiers should
be defined
 Provide broad perspective into how separable
diseases are and if disease diagnosis categories
seem appropriate
GLOBAL ANALYSIS OF DISEASEPERTURBED TRANSCRIPTOMES IN
THE HUMAN BRAIN
Example case study
4/8/2015
49
Multidimensional scaling plot of brain disease data
AI
ALZ
GBM
MDL
MNG
NB
OLG
PRK
normal
Identification of Structured Signatures And Classifiers (ISSAC)
• At each class in the decision
tree, a test sample is either
allowed to pass down the tree
for further classification or
rejected (i.e. 'does not belong to
this class') and thus unable to
pass
Accuracy on randomly split test sets
classification accuracy (%)
100
100
99.0
97.6
100
98.2
90
84.2
94.7
81.8
80
60
40
20
0
AI
 Average
ALZ
GBM
MDL
MNG
NB
OLG
PRK
accuracy of all class samples: 93.9 %
normal
/control
The challenge of ‘Lab Effects’
Sample heterogeneity issues in
personalized medicine
53
0%
Normal (GSE7307)
Normal (GSE3526)
PA (GSE12907)
PA (GSE5675)
OLG (GSE4290)
OLG (GSE4412)
MNG (GSE16581)
MNG (GSE9438)
MNG (GSE4780)
MDL (GSE12992)
MDL (GSE10327)
GBM (GSE4290)
GBM (GSE9171)
GBM (GSE8692)
GBM (GSE4271)
GBM (GSE4412)
EPN (GSE21687)
EPN (GSE16155)
accuracy
Independent hold-out trials for 18 GSE datasets
100%
80%
60%
40%
20%
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
E_12667
E_19188
E_3141
E_10245
E_14814
E_18842
E_15
E_10445
E_10072
E_2109
E_17475
E_7670
E_231
E_10799
E_10245
E_19188
E_3141
E_18842
E_2109
E_6253
E_14814
E_14814
E_2109
E_10445
E_19188
E_18965
E_7368
E_4302
E_8581
E_1650
E_8545
E_5058
E_12345
E_4302
E_47
E_994
E_5058
E_18842
E_1643
E_8545
E_19188
E_15
E_18965
E_1650
E_10799
E_7670
E_10072
E_231
E_8581
Class Sensitivity
E_14814
E_10445
E_7670
E_10799
E_19188
E_15
E_3141
E_17475
E_10245
E_12667
E_10072
E_231
E_2109
E_18842
E_2109
E_19188
E_10245
E_14814
E_3141
E_18842
E_6253
E_14814
E_2109
E_10445
E_19188
E_7368
E_18965
E_4302
E_8581
E_1650
E_8545
E_5058
E_994
E_12345
E_47
E_5058
E_4302
E_18965
E_8545
E_19188
E_1643
E_231
E_10072
E_7670
E_10799
E_18842
E_1650
E_15
E_8581
Class Sensitivity
Leave-batch-out validation shows impact of other
batch effects
ISSAC
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
ADC
SCC
LCLC
AST
COPD
NORM
Study Batch Excluded
ooSVM (50 features)
ADC
SCC
LCLC
AST
COPD
NORM
Study Batch Excluded
Take home messages
 There is tremendous promise in highthroughput approaches to identify biomarkers
 Significant challenges remain to their broad
success
 Integrative systems approaches are essential
that link together data very broadly
 If training set is representative of population,
there are robust signals in the data and
excellent accuracy is possible
 Forward designs and partnering closely with
clinical partners is essential, as is
standardization of data collection and analysis
Summary
 Molecular signature classifiers provide a promising avenue
for disease stratification
 Machine-learning approaches are key
 Goal is optimal prediction of future data
 Must avoid overfitting
 Model complexity
 Feature selection & model selection
 Technical challenges
 Measurement platforms




Network-based classification
Global disease context is key
Lab and batch effects critical to overcome
Sampling of heterogeneity for some disease now sufficient to
achieve stability in classification accuracies
Acknowledgments
Nathan D. Price Research Laboratory
Institute for Systems Biology, Seattle, WA | University of Illinois, Urbana-Champaign, IL
Price Lab Members
Collaborators
Seth Ament, PhD
Daniel Baker
Matthew Benedict
Julie Bletz, PhD
Victor Cassen
Sriram Chandrasekaran
Nicholas Chia, PhD (now
Ast. Prof. at Mayo Clinic)
John Earls
James Eddy
Cory Funk, PhD
Pan Jun Kim, PhD (now
Ast. Prof. at POSTECH)
Alexey Kolodkin, PhD
Charu Gupta Kumar, PhD
Ramkumar Hariharan, PhD
Ben Heavner, PhD
Piyush Labhsetwar
Andrew Magis
Caroline Milne
Shuyi Ma
Beth Papanek
Matthew Richards
Areejit Samal, PhD
Vineet Sangar, PhD
Bozenza Sawicka
Evangelos Simeonidis
Jaeyun Sung
Chunjing Wang
Don Geman (Johns Hopkins)
Wei Zhang (MD Anderson)
Funding
• NIH / National Cancer Institute - Howard
Temin Pathway to Independence Award
• NSF CAREER
• Department of Energy
• Energy Biosciences Institute (BP)
• Department of Defense (TATRC)
• Luxembourg-ISB
Systems Medicine Program
• Roy J. Carver Charitable Trust
Young Investigator Award
• Camille Dreyfus
Teacher-Scholar Award