Slides - The Fenyo Lab

Download Report

Transcript Slides - The Fenyo Lab

Proteomics Informatics –
Molecular signatures (Week 14)
Definition of a molecular signature
A molecular signature is a
computational or
mathematical model that
links high-dimensional
molecular information to
phenotype or other response
variable of interest.
FDA calls them “in vitro
diagnostic multivariate assays”
Uses of molecular signatures
1.
Models of disease phenotype/clinical outcome
• Diagnosis
• Prognosis, long-term disease management
• Personalized treatment (drug selection, titration)
2. Biomarkers for diagnosis, or outcome prediction
• Make the above tasks resource efficient, and
easy to use in clinical practice
3. Discovery of structure & mechanisms
(regulatory/interaction networks, pathways, subtypes)
• Leads for potential new drug candidates
Experimental Design
Experimental Design by Christine Ambrosino
www.hawaii.edu/fishlab/Nearside.htm
Experimental Design
Overcoming the threat from chance and
bias to the validity of conclusion.
Experimental Design
Controllable Factors
Inputs
Process
Uncontrollable Factors
Outputs
Experimental Design
•
Recognition and statement of the problem (e.g. testing a
specific hypothesis or open ended discovery).
•
Selecting a response variable.
•
Choosing controllable factors and their range.
•
Listing uncontrollable factors and estimate their effect.
•
Choosing experimental design.
•
Performing experiment.
•
Statistical analysis of data.
•
Designing the next experiment based on the results.
Exploring the Parameter Space
Factor 1
Score
Score
Score
One factor at a time
Factor 2
k factors : 2k experiments
Factor 2
2-factor factorial design
3-factor factorial design
4 experiments
Factor 1
k-factor factorial design (2k experiments)
For example,
7 factors: 128 experiments,
10 factors: 1,024 experiments
Factor 3
8 experiments
Randomization
• Statistical methods require that observations are
independently distributed random variables. Randomization
usually makes this assumption valid.
• Randomization guards against unknown and uncontrolled
factors.
• Randomize with respect to analysis order, location, material
etc.
Not Randomized
p = 0.19
Randomized
p = 0.32
No change in
sensitivity
during
measurement
Order of Measurements
Order of Measurements
Randomization
Not Randomized
p = 0.19
Standard
Deviation:
0.8, 0.8
Randomized
p = 0.32
Standard
Deviation:
0.7, 0.9
No change in
sensitivity
during
measurement
Order of Measurements
Order of Measurements
p = 5.7x10-6
p = 0.20
Order of Measurements
Order of Measurements
Change in
sensitivity
during
measurement
Standard
Deviation:
1.8, 1.3
Blocking
Blocking is used to control for known and controllable factors.
Randomized Complete Block Design - minimizing the effect of variability
associated with e.g. location, operator, plant, batch, time.
Intrument 1
Sample 3
Sample 1
Sample 4
Sample 2
Intrument 2
Sample 3
Sample 4
Sample 2
Sample 1
Intrument 3
Sample 2
Sample 1
Sample 3
Sample 4
Intrument 4
Sample 1
Sample 4
Sample 2
Sample 3
The Latin Square Design - minimizing the effect of variability
associated with two independent factors
Operator 1
Operator 2
Operator 3
Operator 4
Intrument 1
Sample 1
Sample 2
Sample 4
Sample 3
Intrument 2
Sample 2
Sample 3
Sample 1
Sample 4
Intrument 3
Sample 3
Sample 4
Sample 2
Sample 1
Intrument 4
Sample 4
Sample 1
Sample 3
Sample 2
The rows and columns represent two restrictions on randomization
Replication
Replication is
measurements.
needed
to
estimate
the
•
Technical replicates (repeat measurements).
•
Process replicates
•
Biological replicates
variance
in
the
Uncertainty in Determining the Mean
Normal
n=3
Skewed
n=3
Long tails
n=3
Complex
n=10
n=10
n=10
n=10
n=100
n=100
n=100
n=100
n=1000
Mean
Standard
Error
of the
Mean

n
An example of bad experimental design
Before/After Gradient
Treatment
Length
Before
3h
Before
3h
Before
3h
Before
3h
Before
3h
Before
3h
Before
3h
Before
3h
Before
3h
Before
3h
Before
3h
Before
3h
Before
3h
Before
3h
After
1h
After
1h
After
1h
After
1h
After
1h
After
1h
After
1h
After
1h
After
1h
After
1h
After
1h
After
1h
After
1h
After
1h
Before
3h
After
3h
After
3h
After
1h
After
1h
Before
3h
Date
Laboratory
2010/07/02 13:08
2010/07/02 19:15
2010/07/04 18:19
2010/07/05 00:26
2010/07/11 05:29
2010/07/11 08:33
2010/07/11 14:39
2010/07/11 20:46
2010/07/19 00:12
2010/07/19 09:22
2010/07/19 12:26
2010/07/19 15:29
2010/07/25 09:17
2010/07/25 12:20
2011/02/20 10:49
2011/02/20 13:57
2011/02/20 17:05
2011/03/04 14:07
2011/03/04 15:47
2011/03/04 17:06
2011/03/04 18:25
2011/03/04 19:44
2011/03/04 21:03
2011/03/05 02:19
2011/03/05 03:39
2011/03/05 04:57
2011/03/07 00:35
2011/03/07 02:51
2011/04/16 20:43
2011/04/21 04:54
2011/04/21 11:00
2011/04/22 08:20
2011/04/23 09:03
2011/04/23 21:20
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
2
2
2
2
2
2
2
2
2
2
1
1
1
1
1
1
Patient
6
11
4
10
16
17
19
29
20
53
58
61
35
39
4
6
11
15
16
17
19
20
29
35
39
53
65
58
11
10
15
17
65
20
Experimental Design - Summary
• Chance and bias is a threat to the conclusions from
experiments
• Controllable and uncontrollable factors
• Randomization to guard against unknown and uncontrolled
factors
• Replication (technical, process, and biological replicates) is
used to estimate error in measurement and yields a more
precise estimate.
• Blocking to control for known and controllable factors
• Multiple testing
• Molecular markers
Experimental Design - Summary
•
Use your domain knowledge: using a designed experiment is
not a substitute for thinking about the problem.
•
Keep the design and analysis as simple as possible.
•
Recognize the difference between practical and statistical
significance.
•
Design iterative experiments.
MammaPrint
•
•
•
•
•
•
•
Developed by Agendia
(www.agendia.com)
70-gene signature to stratify women
with breast cancer that hasn’t spread
into “low risk” and “high risk” for
recurrence of the disease
Independently validated in >1,000
patients
So far performed >10,000 tests
Cost of the test is ~$3,000
In February, 2007 the FDA cleared
the MammaPrint test for marketing in
the U.S. for node negative women
under 61 years of age with tumors of
less than 5 cm.
TIME Magazine’s 2007 “medical
invention of the year”.
Oncotype DX Breast Cancer Assay
• Developed by Genomic Health (www.genomichealth.com)
• 21-gene signature to predict whether a woman with localized,
ER+ breast cancer is at risk of relapse
• Independently validated in thousands of patients
• So far performed >100,000 tests
• Price of the test is $4,175
• Not FDA approved but covered by most insurances including
Medicare
• Its sales in 2010 reached $170M and with a compound annual
growth rate is projected to hit $300M by 2015.
Improved Survival and Cost Savings
In a 2005 economic analysis of recurrence in LN-,ER+ patients
receiving tamoxifen, Hornberger et al. performed a cost-utility
analysis using a decision analytic model. Using a model,
recurrence Score result was predicted on average to increase
quality-adjusted survival by 16.3 years and reduce overall costs
by $155,128.
In a 2 million member plan, approximately 773 women are
eligible for the test. If half receive the test, given the high and
increasing cost of adjuvant chemotherapy, supportive care and
management of adverse events, the use of the Oncotype DX
assay is estimated to save approximately $1,930 per woman
tested (given an aggregate 34% reduction in chemotherapy use).
EF Petricoin III, AM Ardekani, BA Hitt, PJ Levine,
VA Fusaro, SM Steinberg, GB Mills, C Simone, DA
Fishman, EC Kohn, LA Liotta, "Use of proteomic
patterns in serum to identify ovarian cancer",
Lancet 359 (2002) 572–77
Check E., Proteomics and cancer: running before we can walk? Nature. 2004 Jun 3;429(6991):496-7.
Example: OvaCheck
• Developed by Correlogic (www.correlogic.com)
• Blood test for the early detection of epithelial ovarian
cancer
• Failed to obtain FDA approval
• Looks for subtle changes in patterns among the tens of
thousands of proteins, protein fragments and metabolites in
the blood
• Signature developed by genetic algorithm
• Significant artifacts in data collection & analysis questioned
validity of the signature:
- Results are not reproducible
- Data collected differently for different groups of
patients
http://www.nature.com/nature/journal/v429/n6991/full/42
9496a.html
Main ingredients for developing
a molecular signature
Base-Line Characteristics
DF Ransohoff, "Bias as a threat
to the validity of cancer
molecular-marker research",
Nat Rev Cancer 5 (2005) 142-9.
How to Address Bias
DF Ransohoff, "Bias as a threat
to the validity of cancer
molecular-marker research",
Nat Rev Cancer 5 (2005) 142-9.
Principles and geometric representation
for supervised learning
• Want to classify objects as boats and houses.
Principles and geometric representation
for supervised learning
• All objects before the coast line are boats and all
objects after the coast line are houses.
• Coast line serves as a decision surface that
separates two classes.
Principles and geometric representation
for supervised learning
Longitude
Boat
House
Latitude
Principles and geometric representation
for supervised learning
Longitude
Boat
House
Latitude
Then the algorithm seeks to find a decision
surface that separates classes of objects
Principles and geometric representation
for supervised learning
Longitude
These objects are classified as houses
?
?
?
?
?
?
These objects are classified as boats
Latitude
Unseen (new) objects are classified as
“boats” if they fall below the decision
surface and as “houses” if the fall above it
Principles and geometric representation
for supervised learning
These boats will be misclassified as houses
This house will be
misclassified as boat
In 2-D this looks simple but what
happens in higher dimensional data…
• 10,000-50,000 (gene expression microarrays, aCGH,
and early SNP arrays)
• >500,000 (tiled microarrays, SNP arrays)
• 10,000-1,000,000 (MS based proteomics)
• >100,000,000 (next-generation sequencing)
This is the ‘curse of dimensionality’
High-dimensionality
(especially with small samples) causes:
• Some methods do not run at all (classical regression)
• Some methods give bad results (KNN, Decision trees)
• Very slow analysis
• Very expensive/cumbersome clinical application
• Tends to “overfit”
Two problems: Over-fitting & Under-fitting
• Over-fitting (a model to your data) = building a model
that is good in original data but fails to generalize
well to new/unseen data.
• Under-fitting (a model to your data) = building a
model that is poor in both original data and
new/unseen data.
Over/under-fitting are related to complexity of the
decision surface and how well the training data is fit
Outcome of
Interest Y
Predictor X
Over/under-fitting are related to complexity of the
decision surface and how well the training data is fit
Outcome of
Interest Y
This line is
good!
This line
overfits!
Training Data
Future Data
Predictor X
Over/under-fitting are related to complexity of the
decision surface and how well the training data is fit
Outcome of
Interest Y
This line is
good!
Training Data
Future Data
Predictor X
This line
underfits!
Successful data analysis methods balance training
data fit with complexity
– Too complex signature (to fit training data well)
overfitting (i.e., signature does not generalize)
– Too simplistic signature (to avoid overfitting) 
underfitting (will generalize but the fit to both the
training and future data will be low and predictive
performance small).
Challenges in computational analysis of omics data
Relatively easy to develop a predictive model and even
easier to believe that a model is when it is not.
There are both practical and theoretical problems.
Omics data has many special characteristics and it is
difficult to analyze.
Example: OvaCheck, a blood test for early detection
of epithelial ovarian cance, failed FDA approval.
- Looks for subtle changes in patterns of proteins
levels
– Signature developed by genetic algorithm
- Data collected differently for the different patient
groups
The Support Vector Machine (SVM)
approach for building molecular signatures
•
•
Support vector machines (SVMs) is a binary
classification algorithm.
SVMs are important because of (a) theoretical
reasons:
- Robust to very large number of variables and small samples
- Can learn both simple and highly complex classification models
- Employ sophisticated mathematical principles to avoid overfitting
and (b) superior empirical results.
The Support Vector Machine (SVM)
approach for building molecular signatures
Gene Y
Normal patients
Cancer patients
Gene X
• Consider example dataset described by 2 genes,
gene X and gene Y
• Represent patients geometrically (by “vectors”)
The Support Vector Machine (SVM)
approach for building molecular signatures
Gene Y
Normal patients
Cancer patients
Gene X
Find a linear decision surface (“hyperplane”) that can
separate patient classes and has the largest distance (i.e.,
largest “gap” or “margin”) between border-line patients (i.e.,
“support vectors”);
The Support Vector Machine (SVM)
approach for building molecular signatures
Gene Y
Cancer
Cancer
Decision surface
kernel
Normal
Normal
Gene X
• If such linear decision surface does not exist, the data is
mapped into a much higher dimensional space (“feature
space”) where the separating decision surface is found;
• The feature space is constructed via very clever
mathematical projection (“kernel trick”).
Estimation of signature accuracy
test
Large sample case:
use hold-out validation
data
train
test
test
Small sample
case: use N-fold data
cross-validation
test
test
train
train train
train train train test
test
Challenges in computational analysis of omics data
for development of molecular signatures
• Signature multiplicity (Rashomon effect)
• Poor experimental design
• Is there predictive signal?
• Assay validity/reproducibility
• Efficiency (Statistical and computational)
• Causality vs predictivness
• Methods development (reinventing the wheel)
• Many variables, few samples, noise, artifacts
• Editorialization/Over-simplification/Sensationalism
General conclusions
1.
Molecular signatures play a crucial role in personalized medicine
and translational bioinformatics.
2. Molecular signatures are being used to treat patients today, not
in the future.
3. Development of accurate molecular signature should rely on use
of supervised methods.
4. In general, there are many challenges for computational analysis
of omics data for development of molecular signatures.
5. One of these challenges is molecular signature multiplicity.
6. There exist an algorithm that can extract the set of maximally
predictive and non-redundant molecular signatures from highthroughput data.
Proteomics Informatics –
Molecular signatures (Week 14)