Statistical, Computational, and Informatics Tools for Biomarker

Download Report

Transcript Statistical, Computational, and Informatics Tools for Biomarker

Statistical, Computational, and
Informatics Tools for Biomarker
Analysis
Methodology Development at the
Data Management and Coordinating Center
of the
Early Detection Research Network
Early
Detection
Research
Network
EDRN ORGANIZATIONAL STRUCTURE
18 Laboratories
2 Laboratories
NIST
8 Centers
CDCP
Chair: Bernard Levin
Chair: David Sidransky
An “infrastructure”
for supporting
collaborative
research on
molecular, genetic
and other
biomarkers in human
cancer detection and
risk assessment.
Early
Detection
Research
Network
INFRASTRUCTURE
BIOREPOSITORY
• Specimens with matching controls and
epidemiological data
• Infrastructure to provide preneoplastic tissues:
- Prostate
- Lung
- Ovarian
- Colon
- Breast
Early
Detection INFRASTRUCTURE
Research
Network
LABORATORY CAPACITY
• Capability in high-throughput molecular and biochemical assays
• Ability to respond to evolving technologies for EDRN needs
• Extensive experience and scale-up ability in proteomics and
molecular assays
• Outstanding infrastructure for handling multiple assays and
validation requests
Early
Detection
Research
Network
INFRASTRUCTURE
DATA STORAGE AND MINING
• Outstanding track record in biomarker research
• Statistical and data mining technology
• Statistical and predictive models for multiple biomarkers
• Novel statistical methods to interpret high-throughput data
Early
Detection INFRASTRUCTURE
Research
Network
DATA EXCHANGE AND SHARING
•Improving informatics and information flow
Network web sites
public web site
secure web site
• Early Detection Research Network Exchange (ERNE)
• Standardizing of Data Reporting: CDEs Developed
Early Detection Research Network (EDRN)
INFORMATICS AND INFORMATION FLOW
EARLY
DETECTION
RESEARCH
NETWORK
COLLABORATION
How To Become an Associate Member
• Contact one of the EDRN Principal Investigators to serve as a
sponsor for an application. Three types of collaborative
opportunities are available:
Type A: Novel research ideas complementing EDRN
ongoing efforts; one year of funding at $100,000
Type B: Share tools, technology and resources, no time limit
Type C: Allow to participate in the EDRN Meetings and
Workshop
For details on how to apply, see http://www.cancer.gov/edrn
DMCC Statisticians
• Margaret Pepe, Lead of Methodology
Group
• Ziding Feng, Principal Investigator
• Yinsheng Qu
• Mary Lou Thompson
• Mark Thornquist
• Yutaka Yasui
Biomarker Lab Collaborators at Eastern
Virginia Medical School
• Bao-Ling Adam
• John Semmes
• George Wright
Focus of Presentation
• Design:
Phase Structure for Biomarker Research
• Analysis:
Statistical Methods for Biomarker
Discovery from High-Dimensional Data
Sets
Design:
Phase Structure for Biomarker Research
Three phase structure for therapeutic trials
well-established
Structure promotes coherent, thorough, efficient
development
Similar structure needs to be developed for
biomarker research
Biomarker Development
• Categorize process into 5 phases
• Define objectives for each phase
• Define ideal study designs, evaluation and
criteria for proceeding further
• Standardize the process to promote
efficiency and rigor
Figure 2. Phases of Biomarker Development
Preclinical
Exploratory
PHASE 1
Promising directions identified
Clinical Assay
and
Validation
PHASE 2
Clinical assay detects established
disease
Retrospective
Longitudinal
PHASE 3
Biomarker detects preclinical disease
and a “screen positive” rule defined
PHASE 4
Extent and characteristics of disease
detected by the test and the false
referral rate are identified
PHASE 5
Impact of screening on reducing
burden of disease on population is
quantified
Prospective
Screening
Cancer
Control
The Details of Study Design
• Specific Aims
• Subject/Specimen Selection
• Outcome measures
• Evaluation of Results
• Sample Size Calculations
• Limitations / Pitfalls
Specific Aims
Phase 1
Phase 2
• Identify leads for
potentially useful
biomarkers
• Determine the
sensitivity and
specificity or ROC
curve for the clinical
biomarker assay in
discriminating clinical
cancer from controls
• Prioritize these leads
Specimen Selection -- Cases
Phase 1
• Cancers that are
ultimately serious if
not treated early, but
treatable in early stage
• Spectrum of sub-types
• Collected at diagnosis
Phase 2: same criteria as
for phase 1
• Wide spectrum of cases
• Clinical specimen at
diagnosis
• From target screening
population
Specimen Selection -- Controls
Phase 1
Phase 2
• Non-cancer tissue same
organ same patient
• From potential target
population for screening
• Normal tissue non-cancer
patient
• Benign growth tissue noncancer patient
Outcome Measures
Phase 1
Phase 2
• True positive and False
positive rates (binary
result)
• Results of clinical
biomarker assay
• True positive rate at
threshold yielding
acceptable false positive
rate
• ROC curve
Evaluation of Results
Phase 1
Phase 2
• Algorithms select and
prioritize markers that best
distinguish tumor from
non-tumor tissue
• ROC curves
• Initial exploratory studies
need confirmation with
new validation specimens
• ROC regression to
determine if
characteristics of cases
and/or characteristics of
controls effect
biomarker’s
discriminatory capacity
Sample Size
Phase 1
Phase 2
• Should be large enough so
that very promising
biomarkers are likely to be
selected for phase 2
development
• Based on a confidence
intervals for the TPR or
FPR, or confidence
intervals for the ROC
curve at selected critical
points
Findings: Sample Size Estimation
• For phase 1 microarray experiments, use of
ROC curves is more efficient than
comparing means
• For phase 2 studies, equal numbers of cases
and controls is often not optimally efficient
• Sample size calculations and look-up tables
are now in EDRN website
1. Pepe et al. Phases of biomarker development for
early detection of cancer. Journal of the National
Cancer Institute 93(14):1054–61, 2001.
2. Pepe et al. “Elements of Study Design for
Biomarker Development” In Tumor Markers,
Diamandis, Fritsche, Lilja, Chan, and Schwartz ,
eds. AAAC Press, Washington, DC. 2002.
3. Pepe. “Statistical Evaluation of Diagnostic Tests
& Biomarkers” Oxford U. Press, 2003.
Selecting Differentially Expressed Genes
from Microarray Experiments
Lead: Margaret Pepe
Context
• gene expression arrays for nD tumor tissues and nC
normal tissues
• Yig = logarithm relative intensity at gene g for tissue i.
• for which genes are Yig different in some/most cases from the
normals?
• how many tissues, nD and nC, should be evaluated in these
experiments?
• illustrated with ovarian cancer data
Statistical Measures for Gene Selection
— typically use a two sample t-test for each gene
— we argue that sensitivity and specificity are more directly
relevant for cancer biomarker research.
— focus attention on high specificity (or high sensitivity)
— use the partial area under the ROC curve to rank genes,
instead of the t-test
Example
Gene Rank (among 100 genes)
gene #5
gene #97
t-test
10
4
partial AUC
3
31
gene 97
gene 5
diseased
1.0
diseased
15
5
10
Frequency
5
0
0
0
1
2
0
normal
1
2
3
4
5
6
7
normal
20
0.8
gene 5
0.6
gene 97
0.4
0.2
15
5
ROC(t) = P[YD > u]
20
10
5
0
0.0
0.0
0
0
1
2
0
1
2
3
4
5
6
0.2
0.4
0.6
7
t = P[YC > u]
0.8
1.0
Sample Sizes for Gene Discovery Studies
• traditional calculations based on statistical hypothesis testing
• These are exploratory studies, need new methods
• Propose to base calculations on the probability that a
differentially expressed gene will rank high among all genes
• Use computer simulation for sample size calculations
Table 3
Study power Pg {100|  k1} as a function of sample size using the ovarian cancer data as a
simulation model. Also shown is the power for the more stringent criterion Pg {100|  k1}.
True Ranking (k1)
< 10
< 20
(nD, nc)
(15, 15)
(25, 25)
(50, 50)
(100, 100)
.997
1.000
1.000
1.000
.982
.996
1.000
1.000
(15, 15)
(25, 25)
(50, 50)
(100, 100)
.960
1.000
1.000
1.000
.654
.928
1.000
1.000
Pg {100|  k1}
< 30
.934
.973
.994
.999
Pg {100|  k1}.
.120
.486
.836
.984
< 40
< 50
.893
.949
.987
.998
.850
.914
.968
.990
.016
.202
.638
.928
.000
.024
.206
.608
• with 50 tumor and 50 normal tissues we can be 83.6%
sure that the top 30 genes will rank in the top 100 in the
experiment.
• Pepe et al. Selecting differentially expressed
genes from microarray experiments.
Biometrics (in press)
Summary
• The method we developed for selecting
genes and calculating sample sizes are more
appropriate for the purpose of diagnosis and
early detection
Analysis:
Statistical Methods for Biomarker Discovery from
High-Dimensional Data Sets
• Method development motivated by SELDI data
from John Semmes/George Wright at Eastern
Virginia Medical School
• Data consist of protein intensities at tens of
thousands of mass/charge points on each of 297
individuals
• Developed three approaches to biomarker
discovery: wavelets, boosting decision tree, and
automated peak identification
The EVMS prostate cancer biomarker
project
• Prostate cancer patients:
N=99 early-stage
N=98 late-stage
• Normal controls
N=96
• Serum samples for proteomic analysis by Surface
Enhanced Laser Desorption/Ionization (SELDI)
• Goal: To discover protein signals that distinguish
cancers from normals
48,000 mass/charge points
(200K Da)
0
Intensity
2
4
6
8
An example of SELDI output
2000
3000
4000
5000
6000
Mass /Charge
7000
800
The design of the biomarker analysis
Normal
PCaearly
PCa-late
N=96
N=99
N=98
Training Data
167 PCa (84 early, 83 late)
vs.
81 Normal
Test Data
30 PCa
15 Normal
(Blinded)
Wavelet Analysis
Lead: Yinsheng Qu
Steps in the wavelet analysis:
• Represent original data plot with a set of
wavelets (dimension reduction)
• Determine those wavelets that distinguish
between subgroups (information criterion)
• Define discriminating functions based on
the distinguishing wavelets (Fisher
discrimination)
0.03
1.0
0.01
0.4
0.6
0.02
0.8
60
40
0
0.0
0.0
0.2
20
Original data
5000
10000
15000
20000
20000
40000
100000
M/Z
140000
180000
M/Z
1.0
0.4
0.010
0.6
0.020
0.8
60
40
0.0
0.0
0.2
20
0
Reconstructed signal
80000
0.030
M/Z
60000
5000
10000
M/Z
15000
20000
20000
40000
60000
M/Z
80000
100000
140000
M/Z
180000
0
20 40 60
R econ with 112 w c
0
20 40 60
R econ with 225 w c
0
20
40
60
R econ with 450 w c
2000
4000
6000
8000
10000
2000
4000
6000
8000
10000
2000
4000
6000
8000
10000
2000
4000
6000
8000
10000
M/Z
0
20
40
60
Original data
Three Group Classification:
Normal, Cancer, BPH
12,352 mass spectrum data points, reduced to
3,420 Haar wavelet coefficients, of which
17 coefficients distinguish between the three cases.
2 classification functions generated.
Predicted:
Normal
Cancer
BPH
Normal
14
1
27
0
Truth:
Cancer BPH
0
7
3
0
8
Qu Y et al. Data reduction using discrete
wavelet transform in discriminant analysis
with very high dimension. Biometrics, in
press.
Boosted Decision Tree Method.
Lead: Yinsheng Qu/Yutaka Yasui
• This method combines multiple weak
learners into a very accurate classifier
• It can be used in cancer detection
• It can also be used in identification of tumor
markers
• Using this method we can separate controls,
BPH, and PCA without error in test set
Outline of boosting decision tree
• The combined classifier is a committee with the
decision stumps, the base classifiers, as its
members. It makes decisions by majority vote.
• The base classifiers are constructed on
weighted examples: the examples misclassified
will increase their weights on next round.
• The 2nd stump’s specialty is to correct the 1st
stump’s mistakes, and the 3rd stump’s specialty
is to correct the 2nd stump’s mistakes, and so
on.
• The combined classifier with dozens and even
hundreds of decision stumps will be accurate.
• Boosting technique is resistant to over
fitting.
Classifier 2: A boosted decision stump classifier with
21 peaks (potential markers)
normal
bph
cancer
sensitivity
specificity
# of peaks
minimal margin
Training set
Testing set
normal
bph
cancer
normal
bph
cancer
82
0
0
14
0
1
0
74
3
0
15
0
7
0
160
0
1
29
95.81%
96.67%
98.11%
96.67%
21 in 21 base classifiers
-0.2555
The Boosting procedure
•
•
•
•
Yi={cancer, normal}={1, -1}, fm(xi)={1, -1}
Initial weights (m=1), wi = 1 (i = 1, . . .,N).
Choose first peak and threshold c.
For m =1 to M: wi = wi exp{am I(incorrect)}
– where am = ln(1-err)/err) and err is
classification error rate at the current stage
– normalize the weights so they sum to N.
– choose a peak and c (i-th subject with weight wi)
the
• Final classifier: f(x) = sum(amfm(x)) over m=1 to
M. f(xi)> 0  i-th subject classified as cancer
When to stop iteration?
• minimal margin: minimum of yi f(xi) over all N
subjects
• The minimal margin in the training sample
measures how well the two classes are separated
by classifier.
• Even classifier reaches zero error on training
sample, if iteration still increases the minimal
margin --> improve prediction in future samples.
• Qu et al. 2002. Boosted Decision Tree
Analysis of SELDI Mass Spectral Serum
Profiles Discriminates Prostate Cancer from
Non-Cancer Patients. Clinical Chemistry. In press.
• Adam et al. 2002. Serum Protein
Fingerprinting Coupled with a Pattern
Matching Algorithm that Distinguishes
Prostate Cancer from Benign Prostate
Hyperplasia and Healthy Men. Cancer
Research. 62:3609-3614.
Summary
• Wavelets approach: Does not require peak
identification (black-box classification)
• Boosting decision tree: Requires peak
identification first. Useful for both
classification and protein mass
identification
Final Summary
• The methods developed in the past two years are
mainly for Phase 1&2 studies, reflecting the
current needs of EDRN.
• EDRN DMCC statisticians are working on key
design and analysis issues in early detection
research.
• More work remains to be done (e.g., In
classification, consider the mislabeling of Prostate
cancer by BPH; exam gene by environmental
interactions).