Statistica Analysis of Microarray Data

Download Report

Transcript Statistica Analysis of Microarray Data

Development Genomic Prediction Models for
Personalized Medicine
James J. Chen
National Center for Toxicological Research
US Food and Drug Administration
Department of Statistics
National Cheng Kung University
September 18, 2014
*
The views expressed in this presentation do not represent
those of the U.S. Food and Drug Administration
Outline

Background: personalized medicine, biomarkers,
personalized medicine biomarker, subgroup identification.

Statistical methods in personalized medicine
 Binary classification
 Classification of survival outcome
 Imbalanced class size
 Biomarker adaptive design with subgroup selection

Data mining methods
 Bicluster analysis and topic modelling
One Size Fits All Approach
A drug is developed intended to treat the entire population:
 A new drug may not be approved if it is effective for only a fraction of the
patient population - Drug is effective for a subpopulation.
 An approved drug is sometimes removed after the post-marketing
discovery of unexpected toxicity - Drug is harmful to a subpopulation.
Post-surgery adjuvant chemotherapy in cancer patients:
 About 75% of stage II colon cancer patients are cured by surgery alone.
 About 40% of stage III patients, are cured by surgery alone.
 Additional therapy is needed only for a subpopulation.
Variability among Patients: differences in disease outcomes and
response to a therapy can be attributed to disease characteristics,
genetic pre-dispositions, dietary/environment factors, interaction, etc.
One Size Does Not Fit All
Serious adverse drug reactions
caused by marketed drugs
Personalized Medicine
Personalized medicine uses information from a patient's
genotype to select the most appropriate therapy for a disease or
condition that is particularly suited to that patient.
The right treatment at the right
dose for the right person at the
right time for the right outcome.
Aim: Identifying molecular biomarkers that cause differential
disease outcomes or treatment responses for better matching of
disease with specific therapies to optimize treatment assignment.
Biomarker: A biological indicator (characteristics) of status of an
organism of a particular health condition or disease state.
5
Molecular-Based Biology
Genotype x Environment
Genotype
Gene Function
Metabolism
Genetic makeup
Environment
Diet/lifestyle
Susceptibility
Exposure
Target response
Health statue/
Disease risk
Genes, Proteins, Metabolites,
Physiological & biochemical
indices
Biomarkers
Sequencing
Transcriptomics
Proteomics
Metabolomics
Phe noty pe
Health/Disease
Trt-Response
Biomarkers in Drug Development
 Biomarker of exposure: a chemical, its metabolite, or the product
measured in the human body, indicates internal dose, or the amount
of chemical exposure that has resulted in absorption into the body.
 Biomarker of effect: an indicator of biochemical, physiologic, or
other alteration in an organism that can be recognized as associated
with an established or possible health impairment or disease.
 Biomarker of susceptibility: an indicator of natural characteristics
or acquired ability of an organism to respond to exposure to a specific
chemical or specific adverse effect or disease
• Personalized medicine: To identify individuals who are more
susceptible to a disease or responsive to a specific therapy, prior to
the treatment.
Personalized Medicine Biomarkers
 Prognostic Biomarker: A baseline biological measurement
(before treatment) to indicate overall disease outcome.
 The presence or absence of such markers is used to guide
patients for a certain or no treatment
 Marker does not relate to any particular treatment
 Predictive Biomarker: A baseline biological measurement
(before treatment) to identify which patient is likely or
unlikely to benefit from a particular treatment
 The presence or absence of such markers is used to select
targeted patient for the treatment
 Marker predicts patient’s response to a particular treatment
 Personalized medicine biomarkers should be identifiable
before the treatment
Statistics in Molecular-Based Biology
A two-way data matrix: Dmn
Sample
Variables
S1
S2
…
x1
x2
.
.
xm
x11
x21
.
.
xm1
x12
x22
.
.
xm2
… x1n
… x2n
…
.
…
.
... xmn
Target
y1
y2
...
Sn
yn
Molecular technology: microarray
SNP, sequence, protein, metabolite
data or combination.
High dimensional data:
Each sample involves hundreds or
thousands of multivariate correlated
measurements.
Statistical and data mining:
 Find meaningful and useful
subgroups in the samples.
 Find the relationships among
subgroups and between subgroups
in S and variables in G and
 Develop a prediction model to
classify subgroup memberships of
new samples by finding a set of
biomarkers g
Subgroup Identification
Subgroup identification: Partition patients into subgroups defined
by sets of biomarkers, where each subgroup corresponds to an
optimal treatment.
S1: Biomarker discovery: Identifying biomarkers, based on the
underlying molecular basis of the disease (tumor’s characteristics
and patient’s genetic makeup) and/or response to a therapy, for
matching disease with the therapy.
S2: Patient classification: Developing a prediction model to
classify patients into different risk/therapy groups, based on the
biomarkers identified.
Subgroup analysis: Evaluating treatment effects in specific patient
10
subgroups defined by the biomarkers (or phenotypes).
Prediction and Classification
Classification: To develop a model f to predict the outcome of
future sample, based on a set of features (biomarkers) X, from
the available data: f: X  y
Prediction function f
Biomarkers X
f(X): Outcome
Outcome measure y = f(X)
 Outcome y: Binary or multiple class, continuous or time-to-event
response (+/-, tumor sixe, risk score, survival times, ..)
 Classification: y is class variables
• Classification algorithms are not designed to analyze nonclass variables.
11
Building a Classifier
0. Split data into two sets: training and test (or external)
1. Model Building (Training phase)
 Selecting predictor set (feature selection and feature extraction)
 Selecting the prediction model (algorithm)
 Specifying the parameters of the prediction model
 Fitting the prediction model to the training samples.
2. Performance assessment (Testing phase)
 Evaluated predictive sensitivity, specificity, and accuracy on the
test data (cross validation) or external data
• Sensitivity: correct prediction of positive samples
• Specificity: correct prediction of negative samples
12
Dimensionality Reduction To Improve
Predictive Performance
HD data: Each sample is characterized by hundreds or
thousands of correlated variables (features).
 Many features are not related to the target variable, using
all features will suppress the performance: over-fitting
 Some classification algorithms require the number
predictors is small than the number of samples.
Dimensionality Reduction
 Feature Extraction: Transform the original HD data set into a
lower (2 or 3) space, such as PCA, MDS, and Isomap, etc.
 Feature Selection: Find the most relevant features.
Feature Extraction: Principal Component Analysis: 24
arrays in 6 groups with more than 12,000 genes
Using the first three
components to build
a model to predict
high versus low
dose exposure
Feature (Variable) Selection
Evaluate each feature individually or in combination with other
features and select the most relevant (minimum number of)
features to build the prediction model.
Advantages of Feature Selection over Feature Extraction:
 Speeding up learning process
 Improving model interpretability
 Better understanding the important features and how they are
related with each other
Two general approaches:
 Filter: filters out irrelevant features
 Wrapper: selects the features by testing with the algorithm
Overfitting - Prediction Model
d=10, r= 0.01
1.5
Example: Polynomial regression
1
A 10th degree polynomial + noise
y = w0+w1x + w2x2 …+ w10x10
0.5
x
x
x
0
x
x
x
-0.5
-10
-8
-6
-4
-2
0
2
4
6
8
10
Modified [email protected]
16
Building a Classifier and Cross Validation
Training Set (90%)
2
3
4
5
6
7
8
9
10
Model Building
RF
SVM
DLDA
feature
selection
……
Test Set (10%)
Algorithm
1
….
Training Set (90%)
1
2
3
4
5
6
7
8
9
Test Set (10%)
10
Accuracy
Model Building
feature
selection
Algorithm
RF
SVM
DLDA
Accuracy
Average Cross-
Validated Accuracy
17
Colon data (22 normal and 40 tumor samples): Cross-validation
accuracy without feature selection (2,000 genes) and with
selection of 50 genes for four classifiers*
Genes
Cross
validation
SVM-RFE1
RF-MDA1
DLDA-BW2
NB-t2
2000
LOU
85.5
82.3
64.5
61.3
10-fold
83.7
81.5
66.3
64.6
2-fold
81.2
74.1
65.3
63.7
LOU
83.9
85.5
88.7
75.8
10-fold
83.8
83.9
86.2
74.4
2-fold
80.9
78.2
78.1
73.6
50
1. Wrapper approach; 2. filter approach
Baek et al. (2009), Briefing in Bioinformatics
Performance of Prediction Model
Performance: Ability to predict the outcome of interest
Consistency/Repeatability/Reproducibility: An agreement of the
predictive scores across settings/conditions.
 Predictability: Ability to classify patients, who are not included in
the model development but from the same study protocol (a similar
population), into proper subgroup correctly.
 Generalizability: Ability to reproduce predictive performance when
applied to data generated independently from different batches
(different times or locations).
• Signature transferability and model transferability
Major Challenge - Reproducibility
Reproducibility among 9 published gene signature sets for
lung cancer: Little overlapping between different sets:
 Batch effect or cohort difference
Lau et al., J Clin Oncol. 2007
Imbalanced Class Size Data
Class Imbalanced data: the class sizes differ substantially.
Ex: clinical diagnostic test of rare diseases, fraud detection.
(Much more negative data than positive data)
If P:N has the 1:9 ratio, a procedure predicts all negatives will
have 90% accuracy, but 0% sensitivity.
A procedure with a high specificity but very low sensitivity, or
vice versa, is not useful in some applications.
 Clinical diagnostic tests: high sensitivity.
 Epidemiology screening tests: high specificity.
Problems in Class-imbalance Data
Fundamental issues:
1.
2.
3.
4.
Imbalance ratio
Domain complexity
Small disjuncts
Feature selection
The standard classifiers can
result in a low accuracy on
the minority class prediction
with a high accuracy on the
majority class.
Small disjunct
Separating
Hyperplane
Effect of Lack of Minority Data On
classification performance
Lack of Data
More Data
A Data-based Ensemble Classifier for
Imbalanced Classification
Classifier 1
Minority Classifier 2
Class
subset
subset
…
Classifier 2k+1
subset
Re-sampling subsets of
majority class
Majority
Class
Class Imbalanced Classification using SVM
Standard and an Ensemble Classifier
Standard
Ensemble
Data
SN
SP
ACC
SN
Liver tumor1
Imprinted2
Estrogen3
8.5
69.8
87.0
97.4
91.6
73.4
72.3
84.4
81.1
62.8
78.6
84.9
Breast Cancer4
91.0
74.1
86.1
88.1
SP
ACC
59.7 60.6
79.2 79.0
75.8 80.9
81.0
85.2
1. 282 positives and 714 negatives; 2. 43 positives and 88 negatives;
3. 131 positives and 101 negatives; 4. 65 positives and 34 negatives
Subgroup Selection in Clinical Trials
Molecularly targeted treatments aim to treat underlying disease
mechanism and are expected to have differential treatment effects
for biologically heterogeneous disease.
Subgroup Selection: To find a subgroup of g+ patients who are
more responsive to a given treatment than its complementary g-.
Clinical trials for targeted drugs can be designed as a drug and
diagnostic co-development by combining efficacy test for treatment
effect with a diagnostic test for patient identification.
Adaptive design: a prospectively planned opportunity to modify specified
aspects of the design based on analysis of data in the study
FDA Guidance: Drug and Diagnostic Co-Development (2005, 2011),
Adaptive design (2010); Enrichment (2012)
Conventional Design and a
Biomarker Design
Biomarker by treatment
Interaction Design
Conventional Design:
mixture of maker +
and marker -
Biomarker diagnosis (+, -)
R
Marker +
Std
Marker R
R
Trt
Std
Trt
Std
Trt
Subgroup Analysis in Adaptive Design
Subgroup analysis is an evaluation of treatment effects in specific
subgroups of patients defined by baseline characteristics.
Two primary hypotheses:
 H1: No treatment effect in the overall population
 H2: No treatment effect in the g+ subpopulation
Statistical tests (a1 + a2 = a, overall type I error):
1. Control vs. treatment in all patients at a1 level
2. Control vs. treatment in the g+ patients at a2 level
Adaptive design: a study with a prospectively planned opportunity to
modify specified aspects of the design/hypotheses based on analysis of
data in the study
28
Adaptive Design with Predefined Subgroups
Comparing T and S at the
a1 significance level
S: Efficacy for
all patients
STOP
R
Std
NS: Comparing T and S
at a2 for g+ patients
Trt
S: Efficacy for
g+ patients
NS: Trial fails
29
Biomarker Adaptive Design with
Subgroup Selection
Aim: To find and show a subgroup of patients from a mixture
of patients for whom a new targeted treatment is effective.
Three components (steps):
1. Biomarker identification: To identify predictive “biomarkers” that
show differentially therapeutic responses.
2. Subgroup classification: To develop prediction models, based on the
predictive biomarkers identified, to classify patients into g+ targeted
(responder) and g- non-targeted (non-responder) subgroups.
3. Performance assessment: 1) Whether the prediction model can
classify patients accurately, and 2) Whether the treatment is beneficial
to the g+ targeted subgroup.
30
Biomarker Adaptive Design
H1: Comparing T and S at
a1 significance level
1. Biomarker identification
2. Subgroup identification,
g+ and g- subgroups
NS
Stop
Efficacy for all
patients
S
S, NS
H2: Comparing T and S
at a2 for g+ patients
S
H3: Comparing T and S
at a2 for g- patients
Efficacy for g+
patients
S
Efficacy for gpatients
31
Example: Lung Cancer Data (GSE14814)*
Lung cancer dataset: 62 control; 71 treatment
Cox regression identified 18 predictive biomarkers
T
4.74 (18)
All
5.3 (62)
6.7 (71)
0.6
5.14 (11)
0.4
g-
0.2
3.24 (51)† 6.49 (53)
0.0
g+
Survival Proportion
0.8
S
1.0
Survival Plot
†
Median survival time
(number of patients)
0
2
4
6
8
Time
All patients: p = 0.376
g+ patients: p = 0.107
g- patients: p = 0.298
* Zhu et al. J Clinical Oncology (2010)
32
Statistical Issues in the Development of
Biomarker Adaptive Designs
Marker identification:
• High dimensionality: feature selection strategy
• Level of significance for the interaction test
• Needed sample size to have sufficient number of identifications
Patient classification:
• Classification algorithms and Imbalanced subgroup size
• Threshold cutoff for continuous response data
• Validation of classifier: predictability and generalizability
Subgroup analysis:
•
•
•
Type errors a1 and a2 allocation
Test strategy for g+ and g- in the control and treatment arms
Needed sample size for all patients and g+ patients
Others: data quality, consistency, …
Data Mining
Data mining (analysis of large multivariate data): computational
and statistical techniques for systematically analyzing large data to
discover hidden patterns/structures or unexpected occurrences.
Supervised analysis, D = {X; y}
 To find a model between X and y, f : X  y, where X is
explanatory multivariate variables; y: target (outcome) variable
 Common methods: Regression, classification, prediction, ..
Unsupervised analysis, D = {X}
 To find patterns and structures in X and explore relationships
between X and the samples.
 Common methods: cluster analysis, bicluster analysis, topic
modelling, ..
34
Cluster Analysis
Cluster analysis: Identify hidden patterns by assigning samples
into meaningful/useful clusters and see how they are related.
Hierarchical clustering tree
Non-hierarchical clustering: a) k-means;
b) Self-organizing map; c) Multidimensional
scaling; d) Principal component analysis
Bicluster Analysis
Bicluster analysis: To identify a submatrix in a two-way data
matrix in which rows and columns are correlated- a subset
of variables X correlated with a subset of samples S.
A bicluster defines a local association
between variables and samples.
V
a
r
i
a
b
l
e
Each bicluster Ci ={Gi, Si} represents
an association between a subset of
variables Gi and a subset of sample
population Si.
There are four biclusters: Red, Green,
Pink, and Blue.
Sample
Cluster and Bicluster Analysis
(a)
Bicluster analysis
1
2
3
4
5
 Cluster analysis partitions samples
into disjoint clusters.
 In cluster analysis, each sample
is assigned to one cluster.
 Cluster analysis uses all variable to
cluster all samples into subgroups.
 In bicluster analysis, a sample
can be part of more than one
bicluster or of no bicluster.
 Cluster analysis provides global
association analysis between
samples and variables.
 Bicluster analysis provides
local association between
samples and variables.
Biclustering Methods
 Model-based Methods: d-biclustering (Cheng and Church, 2000); Plaid
(Lazzeroni and Owen, 2002); FABIA (Hochreiter et al., 2010)
 Matrix Factorization Methods: Singular value decomposition (Kluger et al.,
2003); non-negative matrix factorization (Carmona-Saez et al., 2006)
 Binary data (Constant bicluster): Bimax (Preclic et al., 2006; Zhang et al.,
2010). Software: BicAT (Barkow et al. Bioinformatics, 2006)
 Software: BicAT (Barkow et al. Bioinformatics, 2006)
Types of “coherence” bicluster (Madeira and Oliveira, 2004)
Constant
Rows
Columns
Coherence
References:
Tanay et al., Handbook of
Comp. Mol. Bio 9 (2005)
Xij=m0
Xij=m0+rj
Xij=m0+cj Xij=m0+ri+cj Xij=m0ricj
Kriegell et al., ACM TKDD
9 (2009)
SVD Biclustering algorithm
Singular Value Decomposition (SVD): X = UDVT = S liuivi′= S Xi
Identification of d-biclusters:
S1
S1: Order the data matrix on row and column according to
ui and vi-th column vectors in U and V.
S2a: Identify candidate biclusters from a 2x2 matrix from
the upper left and lower right corners.
S2b:Identify d-Biclusters for each candidate bicluster such
that the proportion of non-signal is less d.
S3: Refinement: merge the biclusters with same rows or
same columns and delete the subsets.
Chen et al. (2013), PLoS One
S2
A Biclustering Analysis of a Simulated Dataset
Model (Input)
Simulation (Output)
Matrix: 100×20. Four biclusters: 10×8, 15×9, 20×5, and 10×3
FAERS Data - Cardiovascular Drugs
 193 drugs and 8,453 adverse events
(1,631,429 combinations)
 150,261drug-event combinations
reported
Current focus on detection involving
a single drug and a single AE only
 Identify the AEs with high reporting rates
compared to other AEs associated with
a particular drug
 Identify drugs associated with high
reporting rate of a particular AE
compared to the other drugs
Drugs
AEs
Composite Prediction Model Via Biclusters
Si
Gi
Sic
Bicluster
mi(g|Gi) = I {s  Si }
Subgroup-specific Classifiers: For
each bicluster Ci ={Gi, Si}, a
subgroup-specific binary classifier
mi(Gi) is built to predict whether or
not a sample is in the subgroup Si.
A composite classification model consists of a set of k binary
classifiers M = {m1, m2, .. . mk} to divide samples into several
subgroups, based on the k biclusters Ci’s.
For K = 2, there are 4 possible classification subgroups:
{0,0}, {1,0}, {0,1}, {1,1}.
42
Lung Cancer data: 111 samples and 100 top genes,
composite model identified 3 biclusters
43
Subgroup classification of 111 lung cancer samples with
53 adenocarcinoma and 58 squamous cell carcinoma
Subgroup
pattern
Adenocarcinoma
Squamous cell
carcinoma
Composite
Model
000
010
100
39
2
12
6
0
52
2-means
0
1
0
1
2
0
1
2
3
Total
42
11
32
14
7
33
9
6
5
53
6
52
3
3
52
4
2
22
30
58
3-Means
4-Means
44
Pathogens Salmonella Serotypes Classification
The five serotypes were randomly selected:
 4,5,12i-: training (1113), test (1156)
 Hadar: training (982), test (992)
 Oranienburg: training (997), test (930)
 Thompson: training (990), test (1147)
 Typhimurium: training (972), test (930)
Training: Bicluster analysis identified 10 biclusters and built 10
binary classifiers from the 5054 isolates:
 16 classification patterns
 13 subpopulations with (n ≥ 5) were identified
Test: (5,055 + 1,000 decoy) isolates using the 10 binary classifiers
 24 classification patterns
 14 subpopulations with (n ≥ 5) were identified
45
PFGE Serotype Prediction: Test Samples
14 subpopulations (n ≥ 5)
4,5,12:i(n=1156)
Hadar
(n=992)
Oranienburg Thompson
(n=930)
(n=1047)
Typhi
(n=930)
Decoy
(n=1000)
Total
(n=6055)
0000000000
38
73
122
56
133
747
1169
1000000000
699
0
0
0
1
176
876
1000000100
204
0
0
0
0
3
207
1000001000
204
0
0
0
0
11
215
0000001000
5
0
0
0
0
2
7
0100000000
0
0
0
0
795
3
798
0010000000
0
0
0
987
0
42
1029
0001000000
0
0
24
0
0
0
24
0000100000
0
0
216
0
0
1
217
0001100000
0
0
544
0
0
0
544
0001110000
0
0
8
0
0
0
8
0000110000
0
0
6
0
0
0
6
0000010000
0
911
9
1
0
10
931
1000010000
0
8
0
0
0
1
9
Hierarchical Cluster Analysis of the Predicted 14 Subpopulations
D C A
B
E
4,5,12:i4,5,12:i4,5,12:i4,5,12:iHadar
Hadar
Oranienburg
47
Oranienburg
Oranienburg
Oranienburg
Oranienburg
Thompson
Typhi
Other
F
FDA’s Adverse Event Reporting
System (AERS) Database
Post-marketing safety surveillance database (over 20 millions):
 AERS: 1968 - 2012
 FAERS: 2013 - present
 Over 5,000 drugs and 16,000 adverse events (AEs)
Collected from physicians, pharmacists, nurses, patients, others
Data Mining for signal detection:
 Identify the AEs with a high reporting rate of a particular drug
compared to other AEs reported for that drug
 Identify the drugs with a high reporting rate of a particular AE
compared to the other drugs reported for that AE
CBER (VARES); CFSAN (CARES); CDRH(MAUDE)
AE Signal Detection Methods
A high occurrence of a
drug i and event j is
summarized in a 2 x 2
Table for every drugevent combination.
With Drug i
W/O Drug i
Total
With Event j
fij = a
c
a+c
W/O Event j
b
d
b+d
Total
a+b
c+d
a+b+c+d
Compute an expected or baseline count e from (a, b, c, d), based
on assumption of no association between Drug and Event.



Proportional reporting ratio, reporting odds ratio ..
EBGM (Empirical Bayes Geometric Mean) ..
Likelihood ratio test (Huang et al., JASA)
(Signal threshold: nij > 3, n/e > 2.0, and c2 > 4.0)
Beyond Single Drug-Event Combination
Bicluster analysis provides a structured approach to identifying
drug families with their associated AE groups.
 Identify drug families that share a common profiling of AEs, with
which potential AEs are analyzed:
• help validation and provide evidence for a single drug-AE analysis
• provide insight into the etiology of AEs
• assist in identifying previously unrecognized AEs
 Predict potential AEs for new drugs with similar active ingredients
or molecular structure on the basis of the known drug classes
• Perez et al., Clin. Phar. Therap. (2011);
• Huang et a;., JBS (2013); Chen et al., JBS (2013)
Example: Cardiovascular Drugs
Data*: 193 cardiovascular drugs and 8,453 adverse events (1,631,429
combinations) with150,261drug-event combinations reported
SVD d-Bicluster analysis (cutoff EBGM = 2)
 23 biclusters are identified
 61 drugs and 118 adverse events
7 drugs and 5 AEs
A
B
C
D
E
F
• Dr. Ana Szarfman
G
Individual drug-event combinations in the Bicluster
with 7 drugs and 5 AEs.
Adverse Events
Drugs
Caesarean
section
Growth
retardation
Premature Respiratory Exposure
baby
distress
pregnancy
A
EB=2.81
RR=2.58
EB=3.56
RR=2.23
EB=3.09
RR=1.91
EB=4.07
RR=2.71
EB=2.46
RR=3.62
B
EB=21.4
RR=11.8
EB=46.8
RR=14.6
EB=13.3
RR=12.5
EB=28.6
RR=11.8
EB=9.19
RR=8.56
C
EB=24.8
RR=20.4
EB=8.94
RR=8.34
EB=16.3
RR=13.3
EB=39.4
RR=26.6
EB=8.73
RR=14.1
D
EB=19.5
RR=10.9
EB=14.9
RR=8.55
EB=10.8
RR=10.7
EB=8.77
RR=10.4
EB=5.29
RR=4.76
E
EB=7.31
RR=5.04
EB=3.50
RR=2.66
EB=5.66
RR=5.27
EB=3.72
RR=3.56
EB=5.51
RR=4.61
F
EB=2.92
RR=2.49
EB=3.49
RR=2.52
EB=2.85
RR=2.60
EB=1.22
RR=0.94
EB=3.73
RR=5.63
G
EB=2.06
RR=0.95
EB=9.37
RR=3.79
EB=2.16
RR=1.55
EB=4.38
RR=2.92
EB=1.46
RR=1.13
Topic Modeling
Topic modeling is a statistical method for text mining of a large
collection of documents by analyzing the words in the texts to
discover the topics of each document.
Corpus: a collection of n documents,
D = {s1, s2 , … , sn}
Document: a document di contain m
words, W = {w1, w2 , … , wm}
Word: an item from a vocabulary
indexed by {1,…,V}.
Bag-of-word: disregarding word
order and grammar but keeping
multiplicity
s1
s2
s3
.
sn
w1
f11
f12
f13
.
f1n
w2
f21
f22
f23
.
f2n
W3
f31
f32
f33
.
f3n
.
.
.
.
.
.
.
.
.
.
.
.
wm
fm1
fm2
fm3
fmn
Latent Dirichlet Allocation (LDA)*
Topic modelling: A document is a mixture of k latent topics and
each topic is expressed by a distribution of words.
Each document is a mixture of multinomial
p(z| θ) over k topics, z ~ multinomial(θ)
Conditional on a topic zi, each word w is a
multinomial p(w|ϕ,zi), w ~ multinomial(ϕ)
θ ~ Dirichlet (α) and ϕ ~ Dirichlet (β).
Likelihood, p(w| α, β) =
Blei et al. (J Mach Learn Res 2003)
Word
Document
Topic
Topic Modeling Inferences
Sample-topic matrix
P(tj|si): The probability that the
document si is related to the
topic tj for each j, ∑j p(tj|si) = 1
Topic-word matrix
P(wv|tj): The probability that the topic
tj contains the word wv, ∑v p(wv|tj) = 1
for each v, conditional on tj,
Individual drug-event combinations in the LDA
analysis with 5 drugs and 8 AEs.
Adverse events
Drug
AE1
AE2
AE3
AE4
AE5
AE6
AE7
AE8
A
EB=18.6
f = 1392
EB=13.5
f = 510
EB=4.01
f = 191
EB=3.46
f = 358
EB=29.6
f = 243
EB=3.66
f = 157
EB=16.7
f = 204
EB=6.42
f = 93
B
EB=10.4
f = 335
EB=9.25
f = 157
EB=5.04
f = 114
EB=4.99
f = 261
EB=34.0
f = 128
EB=3.3 8
f = 87
EB=14.7
f=105
EB=8.07
f = 49
C
EB=1.66
f = 32
EB=5.90
f=60
EB=4.25
f = 58
EB=1.99
f = 76
EB=71.1
f = 153
EB=3.11
f = 49
EB=9.18
f=44
EB=28.5
f = 96
D
EB=3.66
f = 34
EB=5.45
f=30
EB=3.00
f = 21
EB=0.98
f = 15
EB=61.5
f = 70
EB=3.67
f = 28
EB=13.2
f=31
EB=18.4
f = 30
E
EB=15.0
f = 323
EB=12.0
f = 128
EB=3.38
f = 52
EB=2.95
f = 106
EB=15.2
f = 40
EB=5.80
f = 101
EB=18.2
f=92
EB=5.61
f = 27
AE1: Thrombocy-topenia; AE2: Haemorrhage; AE3: Gastrointestinal haemorrhage;
AE4: Myocardial infarction; AE5: Haemorrhage intracranial;
AE6: Haemoglobin decreased; AE7: Haematoma; AE8: Haemorrhagic stroke
References
 Lin, W-J and Chen J.J. (2011). An approach to identify preclinical biomarkers of
susceptibility to drug-induced toxicity. Pharmacogenomics 6, 493-501.
 Lin, W-J and Chen J.J. (2012). Biomarker classifiers for identifying susceptible
subpopulations for treatment decisions. Pharmacogenomics 13, 147-157.
 Lin, W-J and Chen J.J. (2012). Class-imbalanced classifiers for high-dimensional
data. Briefings in Bioinform13, 2012, doi:10.1093/bib/bbs006.
 Chen H-C, Kodell RL, Cheng KF, and Chen, JJ (2012). Assessment of performance of
survival prediction models for cancer prognosis. BMC Med. Res Methodol 12:102.
 Chen H-C and Chen, J.J. (2013). Assessment of reproducibility of cancer survival risk
predictions across medical centers. BMC Med. Res Methodol 13:25.
 Chen, H-C, Tsong, Y., and Chen, J.J. (2013). Data mining for signal detection of
adverse event safety data. J Biopharm Stat 23, 146-160.
 Chen, H-C, Zou, W, Tien, Y-J, and Chen J.J. (2013). Identification of bicluster regions
in a binary matrix and its applications. PLoS ONE 8(8): e71680.
 Chen J.J., Lin W-J, and Chen, H-C. (2013). Pharmacogenomic biomarkers for
personalized medicine. Pharmacogenomics 14, 969-980.
Acknowledgements
Biomarkers, Pharmacogenomics, Classification:
Prof. Wei-Jiun Lin (林維鈞)
Cancer biomarkers (Bioinformatics):
Prof. Tzu-Pin Lu (盧子彬)
Bicluster analysis, Subgroup Identification:
Dr. Hung-Chia Chen (陳弘家)
Subgroup Identification and Topic Modelling:
Dr. Yu-Juan Chen and Dr. Weizhong Zhao (NCTR)
James J. Chen, Ph.D.
Biostatistics Branch
National Center for Toxicological Research
US Food and Drug Administration
E-mail: [email protected]
Thank You
58
 Center for Biologics Evaluation and Research (CBER) –
vaccines, blood, biologics (genes, tissues …)
 Center for Drug Evaluation and Research (CDER) – drugs
 Center for Devices and Radiological Health (CDRH) –
medical devices, diagnostics ..
 Center for Food Safety and Applied Nutrition (CFSAN) –
foods (food safety)
 Center for Veterinary Medicine (CVM) – animal drugs
 Center for Tobacco Program (CTP)
 National Center for Toxicological Research (NCTR)
A data analyst is someone who
wants to get exactly the right
answer, even if it’s the answer to
the wrong question.
A statistician is someone who is
willing to settle for an approximate
answer, as long as it’s the answer
to the right question.
A bioinformaticist is someone who
is willing to settle for answers of
unknown accuracy, to questions
that have not been clearly
articulated, as long as the results
can be graphed in color.
Domain
Knowledge
60
Subgroup Identification in Clinical Oncology
 Outcome y: Survival time, tumor size, treatment response, ..
 Predictors X: 1) Clinical covariates - age, gender, TNM stage, ..
2) genomic data (collected before treatment)
 Treatment: chemotherapy, new drug
ID
site
gender
Ethnicity
ajcc
os
dfs
age
os.time
GSM437270
VB
female
Caucasian
2
alive
no recurrence
70
54.04932
GSM437271
VB
female
black
3
alive
no recurrence
48
63.91233
GSM437272
VB
male
Caucasian
2
death
no recurrence
72
46.29041
…
.
..
..
.
.
..
.
..
Aim: To develop a model for predicting patient’s disease outcome or
treatment response y and classify patients into disjoint subgroups for
treatment recommendation - prognostic model: high-risk and low-risk
Predictive model: responder and non-responder
Building a Classifier for Survival Outcome*
I. Model building (Training data)
1. Identify biomarkers x1, x2 ‘‘’ xk (prognostic or predictive)
2. Fit a multivariate regression model based on the variables identified
3. Compute the predictive score for each patient Si wixij based on the
fitted regression coefficients
4. Specify a cutoff (e.g., median of the predictive scores )
II. Model evaluation (Test data)
5. Compute the predictive score for each patient Si wixij
6. Use the cutoff to classify patients into subgroups
7. Compare the survival times between the subgroups
* Binary Classifiers are only usable to classify binary outcomes
Biomarker Identification
Prognostic and predictive biomarkers can be identified by fitting
a regression model to the control and treatment data, respectively:
* h(y) = b0 + b1i xiu
When b1i is significant, xi identifies subgroups in the sampled
patients such as marker-positive (g+) and marker-negative (g-).





The analysis is performed for each gene at a level (FDR?)
Each significant genes defines a subgroup
The set of significant genes U forms a genomic signature
U contains true positive and false positive biomarkers
U is used to develop a prognostic/predictive classifier
* There are many other models can be used to identify U
63
Example: Breast Cancer Data
Breast Cancer dataset (van 't Veer et al., 2002):
 Training Data (78 patients): 34 poor and 44 good prognoses
 Test Data (19 patients): 12 poor and 7 poor prognoses
Goal: To develop a prognostic prediction model to assign
patients into a risk category based on the AJCC TNM staging
system and clinical variables, or gene expression data.
A: Cox regression – clinical variables
B: Cox regression – 6 most significant genes
C: Emsemble classificatin Tree
Kaplan–Meier survival curves & p-values
of the log-rank test
1.0
Survival Plot
0.8
0.6
0.2
 In setting the cutoff, it may use
other percentiles, which result in
different subgroups.
0.0
Low Risk N = 11
High Risk N = 8
0
2
10
12
1.0
Time
0.4
0.6
0.8
B: p = 0.0288
0.0
Low Risk N = 8
High Risk N = 11
0
2
4
6
8
10
12
Time
0.8
1.0
Survival Plot
0.2
0.4
0.6
C: p = 0.0234
Low Risk
N = 10
High Risk
N = 9
0.0
Survival Proportion
 An alternative cutoff can be based
on the probability at a pre-specified
year of survival.
Survival
Plot
6
8
4
0.2
Survival Proportion
 The number of patients in the two
groups can be very different.
A: p = 0.0076
0.4
Survival Proportion
Subjective cutoff specification
(Model A & B):
0
2
4
6
Time
8
10
12
Cross Validation Test
The 78 training and 19 test patients were pooled:
 The 97 total patients are randomly split into a training set
of 78 patients and a test set of 19 patients.
 The prediction model is developed from the training set
and applied to the test set.
 The procedure repeats 1000 times, the proportion that the
p-value ≤ 0.05 is the measure of performance of the
prediction model.
Cross-validation Test
A: 0.290; B: 0.261; C = 0.355