Lecture slides

Download Report

Transcript Lecture slides

Introduction to Classification Issues in Microarray Data
Analysis
Jane Fridlyand
Jean Yee Hwa Yang
University of California, San Francisco
Elsinore, Denmark
May 17-21, 2004
1
Microarray Workshop
Brief Overview of the Life-Cycle
2
Microarray Workshop
Life
Cycle
Biological question
Experimental design
Failed
Microarray experiment
Quality
measurement
Image analysis
Pre-processing
Pass
Analysis
Estimation
Testing
Clustering
Discrimination
Biological verification
and interpretation
3
Microarray Workshop
• The steps outlined in the “Life Cycle” need to be
carefully thought through and re-adjusted for each data
type/platform combination. Experimental design will
impact what questions should be asked and may be
answered once the data are collected.
• To call in the statistician after the experiment is done
may be no more than asking him to perform a
postmortem examination: he may be able to say
what the experiment died of.
Sir RA Fisher
4
Microarray Workshop
** ***
SAGE
Nylon membrane
Illumina
Bead Array
Different
Technologies
GeneChip Affymetrix
cDNA microarray
Agilent: Long oligo Ink Jet
CGH
5
Microarray Workshop
Some statistical issues
•
•
•
•
•
Designing gene expression experiments.
Acquiring the raw data: image analysis.
Assessing the quality of the data.
Summarizing and removing artifacts from the data.
Interpretation and analysis of the data:
-
Discovering which genes are differentially expressed
Discovering which genes exhibit interesting expression patterns
Detection of gene regulatory mechanisms.
Classification of samples
And many others…
For a review see Smyth, Yang and Speed, “Statistical issues in microarray
data analysis”, In: Functional Genomics: Methods and Protocols, Methods in
Molecular Biology, Humana Press, March 2003
Lots of other bioinformatics issues …
6
Microarray Workshop
Image
analysis
Quality assessment
Pre-processing
CEL, CDF files
Short-oligonucleotide
chip data:
• quality assessment,
• background correction,
• probe-level
normalization,
• probe set summary
gpr, gal files
Two-color spotted array
data:
• quality assessment;
diagnostic plots,
• background correction,
• array normalization.
UCSF spot file
Array CGH data:
•quality assessment;
diagnostic plots,
•, background correction
• clones summary;
• array normalization.
Analysis
probes by sample matrix of log-ratios or log-intensities
Analysis of expression data:
• Identify D.E. genes, estimation and testing,
• clustering, and
• discrimination.
7
Microarray Workshop
Linear Models
Specific examples
T-tests
F-tests
Empirical bayes
SAM
Examples
•
Identify differential expression genes among
two or more tumor subtypes or different cell
Linear
Models
treatments.
•
Look for genes that have different time profiles
between different mutants.
•
Looking for genes associated with survival.
8
Microarray Workshop
Clustering
Algorithms
•Hierarchical clustering
•Self-organizing maps
•Partition around
medoids (pam)
Examples
•
We can cluster cell samples (cols),
the identification of new / unknown tumor
sub classes or cell sub types using gene
expression profiles.
•
We can cluster genes (rows) ,
using large numbers of yeast experiments,
to identify groups of co-expressed genes.
9
Microarray Workshop
Discrimination
Learning set
B-ALL
T-ALL
AML
?
Questions
•
Identification of groups of
genes that predictive of a
particular class of tumors?
•
Can I use the expression
profile of cancer patients to
predict survival?
Gene 1
Mi1 < -0.67
Classification rules
• DLDA or DQDA
• k-nearest neighbor (knn)
• Support vector machine
(svm)
• Classification tree
yes
no
Gene 2
Mi2 > 0.18
AML
yes
no
B-ALL
T-ALL
10
Microarray Workshop
Annotation
Riken ID
GenBank
accession
ZX00049O01
AV128498
Nucleotide
Sequence
TCGTTCCATTTTTCTTTAGGGGGTCTTTC
CCCGTCTTGGGGGGGAGGAAAAGTTCTG
CTGCCCTGATTATGAACTCTATAATAGAG
TATATAGCTTTTGTACCTTTTTTACAGGAA
GGTGCTTTCTGTAATCATGTGATGTATAT
TAAACTTTTTATAAAAGTTAACATTTTGCA
TAAT AAACCATTTTTG
Name
Locuslink
MGD
Inhibitor of
DNA binding 3
15903
MGI:96398
UniGene
Gene
Symbol
Map Position
Chromosome:4
66.0 cM
Mm.110
Idb3
Literature
Swiss-Prot
GO
P20109
GO:0000122
GO:0005634
Bay
Genomic
s ES
cells
Biochemic
al
pathways
(KEGG)
PubMed
12858547
2000388
etc
GO:0019904
11
Microarray Workshop
What is your questions?
•
•
What are the targets genes for my knock-out gene?
Look for genes that have different time profiles between different cell types.
Gene discovery, differential expression
•
Is a specified group of genes all up-regulated in a specified conditions?
Gene set, differential expression
•
•
Can I use the expression profile of cancer patients to predict survival?
Identification of groups of genes that predictive of a particular class of tumors?
Class prediction, classification
•
•
Are there tumor sub-types not previously identified?
Are there groups of co-expressed genes?
Class discovery, clustering
•
•
Detection of gene regulatory mechanisms.
Do my genes group into previously undiscovered pathways?
Clustering. Often expression data alone is not enough, need to incorporate sequence and other
information
12
Microarray Workshop
Classification
13
Microarray Workshop
Gene expression data
Two color spotted array
Data on G genes for n samples
mRNA samples
sample1 sample2 sample3 sample4 sample5 …
Genes
1
2
3
4
0.46
-0.10
0.15
-0.45
0.30
0.49
0.74
-1.03
0.80
0.24
0.04
-0.79
1.51
0.06
0.10
-0.56
0.90
0.46
0.20
-0.32
...
...
...
...
5
-0.06
1.06
1.35
1.09
-1.09
...
Gene expression level of gene i in mRNA sample j
= (normalized) Log( Red intensity / Green intensity)
14
Microarray Workshop
Classification
• Task: assign objects to classes (groups) on the
basis of measurements made on the objects
• Unsupervised: classes unknown, want to discover
them from the data (cluster analysis)
• Supervised: classes are predefined, want to use a
(training or learning) set of labeled objects to form
a classifier for classification of future observations
15
Microarray Workshop
Example: Tumor Classification
• Reliable and precise classification essential for successful
cancer treatment
• Current methods for classifying human malignancies rely on a
variety of morphological, clinical and molecular variables
• Uncertainties in diagnosis remain; likely that existing classes
are heterogeneous
• Characterize molecular variations among tumors by monitoring
gene expression (microarray)
• Hope: that microarrays will lead to more reliable tumor
classification (and therefore more appropriate treatments and
better outcomes)
16
Microarray Workshop
Tumor Classification Using Gene Expression
Data
Three main types of statistical problems associated
with tumor classification:
• Identification of new/unknown tumor classes using
gene expression profiles (unsupervised learning –
clustering)
• Classification of malignancies into known classes
(supervised learning – discrimination)
• Identification of “marker” genes that characterize
the different tumor classes (feature or variable
selection).
17
Microarray Workshop
Clustering
18
Microarray Workshop
Generic Clustering Tasks
• Estimating number of clusters
• Assign samples to the groups
• Assessing strength/confidence of
cluster assignments for individual
objects
19
Microarray Workshop
What to cluster
• Samples: To discover novel subtypes of the
existing groups or entirely new partitions. Their
utility needs to be confirmed with other types of
data, e.g. clinical information.
• Genes: To discover groups of co-regulated
genes/ESTs and use these groups to infer
function where it is unknown using members of
the groups with known function.
20
Microarray Workshop
Basic principles of clustering
Aim: to group observations or variables that are “similar” based on
predefined criteria.
Issues: Which genes / arrays to use?
Which similarity or dissimilarity measure?
Which method to use to join clusters/observations?
Which clustering algorithm?
How to validate the resulting clusters?
It is advisable to reduce the number of genes from the full set to
some more manageable number, before clustering. The basis for
this reduction is usually quite context specific and varies depending
on what is being clustered, genes or arrays.
21
Microarray Workshop
Clustering
of genes
Array Data
For each gene, calculate a
summary statistics and/or
adjusted p-values
Set of candidate DE genes.
Biological
verification
Similarity
metrics
Clustering
Descriptive
interpretation
Clustering
algorithm
22
Microarray Workshop
Clustering
of samples
and genes
Array Data
Set of samples to cluster
Set of genes to use in clustering
(DO NOT use class labels in
the set determination).
Similarity
metrics
Clustering
Clustering
algorithm
Descriptive
Interpretation
of genes separating
novel subgroups
of the samples
Validation of clusters
with clinical data
23
Microarray Workshop
Which similarity or dissimilarity measure?
• A metric is a measure of the similarity or
dissimilarity between two data objects
• Two main classes of metric:
- Correlation coefficients (similarity)
- Compares shape of expression curves
- Types of correlation:
- Centered.
- Un-centered.
- Rank-correlation
- Distance metrics (dissimilarity)
- City Block (Manhattan) distance
- Euclidean distance
24
Microarray Workshop
Correlation (a measure between -1 and 1)
• Pearson Correlation Coefficient (centered correlation)
Sx = Standard deviation of x
Sy = Standard deviation of y
 xi  x  yi  y 
  S  S 
x 
y 
i 1 
n
1
n 1
• Others include Spearman’s  and Kendall’s 
You can use
absolute
correlation to
capture both
positive and
negative
correlation
Positive correlation
Negative correlation
25
Microarray Workshop
Potential pitfalls
Correlation = 1
26
Microarray Workshop
Distance metrics
• City Block (Manhattan)
distance:
- Sum of differences across
dimensions
- Less sensitive to outliers
- Diamond shaped clusters
d ( X , Y )   xi  yi
• Euclidean distance:
- Most commonly used
distance
- Sphere shaped cluster
- Corresponds to the
geometric distance into the
multidimensional space
d ( X ,Y ) 
i
Condition 2
Condition 2
i
2
(
x

y
)
 i i
Y
X
Condition 1
Y
X
Condition 1
where gene X = (x1,…,xn) and gene Y=(y1,…,yn)
27
Microarray Workshop
Euclidean vs Correlation (I)
• Euclidean distance
• Correlation
28
Microarray Workshop
How to Compute Group Similarity?
Four Popular Methods:
Given two groups g1 and g2,
•Single-link algorithm: s(g1,g2)= similarity of the closest
pair
•Complete-link algorithm: s(g1,g2)= similarity of the
furtherest pair
•Average-link algorithm: s(g1,g2)= average of similarity of
all pairs
•Centroid algorithm: s(g1,g2)= distance between centroids
of the two clusters
Supplementary slide
29
Microarray Workshop
Adapted from internet
Distance between clusters
Examples of clustering methods
Single (nearest neighbor)
Leads to the “cluster chains”
x
Complete (furtherest neighbor):
Leads to small compact clusters
x
Distance between centroids
Average (Mean) linkage
30
Microarray Workshop
Comparison of the Three Methods
• Single-link
- Elongated clusters
- Individual decision, sensitive to outliers
• Complete-link
- Compact clusters
- Individual decision, sensitive to outliers
• Average-link or centroid
- “In between”
- Group decision, insensitive to outliers
• Which one is the best? Depends on what you need!
Adapted from internet
31
Microarray Workshop
Clustering algorithms
• Clustering algorithm comes in 2 basic flavors
Hierarchical
Partitioning
32
Microarray Workshop
Partitioning methods
• Partition the data into a pre-specified number k of
mutually exclusive and exhaustive groups.
• Iteratively reallocate the observations to clusters until
some criterion is met, e.g. minimize within cluster sums
of squares. Ideally, dissimilarity between clusters will be
maximized while it is minimized within clusters.
• Examples:
- k-means, self-organizing maps (SOM), PAM, etc.;
- Fuzzy (each object is assigned probability of being in
a cluster): needs stochastic model, e.g. Gaussian
mixtures.
33
Microarray Workshop
Partitioning methods
K=2
34
Microarray Workshop
Partitioning methods
K=4
35
Microarray Workshop
Example of a partitioning algorithm
K-Means or PAM (Partitioning Around
Medoids)
1.
2.
3.
4.
5.
6.
Given a similarity function
Start with k randomly selected data points
Assume they are the centroids (medoids) of k clusters
Assign every data point to a cluster whose centroid
(medoid) is the closest to the data point
Recompute the centroid (medoid) for each cluster
Repeat this process until the similarity-based objective
function converges
36
Microarray Workshop
Mixture Model for Clustering
P(X|Cluster1)
P(X|Cluster2)
P(X|Cluster3)
P(X)=1P(X|Cluster1)+ 2P(X|Cluster2)+3P(X|Cluster3)
X | Clusteri
N~( i ,  i )
2
I is a cluster prior
37
Microarray Workshop
Adapted from internet
Mixture Model Estimation
• Likelihood function (generally Gaussian)
• Parameters: e.g., i, i, I
2
(
x


)
i
p( x)   i 21 exp(
)
2
i
2 i
i 1
k
• Using EM algorithm
- Similar to “soft” K-mean
• Number of clusters can be determined using a
model-selection criterion, e.g. BIC (Raftery and
Fraley, 1998)
38
Microarray Workshop
Adapted from internet
Hierarchical methods
• Hierarchical clustering methods produce a tree
or dendrogram.
• They avoid specifying how many clusters are
appropriate by providing a partition for each k
obtained from cutting the tree at some level.
• The tree can be built in two distinct ways
- bottom-up: agglomerative clustering (usually used).
- top-down: divisive clustering.
39
Microarray Workshop
Agglomerative Methods
• Start with n mRNA sample (or G gene) clusters
• At each step, merge the two closest clusters using
a measure of between-cluster dissimilarity which
reflects the shape of the clusters
The distance between clusters is defined by the
method used (e.g., if complete linkage, the
distance is defined as the distance between
furtherest pair of points in the two clusters)
Supplementary slide
40
Microarray Workshop
Divisive Methods
• Start with only one cluster
• At each step, split clusters into two parts
• Advantage: Obtain the main structure of the
data (i.e. focus on upper levels of
dendrogram)
• Disadvantage: Computational difficulties
when considering all possible divisions into
two groups
Divisive methods are rarely utilized in microarray data
analysis.
Supplementary slide
41
Microarray Workshop
Illustration of points
In two dimensional
space
Agglomerative
1 5 2 3 4
1,2,3,4,5
4
3
1,2,5
5
1
3,4
1,5
2
1
5
2 3
4
42
Microarray Workshop
Tree re-ordering?
Agglomerative
1 5 2 3 4
2 1 53 4
1,2,3,4,5
4
3
1,2,5
5
1
3,4
1,5
2
1
5
2 3
4
43
Microarray Workshop
Partitioning vs. hierarchical
Partitioning:
Advantages
• Optimal for certain criteria.
• Objects automatically assigned to
clusters
Disadvantages
• Need initial k;
• Often require long computation
times.
• All objects are forced into a cluster.
Hierarchical
Advantages
• Faster computation.
• Visual.
Disadvantages
• Unrelated objects are
eventually joined
• Rigid, cannot correct later for
erroneous decisions made
earlier.
• Hard to define clusters – still
need to know “where to cut”.
Note that hierarchical clustering results may be used as the starting points for
the partitioning or model-based algorithms
44
Microarray Workshop
Clustering microarray data
• Clustering leads to readily interpretable figures and can
be helpful for identifying patterns in time or space.
Examples:
• We can cluster cell samples (cols),
e.g. the identification of new / unknown tumor classes or
cell subtypes using gene expression profiles.
• We can cluster genes (rows) ,
e.g. using large numbers of yeast experiments, to
identify groups of co-regulated genes.
• We can cluster genes (rows) to reduce redundancy (cf.
variable selection) in predictive models.
45
Microarray Workshop
Estimating number of clusters using
silhouette (see PAM)
Define silhouette width of the observation is :
S = (b-a)/max(a,b)
Where a is the average dissimilarity to all the points in the cluster and b
Is the minimum distance to any of the objects in the other clusters.
Intuitively, objects with large S are well-clustered while the ones with small S
tend to lie between clusters.
How many clusters: Perform clustering for a sequence of the number of clusters
k and choose the number of components corresponding to the largest average
silhouette.
Issue of the number of clusters in the data is most relevant for novel class
discovery, i.e. for clustering sampes.
46
Microarray Workshop
Estimating Number of Clusters with
Silhouette (ctd)
Compute average silhouette for k=3
And compare it with the results for
other k’s.
47
Microarray Workshop
Estimating number of clusters using
reference distribution
Idea: Define a goodness of clustering score to minimize, e,g. pooled Within clusters
Sum of Squares (WSS) around the cluster means, reflecting compactness of
clusters.
where n and D are the number of points in the cluster and sum of
k
1
Wk  
Dr
all pairwise distances, respectively.
r 1 2 nr
Then gap statistic for k clusters is defined as:
Gapn(k )  En* (log( Wk ))  log( Wk )
Where E*n is the average under a sample of the same size
from the reference distribution. Reference distribution can be
generated either parametrically (e.g. from a multivariate) or
non-parametrically (e.g. by sampling from marginal distributions
of the variables. The first local maximum is chosen to be the
number of clusters (slightly more complicated rule) (Tibshirani et al, 2001)
48
Microarray Workshop
Adapted from internet
Estimating number of clusters
There are other resampling (e.g. Dudoit and Fridlyand,
2002) and non-resampling based rules for estimating the
number of clusters (for review see Milligan and Cooper
(1978) and Dudoit and Fridlyand (2002) ).
The bottom line is that none work very well in complicated
situation and, to a large extent, clustering lies outside a
usual statistical framework.
It is always reassuring when you are able to characterize a
newly discovered clusters using information that was not
used for clustering.
49
Microarray Workshop
Confidence in of the individual cluster
assignments
Want to assign confidence to individual observations of being in their
assigned clusters.
•Model-based clustering: natural probability interpretation
•Partitioning methods: silhouette
•Dudoit and Fridlyand (2003) have presented a resampling-based approach
that assigns confidence by computing how proportion of resampling times
that an observation ends up in the assigned cluster.
50
Microarray Workshop
Tight clustering (genes)
Identifies small stable gene clusters by not attempting to cluster all the genes.
Thus, it does not necessitate estimation of the number of clusters and
assignment of all points into the clusters. Aids interpretability and validity of the
results. (Tseng et al, 2003)
Algorithm:
For sequence of k > k0:
1. Identify the set of genes that are consistently grouped together when
genes are repeatedly sub-sampled. Order those sets by size. Consider the top
largest q sets for each k.
2. Stop when for (k, (k+1)), the two sets are nearly identical. Take the set
corresponding to (k+1). Remove that set from the dataset.
3. Set k0 = k0 -1 and repeat the procedure.
51
Microarray Workshop
Two-way clustering of genes and
samples.
Refer to the methods that use samples and genes simulteneously to extract
information. These methods are not yet well developed.
Some examples of the approaches include Block Clustering (Hartigan, 1972)
which repeatedly rearranges rows and columns to obtain the largest reduction
of total within block variance.
Another method is based on Plaid Models (Lazzeroni and Owen, 2002)
Friedman and Meulmann (2002) present an algorithm allowing to cluster samples
based on the subsets of attributes, i.e. each group of samples could have been
characterized by different gene sets.
52
Microarray Workshop
Applications of clustering to the
microarray data
Alizadeh et al (2000) Distinct types of diffuse large
B-cell lymphoma identified by gene expression
profiling,.
•Three subtypes of lymphoma (FL, CLL and
DLBCL) have different genetic signatures. (81 cases
total)
•DLBCL group can be partitioned into two
subgroups with significantly different survival. (39
DLBCL cases)
53
Microarray Workshop
Clustering both
cell samples
and genes
Taken from
Nature February, 2000
Paper by A Alizadeh et al
Distinct types of diffuse large
B-cell lymphoma identified by
Gene expression profiling,
54
Microarray Workshop
Clustering cell samples
Discovering sub-groups
Taken from
Alizadeh et al
(Nature, 2000)
55
Microarray Workshop
Attempt at validation
of DLBCL subgroups
Taken from
Alizadeh et al
(Nature, 2000)
56
Microarray Workshop
Clustering genes
Finding different patterns in the data
Yeast Cell Cycle
(Cho et al, 1998)
6 × 5 SOM with
828 genes
Taken from Tamayo et al, (PNAS, 1999)
57
Microarray Workshop
Summary
Which clustering method should I use?
- What is the biological question?
- Do I have a preconceived notion of how many clusters there
should be?
- Hard or soft boundaries between clusters
Keep in mind:
- Clustering cannot NOT work. That is, every clustering
methods will return clusters.
- Clustering helps to group / order information and is a
visualization tool for learning about the data. However, clustering
results do not provide biological “proof”.
- Clustering is generally used as an exploratory and hypotheses
generation tool.
58
Microarray Workshop
Discrimination
59
Microarray Workshop
Basic principles of discrimination
•Each object associated with a class label (or response) Y  {1, 2, …,
K} and a feature vector (vector of predictor variables) of G
measurements: X = (X1, …, XG)
Aim: predict Y from X.
1
K
2
Predefined
Class
{1,2,…K}
Objects
Y = Class Label = 2
Classification rule ?
X = {red, square}
Y=?
X = Feature vector
{colour, shape}
60
Microarray Workshop
Discrimination and Allocation
Learning Set
Data with
known classes
Prediction
Classification
rule
Data with
unknown classes
Classification
Technique
Class
Assignment
Discrimination
61
Microarray Workshop
Learning set
Predefine
classes
Clinical
outcome
Bad prognosis
recurrence < 5yrs
Good Prognosis
recurrence > 5yrs
Good Prognosis
?
Matesis > 5
Objects
Array
Feature vectors
Gene
expression
new
array
Reference
L van’t Veer et al (2002) Gene expression
profiling predicts clinical outcome of breast
cancer. Nature, Jan.
.
Classification
rule
62
Microarray Workshop
Learning set
Predefine
classes
Tumor type
B-ALL
T-ALL
AML
T-ALL
?
Objects
Array
Feature vectors
Gene
expression
new
array
Reference
Golub et al (1999) Molecular classification
of cancer: class discovery and class
prediction by gene expression monitoring.
Science 286(5439): 531-537.
Classification
Rule
63
Microarray Workshop
Classification Rule
Performance
Assessment
e.g. Cross validation
-Classification procedure,
-Feature selection,
-Parameters [pre-determine,
estimable],
Distance measure,
Aggregation methods
• One can think of the classification rule as a black box,
some methods provides more insight into the box.
• Performance assessment needs to be looked at for all
classification rule.
64
Microarray Workshop
Classification rule
Maximum likelihood discriminant rule
• A maximum likelihood estimator (MLE) chooses
the parameter value that makes the chance of the
observations the highest.
• For known class conditional densities pk(X), the
maximum likelihood (ML) discriminant rule predicts
the class of an observation X by
C(X) = argmaxk pk(X)
65
Microarray Workshop
Gaussian ML discriminant rules
• For multivariate Gaussian (normal) class densities
X|Y= k ~ N(k, k), the ML classifier is
C(X) = argmink {(X - k) k-1 (X - k)’ + log| k |}
• In general, this is a quadratic rule (Quadratic
discriminant analysis, or QDA)
• In practice, population mean vectors k and
covariance matrices k are estimated by
corresponding sample quantities
66
Microarray Workshop
ML discriminant rules - special cases
[DLDA]
Diagonal linear discriminant analysis
class densities have the same diagonal
covariance matrix = diag(s12, …, sp2)
[DQDA]
Diagonal quadratic discriminant analysis)
class densities have different diagonal
covariance matrix k= diag(s1k2, …, spk2)
Note. Weighted gene voting of Golub et al. (1999) is a minor variant of DLDA for
two classes (different variance calculation).
67
Microarray Workshop
The Logistic Regression Model
2-class case: log[p/(1-p)] =  + t X + e
 p is the probability that the event Y occurs given the
observed gene expression pattern, p(Y=1 | X)
 p/(1-p) is the "odds ratio"
 log[p/(1-p)] is the log odds ratio, or "logit"
This can easily be generalized to multiclass outcome and to
more general dependences than linear. Also, logistic
regression makes fewer assumptions on the marginal
distribution of the variables. However, the results are
generally very similat to LDA. (Hastie et al, 2003)
68
Microarray Workshop
Classification with SVMs
Generalization of the ideas of separating hyperplanes in the original space.
Linear boundaries between classes in higher-dimensional space lead to
the non-linear boundaries in the original space.
69
Microarray Workshop
Adapted from internet
Nearest neighbor classification
• Based on a measure of distance between
observations (e.g. Euclidean distance or one
minus correlation).
• k-nearest neighbor rule (Fix and Hodges (1951))
classifies an observation X as follows:
- find the k observations in the learning set closest to X
- predict the class of X by majority vote, i.e., choose
the class that is most common among those k
observations.
• The number of neighbors k can be chosen by
cross-validation (more on this later).
70
Microarray Workshop
Nearest neighbor rule
71
Microarray Workshop
Classification tree
• Partition the feature space into a set of
rectangles, then fit a simple model in each one
• Binary tree structured classifiers are constructed
by repeated splits of subsets (nodes) of the
measurement space X into two descendant
subsets (starting with X itself)
• Each terminal subset is assigned a class label;
the resulting partition of X corresponds to the
classifier
72
Microarray Workshop
Classification tree
Gene 1
Mi1 < -0.67
yes
Gene 2
0
no
Gene 2
Mi2 > 0.18
2
2
0.18
Gene 1
yes
no
0
1
1
-0.67
73
Microarray Workshop
Three aspects of tree construction
• Split selection rule:
- Example, at each node, choose split maximizing decrease in
impurity (e.g. Gini index, entropy, misclassification error).
• Split-stopping:
- Example, grow large tree, prune to obtain a sequence of
subtrees, then use cross-validation to identify the subtree with
lowest misclassification rate.
• Class assignment:
- Example, for each terminal node, choose the class minimizing
the resubstitution estimate of misclassification probability, given
that a case falls into this node.
Supplementary slide
74
Microarray Workshop
Another component in classification rule:
aggregating classifiers
Resample 1
Classifier 1
Resample 2
Classifier 2
Training
Set
X1, X2, … X100
Aggregate
classifier
Resample 499
Resample 500
Classifier 499
Classifier 500
Examples:
Bagging
Boosting
Random Forest
75
Microarray Workshop
Aggregating classifiers:
Bagging
Test
sample
Resample 1
X*1, X*2, … X*100
Tree 1
Class 1
Resample 2
X*1, X*2, … X*100
Tree 2
Class 2
Lets the
tree
vote
Training
Set (arrays)
X1, X2, … X100
90% Class 1
10% Class 2
Resample 499
X*1, X*2, … X*100
Tree 499
Class 1
Resample 500
X*1, X*2, … X*100
Tree 500
Class 1
76
Microarray Workshop
Other classifiers include…
• Neural networks
• Projection pursuit
• Bayesian belief networks
•…
77
Microarray Workshop
Why select features
• Lead to better classification performance
by removing variables that are noise with
respect to the outcome
• May provide useful insights into etiology of
a disease
• Can eventually lead to the diagnostic tests
(e.g., “breast cancer chip”)
78
Microarray Workshop
Why select features?
Top 100
feature selection
Selection based on variance
No feature
selection
-1
Correlation plot
Data: Leukemia, 3 class
+1
79
Microarray Workshop
Approaches to feature selection
• Methods fall into three basic category
- Filter methods
- Wrapper methods
- Embedded methods
• The simplest and most frequently used
methods are the filter methods.
Microarray Workshop
Adapted from A. Hartemnick
80
Filter methods
R
p
Feature selection
R
s
Classifier design
s << p
•Features are scored independently and the top s are used by
the classifier
•Score: correlation, mutual information, t-statistic, F-statistic,
p-value, tree importance statistic etc
Easy to interpret. Can provide some insight into the disease
markers.
Microarray Workshop
Adapted from A. Hartemnick
81
Problems with filter method
• Redundancy in selected features: features are
considered independently and not measured on
the basis of whether they contribute new
information
• Interactions among features generally can not
be explicitly incorporated (some filter methods
are smarter than others)
• Classifier has no say in what features should be
used: some scores may be more appropriates in
conjuction with some classifiers than others.
Supplementary slide
Microarray Workshop
Adapted from A. Hartemnick
82
Dimension reduction: a variant on a filter
method
• Rather than retain a subset of s features, perform
dimension reduction by projecting features onto s
principal components of variation (e.g. PCA etc)
• Problem is that we are no longer dealing with one
feature at a time but rather a linear or possibly more
complicated combination of all features. It may be good
enough for a black box but how does one build a
diagnostic chip on a “supergene”? (even though we don’t
want to confuse the tasks)
• Those methods tend not to work better than simple filter
methods.
Supplementary slide
Microarray Workshop
Adapted from A. Hartemnick
83
Wrapper methods
R
p
Feature selection
R
s
Classifier design
s << p
•Iterative approach: many feature subsets are scored based
on classification performance and best is used.
•Selection of subsets: forward selection, backward selection,
Forward-backward selection, tree harvesting etc
Microarray Workshop
Adapted from A. Hartemnick
84
Problems with wrapper methods
• Computationally expensive: for each
feature subset to be considered, a
classifier must be built and evaluated
• No exhaustive search is possible (2
subsets to consider) : generally greedy
algorithms only.
• Easy to overfit.
p
Supplementary slide
Microarray Workshop
Adapted from A. Hartemnick
85
Embedded methods
• Attempt to jointly or simultaneously train
both a classifier and a feature subset
• Often optimize an objective function that
jointly rewards accuracy of classification
and penalizes use of more features.
• Intuitively appealing
Some examples: tree-building algorithms,
shrinkage methods (LDA, kNN)
Microarray Workshop
Adapted from A. Hartemnick
86
Performance assessment
• Any classification rule needs to be evaluated for its
performance on the future samples. It is almost never
the case in microarray studies that a large independent
population-based collection of samples is available at the
time of initial classifier-building phase.
• One needs to estimate future performance based on
what is available: often the same set that is used to build
the classifier.
• Assessing performance of the classifier based on
- Cross-validation.
- Test set
- Independent testing on future dataset
87
Microarray Workshop
Diagram of performance assessment
Classifier
Training
Set
Resubstitution
estimation
Performance
assessment
Training
set
Classifier
Independent
test set
Test set
estimation
88
Microarray Workshop
Performance assessment (II)
• V-fold cross-validation (CV) estimation: Cases in learning
set randomly divided into V subsets of (nearly) equal size.
Build classifiers by leaving one set out; compute test set
error rates on the left out set and averaged.
- Bias-variance tradeoff: smaller V can give larger bias but smaller
variance
- Computationally intensive.
• Leave-one-out cross validation (LOOCV).
(Special case for V=n). Works well for stable classifiers (kNN, LDA, SVM)
Supplementary slide
89
Microarray Workshop
Performance assessment (I)
• Resubstitution estimation: error rate on the learning set.
- Problem: downward bias
• Test set estimation:
1) divide learning set into two sub-sets, L and T; Build
the classifier on L and compute the error rate on T.
2) Build the classifier on the training set (L) and compute
the error rate on an independent test set (T).
- L and T must be independent and identically distributed (i.i.d).
- Problem: reduced effective sample size
Supplementary slide
90
Microarray Workshop
Diagram of performance assessment
Training
Set
Classifier
Resubstitution
estimation
(CV) Learning
set
Training
set
Classifier
Cross
Validation
Performance
assessment
(CV) Test
set
Classifier
Independent
test set
Test set
estimation
91
Microarray Workshop
Performance assessment (III)
• Common practice to do feature selection using the
learning , then CV only for model building and
classification.
• However, usually features are unknown and the intended
inference includes feature selection. Then, CV
estimates as above tend to be downward biased.
• Features (variables) should be selected only from the
learning set used to build the model (and not the entire
set)
92
Microarray Workshop
Comparison study
• Leukemia data – Golub et al. (1999)
- n = 72 samples,
- G = 3,571 genes,
- 3 classes (B-cell ALL, T-cell ALL, AML).
• Reference:
S. Dudoit, J. Fridlyand, and T. P. Speed (2002).
Comparison of discrimination methods for the
classification of tumors using gene expression data.
Journal of the American Statistical Association, Vol. 97,
No. 457, p. 77-87
93
Microarray Workshop
Leukemia data, 3 classes: Test set error rates;150 LS/TS runs
94
Microarray Workshop
Results
• In the main comparison, NN and DLDA had the
smallest error rates.
• Aggregation improved the performance of CART
classifiers.
• For the leukemia datasets, increasing the
number of genes to G=200 didn't greatly affect
the performance of the various classifiers.
95
Microarray Workshop
Comparison study – discussion (I)
• “Diagonal” LDA: ignoring correlation between genes
helped here. Unlike classification trees and nearest
neighbors, DLDA is unable to take into account gene
interactions.
• Classification trees are capable of handling and
revealing interactions between variables. In addition,
they have useful by-product of aggregated classifiers:
prediction votes, variable importance statistics.
• Although nearest neighbors are simple and intuitive
classifiers, their main limitation is that they give very little
insight into mechanisms underlying the class
distinctions.
96
Microarray Workshop
Summary (I)
• Bias-variance trade-off. Simple classifiers do well on
small datasets. As the number of samples increases, we
expect to see that classifiers capable of considering
higher-order interactions (and aggregated classifiers) will
have an edge.
• Cross-validation . It is of utmost importance to crossvalidate for every parameter that has been chosen
based on the data, including meta-parameters
-
what and how many features
how many neighbors
pooled or unpooled variance
classifier itself.
If this is not done, it is possible to wrongly declare having
discrimination power when there is none.
97
Microarray Workshop
Summary (II)
• Generalization error rate estimation. It is necessary to
keep sampling scheme in mind.
• Thousands and thousands of independent samples from
variety of sources are needed to be able to address the
true performance of the classifier.
• We are not at that point yet with microarrays studies.
Van Veer et al (2002) study is probably the only study to
date with ~300 test samples.
98
Microarray Workshop
Some performance assessment
quantities
Assume 2-class problem
class 1 = no event ~ null hypothesis. E.g. , no recurrence
class 2 = event ~ alternative hypothesis. E.g., recurrence
All quantities are estimated on the available dataset (test set if
available)
• Misclassification error rate: proportion of misclassified samples
• Lift: proportion of correct class 2 predictions divided by the
proportion of class 2 cases
Proportion (class 2 is true | class 2 is detected) / Proportion (class is
2)
• Odds ratio: measure of association between true and predicted
labels.
99
Microarray Workshop
Some performance assessment
quantities (ctd)
• Sensitivity: proportion of correct class 2 predictions
Prob(detect class 2| class 2 is true) ~ power
• Specificity: proportion of correct class 1 predictions
Prob(declare class 1 | class 1 is true ) = 1 –
Prob(detect class 2 | class 1 is true) ~ 1 – type I error
100
Microarray Workshop
Some performance assessment
quantities (ctd)
• Positive Predictive Value (PPV): proportion of class 2 cases
among predicted class 2 cases (should be applicable to the
population)
Prob(class 2 is true | class 2 is detected) = P(detect class 2 |
class 2 is true) x Prob(class 2 is true )/Prob(detect class 2) =
sensitivity x Prob(class is 2)/
[sensitivity x Prob(class is 2) + (1-specificity) x (1-Prob(class2))]
Note that PPV is the only quantity explicitely incorporating population
proportions: i.e., prevalence of class 2 in the population of interest (
Prob(class is 2)) as well as sensitivity and specificity.
If the prevalence is low, specificity of the test has to be very high to be clinically
useful.
101
Microarray Workshop
Learning set
Bad
Classification
Rule
Good
Feature selection.
Correlation with class
labels, very similar to t-test.
Using cross validation to
select 70 genes
295 samples selected
from Netherland Cancer Institute
tissue bank (1984 – 1995).
Results” Gene expression profile is a more
powerful predictor then standard systems
based on clinical and histologic criteria
Agendia (formed by reseachers from the Netherlands Cancer Institute)
Has started in Oct, 2003
1)
5000 subjects [Health Council of the Netherlands]
2)
5000 subjects New York based Avon Foundation.
Custom arrays are made by Agilent including
70 genes + 1000 controls
Case studies
Reference 1
Retrospective study
L van’t Veer et al Gene
expression profiling predicts
clinical outcome of breast
cancer. Nature, Jan 2002.
.
Reference 2
Cohort study
M Van de Vijver et al. A gene
expression signature as a
predictor of survival in breast
cancer. The New England
Jouranl of Medicine, Dec
2002.
Reference 3
Prospective trials.
Aug 2003
Clinical trials
http://www.agendia.com/
102
Microarray Workshop
Van’t Veer breast cancer study study
Investigate whether tumor ability for metastasis is
obtained later in development or inherent in the initial
gene expression signature.
• Retrospective sampling of node-negative women: 44
non-recurrences within 5 years of surgery and 34
recurrences. Additionally, 19 test sample (12 recur. and 7
non-recur)
• Want to demonstrate that gene expression profile is
significantly associated with recurrence independent of
the other clinical variables.
Nature, 2002
103
Microarray Workshop
Predictor development
•
•
•
Identify a set of genes with correlation > 0.3 with the binary outcome. Show that there
are significant enrichment for such genes in the dataset.
Rank-order genes on the basis of their correlation
Optimize number of genes in the classifier by using CV-1
Classification is made on the basis of the correlations of the expression profile of leaveout-out sample with the mean expression of the remaining samples from the good
and bad prognosis patients, respectively.
N. B.: The correct way to select genes is within rather than outside cross-validation,
resulting in different set of markers for each CV iteration
N. B. : Optimizing number of variables and other parameters should be done via 2-level
cross-validation if results are to be assessed on the training set.
The classification indicator is included into the logistic model along with other clinical
variables. It is shown that gene expression profile has the strongest effect. Note that
some of this may be due to overfitting for the threshold parameter.
104
Microarray Workshop
Van ‘t Veer, et al., 2002
105
Microarray Workshop
van de Vuver’s breast data
(NEJM, 2002)
• 295 additional breast cancer patients, mix
of node-negative and node-positive
samples.
• Want to use the predictor that was
developed to identify patients at risk for
metastasis.
• The predicted class was significantly
associated with time to recurrence in the
multivariate cox-proportional model.
106
Microarray Workshop
107
Microarray Workshop
Some examples of wrong answers and
questions in microarray data analysis
108
Microarray Workshop
Life
Cycle
Biological question
Experimental design
Failed
Microarray experiment
Quality
measurement
Image analysis
Normalization
Pass
Analysis
Estimation
Testing
Clustering
Discrimination
Biological verification
and interpretation
109
Microarray Workshop
Prediction I: estimating misclassification
error
Performance of the classifiers on the future samples needs to be assessed
while taking population proportions into the account.
Question: Build a classifier to predict a rare (1/100) subclass of cancer and
estimate its misclassification rate in the population.
Design: Retrospectively collect equal numbers of rare and common subtypes
and build a classifier. Estimate its future performance using cross-validation
on the collected set.
Issues: Population proportions of the two types differ from the proportions in the
study. For instance, if 0/50 of rare subtype and 10/50 of common subtype
were misclassified (10/100), then in population, we expect to observe 1 rare
instance and 99 common ones and will misclassify approximately 20/100
samples.
Conclusion: If a dataset is not representative of population distributions, one
needs to think hard about how to do the “translation”. (e.g., Positive
Predictive Value on the future samples vs Specificity and Sensitivity on the
current ones).
110
Microarray Workshop
Adapted from the comment in Lancer by Rockhill
Prediction II: Prevalence vs PPV (ctd)
Prevalence
50%
43%
10%
1%
0.1%
One
per 2500
Specificity
90%
91
95%
95
99%
99
99.9%
99.9
88
94*
99
99.9
53
69
92
99
9
17
50
91
1
2
9
50
0.4
0.8**
4
29
Assumes a constant sensitivity of 100%.
*PPV reported by Petricoin et al (2002)
**Correct PPV assuming prevalence of ovariann cancer in general population is
1/2500.
Note that discovering discriminatory power is not the same as demonstrating a clinical
utility of the classifier.l
111
Microarray Workshop
Experimental design
Proper randomization is essential in experimental
design.
Question: Build a predictor to diagnose ovarian
cancer
Design: Tissue from Normal women and Ovarian
cancer patients arrives at different times.
Issues: Complete confounding between tissue type
and time of processing.
This phenomenom is very common in the absence
of carefully thought-through design.
Post-mortem diagnosis: lack of randomization.
112
Microarray Workshop
Clustering I
The procedure should not bias results towards desired
conclusions.
Question: Do expression data cluster according to the
survival status.
Design: Identify genes with high t-statistic for comparison
short and long survivors. Use these genes to cluster
samples. Get excited that samples cluster according to
survival status.
Issues: The genes were already selected based on the
survival status. Therefore, it would rather be surprising if
samples did *not* cluster according to their survival.
Conclusion: None are possible with respect to clustering
as variable selection was driven by class distinction.
113
Microarray Workshop
Clustering II
P-values for differential expression are only valid when the class labels
are independent of the current dataset.
Question: Identify genes distinguishing among “interesting” subgroups.
Design: Cluster samples into K groups. For each gene, compute Fstatistic and its associated p-value to test for differential expression
among two subgroups.
Issues: Same data was used to create groups as to test for DEs – pvalues are invalid.
Conclusion: None with respect to DEs p-values. Nevertheless, it is
possible to select genes with high value of the statistic and test
hypotheses about functional enrichment with, e.g., Gene Ontology.
Also, can cluster these genes and use the results to generate new
hypotheses.
114
Microarray Workshop
Acknowledgements
SFGH
•
•
•
•
Agnes Paquet
David Erle
Andrea Barczac
UCSF Sandler Genomics
Core Facility.
UCSF /CBMB
• Ajay Jain
• Mark Segal
• UCSF Cancer Center
Array Core
• Jain Lab
UCB
• Terry Speed
• Sandrine Dudoit
115
Microarray Workshop
Some references
1.
2.
3.
4.
5.
6.
7.
8.
9.
Hastie, Tibshirani, Friedman “The Elements of Statistical Learning”, Springer,
2001
Speed (editor) “Statistical Analysis of Gene Expression Microarray Data”,
Chapman & Hall/CRC, 2003
Alizadeh et al, “Distinct types of diffuse large B-cell lymphoma identified by
gene expression profiling, Nature, 2000
Van ‘t Veer et al, “Gene expression profiling predicts clinical outcome of breast
cancer, Nature, 2002
Van de Vijver et al, “A gene-expression signature as a predictor of survival in
breast cancer, NEJM, 2002
Petricoin et al, “Use of proteomics patterns in serum to identify ovarian cancer”,
Lancet, 2002 (and relevant correspondence)
Golub et al, “Molecular Classification of Cancer: Class Discovery and Class
prediction by Gene Expression Monitoring “, Science, 1999
Cho et al, A genome-wide transcriptional analysis of the mitotic cell cycle,
Mol. Cell, 1999
Dudoit, et al, :Comparison of discrimination methods for the classification of
tumors using gene expression data, JASA, 2002
116
Microarray Workshop
Some references
10. Ambroise and McLachlan, “Selection bias in gene extraction on the basis
microarray gene expression data”, PNAS, 2002
11. Tibshirani et al, “Estimating the number of clusters in the dataset via the GAP
statistic”, Tech Report, Stanford, 2000
12. Tseng et al, “Tight clustering : a resampling-based approach for identifying
stable and tight patterns in data”, Tech Report, 2003
13. Dudoit and Fridlyand, “A prediction-based resampling method for estimating
the number of clusters in a dataset “, Genome Biology, 2002
14. Dudoit and Fridlyand, “Bagging to improve the accuracy of a clustering
procedure”, Bioinformatics, 2003
15. Kaufmann and Rousseeuw, “Clustering by means of medoids.”,
Elsevier/North Holland 1987
16. See many article by Leo Breiman on aggregation
117
Microarray Workshop