slides III - George C. Tseng

Download Report

Transcript slides III - George C. Tseng

Cluster Analysis III
10/5/2012
Outline
Estimate the number of clusters.
 Evaluation of clustering results.

Estimate the number of clusters
Milligan & Cooper(1985) compared over 30 published rules.
None is particularly better than all the others. The best
method is data dependent.
1. Calinski & Harabasz (1974)
B(k ) /(k  1)
max CH (k ) 
W (k ) /(n  k )
Where B(k) and W(k) are the between-and within-cluster sum
of squares with k clusters.
2. Hartigan (1975)
, Stop when H(k)<10
Estimate the number of clusters
3. Gap statistic: Tibshirani, Walther & Hastie (2000)
• Within sum of squares W(k) is a decreasing function of k.
• Normally look for a turning point of elbow-shape to
identify the number of clusters, k.
Estimate the number of clusters
• Instead of the above arbitrary criterion, the paper proposes
to maximize the following Gap statistics. The background
expectation is calculated from random sampling from .
max Gapn (k )  En* (log(W (k )))  log(W (k ))
• The background expectation is calculated from random
sampling from uniform distribution.
Observation
s
Observatio
ns
Bounding Box
(aligned with
feature axes)
Bounding Box
(aligned with
principle axes)
Monte Carlo
Simulations
Monte Carlo
Simulations
Computation of the Gap Statistic
for l = 1 to B
Compute Monte Carlo sample X1b, X2b, …, Xnb (n is # obs.)
for k = 1 to K
Cluster the observations into k groups and compute log Wk
for l = 1 to B
Cluster the M.C. sample into k groups and compute log Wkb
Compute Gap(k )  1 B logW  logW
B

b 1
kb
k
Compute sd(k), the s.d. of {log Wkb}l=1,…,B
Set the total s.e.
s  1 1/ B  sd (k )
k
Find the smallest k such that
Gap(k )  Gap(k  1)  sk 1
Estimate the number of clusters
4. Resampling methods:
Tibshirani et al. (2001); Dudoit and Fridlyand (2002)
Treat as underlying truth.
Compare
: training data
: testing data
: K-means clustering cetroids from training data
: comembership matrix
Prediction strenth:
Ak1, Ak2,…, Akk be the K-means clusters from test data Xte
nkj=#(Akj)
Find k that maximizes ps(k).
Estimate the number of clusters
Conclusions:
1. There’s no dominating “good” method for estimating
the number of clusters. Some are good only in some
specific simulations or examples.
2. Imagine in a high-dimensional complex data set.
There might not be a clear “true” number of clusters.
3. The problem is also about the “resolution”. In
“coarser” resolution few loose clusters may be
identified, while in “refined” resolution many small
tight clusters will stand out.
Cluster Evaluation
• Evaluation and comparison of clustering methods is
always difficult.
• In supervised learning (classification), the class labels
(underlying truth) are known and performance can be
evaluated through cross validation.
• In unsupervised learning (clustering), external validation is
usually not available.
• Ideal data for cluster evaluation:
• Data with class/tumor labels (for clustering samples)
• Cell cycle data (for clustering genes)
• Simulated data
Rand Index
Y={(a,b,c), (d,e,f)}
ab
together
in both
ac
ae
af
bc
bd
be
bf
cd
ce
cf
*
de
df
ef
Total
*
separate
in both
discorda
nt
ad
Y'={(a,b), (c,d,e), (f)}
* * *
*
* * *
*
2
*
* *
7
* *
Rand index: c(Y, Y') =(2+7)/15=0.6 (percentage of concordance)
1. 1 c(Y, Y')0
2. Clustering methods can be evaluated by c(Y, Ytruth) if Ytruth
available.
6
Adjusted Rand index: (Hubert and Arabie 1985)
index - expectedindex
maximumindex - expectedindex
Adjusted Rand index =
The adjusted Rand index will take maximum value at 1 and
constant expected value 0 (when two clusterings are totally
independent)
Comparison
Advantage
Hierarchical • Intuitive algorithm
clustering
• Good interpretability
• Do not need to estimate # of
clusters
Disadvantage
• Very vulnerable to outliers
• Tree not unique; gene closer not
necessarily more similar
• Hard to read when tree is big
K-means
• Simplified Gaussian mixture model
• Normally get nice clusters
• Local minimum
• Estimating # of clusters
SOM
• Clusters has interpretation on 2D
geometry (more interpretable)
• The algorithm very heuristic
• Solution sub-optimal due to 2D
geometry restriction
Model-based • Flexibility on cluster structure
clustering
• Rigorous statistical inference
Tight
clustering
• Allow genes not being clustered;
only produce tight clusters
• Ease the problem of accurate
estimation of # of clusters
• Biologically more meaningful
• Model selection usually difficult
• Local minimum problem
• Slower computation when data
large
20 samples
Simulation:
• 20 time-course samples for each gene.
• In each cluster, four groups of samples
with similar intensity.
15
clusters
• Individual sample and gene variation
are added.
• # of genes in each cluster ~ Poisson(10)
• Scattered (noise) genes are added.
• The simulated data well assembles real
Thalamuthu et al. 2006
data by visualization.
Different types of perturbations



Type I: a number (0, 5, 10, 20, 60, 100 and 200% of the
original total number of clustered genes) of randomly
simulated scattered genes are added. E.g. For sample j in
a scattered gene, the expression level is randomly
sampled from the empirical distribution of expressions
of all clustered genes in sample j.
Type II: For each element of the log-transformed
expression matrix, a small random error from normal
distribution (SD = 0.05, 0.1, 0.2, 0.4, 0.8, 1.2) is added, to
evaluate robustness of the clustering against potential
random errors.
Type III: combination of type I and II.
Different degree of perturbation in the simulated microarray data
Simulation schemes performed in the paper.
In total, 25 (simulation settings) X 100 (data sets) = 2500 are evaluated.
• Adjusted Rand index: a measure of similarity of two clustering;
• Compare each clustering result to the underlying true clusters.
Obtain the adjusted Rand index (the higher the better).
T: tight clustering
M: model-based
P: K-medoids
K: K-means
H: hierarchical
S: SOM
Consensus Clustering (Monti et al,
2003)
Simpson et al. BMC Bioinformatics 2010 11:590 doi:10.1186/1471-2105-11-590
Consensus Clustering

The consensus matrix can be used as
distance matrix for clustering.

Alternatively, people also can cluster on
the original data and attach a
measurement of robustness for each
cluster.
Cluster and membership robustness

Cluster robustness: average connectivity in a
cluster.
mk  
1
M i , j where M i , j is the connectivity

N k  N k  1 / 2 i , jIk
i j
measurement of the ith and jth element of cluster k.

Membership robustness: average connectivity
between and one element and all of the other
element of the cluster.
1
mi  k  
M i, j

N k  1 jI k
j i
•Consensus clustering with PAM (blue)
•Consensus clustering with hierarchical clustering (red)
•HOPACH (black)
•Fuzzy c-means (green)
Evaluate the clustering results
26
Comparison in real data sets:
(see paper for detailed comparison criteria)
• Despite many sophisticated methods for detecting regulatory
interactions (e.g. Shortest-path and Liquid Association), cluster
analysis remains a useful routine in high dimensional data
analysis.
• We should use these methods for visualization, investigation
and hypothesis generation.
• We should not use these methods inferentially.
• In general, methods with resampling evaluation, allowing
scattered genes and related to model-based approach are
better.
• Hierarchical clustering specifically: we are provided with a
picture from which we can make many/any conclusions.
Common mistakes or warnings:
1. Run K-means with large k and get excited to
see patterns without further investigation.
K-means can let you see patterns even in randomly
generated data and besides human eyes tend to see
“patterns”.
2. Identify genes that are predictive to survival
(e.g. apply t-statistics to long and short
survivors). Cluster samples based on the
selected genes and find the samples are
clustered according to survival status.
The gene selection procedure is already biased
towards the result you desire.
Common mistakes (con’d):
3. Cluster samples into k groups. Perform F-test
to identify genes differentially expressed among
subgroups.
Data has been re-used for both clustering and
identifying differentially expressed genes.You always
obtain a set of differentially expressed genes but not
sure it’s real or by random.