Transcript Clustering

Clustering
GLBIO ML workshop
May 17, 2016
Ivan Kryukov and Jeff Wintersinger
Introduction
Why cluster?
Goal: given data points, group
them by common properties
What properties do they share?
Example of unsupervised learning
-- no ground truth against which
we can compare
Sometimes we want small number of
clusters broadly summarizing trends
Sometimes we want large number of
homogeneous clusters, each with
only a few members
Image source: Wikipedia
Our problem
We have single-cell RNA-seq
data for 271 cells across 575
genes
Cells sampled at 0 h, 24 h, 48 h,
72 h
Image source: Trapnell (2014)
Do cells at same timepoints
show same gene
expression?
If each cluster consists of only
cells from the same
timepoint, then the answer is
K-means
K-means clustering
Extremely simple clustering
algorithm, but can be quite
effective
One of two clustering
algorithms we will discuss
You must define the number
of clusters K you want
Image source: Wikipedia
K-means clustering: step 1
We’re going to create three clusters
So, we randomly place three centroids
amongst our data
Image source: Wikipedia
K-means clustering: step 2
Assign every data point to its closest
centroid
Image source: Wikipedia
K-means clustering: step 3
Move each centroid to the centre of
all the data points belonging to its
cluster
Now go back to step 2 and iterate
Image source: Wikipedia
K-means: step 4
When no data points change
assignments, you’re done!
Note that, depending on
where you place your
centroids at the start, your
results may differ
Image source: Wikipedia
Gaussian mixture models
Gaussian mixture model clustering
We will fit a mixture of Gaussians using expectation maximization
Each Gaussian has parameters describing mean and variance
GMM step 1
Initialize with a Gaussian for each cluster, using random means and variances
GMM step 2
Calculate expectation of cluster membership for each point
Not captured by figure: these are soft assignments
GMM step 3
Choose parameter values that maximize likelihood of observed assignment of
points to clusters
GMM step 4
Once you converge, you’re done!
Let’s cluster simulated data using a GMM!
Once more, to the notebook!
Evaluating clustering success
Evaluating clustering success
How do we evaluate clustering?
For supervised learning, we can examine accuracy, precision-recall curve, etc.
Two types of evaluation: extrinsic and intrinsic
Extrinsic measure: compare your clusters relative to ground-truth classes
This is similar to supervised learning, in which you know the “correct” answer for some of your
data
For our RNA-seq data, we know what timepoint each cell came from
But if gene expression isn’t consistent between cells in same timepoint, the data won’t
cluster well -- this is a problem with the data, not the clustering algorithm
Extrinsic metric: V-measure
V-measure: average of homogeneity and completeness, both of which are
desireable
Homogeneity: for a given cluster, do all the points in it come from the same
class?
Completeness: for a given class, are all its points placed in one cluster?
Achieving good V-measure scores:
Perfect homogeneity, perfect completeness: your clustering matches your classes perfectly
Perfect homogeneity, horrible completeness: every single point is placed in its own cluster
Perfect completeness, horrible homogeneity: all your points are placed in just one cluster
Calculating homogeneity
Homogeneity and completeness are defined in terms of entropies, which is a
numeric measure of uncertainty
Both values occur on the [0, 1] interval
If I tell you what points went in a given cluster -- e.g., “for cluster 1, cells 19,
143, and 240 are in it” -- and you know with certainty the class of all points in
that cluster -- “oh, that’s the T = 24 h timepoint”, then the cluster is
Calculating completeness
If I tell you what points are in a given class -- “the T = 48 h timepoint has cells
131, 179, and 221” -- and you know with certainty what cluster they belong to
-- “oh, those cells are all in the second cluster” -- then that class is complete
with respect to the clustering
Now that we have homogeneity and completeness
...V-measure is just the (harmonic) mean of homogeneity and completeness
Why the harmonic mean rather than the arithmetic mean?
If h = 1 and c = 0: then arithmetic mean is 0.5
This is the degenerate case where each point goes to its own cluster
But under same values, harmonic mean is 0, which better represents the quality of the clustering
Intrinsic measure: silhouette score
Example low silhouette score
Let’s see how well our simulated data is clustered!
Notebook time! Hooray!
Why does k-means do better than GMM? Our data were generated via
Gaussians
Exercise: generate more complex simulated data, evaluate performance
The curse of dimensionality
What is the curse of dimensionality?
You have a straight line 100 metres long. Drop a penny on it. Easy to find!
You have a square 100 m * 100 m. Drop a penny inside it. Harder to find
Like two football fields put next to each other
You have a building 100 m * 100 m * 100 m. Drop a penny in it
Now you’re searching inside a 30-storey building the size of a football field
Your life sucks
The point: intuition of what works in two or three dimensions breaks down as we
move to much higher-dimensional spaces