Transcript Document

Clustering
• Relevance to Bioinformatics
– Array(s) analysis
– Examples
• Principal Component Analysis
• Clustering Algorithms
Lecture 13, CS567
1
Bioinformatic arrays
• Array = View of biological (sub) system(s)
in terms of a vector of attributes
• Arrays at different resolutions
– Ecosystem:
• Spectrum of life in a pond
– Populations:
• Characteristics of a population of species
– Organism:
• “FBI profile” (multiple attribute types)
• Single Nucleotide Polymorphism (SNP) profile
(single attribute type)
Lecture 13, CS567
2
Bioinformatic arrays
• Arrays at different resolutions
– Cell:
• mRNA expression array in cancer cell
– Cell subcompartment:
• Cytosolic protein array
• Multiple arrays
– Trivially, replicates
• Increase signal/noise ratio
– Under different conditions
• Useful for analysis of
– Relationships between conditions
– Relationships between attributes
Lecture 13, CS567
3
Bioinformatic clusters
• Genes {g} that are turned on together in
condition (e.g., disease, phase of life) x
• Conditions {x} that gene g is turned on in
• Sequence/subsequence families
• Structure/substructure families
• Hierarchical classification of molecules/systems
• Group of people sharing disease
• Group of diseases common to a people/species
Lecture 13, CS567
4
Principal Component Analysis
• Not really clustering but of some use in analysis
of multidimensional data
• Grounds for PCA
– Humans restricted to viewing only 3 dimensions at a
time; PCA renders a lower dimensional
approximation of a high dimensional space
– PCA highlights attributes showing maximum
variation; helps in ignoring attributes showing
minimal variation
Lecture 13, CS567
5
Principal Component Analysis
• Project data from N dimensional space to M
dimensional spaces such that
– M << N
– The first principal component is aligned with the “direction”
of maximal variability
– Data remaining after removal of projection along a principal
component is “residual” data, with “residual” variance
– Each component is derived in ranked order of capturing
residual variance
– All PCs are orthogonal
• Limitations
– Different clusterings may look alike by PCA
– Combination of attributes that constitute a PCA may not be biologically
intuitive
Lecture 13, CS567
6
• Ultimate goal:
Clustering
– “Birds of a feather stick together; We be brothers thicker than blood”
– Minimum similarity within a cluster should exceed maximum similarity
between clusters
– Maximum distance within a cluster should be less than minimum distance
between clusters
• Clustering: Classifying data without class labels
• Supervised: Classes are known (“Classifying Olympic athletes
who are missing their country tags”)
• Unsupervised: Classes not known (“Classifying forms of life on
an alien planet”)
• Hybrid: Combination of above (“Find classes based on sample
of data, followed by classifying remaining data into the newly
defined classes”)
Lecture 13, CS567
7
The notion of Distance
• Definition
– Positivity
– Symmetry (Why relative entropy is not a distance. “One way
traffic not really distance!”)
– Triangular inequality (“’Beam me up Scotty’ route not a
distance”)
• Getting from a to b via some c is never less than getting from a to b
directly
– Self distance = 0 (“If you are trying to find yourself, just
open your eyes – you are right here!”)
• Examples of true distances
– Minkowski distances (déjà vu)
– Pearson correlation coefficient
Lecture 13, CS567
8
Representing Distances
• Absolute measures
– Measures that can be defined without reference to
other data points
– Examples: Height, Weight
• Pairwise distances
– Defined with reference to pair of data points
– May or may not be derivable from absolute
measures
– Inverse of similarity between points
– Examples: Difference in height
Lecture 13, CS567
9
Distances in bioinformatics
• Absolute
– (sub)Sequence Length/Molecular weight
• Pairwise
– Between primary structures
• Based on probabilistic similarity scores obtained from
pairwise sequence alignment
• Based on expectations of similarity scores
• Examples: Clustering based on BLAST searches
– Between secondary/tertiary structures
• Root mean square (RMS) difference of pairs of 3D arrays
• Examples: Structural classes in CATH database
Lecture 13, CS567
10
• Real numbers
Data Types
– Interval-scaled
• Linear scale
– Ratio-scaled
• Non-linear scale
– Fine grained resolution
• Nominal
– Named categories
– Distance typically Boolean (“You are either a Muggle or a
Wizard!”)
– Coarse distance
• Ordinal
– Named categories with implicit ranking (“Distance from
janitor to CEO > distance from President to CEO, not
counting the .com boom years”)
– Distance depends on relative ranks (Intermediate granularity)
Lecture 13, CS567
11
•
Clustering algorithms
K-means (Centroid)
– Algorithm
1.
2.
3.
4.
5.
Pick k points
For each point i, find Dmin(i,k) and assign i to k
Replace the k points by the centroids (means) of each cluster
Calculate mean square distance from cluster centers
Iterate 2-4 till membership of points does not change OR 4 is
minimized
– Scales well
– Limitations:
•
•
•
•
•
Number of clusters predetermined
Clusters always spherical in shape
Adversely affected by outliers (‘coz centroids are virtual data points
a la “Simone”)
Mean needs to be definable
Only local minimum found
Lecture 13, CS567
12
Clustering algorithms
•
K-means (Centroid) – EM variation
– Algorithm
1. Pick k points
2. For each point i, find all D(i,k) and assign i to each k
probabilistically
3. Replace the k points by the centroids (means) of each cluster
4. Calculate mean square distance from cluster centers
5. Iterate 2-4 till expectation of error function is minimized
(likelihood is maximized)
– Better minimum found (Why?)
– Limitations:
•
•
•
•
Number of clusters predetermined
Clusters always spherical in shape
Adversely affected by outliers (‘coz centroids are virtual data
points)
Mean needs to be definable
Lecture 13, CS567
13
Clustering algorithms
•
K-mediods (Use medians rather than means)
–
Algorithm
1.
2.
3.
4.
5.
6.
–
–
Pick k points
For each point i, find Dmin(i,k) and assign i to k
Calculate mean square distance Cold from cluster centers
Replace one of the k points by a random point from each cluster,
calculating Cnew every time cluster structure is changed
If (Cnew < Cold), then accept new clustering
Iterate till membership of points does not change OR C is
minimized
Outliers handled well
Limitations:
•
•
•
Number of clusters predetermined
Clusters always spherical in shape
Doesn’t scale as well as k-means
Lecture 13, CS567
14
•
Clustering algorithms
Hierarchical clustering (Tree building-Average
Linkage clustering)
–
Algorithm
1. Find points i and j that are the closest and unite with a
node
2. Treat i and j as a single cluster by taking their average
3. Now iterate 1-2 till all points are included in tree
–
–
Biologically relevant results
Limitations:
•
•
•
Solution frequently suboptimal
Cluster boundaries not clear
Linear orderings may not correspond to similarity
Lecture 13, CS567
15
•
Clustering algorithms
Single linkage clustering
–
Algorithm
1. If D(i,j) < cut-off, then include them in cluster
2. If i in cluster x is within cut-off distance of j in cluster y,
then merge x and y into single cluster
–
–
Handles non-spherical clusters well
Limitations:
•
•
•
•
Not the best way to go
Doesn’t scale well
Essentially, assumes transitivity in similarity (“Saudi
princely clan, Indian joint family, Godfather’s clan”)
Clusters not compact, but scraggly
Lecture 13, CS567
16