May 9th, 2014
Remains large frontier
– Organize and serve data
– Develop tools to analyze
– Interpret results
• Clustering’s goal is to take large amounts of data
and group it into similarity classes.
• Viewing and analyzing vast amounts of biological
data as a whole can be a challenging job. It is easier
to interpret the data if it is partitioned into clusters
combining similar data points.
• Clustering arose from researchers wanting to know
the functions of newly sequenced genes. If you were
to compare the gene sequence to already know
DNA sequence it would not often give away the
function of the gene. In fact 40% of sequenced
gene’s functionality cannot be attained by only
comparing sequences of known genes.
• Biologists infer gene functions by using microarrays.
Microarrays are a grid of DNA segment of known
sequences that is used to test and map DNA
fragments. They measure the expression level of
genes under varying conditions.
• Expression level is estimated by measuring the
amount of mRNA for that particular gene. Microarray
data are transformed into an intensity matrix, which
allows biologists to make correlations between
genes. This is where clustering comes in handy.
• Microarray data is plotted by each data as a point in
an N-dimensional space. Clustering makes a
distance matrix for the distances between every
gene point. Genes that are close together share the
same characteristics. The outcome is groups of
clusters that have related functionality.
Red: experimental cell
Yellow: in both samples
Black: in neither sample
also be expressed
Each box represents
one gene’s expression
• Hierarchical clustering is a method which seeks to
build a hierarchy of clusters. This is done in two
ways, agglomerative and divisive.
• Agglomerative starts with every element in its own
cluster and iteratively joins clusters together. The
time complexity of agglomerative clustering is
• Divisive starts with one cluster and iteratively divides
it into smaller clusters. Divisive clustering with an
exhaustive search is O(2^n), which is even worse.
However, for some special cases, optimal efficient
agglomerative methods (of complexity O(n^2)) are
known: SLINK for single-linkage and CLINK for
clustering is often
used to reveal
blusters is the
between any pair
of their elements.
is between all
pairs of elements.
This is a good example of clustering.
It satisfies both homogeneity and
Hierarchical Clustering (d , n)
Form n clusters each with one element
Construct a graph T by assigning one vertex to each cluster
while there is more than one cluster
Find the two closest clusters C1 and C2
Merge C1 and C2 into new cluster C with |C1| +|C2| elements
Compute distance from C to all other clusters
Add a new vertex C to T and connect to vertices C1 and C2
Remove rows and columns of d corresponding to C1 and C2
Add a row and column to d corrsponding to the new cluster C
dmin(C, C*) = min d(x,y)
davg(C, C*) = (1 / |C*||C|) ∑ d(x,y)
Squared Error Disttortion:
d(V,X) = ∑d(vi, X)2 / n
K-Means Cluster and Lloyd’s
• K-Means clustering is a technique to partition a set of N
points into K clusters.
• 1-Means Clustering problem is easy. However, it becomes
very difficult (NP-complete) for more than one center.
• An efficient heuristic method for K-Means clustering is the
Lloyd algorithm. It differs from K-Means in that its inputs
are a continuous geometric region rather than a discrete
set of points. Lloyd’s algorithm, while being fast, in each
iteration it moves many data points not necessarily
causing a better convergence.
• A more overall better clustering cost is done with the
Progressive Greedy K-Means clustering. This method
moves one point at a time.
• A comparison between the two has the conclusion that
Lloyd’s method is more efficient in run-time and in best
K-Means Clustering: Lloyd
1. Arbitrarily assign the k cluster centers
2. while the cluster centers keep changing
Assign each data point to the cluster
Ci corresponding to the closest cluster
representative (center) (1 ≤ i ≤ k)
4. After the assignment of all data points,
compute new cluster representatives
according to the center of gravity of each
cluster, that is, the new cluster
representative is ∑v \ |C| for all v in C
for every cluster C
Progressive “Greedy” K-Means
K-Mean Execution Ex.
• A distance matrix can
be turned into a
Genes are then
vertices in the graph,
otherwise known as
Clique graphs. A
problem with clique
graphs is corruption
which can be solved
with the CAST
Other Clustering Uses
• Clustering has a few more applications.
Clustering can be used on highly homologous
sequences to reduce the size of large protein
databases. It reduces the complexity and
speeds up comparisons.
• Clustering can be used to evaluate proteinprotein interaction networks. The structure of
proteins interactions can be represented by a
graph. Clustering then looks for clusters in the
• Also it can be used to improve the protein
structure prediction by merging the predictions
made by a large number of alternative
• Need way to compare newly
sequenced genes with
• Quadratic local alignment too slow
• Heuristic approach instead allows
for O(nm) runtime
• Step 1: Removal of low
• Step 2: Splitting
• Step 3: All words above threshold
incorporated into query list and
placed in efficient search tree
• Step 4: DB queried for matches
• Step 5: Verification/Extension of
– Ungapped BLAST
– Gapped BLAST
• Step 6: Verifies threshold and
returns ranked results
• NCBI: http://blast.ncbi.nlm.nih.gov/
Future of BLAST