Clustering and unsupervised learning

Download Report

Transcript Clustering and unsupervised learning

Introduction to Bioinformatics: Lecture VII
Clustering and Unsupervised Learning
Jarek Meller
Division of Biomedical Informatics,
Children’s Hospital Research Foundation
& Department of Biomedical Engineering, UC
JM - http://folding.chmcc.org
1
Outline of the lecture

From information flood to knowledge: finding an
inherent “structure” in the data using unsupervised
learning approach
 Clustering analysis as data mining and knowledge
discovery approach
 The notion of similarity and its importance in clustering
 K-means and hierarchical clustering algorithms
JM - http://folding.chmcc.org
2
Literature watch: “Mathematics in Biology” featured in Science
From Bayesian networks to game theory in biology ….
http://www.sciencemag.org/
Reading assignment: Friedman N., “Inferring Cellular Networks Using
Probabilistic Graphical Models”, Science 303 (2004)
JM - http://folding.chmcc.org
3
Clustering and cluster analysis: general considerations
Definition The goal of clustering analysis is to group a collection of objects
into subsets called clusters, such that the objects within each cluster are
more closely related than objects assigned to different clusters.
JM - http://folding.chmcc.org
4
11113.3, Pauci , 1_Pauci , MTX 0 , XR_unknown
878, Pauci , 2_Poly , MTX 1 , XR_erosions
Individuals (33 patients + 12 controls)
na , MTX 0 , XR_ na
...
na , MTX 0 , XR_ na
na , MTX 0 , XR_ na
na , MTX 0 , XR_ na
1085, Control,
1089, Control,
1095, Control,
na , MTX 0 , XR_ na
na , MTX 0 , XR_ na
813.3, Control,
7113. 3, Control, na , MTX 0 , XR_na
9150, Spond, JAS , MTX 0 , XR_sclerosis
7108, Syst, 3_Syst em ic , M TX 1 , XR_norm al
1084, Control,
7021. 31, Control, na , MTX 0 , XR_ na
18042, Spond, JAS , MTX 0 , XR_sclerosis
7118. 3, Control, na , MTX 0 , XR_ na
9264, Pauci , 1_Pauci , MTX 0 , XR_erosions
824, Poly, 2_Poly , M TX 0 , XR_norm al
9245, Spond, JSPA , MTX 0 , XR_space narrowing
1087ctrl, Control, na , MTX 0 , XR_ na
801, Pauci , 2_Poly , MTX 1 , XR_erosions
7149. 3, Control, na , MTX 0 , XR_ na
na , MTX 0 , XR_ na
1082, Control,
817, Pauci , 1_Pauci , M TX 0 , XR_space narrowing
7206, Pauci , 1_Pauci , MTX 0 , XR_unknown
7145, Pauci , 1_Pauci , M TX 0 , XR_normal
1083, Control,
976, Pauci , 1_Pauci , M TX 1 , XR_normal
9161, Pauci , 1_Pauci , M TX 1 , XR_normal
7177, Spond, JSPA , MTX 1 , XR_space narrowing
8003, Pauci , 1_Pauci , MTX 0 , XR_unknown
993, Syst, 2_Poly , MTX 1 , XR_space narrowing
9137, Poly, 2_Poly , MTX 0 , XR_space narrowing
19, Pauci , 2_Poly , MTX 1 , XR_space narrowing
912, Syst, 2_Poly , MTX 1 , XR_space narrowing
1081, Poly, 2_Poly , MTX 0 , XR_space narrowing
9272, Poly, 2_Poly , MTX 1 , XR_unknown
1087PB, Poly, 2_Poly , MTX 1 , XR_space narrowing
1073, Syst, 2_Poly , MTX 0 , XR_space narrowing
872, Poly, 2_Poly , MTX 1 , XR_space narrowing
850, Poly, 2_Poly , M TX 1 , XR_erosions
831, Poly, 2_Poly , M TX 1 , XR_erosions
894, Pauci , 2_Poly , MTX 1 , XR_normal
7029, Pauci , 1_Pauci , M TX 0 , XR_normal
18036, Pauci , 1_Pauci , M TX 0 , XR_normal
18057, Spond, JAS , MTX 0 , XR_sclerosis
845, Spond, JAS , MTX 0 , XR_space narrowing
One classical example: gene expression data
Individual:
Poly-Articular JRA Course
Controls
105 Genes with
Significantly Lower
Expression In
PolyArticular
JRA
242
genes
...
137 Genes with
Significantly
Higher
Expression In
PolyArticular
JRA
Picture: courtesy of B. Aronow
5
Another example: clustering and organizing
hierarchically protein structures in CATH
http://www.biochem.ucl.ac.uk/bsm/cath/
Orengo et. al.
Overall structure of the protein universe:
Which cluster does my protein belong
to?
Measuring similarity to cluster
representatives …
JM - http://folding.chmcc.org
6
Similarity and distance measures in clustering



Clustering (or segmentation) of objects starts with the
arbitrary choice of a similarity measure that describes
proximity between different objects
The choice of the similarity (or dissimilarity/distance)
measure ultimately defines the outcome of the
clustering and is far more important than the choice of
the actual clustering algorithm
Subject specific considerations provide a suitable
similarity notion, externally to the actual clustering
JM - http://folding.chmcc.org
7
Distance and similarity measures for string matching
JM - http://folding.chmcc.org
8
Similarity and distance measures in clustering



In general, any similarity measure can be converted
into a dissimilarity measure by applying a suitable
monotone-decreasing function
For N objects one may define an N times N matrix D
with non-negative entries (and zero diagonal elements)
representing dissimilarity for each pair of objects
Some clustering algorithms assume that the
dissimilarity or distance measure is a metric
Problem Assuming that D is not symmetric, define a modified symmetric
distance measure.
JM - http://folding.chmcc.org
9
Similarity measures
JM - http://folding.chmcc.org
10
Some observations regarding similarity measures
JM - http://folding.chmcc.org
11
K-mean heuristic as a solution to clustering problem




Definition For N objects and K<N postulated clusters
find an assignment of each object to one of the clusters
that minimizes “within cluster” point scatter cost function
The number of possible assignments scales
exponentially (again ) and in fact clustering problem is
another instance of global optimization problem
K-means is an iterative greedy descent algorithm that
attempts to find good solution (local minimum)
Trying different initial solution may be a good idea
JM - http://folding.chmcc.org
12
K-means algorithm
Step 1: Choose K initial cluster centers (e.g. randomly chosen data points)
and the resulting cluster assignment: each point is assigned to the closest
cluster center.
JM - http://folding.chmcc.org
13
K-means algorithm
Step 2: Given the initial cluster assignment find new cluster centers as
geometric centers of all the data points (vectors) in each cluster. Redefine
cluster assignments using the new centers
JM - http://folding.chmcc.org
14
K-means algorithm
Iterate Step 2 until cluster assignments do not change.
JM - http://folding.chmcc.org
15
Hierarchical clustering and its applications



As opposed to K-means there is no predefined number
of clusters
Instead, groups (cluster) are built iteratively form
pairwise dissimilarities (distances), such that at level of
the hierarchy clusters are created by merging (or
splitting) clusters at the next (previous) level
Representations of the data structure as rooted binary
trees: a convenient analysis tool
JM - http://folding.chmcc.org
16
Hierarchical clustering: agglomerative vs. divisive strategies
One may use height of the nodes to indicate the intergroup dissimilarity
between the two daughter nodes (subclusters at this level). One may also
order the tree (dendogram) by flipping the branches, such that the tighter
cluster is placed e.g. to the left.
Cut K=2
Cut K=3
The example above is just an illustration of divisive (top-down) strategy
JM - http://folding.chmcc.org
17