HCS Clustering Algorithm - Computer Science @ UC Davis

Download Report

Transcript HCS Clustering Algorithm - Computer Science @ UC Davis

HCS Clustering Algorithm
A Clustering Algorithm
Based on Graph Connectivity
Presentation Outline
• The Problem
• HCS Algorithm Overview
– Main Players
– General Algorithm
– Properties
– Improvements
• Conclusion
ECS289A Modeling Gene Regulation • HCS Clustering Algorithm • Sophie Engle
2
The Problem
• Clustering:
– Group elements into subsets based on similarity between
pairs of elements
• Requirements:
– Elements in the same cluster are highly similar to each
other
– Elements in different clusters have low similarity to each
other
• Challenges:
– Large sets of data
– Inaccurate and noisy measurements
ECS289A Modeling Gene Regulation • HCS Clustering Algorithm • Sophie Engle
3
Presentation Outline
• The Problem
• HCS Algorithm Overview
– Main Players
– General Algorithm
– Properties
– Improvements
• Conclusion
ECS289A Modeling Gene Regulation • HCS Clustering Algorithm • Sophie Engle
4
HCS Algorithm Overview
• Highly Connected Subgraphs Algorithm
– Uses graph theoretic techniques
• Basic Idea
– Uses similarity information to construct a
similarity graph
– Groups elements that are highly connected with
each other
ECS289A Modeling Gene Regulation • HCS Clustering Algorithm • Sophie Engle
5
Presentation Outline
• The Problem
• HCS Algorithm Overview
– Main Players
– General Algorithm
– Properties
– Improvements
• Conclusion
ECS289A Modeling Gene Regulation • HCS Clustering Algorithm • Sophie Engle
6
HCS: Main Players
• Similarity Graph
– Nodes correspond to elements (genes)
– Edges connect similar elements (those whose similarity
value is above some threshold)
gene2
gene1
Gene1 similar to gene2
Gene1 similar to gene3
Gene2 similar to gene3
gene3
ECS289A Modeling Gene Regulation • HCS Clustering Algorithm • Sophie Engle
7
HCS: Main Players
• Edge Connectivity
– Minimum number of edges whose removal results in a
disconnected graph
gene2
gene3
Must remove 3 edges to
disconnect graph, thus has an
edge connectivity k(G) = 3
gene1
gene4
ECS289A Modeling Gene Regulation • HCS Clustering Algorithm • Sophie Engle
8
HCS: Main Players
• Edge Connectivity
– Minimum number of edges whose removal results in a
disconnected graph
gene2
gene3
Must remove 3 edges to
disconnect graph, thus has an
edge connectivity k(G) = 3
gene1
gene4
ECS289A Modeling Gene Regulation • HCS Clustering Algorithm • Sophie Engle
9
HCS: Main Players
• Edge Connectivity
– Minimum number of edges whose removal results in a
disconnected graph
gene2
gene3
Must remove 3 edges to
disconnect graph, thus has an
edge connectivity k(G) = 3
gene1
gene4
ECS289A Modeling Gene Regulation • HCS Clustering Algorithm • Sophie Engle
10
HCS: Main Players
• Highly Connected Subgraphs
– Subgraphs whose edge connectivity exceeds half the
number of nodes
gene2
gene5
gene3
gene1
gene8
Entire Graph
Nodes = 8
Edge connectivity = 1
gene6
gene4
gene7
ECS289A Modeling Gene Regulation • HCS Clustering Algorithm • Sophie Engle
Not HCS!
11
HCS: Main Players
• Highly Connected Subgraphs
– Subgraphs whose edge connectivity exceeds half the
number of nodes
gene2
gene5
gene3
gene1
gene8
Sub Graph
Nodes = 5
Edge connectivity = 3
gene6
gene4
gene7
ECS289A Modeling Gene Regulation • HCS Clustering Algorithm • Sophie Engle
HCS!
12
HCS: Main Players
• Cut
– A set of edges whose removal disconnects the graph
gene2
gene5
gene3
gene1
gene8
gene6
gene4
gene7
ECS289A Modeling Gene Regulation • HCS Clustering Algorithm • Sophie Engle
13
HCS: Main Players
• Minimum Cut
– A cut with a minimum number of edges
gene2
gene5
gene3
gene1
gene8
gene6
gene4
gene7
ECS289A Modeling Gene Regulation • HCS Clustering Algorithm • Sophie Engle
14
HCS: Main Players
• Minimum Cut
– A cut with a minimum number of edges
gene2
gene5
gene3
gene1
gene8
gene6
gene4
gene7
ECS289A Modeling Gene Regulation • HCS Clustering Algorithm • Sophie Engle
15
HCS: Main Players
• Minimum Cut
– A cut with a minimum number of edges
gene2
gene5
gene3
gene1
gene8
gene6
gene4
gene7
ECS289A Modeling Gene Regulation • HCS Clustering Algorithm • Sophie Engle
16
Presentation Outline
• The Problem
• HCS Algorithm Overview
– Main Players
– General Algorithm
– Properties
– Improvements
• Conclusion
ECS289A Modeling Gene Regulation • HCS Clustering Algorithm • Sophie Engle
17
HCS: Algorithm (by example)
5
2
3
4
6
12
11
10
7
9
8
1
find and remove a minimum cut
ECS289A Modeling Gene Regulation • HCS Clustering Algorithm • Sophie Engle
18
HCS: Algorithm (by example)
5
Highly Connected!
2
3
4
6
12
11
10
7
9
8
1
are the resulting
subgraphs highly connected?
ECS289A Modeling Gene Regulation • HCS Clustering Algorithm • Sophie Engle
19
HCS: Algorithm (by example)
5
Cluster 1
2
3
4
6
12
11
10
7
9
8
1
repeat process on non-highly
connected subgraphs
ECS289A Modeling Gene Regulation • HCS Clustering Algorithm • Sophie Engle
20
HCS: Algorithm (by example)
5
Cluster 1
2
3
4
6
12
11
10
7
9
8
1
find and remove a minimum cut
ECS289A Modeling Gene Regulation • HCS Clustering Algorithm • Sophie Engle
21
HCS: Algorithm (by example)
5
Cluster 1
2
3
4
6
Highly Connected!
1
Highly Connected!
12
11
are the resulting
subgraphs highly connected?
10
7
9
8
ECS289A Modeling Gene Regulation • HCS Clustering Algorithm • Sophie Engle
22
HCS: Algorithm (by example)
5
Cluster 1
2
3
4
6
Cluster 2
1
Cluster 3
12
resulting clusters
11
10
7
9
8
ECS289A Modeling Gene Regulation • HCS Clustering Algorithm • Sophie Engle
23
HCS: Algorithm
HCS( G ) {
MINCUT( G ) = { H1, … , Ht }
for each Hi, i = [ 1, t ] {
if k( Hi ) > n ÷ 2
return Hi
else
HCS( Hi )
}
}
ECS289A Modeling Gene Regulation • HCS Clustering Algorithm • Sophie Engle
24
HCS: Algorithm
HCS( G ) {
MINCUT( G ) = { H1, … , Ht }
for each Hi, i = [ 1, t ] {
if k( Hi ) > n ÷ 2
return Hi
else
Find a minimum cut in graph G.
This returns a set of subgraphs
HCS( Hi )
{ H , … , H } resulting
from the removal of the cut set.
}
1
t
}
ECS289A Modeling Gene Regulation • HCS Clustering Algorithm • Sophie Engle
25
HCS: Algorithm
HCS( G ) {
MINCUT( G ) = { H1, … , Ht }
for each Hi, i = [ 1, t ] {
if k( Hi ) > n ÷ 2
return Hi
else
HCS( Hi )
For each subgraph…
}
}
ECS289A Modeling Gene Regulation • HCS Clustering Algorithm • Sophie Engle
26
HCS: Algorithm
HCS( G ) {
MINCUT( G ) = { H1, … , Ht }
for each Hi, i = [ 1, t ] {
if k( Hi ) > n ÷ 2
return Hi
If the subgraph is highly
else
connected, then return that
subgraph as a cluster.
HCS( Hi )
(Note: k( H ) denotes edge
connectivity of graph H , n
}
i
i
}
ECS289A Modeling Gene Regulation • HCS Clustering Algorithm • Sophie Engle
denotes number of nodes)
27
HCS: Algorithm
HCS( G ) {
MINCUT( G ) = { H1, … , Ht }
for each Hi, i = [ 1, t ] {
if k( Hi ) > n ÷ 2
return Hi
Otherwise, repeat the algorithm
else
on the subgraph.
(recursive function)
HCS( Hi )
This continues until there are
}
no more subgraphs, and all
}
ECS289A Modeling Gene Regulation • HCS Clustering Algorithm • Sophie Engle
clusters have been found.
28
HCS: Algorithm
HCS( G ) {
MINCUT( G ) = { H1, … , Ht }
for each Hi, i = [ 1, t ] {
if k( Hi ) > n ÷ 2
return Hi
Running time is bounded by
else
2N × f( n, m ) where N is
the number of clusters found,
HCS( Hi )
and f( n, m ) is the time
complexity of computing a
}
minimum cut in a graph with n
}
ECS289A Modeling Gene Regulation • HCS Clustering Algorithm • Sophie Engle
nodes and m edges.
29
HCS: Algorithm
HCS( G ) {
MINCUT( G ) = { H1, … , Ht }
for each Hi, i = [ 1, t ] {
if k( Hi ) > Deterministic
n ÷ 2 for
Graph:
return Un-weighted
Hi
takes O(nm) steps
else
where n is the number
nodes and m is the
HCS( Hiof )
number of edges
}
}
ECS289A Modeling Gene Regulation • HCS Clustering Algorithm • Sophie Engle
30
Presentation Outline
• The Problem
• HCS Algorithm Overview
– Main Players
– General Algorithm
– Properties
– Improvements
• Conclusion
ECS289A Modeling Gene Regulation • HCS Clustering Algorithm • Sophie Engle
31
HCS: Properties
• Homogeneity
– Each cluster has a diameter of at most 2
• Distance is the minimum length path between two nodes
– Determined by number of EDGES traveled between nodes
• Diameter is the longest distance in the graph
– Each cluster is at least half as dense as a clique
• Clique is a graph with maximum possible edge connectivity
a
f
e
Dist( a, d ) = 2
Dist( a, e ) = 3
Diam( G ) = 4
b
c
d
ECS289A Modeling Gene Regulation • HCS Clustering Algorithm • Sophie Engle
clique
32
HCS: Properties
• Separation
– Any non-trivial split is unlikely to have diameter of two
– Number of edges removed by each iteration is linear in
the size of the underlying subgraph
• Compared to quadratic number of edges within final clusters
• Indicates separation unless sizes are small
• Does not imply number of edges removed overall
ECS289A Modeling Gene Regulation • HCS Clustering Algorithm • Sophie Engle
33
Presentation Outline
• The Problem
• HCS Algorithm Overview
– Main Players
– General Algorithm
– Properties
– Improvements
• Conclusion
ECS289A Modeling Gene Regulation • HCS Clustering Algorithm • Sophie Engle
34
HCS: Improvements
2
3
4
6
12
11
10
7
1
8
Choosing between cut sets
ECS289A Modeling Gene Regulation • HCS Clustering Algorithm • Sophie Engle
35
HCS: Improvements
2
3
4
6
12
11
10
7
1
8
ECS289A Modeling Gene Regulation • HCS Clustering Algorithm • Sophie Engle
36
HCS: Improvements
2
3
4
6
12
11
10
7
1
8
ECS289A Modeling Gene Regulation • HCS Clustering Algorithm • Sophie Engle
37
HCS: Improvements
• Iterated HCS
– Sometimes there are multiple minimum cuts to
choose from
• Some cuts may create “singletons” or nodes that
become disconnected from the rest of the graph
– Performs several iterations of HCS until no new
cluster is found (to find best final clusters)
• Theoretically adds another O(n) factor to running time,
but typically only needs 1 – 5 more iterations
ECS289A Modeling Gene Regulation • HCS Clustering Algorithm • Sophie Engle
38
HCS: Improvements
• Remove low degree nodes first
– If node has low degree, likely will just be
separated from rest of graph
– Calculating separation for those nodes is
expensive
– Removal helps eliminate unnecessary iterations
and significantly reduces running time
ECS289A Modeling Gene Regulation • HCS Clustering Algorithm • Sophie Engle
39
Presentation Outline
• The Problem
• HCS Algorithm Overview
– Main Players
– General Algorithm
– Properties
– Improvements
• Conclusion
ECS289A Modeling Gene Regulation • HCS Clustering Algorithm • Sophie Engle
40
Conclusion
• Performance
– With improvements, can handle problems with
up to thousands of elements in reasonable
computing time
– Generates clusters with high homogeneity and
separation
– More robust (responds better when noise is
introduced) than other approaches based on
connectivity
ECS289A Modeling Gene Regulation • HCS Clustering Algorithm • Sophie Engle
41
References
“A Clustering Algorithm
based on Graph Connectivity”
By Erez Hartuv and Ron Shamir
March 1999 ( Revised December 1999)
http://www.math.tau.ac.il/~rshamir/papers.html
ECS289A Modeling Gene Regulation • HCS Clustering Algorithm • Sophie Engle
42