Transcript Clustering

Data Mining:
Concepts and Techniques
— Chapter 6 —
Clustring
April 9, 2016
1
Data Mining: Concepts and Techniques
1
What is Cluster Analysis?
• Cluster: a collection of data objects
– Similar to one another within the same cluster
– Dissimilar to the objects in other clusters
• Cluster analysis
– Finding similarities between data according to the characteristics found
in the data and grouping similar data objects into clusters
• Unsupervised learning: no predefined classes
2
Clustering: Applications
• Pattern Recognition
• Image Processing
• Economic Science (especially market research)
• WWW
– Document classification
– Cluster Weblog data to discover groups of similar access patterns
3
Examples of Clustering Applications
•
Marketing: Help marketers discover distinct groups in their customer bases, and
then use this knowledge to develop targeted marketing programs
•
Land use: Identification of areas of similar land use in an earth observation database
•
Insurance: Identifying groups of motor insurance policy holders with a high average
claim cost
•
City-planning: Identifying groups of houses according to their house type, value, and
geographical location
4
Quality: What Is Good Clustering?
• A good clustering method will produce high quality clusters with
– high intra-class similarity
– low inter-class similarity
• The quality of a clustering result depends on both the similarity measure
used by the method and its implementation
• The quality of a clustering method is also measured by its ability to
discover some or all of the hidden patterns
5
Measure the Quality of Clustering
• Dissimilarity/Similarity metric: Similarity is expressed in terms of a distance
function, typically metric: d(i, j)
• There is a separate “quality” function that measures the “goodness” of a
cluster.
• The definitions of distance functions are usually very different for any kinds
of variables.
• It is hard to define “similar enough” or “good enough”
– the answer is typically highly subjective.
6
Requirements of Clustering in Machine Learning
• Scalability
• Ability to deal with different types of attributes
• Ability to handle dynamic data
• Minimal requirements for domain knowledge to determine
input parameters
• Able to deal with noise and outliers
• High dimensionality
• Interpretability and usability
7
Similarity and Dissimilarity Between Objects
There are some methods to measure similarity (Distance) between
objects (can refer to KNN slides )
8
Dissimilarity between Binary Variables
Object j
1
0
1
a
b
• A contingency table for binary data
Object i
0
c
d
sum a  c b  d
• Distance measure for symmetric
binary variables:
d (i, j) 
bc
a bc  d
• Distance measure for asymmetric
binary variables:
sum
a b
cd
p
d (i, j) 
bc
a bc
9
Dissimilarity between Binary Variables
• Example
Name
Jack
Mary
Jim
Gender
M
F
M
Fever
Y
Y
Y
Cough
N
N
P
Test-1
P
P
N
Test-2
N
N
N
Test-3
N
P
N
Test-4
N
N
N
– gender is a symmetric attribute
– the remaining attributes are asymmetric binary
01
d ( jack , mary ) 
 0.33
2 01
11
d ( jack , jim ) 
 0.67
111
1 2
d ( jim , mary ) 
 0.75
11 2
10
Nominal Variables
• A generalization of the binary variable in that it can take more than 2 states,
e.g., red, yellow, blue, green
• Method 1: Simple matching
– m: # of matches, p: total # of variables
m
d (i, j)  p 
p
• Method 2: use a large number of binary variables
– creating a new binary variable for each of the M nominal states
11
Major Clustering Approaches (I)
•
Partitioning approach:
–
Construct various partitions and then evaluate them by some criterion, e.g., minimizing the sum of
square errors
–
•
•
Typical methods: Graph Partitioning, k-means, k-medoids
Hierarchical approach:
–
Create a hierarchical decomposition of the set of data (or objects) using some criterion
–
Typical methods: Diana, Agnes, BIRCH, ROCK
Density-based approach:
–
Based on connectivity and density functions
–
Typical methods: DBSACN, OPTICS
12
Major Clustering Approaches (II)
•
•
Grid-based approach:
–
based on a multiple-level granularity structure
–
Typical methods: STING, WaveCluster, CLIQUE
Model-based:
–
A model is hypothesized for each of the clusters and tries to find the best fit of that model to each
other
–
•
Typical methods: EM, SOM, COBWEB
Frequent pattern-based:
–
Based on the analysis of frequent patterns
–
Typical methods: pCluster
13
An Example: Graph Partitioning(1)
Sessions
Co-Occurrence Matrix
S1: a, c, b, d, c
S2: b, d, e, a
S3:e,c
[...]
Sn: b, e, d
Create Undirected Graph
based on Matrix M
14
An Example: Graph Partitioning(2)
Co-occurrence Matrix
Undirected
Graph
Clusters
C1: a, b, e
C2: c, f, g, i
C3: d
Small Clusters are filtered out (MinClusterSize)
15
An Example: Graph Partitioning(3)
16
16
Distance (Similarity) Matrix for Graph-based
Clustering
• Similarity (Distance) Matrix
– based on the distance or similarity measure we can construct a
symmetric matrix of distance (or similarity values)
– (i, j) entry in the matrix is the distance (similarity) between items i and
j
Note that dij = dji (i.e., the matrix is
symmetric. So, we only need the lower
triangle part of the matrix.
The diagonal is all 1’s (similarity) or all
0’s (distance)
dij  similarity (or distance) of Di to D j
17
Example: Term Similarities in Documents
• Suppose we want to cluster terms that appear in a collection of
documents with different frequencies
Each term can be viewed
as a vector of term
frequencies (weights)
Doc1
Doc2
Doc3
Doc4
Doc5
T1
0
3
3
0
2
T2
4
1
0
1
2
T3
0
4
0
0
2
T4
0
3
0
3
3
T5
0
1
3
0
1
T6
2
2
0
0
4
T7
1
0
3
2
0
T8
3
1
0
0
2
• We need to compute a term-term similarity matrix
– For simplicity we use the dot product as similarity measure (note that this is
the non-normalized version of cosine similarity)
N
sim(Ti ,T j )   ( wik  w jk )
k 1
– Example:
N = total number of dimensions (in this case documents)
wik = weight of term i in document k.
Sim(T1,T2) = <0,3,3,0,2> * <4,1,0,1,2>
0x4 + 3x1 + 3x0 + 0x1 + 2x2 = 7
18
Example: Term Similarities in Documents
Doc1
Doc2
Doc3
Doc4
Doc5
T1
0
3
3
0
2
T2
4
1
0
1
2
T3
0
4
0
0
2
T4
0
3
0
3
3
T5
0
1
3
0
1
T6
2
2
0
0
4
T7
1
0
3
2
0
T8
3
1
0
0
2
N
sim(Ti , Tj )   ( wik  w jk )
k 1
Term-Term
Similarity Matrix
T2
T3
T4
T5
T6
T7
T8
T1
7
16
15
14
14
9
7
T2
T3
T4
T5
T6
T7
8
12
3
18
6
17
18
6
16
0
8
6
18
6
9
6
9
3
2
16
3
19
Similarity (Distance) Thresholds
– A similarity (distance) threshold may be used to mark pairs that are
“sufficiently” similar
T2
T3
T4
T5
T6
T7
T2
T3
T4
T5
T6
T7
T8
T1
7
16
15
14
14
9
7
8
12
3
18
6
17
18
6
16
0
8
6
18
6
9
6
9
3
2
16
3
T2
T3
T4
T5
T6
T7
T2
T3
T4
T5
T6
T7
T8
T1
0
1
1
1
1
0
0
0
1
0
1
0
1
1
0
1
0
0
0
1
0
0
0
0
0
0
1
0
Using a threshold
value of 10 in the
previous example
20
Graph Representation
• The similarity matrix can be visualized as an undirected graph
– each item is represented by a node, and edges represent the fact that
two items are similar (a one in the similarity threshold matrix)
T2
T3
T4
T5
T6
T7
T8
T1 T2 T3 T4 T5 T6 T7
0
1 0
1 1 1
1 0 0 0
1 1 1 1 0
0 0 0 0 0 0
0 1 0 0 0 1 0
If no threshold is used, then
matrix can be represented as
a weighted graph
T1
T3
T5
T4
T2
T7
T6
T8
21
Simple Clustering Algorithms
• If we are interested only in threshold (and not the degree of
similarity or distance), we can use the graph directly for clustering
• Clique Method (complete link)
– all items within a cluster must be within the similarity threshold of
all other items in that cluster
– clusters may overlap
– generally produces small but very tight clusters
• Single Link Method
– any item in a cluster must be within the similarity threshold of at
least one other item in that cluster
– produces larger but weaker clusters
22
Simple Clustering Algorithms
• Clique Method
– a clique is a completely connected subgraph of a graph
– in the clique method, each maximal clique in the graph becomes a
cluster
T1
T3
Maximal cliques (and therefore the
clusters) in the previous example are:
T5
T4
T2
{T1, T3, T4, T6}
{T2, T4, T6}
{T2, T6, T8}
{T1, T5}
{T7}
Note that, for example, {T1, T3, T4}
is also a clique, but is not maximal.
T7
T6
T8
23
Simple Clustering Algorithms
• Single Link Method
– selected an item not in a cluster and place it in a new cluster
– place all other similar item in that cluster
– repeat step 2 for each item in the cluster until nothing more can be
added
– repeat steps 1-3 for each item that remains unclustered
T1
T3
In this case the single link method
produces only two clusters:
T5
T4
{T1, T3, T4, T5, T6, T2, T8}
{T7}
Note that the single link method does
not allow overlapping clusters, thus
partitioning the set of items.
T2
T7
T6
T8
24
K-Means (A partitioning clustering Method)
•
In machine learning, k-means clustering is a method of cluster analysis which aims to partition n
observations into k clusters in which each observation belongs to the cluster with the nearest mean.
•
Centroid: the “middle” of a cluster
Cm 
•
iN 1(t
ip
)
N
Partitioning method: Construct a partition of a database D of n objects into a set of k clusters, s.t., min
sum of squared distance
km1tmiKm (Cm  tmi )2
•
Given a k, find a partition of k clusters that optimizes the chosen partitioning criterion
–
Heuristic methods: k-means and k-medoids algorithms
–
k-means : Each cluster is represented by the center of the cluster
–
k-medoids: Each cluster is represented by one of the objects in the cluster
25
The K-Means Clustering Method
• Given k, the k-means algorithm is implemented in four steps:
– Partition objects into k nonempty subsets
– Compute seed points as the centroids of the clusters of the current
partition (the centroid is the center, i.e., mean point, of the cluster)
– Assign each object to the cluster with the nearest seed point
– Go back to Step 2, stop when no more new assignment
26
The K-Means Clustering Method
10
10
9
9
8
8
7
7
6
6
5
5
10
9
8
7
6
5
4
4
3
2
1
0
0
1
2
3
4
5
6
7
8
K=2
Arbitrarily choose K
object as initial
cluster center
9
10
Assign
each
objects
to most
similar
center
3
2
1
0
0
1
2
3
4
5
6
7
8
9
10
Update
the
cluster
means
4
3
2
1
0
0
1
2
3
4
5
6
reassign
10
9
9
8
8
7
7
6
6
5
5
4
3
2
1
0
1
2
3
4
5
6
7
8
8
9
10
reassign
10
0
7
9
10
Update
the
cluster
means
4
3
2
1
0
0
1
2
3
4
5
6
7
8
27
9
10
The K-Means Clustering Method: An Example
Document Clustering
Initial Set with k=3
C1 = {D1,D2},
C2 = {D3,D4},
C3 = {D5,D6}
Cluster Means
D1
D2
D3
D4
D5
D6
D7
D8
C1
C2
C3
T1
0
4
0
0
0
2
1
3
4/2
0/2
2/2
T2
3
1
4
3
1
2
0
1
4/2
7/2
3/2
T3
3
0
0
0
3
0
3
0
3/2
0/2
3/2
T4
0
1
0
3
0
0
2
0
1/2
3/2
0/2
T5
2
2
2
3
1
4
0
2
4/2
5/2
5/2
28
Example: K-Means
Now compute the similarity (or distance) of each item with each cluster, resulting a
cluster-document similarity matrix (here we use dot product as the similarity measure).
C1
C2
C3
D1
29/2
31/2
28/2
D2
29/2
20/2
21/2
D3
24/2
38/2
22/2
D4
27/2
45/2
24/2
D5
17/2
12/2
17/2
D6
32/2
34/2
30/2
D7
15/2
6/2
11/2
D8
24/2
17/2
19/2
For each document, reallocate the document to the cluster to which it has the highest
similarity (shown in red in the above table). After the reallocation we have the following
new clusters. Note that the previously unassigned D7 and D8 have been assigned, and that
D1 and D6 have been reallocated from their original assignment.
C1 = {D2,D7,D8}, C2 = {D1,D3,D4,D6}, C3 = {D5}
This is the end of first iteration (i.e., the first reallocation).
Next, we repeat the process for another reallocation…
29
Example: K-Means
Now compute new cluster centroids using the original document-term matrix
C1 = {D2,D7,D8}, C2 = {D1,D3,D4,D6}, C3 = {D5}
D1
D2
D3
D4
D5
D6
D7
D8
C1
C2
C3
T1
0
4
0
0
0
2
1
3
8/3
2/4
0/1
T2
3
1
4
3
1
2
0
1
2/3
12/4
1/1
T3
3
0
0
0
3
0
3
0
3/3
3/4
3/1
T4
0
1
0
3
0
0
2
0
3/3
3/4
0/1
T5
2
2
2
3
1
4
0
2
4/3
11/4
1/1
30
Comments on the K-Means Method
Weakness
– Applicable only when mean is defined, then what about categorical data?
– Need to specify k, the number of clusters, in advance
– Unable to handle noisy data and outliers
31
Variations of the K-Means Method
•
A few variants of the k-means which differ in
– Selection of the initial k means
– Dissimilarity calculations
– Strategies to calculate cluster means
32
What Is the Problem of the K-Means Method?
• The k-means algorithm is sensitive to outliers !
– Since an object with an extremely large value may substantially distort the distribution of
the data.
•
K-Medoids: Instead of taking the mean value of the object in a cluster as a reference
point, medoids can be used, which is the most centrally located object in a cluster.
10
10
9
9
8
8
7
7
6
6
5
5
4
4
3
3
2
2
1
1
0
0
0
1
2
3
4
5
6
7
8
9
10
0
1
2
3
4
5
6
7
8
9
10
33
The K-Medoids Clustering Method
•
Find representative objects, called medoids, in clusters
•
PAM (Partitioning Around Medoids) Algorithm :
1. Initialize: randomly select k of the n data points as the medoids
2. Associate each data point to the closest medoid. ("closest" here is defined using any
valid distance metric, most commonly Euclidean distance, Manhattan distance or
Minkowski distance)
3. For each medoid m
1. For each non-medoid data point o
2. Swap m and o and compute the total cost of the configuration
4. Select the configuration with the lowest cost.
5. repeat steps 2 to 5 until there is no change in the medoid.
34
Hierarchical Clustering
• Use distance matrix as clustering criteria. This method does not require
the number of clusters k as an input, but needs a termination condition
Step 0
a
Step 1
Step 2 Step 3 Step 4
ab
b
abcde
c
cde
d
de
e
Step 4
agglomerative
(AGNES-Agglomerative Nesting)
Step 3
Step 2 Step 1 Step 0
divisive
(DIANA-Divisive analysis)
35
Agglomerative hierarchical clustering
• Group data objects in a bottom-up fashion.
• Initially each data object is in its own cluster.
• Then we merge these atomic clusters into larger and larger clusters, until
all of the objects are in a single cluster or until certain termination
conditions are satisfied.
• A user can specify the desired number of clusters as a termination
condition.
36
Divisive hierarchical clustering
• Groups data objects in a top-down fashion.
• Initially all data objects are in one cluster.
• We then subdivide the cluster into smaller and smaller clusters, until each
object forms cluster on its own or satisfies certain termination conditions,
such as a desired number of clusters is obtained.
37
AGNES (Agglomerative Nesting)
10
10
10
9
9
9
8
8
8
7
7
7
6
6
6
5
5
5
4
4
4
3
3
3
2
2
2
1
1
1
0
0
0
1
2
3
4
5
6
7
8
9
10
0
0
1
2
3
4
5
6
7
8
9
10
0
1
2
3
4
5
6
7
8
9
38
10
Dendrogram: Shows How the Clusters are Merged
39
Density-Based Clustering Methods
• Clustering based on density (local cluster criterion), such as density-
connected points
• Major features:
– Discover clusters of arbitrary shape
– Handle noise
– One scan
– Need density parameters as termination condition
40
Density-Based Clustering: DBSCAN (Density-Based Spatial Clustering of Applications with
Noise)& OPTIC (Ordering Points To Identify the Clustering Structure)
41
GRID-BASED CLUSTERING METHODS
•
This is the approach in which we quantize space into a finite number of
cells that form a grid structure on which all of the operations for
clustering is performed.
•
Assume that we have a set of records and we want to cluster with respect
to two attributes, then, we divide the related space (plane), into a grid
structure and then we find the clusters.
42
Salary (10,000)
8
Our “space” is this
plane
7
6
5
4
3
2
1
0
20
30
40
50
60
Age
43
STING: A Statistical Information Grid Approach
• The spatial area is divided into rectangular cells
• There are several levels of cells corresponding to different levels of
resolution
44
Different Grid Levels during Query Processing.
45
The STING Clustering Method
– Each cell at a high level is partitioned into a number of smaller cells in the next lower
level
– Statistical info of each cell is calculated and stored beforehand and is used to answer
queries
– Parameters of higher level cells can be easily calculated from parameters of lower level
cell
• count, mean, s, min, max
• type of distribution—normal, uniform, etc.
– Use a top-down approach to answer spatial data queries
– Start from a pre-selected layer—typically with a small number of cells
– For each cell in the current level compute the confidence interval
46
Comments on STING
•
•
•
•
Remove the irrelevant cells from further consideration
When finish examining the current layer, proceed to the next lower level
Repeat this process until the bottom layer is reached
Advantages:
– Query-independent, easy to parallelize, incremental update
• Disadvantages:
– All the cluster boundaries are either horizontal or vertical, and no diagonal
boundary is detected
47
Model-Based Clustering
• What is model-based clustering?
– Attempt to optimize the fit between the given data and some mathematical
model
– Based on the assumption: Data are generated by a mixture of underlying
probability distribution
• Typical methods
– Statistical approach
• EM (Expectation maximization), AutoClass
– Machine learning approach
• COBWEB, CLASSIT
– Neural network approach
• SOM (Self-Organizing Feature Map)
48
Conceptual Clustering
• Conceptual clustering
– A form of clustering in machine learning
– Produces a classification scheme for a set of unlabeled objects
– Finds characteristic description for each concept (class)
• COBWEB
– A popular a simple method of incremental conceptual learning
– Creates a hierarchical clustering in the form of a classification tree
– Each node refers to a concept and contains a probabilistic description of that
concept
49
COBWEB Clustering Method
A classification tree
50
What Is Outlier Discovery?
• What are outliers?
– The set of objects are considerably dissimilar from the remainder of the data
• Problem: Define and find outliers in large data sets
• Applications:
– Credit card fraud detection
– Customer segmentation
– Medical analysis
51