CSE591 Data Mining

Transcript CSE591 Data Mining

4. Clustering Methods
Concepts
Partitional (k-Means, k-Medoids)
Hierarchical (Agglomerative & Divisive, COBWEB)
Density-based (DBSCAN, CLIQUE)
Large size data (STING, BIRCH, CURE)
Spring 2003
Data Mining by H. Liu, ASU
1
Concepts of Clustering
• Clusters
• Different ways of
representing clusters
–
–
–
–
–
–
Division with boundaries
Venn diagram or spheres
Probabilistic
Dendrograms
I1
Trees
I2
…
Rules
1 2 3
0.5 0.2 0.3
In
Spring 2003
Data Mining by H. Liu, ASU
2
• About clusters
– Inter-clusters distance  maximization
– Intra-clusters distance  minimization
• Clustering vs. classification
– Which one is more difficult? Why?
– Various possible ways of clustering, which way is the
best?
Spring 2003
Data Mining by H. Liu, ASU
3
Major Categories
• Partitioning: Divide into k partitions (k fixed);
repartition to get better clustering.
• Hierarchical: Divide into different number of
partitions in layers - merge (bottom-up) or divide
(top-down).
• Density-based: Continue to grow a cluster as long as
the density of the cluster exceeds a threshold
• Grid-based: First divide space into grids, then
perform clustering on the grids.
Spring 2003
Data Mining by H. Liu, ASU
4
k-Means
•
Algorithm
1.
2.
3.
4.
5.
•
Given k
Randomly pick k instances as the initial centers
Assign the rest instances to closest one of k clusters
Recalculate the mean of each cluster
Repeat 3 & 4 until means don’t change
How good the clusters are
– Initial and final clusters
– Within-cluster variation diff(x,mean)^2
– Why don’t we consider inter-cluster distance?
Spring 2003
Data Mining by H. Liu, ASU
5
Example
• For simplicity, 1 dimensional objects and k=2.
• Objects: 1, 2, 5, 6,7
• K-means:
– Randomly select 5 and 6 as initial centroids;
– => Two clusters {1,2,5} and {6,7}; meanC1=8/3,
meanC2=6.5
– => {1,2}, {5,6,7}; meanC1=1.5, meanC2=6
– => no change.
– Aggregate dissimilarity = 0.5^2 + 0.5^2 + 1^2 + 1^2 =
2.5
Spring 2003
Data Mining by H. Liu, ASU
6
Discussions
• Limitations:
–
–
–
–
Means cannot be defined for categorical attributes;
Choice of k;
Sensitive to outliers;
Crisp clustering
• Variants of k-means exist:
– Using modes to deal with categorical attributes
• How about distance measures
• Is it similar to or different from k-NN?
– With and without learning
Spring 2003
Data Mining by H. Liu, ASU
7
k-Medoids
• k-Means algorithm is sensitive to outliers
– Is this true? How to prove it?
• Medoid – the most centrally located point in a
cluster, as a representative point of the cluster.
• In contrast, a centroid is not necessarily inside a
cluster.
• An example
Initial Medoids
Spring 2003
Data Mining by H. Liu, ASU
8
Partition Around Medoids
•
PAM:
1. Given k
2. Randomly pick k instances as initial medoids
3. Assign each instance to the nearest medoid x
4. Calculate the objective function
• the sum of dissimilarities of all instances to their
nearest medoids
5. Randomly select an instance y
6. Swap x by y if the swap reduces the objective
function
7. Repeat (3-6) until no change
Spring 2003
Data Mining by H. Liu, ASU
9
k-Means and k-Medoids
• The key difference lies in how they update means
or medoids
• Both require distance calculation and reassignment
of instances
• Time complexity
Outlier (100 unit away)
– Which one is more costly?
• Dealing with outliers
Spring 2003
Data Mining by H. Liu, ASU
10
EM (Expectation and Maximization)
• Moving away from crisp clusters as in k-Means by
allowing an instance to belong to several clusters
• Finite mixtures – a statistical clustering model
– A mixture is a set of k probability distributions,
representing k clusters
– The simplest finite mixture: one feature with a Gaussian
– When k=2, we need to estimate 5 parameters: 2 pairs of μ
and σ and pA, where pB = 1- pA
• EM
– Estimate using instances
– Maximize the overall likelihood that data came from this
data set
Spring 2003
Data Mining by H. Liu, ASU
11
Agglomerative
• Each object is viewed as a cluster (bottom up).
• Repeat until the number of clusters is small
enough
– Choose a closest pair of clusters
– Merge the two into one
• Defining “closest”: Centroid (mean of cluster)
distance, (average) sum of pairwise distance, …
– Refer to the Evaluation part
• A dendrogram is a tree that shows clustering
process.
Spring 2003
Data Mining by H. Liu, ASU
12
Dendrogram
• Cluster 1, 2, 4, 5, 6, 7 into two clusters (centriod
distance)
1
2
4
5
6
7
Spring 2003
Data Mining by H. Liu, ASU
13
An example to show different Links
• Single link
A B C D E
– Merge the nearest clusters measured by
the shortest edge between the two
– (((A B) (C D)) E)
• Complete link
– Merge the nearest clusters measured by
the longest edge between the two
– (((A B) E) (C D))
A 0
1
2
2
3
B 1
0
2
4
3
C 2
2
0
1
5
D 2
4
1
0
3
E 3
3
5
3
0
B
A
• Average link
– Merge the nearest clusters measured byE
the average edge length between the two
– (((A B) (C D)) E)
Spring 2003
Data Mining by H. Liu, ASU
C
D
14
Divisive
• All instances belong to one cluster (top-down)
• To find an optimal division at each layer (especially
the top one) is computationally prohibitive.
• One heuristic method is based on the Minimum
Spanning Tree (MST) algorithm
– Connecting all instances with MST (O(N2))
– Repeatedly cut out the longest edges at each iteration
until some stopping criterion is met or until one instance
remains in each cluster.
Spring 2003
Data Mining by H. Liu, ASU
15
COBWEB
• Building a conceptual hierarchy incrementally
• Category Utility:
kijP(fi=vij)P(fi=vij|ck)P(ck|fi=vij)
– All categories ck, all features fi, all feature values vij
• It attempts to maximize both the probability that
two objects in the same category have values in
common and the probability that objects in
different categories will have different property
values
Spring 2003
Data Mining by H. Liu, ASU
16
• Processing one instance at a time by evaluating
– Placing the instance in the best existing category
– Adding a new category containing only the instance
– Merging of two existing categories into a new one and
adding the instance to that category
– Splitting of an existing category into two and placing
the instance in the best new resulting category
Grandparent
Grandparent
Split
Parent
Child 1
Spring 2003
Child 2
Merge
Child 1
Data Mining by H. Liu, ASU
Child 2
17
Density-based
• BBSCAN –Density-Based Clustering of
Applications with Noise
• It grows regions with sufficiently high density into
clusters and can discover clusters of arbitrary
shape in spatial databases with noise.
– Many existing clustering algorithms find spherical
shapes of clusters
• DEBSCAN defines a cluster as a maximal set of
density-connected points.
Spring 2003
Data Mining by H. Liu, ASU
18
• Defining density and connection
–
–
–
–
–
-neighborhood of an object x (core object) (M, P, O)
MinPts of objects within -neighborhood (say, 3)
directly density-reachable (Q from M, M from P)
density-reachable (Q from P, P not from Q) [asymmetric]
density-connected (O, R, S) [symmetric] for border points
• What is the relationship between DR and DC?
Q
R
M
S
Q
Spring 2003
O
Data Mining by H. Liu, ASU
19
• Clustering with DBSCAN
– Search for clusters by checking the -neighborhood of
each instance x
– If the -neighborhood of x contains more than MinPts,
create a new cluster with x as a core object
– Iteratively collect directly density-reachable objects from
these core object and merge density-reachable clusters
– Terminate when no new point can be add to any cluster
• DBSCAN is sensitive to the thresholds of density,
but it is many folds faster than CLARANS
• Time complexity O(N log N) if a spatial index is
used, O(N2) otherwise
Spring 2003
Data Mining by H. Liu, ASU
20
Dealing with Large Data
• Key ideas
– Reducing the number of instances yet to maintain the
distribution
– Identifying relevant subspaces where clusters possibly exist
– Using summarized information to avoid repeated data
access
• Sampling
– CLARA (Clustering LARge Applications) working on
samples instead of the whole data
– CLARANS (Clustering Large Applications based on
RANdomized Search)
Spring 2003
Data Mining by H. Liu, ASU
21
• Grid: STING (STatistical INformation Grid)
– Statistical parameters of higher-level cells can easily be
computed from those of lower-level cells
• Attribute-independent: count
• Attribute-dependent: mean, standard deviation, min, max
• Type of distribution: normal, uniform, exponential, or
unknown
– Irrelevant cells can be removed
Spring 2003
Data Mining by H. Liu, ASU
22
Representatives
•
BIRCH using Clustering Feature (CF) and CF tree
– A cluster feature is a triplet about sub-clusters of instances
(N, LS, SS)
•
N - the number of instances, LS – linear sum, SS – square sum
– Two thresholds: branching factor and the max number of
children per non-leaf node
– Two phases
1. Build an initial in-memory CF tree
2. Apply a clustering algorithm to cluster the leaf nodes in CF tree
•
CURE (Clustering Using REpresentitives) is another
example
Spring 2003
Data Mining by H. Liu, ASU
23
• Taking advantage of the property of density
– If it’s dense in higher dimensional subspaces, it should
be dense in some lower dimensional subspaces
– CLIQUE (CLustering In QUEst)
• With high dimensional data, there are many void subspaces
• Using the property identified, we can start with dense lower
dimensional data
• CLIQUE is a density-based method that can automatically find
subspaces of the highest dimensionality such that high-density
clusters exist in those subspaces
Spring 2003
Data Mining by H. Liu, ASU
24
Chameleon
• A hierarchical Clustering Algorithm Using
Dynamic Modeling
– Observations on the weakness of CURE and ROCK
Spring 2003
Data Mining by H. Liu, ASU
25
Summary
• There are many clustering algorithms
• Good clustering algorithms maximize inter-cluster
dissimilarity and intra-cluster similarity
• Without prior knowledge, it is difficult to choose
the best clustering algorithm.
• Clustering is an important tool for outlier analysis.
Spring 2003
Data Mining by H. Liu, ASU
26
Bibliography
• I.H. Witten and E. Frank. Data Mining – Practical
Machine Learning Tools and Techniques with Java
Implementations. 2000. Morgan Kaufmann.
• M. Kantardzic. Data Mining – Concepts, Models,
Methods, and Algorithms. 2003. IEEE.
• J. Han and M. Kamber. Data Mining – Concepts
and Techniques. 2001. Morgan Kaufmann.
• M. H. Dunham. Data Mining – Introductory and
Advanced Topics.
Spring 2003
Data Mining by H. Liu, ASU
27

CSE591 Data Mining

Transcript CSE591 Data Mining

Directory