Transcript cluster

Clustering Part2 continued
1.
BIRCH skipped
2.
Density-based Clustering --- DBSCAN and DENCLUE
3.
GRID-based Approaches --- STING and CLIQUE
4.
SOM skipped
5.
Cluster Validation other set of transparencies
6.
Outlier/Anomaly Detection other set of
transparencies
Slides in red will be used for Part1.
1
BIRCH (1996)


Birch: Balanced Iterative Reducing and Clustering using
Hierarchies, by Zhang, Ramakrishnan, Livny (SIGMOD’96)
Incrementally construct a CF (Clustering Feature) tree, a
hierarchical data structure for multiphase clustering


Phase 1: scan DB to build an initial in-memory CF tree (a
multi-level compression of the data that tries to preserve
the inherent clustering structure of the data)
Phase 2: use an arbitrary clustering algorithm to cluster
the leaf nodes of the CF-tree

Scales linearly: finds a good clustering with a single scan

Weakness: handles only numeric data, and sensitive to the
and improves the quality with a few additional scans
order of the data record.
2
Clustering Feature Vector
Clustering Feature: CF = (N, LS, SS)
N: Number of data points
LS: Ni=1=Xi
SS: Ni=1=Xi2
CF = (5, (16,30),(54,190))
10
9
8
7
6
5
4
3
2
1
0
0
1
2
3
4
5
6
7
8
9
10
(3,4)
(2,6)
(4,5)
(4,7)
(3,8)
3
CF Tree
Root
B=7
CF1
CF2 CF3
CF6
L=6
child1
child2 child3
child6
Non-leaf node
CF1
CF2 CF3
CF5
child1
child2 child3
child5
Leaf node
prev CF1 CF2
CF6 next
Leaf node
prev CF1 CF2
CF4 next
4
Chapter 8. Cluster Analysis

What is Cluster Analysis?

Types of Data in Cluster Analysis

A Categorization of Major Clustering Methods

Partitioning Methods

Hierarchical Methods

Density-Based Methods

Grid-Based Methods

Model-Based Clustering Methods

Outlier Analysis

Summary
5
Density-Based Clustering Methods



Clustering based on density (local cluster criterion),
such as density-connected points or based on an
explicitly constructed density function
Major features:
 Discover clusters of arbitrary shape
 Handle noise
 Usually one scan
 Need density estimation parameters
Several interesting studies:
 DBSCAN: Ester, et al. (KDD’96)
 OPTICS: Ankerst, et al (SIGMOD’99).
 DENCLUE: Hinneburg & D. Keim (KDD’98)
 CLIQUE: Agrawal, et al. (SIGMOD’98)
6
DBSCAN



Two parameters:

Eps: Maximum radius of the neighbourhood

MinPts: Minimum number of points in an Epsneighbourhood of that point
NEps(p): {q belongs to D | dist(p,q) <= Eps}
Directly density-reachable: A point p is directly densityreachable from a point q wrt. Eps, MinPts if

1) p belongs to NEps(q)

2) core point condition:
|NEps (q)| >= MinPts
p
q
MinPts = 5
Eps = 1 cm
7
Density-Based Clustering: Background (II)

Density-reachable:


p
A point p is density-reachable from
a point q wrt. Eps, MinPts if there
is a chain of points p1, …, pn, p1 =
q, pn = p such that pi+1 is directly
density-reachable from pi
p1
q
Density-connected

A point p is density-connected to a
point q wrt. Eps, MinPts if there is
a point o such that both, p and q
are density-reachable from o wrt.
Eps and MinPts.
p
q
o
8
DBSCAN: Density Based Spatial
Clustering of Applications with Noise


Relies on a density-based notion of cluster: A cluster is
defined as a maximal set of density-connected points
Discovers clusters of arbitrary shape in spatial databases
Not density reachable
with noise
from core point
Density reachable
from core point
Outlier
Border
Eps = 1cm
Core
MinPts = 5
9
DBSCAN: The Algorithm





Arbitrary select a point p
Retrieve all points density-reachable from p wrt Eps
and MinPts.
If p is a core point, a cluster is formed.
If p ia not a core point, no points are densityreachable from p and DBSCAN visits the next point of
the database.
Continue the process until all of the points have been
processed.
10
Density-based Clustering: Pros and Cons
+: can discover clusters of arbitrary shape

+: not sensitive to outliers and supports outlier
detection

+: can handle noise

+-: medium algorithm complexities

-: finding good density estimation parameters is
frequently difficult; more difficult to use than K-means.

-: usually, do not do well in clustering high-dimensional
datasets.


-: cluster models are not well understood (yet)
11
DENCLUE: using density functions

DENsity-based CLUstEring by Hinneburg & Keim (KDD’98)

Major features

Solid mathematical foundation

Good for data sets with large amounts of noise



Allows a compact mathematical description of arbitrarily
shaped clusters in high-dimensional data sets
Significant faster than existing algorithm (faster than
DBSCAN by a factor of up to 45)
But needs a large number of parameters
12
Denclue: Technical Essence





Influence function: describes the impact of a data point within its
neighborhood.
Overall density of the data space can be calculated as the sum of
the influence function of all data points.
Clusters can be determined mathematically by identifying density
attractors; object that are associated with the same density
attractor belong to the same cluster
Density attractors are local maximal of the overall density function.
Uses grid-cells to speed up computations; only data points in
neighboring grid-cells are used to determine the density for a
point.
13
DENCLUE Influence Function and its Gradient

Example
f Gaussian ( x , y )  e
f
D
Gaussian
f
d ( x , y )2
2 2
( x )   i 1 e
N
-
d ( x , xi ) 2
2 2
( x, xi )  i 1 ( xi - x)  e
D
Gaussian
N
-
d ( x , xi ) 2
2 2
14
Example: Density Computation
D={x1,x2,x3,x4}
fDGaussian(x)= influence(x1) + influence(x2) + influence(x3) +
influence(x4)=0.04+0.06+0.08+0.6=0.78
x1
0.04
x3
y
x2
0.06
0.08
x
x4
0.6
Remark: the density value of y would be larger than the one for x
15
Density Attractor
16
Examples of DENCLUE Clusters
17
Basic Steps DENCLUE Algorithms
1.
2.
3.
Determine density attractors
Associate data objects with density
attractors using hill climbing ( initial
clustering)
Merge the initial clusters further relying
on a hierarchical clustering approach
(optional)
18
Chapter 8. Cluster Analysis

What is Cluster Analysis?

Types of Data in Cluster Analysis

A Categorization of Major Clustering Methods

Partitioning Methods

Hierarchical Methods

Density-Based Methods

Grid-Based Methods

Model-Based Clustering Methods

Outlier Analysis

Summary
19
Steps of Grid-based Clustering Algorithms
Basic Grid-based Algorithm
1.
Define a set of grid-cells
2.
Assign objects to the appropriate grid cell and
compute the density of each cell.
3.
Eliminate cells, whose density is below a
certain threshold t.
4.
Form clusters from contiguous (adjacent)
groups of dense cells (usually minimizing a
given objective function).
20
Advantages of Grid-based Clustering
Algorithms


fast:
 No distance computations
 Clustering is performed on summaries and not
individual objects; complexity is usually O(#populated-grid-cells) and not O(#objects)
 Easy to determine which clusters are
neighboring
Shapes are limited to union of rectangular gridcells
21
Grid-Based Clustering Methods

Several interesting methods (in addition to the basic gridbased algorithm)


STING (a STatistical INformation Grid approach) by
Wang, Yang and Muntz (1997)
CLIQUE: Agrawal, et al. (SIGMOD’98)
22
STING: A Statistical Information
Grid Approach



Wang, Yang and Muntz (VLDB’97)
The spatial area area is divided into rectangular cells
There are several levels of cells corresponding to different
levels of resolution
23
STING: A Statistical Information
Grid Approach (2)







Main contribution of STING is the proposal of a data structure that
can be used for many purposes (e.g. SCMRG, BIRCH kind of uses
it)
The data structure is used to form clusters based on queries
Each cell at a high level is partitioned into a number of smaller
cells in the next lower level
Statistical info of each cell is calculated and stored beforehand
and is used to answer queries
Parameters of higher level cells can be easily calculated from
parameters of lower level cell
 count, mean, s, min, max
 type of distribution—normal, uniform, etc.
Use a top-down approach to answer spatial data queries
Clusters are formed by merging cells that match a given query
description ( next slide)
24
STING: Query Processing(3)
Used a top-down approach to answer spatial data queries
1.
Start from a pre-selected layer—typically with a small number of
cells
2.
From the pre-selected layer until you reach the bottom layer do
the following:

3.
For each cell in the current level compute the confidence interval
indicating a cell’s relevance to a given query;

If it is relevant, include the cell in a cluster

If it irrelevant, remove cell from further consideration

otherwise, look for relevant cells at the next lower layer
Combine relevant cells into relevant regions (based on gridneighborhood) and return the so obtained clusters as your
answers.
25
STING: A Statistical Information
Grid Approach (3)


Advantages:
 Query-independent, easy to parallelize, incremental
update
 O(K), where K is the number of grid cells at the
lowest level
 Can be used in conjunction with a grid-based
clustering algorithm
Disadvantages:
 All the cluster boundaries are either horizontal or
vertical, and no diagonal boundary is detected
26
Subspace Clustering


Clustering in very high-dimensional spaces is very difficult
High dimensional attribute spaces tend to be sparseit is
hard to find any clusters

It is very difficult to create summaries from clusters in very
difficult

This creates the motivation for subspace clustering:

Find interesting subspaces (areas that are dense with
respect to the attributes belonging to the subspace)

Find clusters for each interesting

Remark: multiple, overlapping clusters might be
obtained; basically one clustering for each subspace.
27
CLIQUE (Clustering In QUEst)

Agrawal, Gehrke, Gunopulos, Raghavan (SIGMOD’98).

Automatically identifying subspaces of a high dimensional
data space that allow better clustering than original space

CLIQUE can be considered as both density-based and gridbased

It partitions each dimension into the same number of
equal length interval

It partitions an m-dimensional data space into nonoverlapping rectangular units

A unit is dense if the fraction of total data points
contained in the unit exceeds the input model parameter

A cluster is a maximal set of connected dense units
within a subspace
28
CLIQUE: The Major Steps



Partition the data space and find the number of points
that lie inside each cell of the partition.
Identify the subspaces that contain clusters using the
Apriori principle
Identify clusters:


Determine dense units in all subspaces of interests
Determine connected dense units in all subspaces of
interests.
29
t=3
30
40
Vacation
20
50
Salary
(10,000)
0 1 2 3 4 5 6 7
30
Vacation
(week)
0 1 2 3 4 5 6 7
age
60
20
30
40
50
age
60
50
age
30
Strength and Weakness of CLIQUE


Strength
 It automatically finds subspaces of the highest
dimensionality such that high density clusters exist in
those subspaces
 It is insensitive to the order of records in input and does
not presume some canonical data distribution
 It scales linearly with the size of input and has good
scalability as the number of dimensions in the data
increases
Weakness
 The accuracy of the clustering result may be degraded at
the expense of simplicity of the method
 Quite expensive
31
Self-organizing feature maps (SOMs)





Clustering is also performed by having several
units competing for the current object
The unit whose weight vector is closest to the
current object wins
The winner and its neighbors learn by having
their weights adjusted
SOMs are believed to resemble processing that
can occur in the brain
Useful for visualizing high-dimensional data in
2- or 3-D space
32
References (1)










R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. Automatic subspace clustering of
high dimensional data for data mining applications. SIGMOD'98
M. R. Anderberg. Cluster Analysis for Applications. Academic Press, 1973.
M. Ankerst, M. Breunig, H.-P. Kriegel, and J. Sander. Optics: Ordering points to identify
the clustering structure, SIGMOD’99.
P. Arabie, L. J. Hubert, and G. De Soete. Clustering and Classification. World Scietific, 1996
M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discovering
clusters in large spatial databases. KDD'96.
M. Ester, H.-P. Kriegel, and X. Xu. Knowledge discovery in large spatial databases:
Focusing techniques for efficient class identification. SSD'95.
D. Fisher. Knowledge acquisition via incremental conceptual clustering. Machine Learning,
2:139-172, 1987.
D. Gibson, J. Kleinberg, and P. Raghavan. Clustering categorical data: An approach based
on dynamic systems. In Proc. VLDB’98.
S. Guha, R. Rastogi, and K. Shim. Cure: An efficient clustering algorithm for large
databases. SIGMOD'98.
A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Printice Hall, 1988.
33
References (2)









L. Kaufman and P. J. Rousseeuw. Finding Groups in Data: an Introduction to Cluster
Analysis. John Wiley & Sons, 1990.
E. Knorr and R. Ng. Algorithms for mining distance-based outliers in large datasets.
VLDB’98.
G. J. McLachlan and K.E. Bkasford. Mixture Models: Inference and Applications to
Clustering. John Wiley and Sons, 1988.
P. Michaud. Clustering techniques. Future Generation Computer systems, 13, 1997.
R. Ng and J. Han. Efficient and effective clustering method for spatial data mining.
VLDB'94.
E. Schikuta. Grid clustering: An efficient hierarchical clustering method for very large
data sets. Proc. 1996 Int. Conf. on Pattern Recognition, 101-105.
G. Sheikholeslami, S. Chatterjee, and A. Zhang. WaveCluster: A multi-resolution
clustering approach for very large spatial databases. VLDB’98.
W. Wang, Yang, R. Muntz, STING: A Statistical Information grid Approach to Spatial
Data Mining, VLDB’97.
T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH : an efficient data clustering method
for very large databases. SIGMOD'96.
34