Data Mining: Concepts and Techniques
Download
Report
Transcript Data Mining: Concepts and Techniques
Data Mining:
Concepts and Techniques
— Chapter 7 —
Jiawei Han
Department of Computer Science
University of Illinois at Urbana-Champaign
www.cs.uiuc.edu/~hanj
©2006 Jiawei Han and Micheline Kamber, All rights reserved
April 10, 2016
Data Mining: Concepts and Techniques
1
Chapter 7. Cluster Analysis
1. What is Cluster Analysis?
2. Types of Data in Cluster Analysis
3. A Categorization of Major Clustering Methods
4. Partitioning Methods
5. Hierarchical Methods
6. Density-Based Methods
7. Grid-Based Methods
8. Model-Based Methods
9. Clustering High-Dimensional Data
10. Constraint-Based Clustering
11. Outlier Analysis
12. Summary
April 10, 2016
Data Mining: Concepts and Techniques
2
Chapter 7. Cluster Analysis
1. What is Cluster Analysis?
2. Types of Data in Cluster Analysis
3. A Categorization of Major Clustering Methods
4. Partitioning Methods
5. Hierarchical Methods
6. Density-Based Methods
7. Grid-Based Methods
8. Model-Based Methods
9. Clustering High-Dimensional Data
10. Constraint-Based Clustering
11. Outlier Analysis
12. Summary
April 10, 2016
Data Mining: Concepts and Techniques
3
What is Cluster Analysis?
Cluster: a collection of data objects
Similar to one another within the same cluster
Dissimilar to the objects in other clusters
Cluster analysis
Finding similarities between data according to the
characteristics found in the data and grouping similar
data objects into clusters
Unsupervised learning: no predefined classes
Typical applications
As a stand-alone tool to get insight into data distribution
As a preprocessing step for other algorithms
April 10, 2016
Data Mining: Concepts and Techniques
4
Clustering: Rich Applications and
Multidisciplinary Efforts
Pattern Recognition
Spatial Data Analysis
Create thematic maps(专题地图) in GIS by clustering
feature spaces
Detect spatial clusters or for other spatial mining tasks
Image Processing
Economic Science (especially market research)
WWW
Document classification
Cluster Weblog data to discover groups of similar access
patterns
April 10, 2016
Data Mining: Concepts and Techniques
5
Examples of Clustering Applications
Marketing: Help marketers discover distinct groups in their customer
bases, and then use this knowledge to develop targeted marketing
programs
Land use: Identification of areas of similar land use in an earth
observation database
Insurance(保险): Identifying groups of motor insurance policy
holders with a high average claim cost
City-planning(城市规划): Identifying groups of houses according to
their house type, value, and geographical location
Earth-quake studies: Observed earth quake epicenters should be
clustered along continent faults
April 10, 2016
Data Mining: Concepts and Techniques
6
Quality: What Is Good Clustering?
A good clustering method will produce high quality
clusters with
high intra-class similarity
low inter-class similarity
The quality of a clustering result depends on both the
similarity measure used by the method and its
implementation
The quality of a clustering method is also measured by its
ability to discover some or all of the hidden patterns
April 10, 2016
Data Mining: Concepts and Techniques
7
Measure the Quality of Clustering
Dissimilarity/Similarity metric: Similarity is expressed in
terms of a distance function, typically metric: d(i, j)
There is a separate “quality” function that measures the
“goodness” of a cluster.
The definitions of distance functions are usually very
different for interval-scaled(区间标度度量,即数值型度量), boolean,
categorical(分类), ordinal , ratio, and vector variables.
Weights should be associated with different variables based
on applications and data semantics.
It is hard to define “similar enough” or “good enough”
the answer is typically highly subjective.
April 10, 2016
Data Mining: Concepts and Techniques
8
Requirements of Clustering in Data Mining
Scalability
Ability to deal with different types of attributes
Ability to handle dynamic data(Incremental Clustering)
Discovery of clusters with arbitrary shape
Minimal requirements for domain knowledge to
determine input parameters
Able to deal with noise and outliers
Insensitive to order of input records
High dimensionality
Incorporation of user-specified constraints
Interpretability and usability
April 10, 2016
Data Mining: Concepts and Techniques
9
Chapter 7. Cluster Analysis
1. What is Cluster Analysis?
2. Types of Data in Cluster Analysis
3. A Categorization of Major Clustering Methods
4. Partitioning Methods
5. Hierarchical Methods
6. Density-Based Methods
7. Grid-Based Methods
8. Model-Based Methods
9. Clustering High-Dimensional Data
10. Constraint-Based Clustering
11. Outlier Analysis
12. Summary
April 10, 2016
Data Mining: Concepts and Techniques
10
Data Structures
Data matrix(n objects×p variables)
x11 ... x1f ... x1p
(two modes)
...
x
i1
...
x
n1
...
...
...
xif
...
...
... xnf
...
...
...
xip
...
...
... xnp
Dissimilarity matrix (n objects×n objects)
(one mode)
0
d(2,1)
0
d(3,1) d ( 3,2) 0
:
:
:
d ( n,1) d ( n,2) ...
April 10, 2016
Data Mining: Concepts and Techniques
... 0
11
Type of data in clustering analysis
Interval-scaled variables
Binary variables
Nominal, ordinal, and ratio variables
Variables of mixed types
April 10, 2016
Data Mining: Concepts and Techniques
12
Interval-valued variables
Standardize data
Calculate the mean absolute deviation(均值绝对偏差):
sf 1
n (| x1 f m f | | x2 f m f | ... | xnf m f |)
where
mf 1
n (x1 f x2 f
...
xnf )
.
Calculate the standardized measurement (z-score)
xif m f
zif
sf
Using mean absolute deviation is more robust than using
standard deviation
April 10, 2016
Data Mining: Concepts and Techniques
13
Similarity and Dissimilarity Between Objects
Distances are normally used to measure the similarity or
dissimilarity between two data objects
Some popular ones include: Minkowski distance:
d (i, j) q (| x x |q | x x |q ... | x x |q )
i1
j1
i2
j2
ip
jp
where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are
two p-dimensional data objects, and q is a positive
integer
If q = 1, d is Manhattan distance
d (i, j) | x x | | x x | ... | x x |
i1 j1 i2 j 2
ip j p
April 10, 2016
Data Mining: Concepts and Techniques
14
Similarity and Dissimilarity Between
Objects (Cont.)
If q = 2, d is Euclidean distance:
d (i, j) (| x x |2 | x x |2 ... | x x |2 )
i1
j1
i2
j2
ip
jp
Properties
d(i,j) 0
d(i,i) = 0
d(i,j) = d(j,i)
d(i,j) d(i,k) + d(k,j)
Also, one can use weighted distance, parametric
Pearson product moment correlation, or other
disimilarity measures
April 10, 2016
Data Mining: Concepts and Techniques
15
Binary Variables
Object j
1
0
A contingency table(相依表)
1
a
b
Object i
for binary data
0
c
d
sum a c b d
Distance measure for
symmetric binary variables:
Distance measure for
asymmetric binary variables:
Jaccard coefficient (similarity
measure for asymmetric
d (i, j)
d (i, j)
April 10, 2016
bc
a bc d
bc
a bc
simJaccard (i, j)
binary variables):
Data Mining: Concepts and Techniques
sum
a b
cd
p
a
a b c
16
Dissimilarity between Binary Variables
Example
Name
Jack
Mary
Jim
Gender
M
F
M
Fever
Y
Y
Y
Cough
N
N
Y
Test-1
P
P
N
Test-2
N
N
N
Test-3
N
P
N
Test-4
N
N
N
gender is a symmetric attribute
the remaining attributes are asymmetric binary
let the values Y and P be set to 1, and the value N be set to 0
0 1
0.33
2 0 1
1 2
d ( jim, mary )
0.75
11 2
d ( jack , jim) ?
d ( jack , mary )
April 10, 2016
Data Mining: Concepts and Techniques
17
Nominal Variables(标称变量)
A generalization of the binary variable in that it can take
more than 2 states, e.g., red, yellow, blue, green
Method 1: Simple matching
m: # of matches, p: total # of variables
m
d (i, j) p
p
Method 2: use a large number of binary variables
creating a new binary variable for each of the M
nominal states
April 10, 2016
Data Mining: Concepts and Techniques
18
Nominal Variables
Name
Jack
Mary
Jim
Gender
M
F
M
Name
Jack
Mary
Jim
April 10, 2016
Fever
Y
Y
Y
Gender
M
F
M
Cough
N
N
Y
Red
Y
Y
Y
Test-1
P
P
N
Yellow
N
N
Y
Test-2
N
N
N
Green
P
P
N
Test-3
N
P
N
Blue
N
N
N
Data Mining: Concepts and Techniques
Test-4
N
N
N
Pink
N
P
N
Gray
N
N
N
19
Ordinal Variables
An ordinal variable can be discrete or continuous
Order is important, e.g., rank
Can be treated like interval-scaled
replace xif by their rank
map the range of each variable onto [0, 1] by replacing
i-th object in the f-th variable by
zif
rif {1,...,M f }
rif 1
M f 1
compute the dissimilarity using methods for intervalscaled variables
April 10, 2016
Data Mining: Concepts and Techniques
20
Ratio-Scaled Variables
Ratio-scaled variable: a positive measurement on a
nonlinear scale, approximately at exponential scale(指数刻
度),
such as AeBt or Ae-Bt
Methods:
treat them like interval-scaled variables—not a good
choice! (why?—the scale can be distorted)
apply logarithmic transformation
yif = log(xif)
treat them as continuous ordinal data treat their rank as
interval-scaled
April 10, 2016
Data Mining: Concepts and Techniques
21
Variables of Mixed Types
A database may contain all the six types of variables
symmetric binary, asymmetric binary, nominal,
ordinal, interval and ratio
One may use a weighted formula to combine their
effects
pf 1 ij( f ) d ij( f )
d (i, j )
pf 1 ij( f )
f is binary or nominal:
dij(f) = 0 if xif = xjf , or dij(f) = 1 otherwise
f is interval-based: use the normalized distance
f is ordinal or ratio-scaled
compute ranks rif and
r 1
z
if
and treat zif as interval-scaled
M 1
if
f
April 10, 2016
Data Mining: Concepts and Techniques
22
Vector Objects
Vector objects: keywords in documents, gene
features in micro-arrays, etc.
Broad applications: information retrieval, biologic
taxonomy, etc.
Cosine measure
A variant: Tanimoto coefficient
April 10, 2016
Data Mining: Concepts and Techniques
23
Chapter 7. Cluster Analysis
1. What is Cluster Analysis?
2. Types of Data in Cluster Analysis
3. A Categorization of Major Clustering Methods
4. Partitioning Methods
5. Hierarchical Methods
6. Density-Based Methods
7. Grid-Based Methods
8. Model-Based Methods
9. Clustering High-Dimensional Data
10. Constraint-Based Clustering
11. Outlier Analysis
12. Summary
April 10, 2016
Data Mining: Concepts and Techniques
24
Major Clustering Approaches (I)
Partitioning approach:
Construct various partitions and then evaluate them by some criterion, e.g.,
minimizing the sum of square errors
Typical methods: k-means, k-medoids, CLARANS
Hierarchical approach:
Create a hierarchical decomposition of the set of data (or objects) using
some criterion
Typical methods: Diana, Agnes, BIRCH, ROCK, CAMELEON
Density-based approach:
Based on connectivity and density functions
Typical methods: DBSACN, OPTICS, DenClue
April 10, 2016
Data Mining: Concepts and Techniques
25
Major Clustering Approaches (II)
Grid-based approach:
based on a multiple-level granularity structure
Typical methods: STING, WaveCluster, CLIQUE
Model-based:
A model is hypothesized for each of the clusters and tries to find the best
fit of that model to each other
Typical methods: EM, SOM, COBWEB
Frequent pattern-based:
Based on the analysis of frequent patterns
Typical methods: pCluster
User-guided or constraint-based:
Clustering by considering user-specified or application-specific constraints
Typical methods: COD (obstacles), constrained clustering
April 10, 2016
Data Mining: Concepts and Techniques
26
Typical Alternatives to Calculate the Distance
between Clusters
Single link: smallest distance between an element in one cluster
and an element in the other, i.e., dis(Ki, Kj) = min(tip, tjq)
Complete link: largest distance between an element in one cluster
and an element in the other, i.e., dis(Ki, Kj) = max(tip, tjq)
Average: avg distance between an element in one cluster and an
element in the other, i.e., dis(Ki, Kj) = avg(tip, tjq)
Centroid(质心): distance between the centroids of two clusters, i.e.,
dis(Ki, Kj) = dis(Ci, Cj)
Medoid(中心点): distance between the medoids of two clusters,
i.e., dis(Ki, Kj) = dis(Mi, Mj)
Medoid: one chosen, centrally located object in the cluster
April 10, 2016
Data Mining: Concepts and Techniques
27
Centroid, Radius and Diameter of a
Cluster (for numerical data sets)
Centroid: the “middle” of a cluster
ip
)
N
Radius: square root of average distance from any point of the
cluster to its centroid
Cm
iN 1(t
N (t cm ) 2
Rm i 1 ip
N
Diameter: square root of average mean squared distance between
all pairs of points in the cluster
N N (t t ) 2
ip jq
Dm i 1 j 1
N ( N 1)
April 10, 2016
Data Mining: Concepts and Techniques
28
Chapter 7. Cluster Analysis
1. What is Cluster Analysis?
2. Types of Data in Cluster Analysis
3. A Categorization of Major Clustering Methods
4. Partitioning Methods
5. Hierarchical Methods
6. Density-Based Methods
7. Grid-Based Methods
8. Model-Based Methods
9. Clustering High-Dimensional Data
10. Constraint-Based Clustering
11. Outlier Analysis
12. Summary
April 10, 2016
Data Mining: Concepts and Techniques
29
Partitioning Algorithms: Basic Concept
Partitioning method: Construct a partition of a database D of n objects
into a set of k clusters, s.t., min sum of squared distance
km1tmiKm (Cm tmi )2
Given a k, find a partition of k clusters that optimizes the chosen
partitioning criterion
Global optimal: exhaustively(全部的) enumerate(枚举) all partitions
Heuristic methods: k-means and k-medoids algorithms
k-means (MacQueen’67): Each cluster is represented by the center
of the cluster
k-medoids or PAM (Partition around medoids) (Kaufman &
Rousseeuw’87): Each cluster is represented by one of the objects
in the cluster
April 10, 2016
Data Mining: Concepts and Techniques
30
The K-Means Clustering Method
Typically,the Square-error Criterion is used:
E i 1 pC | p mi | 2
k
i
April 10, 2016
Data Mining: Concepts and Techniques
31
The K-Means Clustering Method
Example
10
10
9
9
8
8
7
7
6
6
5
5
10
9
8
7
6
5
4
4
3
2
1
0
0
1
2
3
4
5
6
7
8
K=2
Arbitrarily choose K
object as initial
cluster center
9
10
Assign
each
objects
to most
similar
center
3
2
1
0
0
1
2
3
4
5
6
7
8
9
10
4
3
2
1
0
0
1
2
3
4
5
6
reassign
10
10
9
9
8
8
7
7
6
6
5
5
4
2
1
0
0
1
2
3
4
5
6
7
8
7
8
9
10
reassign
3
April 10, 2016
Update
the
cluster
means
9
10
Update
the
cluster
means
Data Mining: Concepts and Techniques
4
3
2
1
0
0
1
2
3
4
5
6
7
8
9
10
32
Comments on the K-Means Method
Strength: Relatively efficient: O(tkn), where n is # objects, k is #
clusters, and t is # iterations. Normally, k, t << n.
Comparing: PAM: O(k(n-k)2 ), CLARA: O(ks2 + k(n-k))
Comment: Often terminates at a local optimum. The global optimum
may be found using techniques such as: deterministic annealing (确定性
退火算法)and genetic algorithms(遗传算法)
Weakness
Applicable only when mean is defined, then what about categorical
data?
Need to specify k, the number of clusters, in advance
Unable to handle noisy data and outliers
Not suitable to discover clusters with non-convex shapes
April 10, 2016
Data Mining: Concepts and Techniques
33
Variations of the K-Means Method
A few variants of the k-means which differ in
Selection of the initial k means
Dissimilarity(相异度) calculations
Strategies to calculate cluster means
Handling categorical data: k-modes (Huang’98)
Replacing means of clusters with modes(众数)
Using new dissimilarity measures to deal with categorical objects
Using a frequency-based method to update modes of clusters
A mixture of categorical and numerical data: k-prototype method
April 10, 2016
Data Mining: Concepts and Techniques
34
What Is the Problem of the K-Means Method?
The k-means algorithm is sensitive to outliers !
Since an object with an extremely large value may substantially
distort the distribution of the data.
K-Medoids: Instead of taking the mean value of the object in a
cluster as a reference point, medoids can be used, which is the most
centrally located object in a cluster.
10
10
9
9
8
8
7
7
6
6
5
5
4
4
3
3
2
2
1
1
0
0
0
April 10, 2016
1
2
3
4
5
6
7
8
9
10
0
1
2
3
Data Mining: Concepts and Techniques
4
5
6
7
8
9
10
35
The K-Medoids Clustering Method
Find representative objects, called medoids, in clusters
PAM (Partitioning Around Medoids, 1987)
starts from an initial set of medoids and iteratively replaces one
of the medoids by one of the non-medoids if it improves the
total distance of the resulting clustering
PAM works effectively for small data sets, but does not scale
well for large data sets
CLARA (Kaufmann & Rousseeuw, 1990)
CLARANS (Ng & Han, 1994): Randomized sampling
Focusing + spatial data structure (Ester et al., 1995)
April 10, 2016
Data Mining: Concepts and Techniques
36
A Typical K-Medoids Algorithm (PAM)
Total Cost = 20
10
10
10
9
9
9
8
8
8
Arbitrary
choose k
object as
initial
medoids
7
6
5
4
3
2
7
6
5
4
3
2
1
1
0
0
0
1
2
3
4
5
6
7
8
9
0
10
1
2
3
4
5
6
7
8
9
10
Assign
each
remainin
g object
to
nearest
medoids
7
6
5
4
3
2
1
0
0
K=2
Until no
change
10
3
4
5
6
7
8
9
10
10
Compute
total cost of
swapping
9
9
Swapping O
and Oramdom
8
If quality is
improved.
5
5
4
4
3
3
2
2
1
1
7
6
0
8
7
6
0
0
April 10, 2016
2
Randomly select a
nonmedoid object,Oramdom
Total Cost = 26
Do loop
1
1
2
3
4
5
6
7
8
9
10
Data Mining: Concepts and Techniques
0
1
2
3
4
5
6
7
8
9
10
37
PAM (Partitioning Around Medoids) (1987)
PAM (Kaufman and Rousseeuw, 1987), built in Splus
Use real object to represent the cluster
Select k representative objects arbitrarily
For each pair of non-selected object h and selected
object i, calculate the total swapping cost TCih
For each pair of i and h,
If TCih < 0, i is replaced by h
Then assign each non-selected object to the most
similar representative object
repeat steps 2-3 until there is no change
April 10, 2016
Data Mining: Concepts and Techniques
38
PAM Clustering: Total swapping cost TCih=jCjih
10
10
9
9
t
8
7
7
6
5
i
4
3
j
6
h
4
5
h
i
3
2
2
1
1
0
0
0
1
2
3
4
5
6
7
8
9
10
Cjih = d(j, h) - d(j, i)
0
1
2
3
4
5
6
7
8
9
10
Cjih = 0
10
10
9
9
h
8
8
7
j
7
6
6
i
5
5
i
4
h
4
t
j
3
3
t
2
2
1
1
0
0
0
April 10, 2016
j
t
8
1
2
3
4
5
6
7
8
9
10
0
1
2
3
4
5
6
7
8
9
CjihTechniques
= d(j, h) - d(j, t)
Cjih = d(j, t) - d(j, i) Data Mining: Concepts and
10
39
What Is the Problem with PAM?
Pam is more robust than k-means in the presence of
noise and outliers because a medoid is less influenced by
outliers or other extreme values than a mean
Pam works efficiently for small data sets but does not
scale well for large data sets.
O(k(n-k)2 ) for each iteration
where n is # of data,k is # of clusters
Sampling based method,
CLARA(Clustering LARge Applications)
April 10, 2016
Data Mining: Concepts and Techniques
40
CLARA (Clustering Large Applications) (1990)
CLARA (Kaufmann and Rousseeuw in 1990)
Built in statistical analysis packages, such as S+
It draws multiple samples of the data set, applies PAM on
each sample, and gives the best clustering as the output
Strength: deals with larger data sets than PAM
Weakness:
Efficiency depends on the sample size
A good clustering based on samples will not
necessarily represent a good clustering of the whole
data set if the sample is biased
April 10, 2016
Data Mining: Concepts and Techniques
41
CLARANS (“Randomized” CLARA) (1994)
CLARANS (A Clustering Algorithm based on Randomized
Search) (Ng and Han’94)
CLARANS draws sample of neighbors dynamically
The clustering process can be presented as searching a
graph where every node is a potential solution, that is, a
set of k medoids
If the local optimum is found, CLARANS starts with new
randomly selected node in search for a new local optimum
It is more efficient and scalable than both PAM and CLARA
Focusing techniques and spatial access structures may
further improve its performance (Ester et al.’95)
April 10, 2016
Data Mining: Concepts and Techniques
42
Chapter 7. Cluster Analysis
1. What is Cluster Analysis?
2. Types of Data in Cluster Analysis
3. A Categorization of Major Clustering Methods
4. Partitioning Methods
5. Hierarchical Methods
6. Density-Based Methods
7. Grid-Based Methods
8. Model-Based Methods
9. Clustering High-Dimensional Data
10. Constraint-Based Clustering
11. Outlier Analysis
12. Summary
April 10, 2016
Data Mining: Concepts and Techniques
43
Hierarchical Clustering
Use distance matrix as clustering criteria. This method
does not require the number of clusters k as an input,
but needs a termination condition
Step 0
a
Step 1
Step 2 Step 3 Step 4
agglomerative
(AGNES)
ab
b
abcde
c
cde
d
de
e
Step 4
April 10, 2016
Step 3
Step 2 Step 1 Step 0
Data Mining: Concepts and Techniques
divisive
(DIANA)
44
AGNES (Agglomerative Nesting)
Introduced in Kaufmann and Rousseeuw (1990)
Implemented in statistical analysis packages, e.g., Splus
Use the Single-Link method and the dissimilarity matrix.
Merge nodes that have the least dissimilarity
Go on in a non-descending fashion
Eventually all nodes belong to the same cluster
10
10
10
9
9
9
8
8
8
7
7
7
6
6
6
5
5
5
4
4
4
3
3
3
2
2
2
1
1
1
0
0
0
1
2
3
April 10, 2016
4
5
6
7
8
9
10
0
0
1
2
3
4
5
6
7
8
9
10
Data Mining: Concepts and Techniques
0
1
2
3
4
5
6
7
8
9
10
45
Dendrogram(树状图): Shows How the Clusters are
Merged
April 10, 2016
Data Mining: Concepts and Techniques
46
DIANA (Divisive Analysis)
Introduced in Kaufmann and Rousseeuw (1990)
Implemented in statistical analysis packages, e.g., Splus
Inverse order of AGNES
Eventually each node forms a cluster on its own
10
10
10
9
9
9
8
8
8
7
7
7
6
6
6
5
5
5
4
4
4
3
3
3
2
2
2
1
1
1
0
0
0
0
1
2
April 10, 2016
3
4
5
6
7
8
9
10
0
1
2
3
4
5
6
7
8
9
10
Data Mining: Concepts and Techniques
0
1
2
3
4
5
6
7
8
9
10
47
Recent Hierarchical Clustering Methods
Major weakness of agglomerative clustering methods
do not scale well: time complexity of at least O(n2),
where n is the number of total objects
can never undo what was done previously
Integration of hierarchical with distance-based clustering
BIRCH (1996): uses CF-tree and incrementally adjusts
the quality of sub-clusters
ROCK (1999): clustering categorical data by neighbor
and link analysis
CHAMELEON (1999): hierarchical clustering using
dynamic modeling
April 10, 2016
Data Mining: Concepts and Techniques
48
CHAMELEON: Hierarchical Clustering Using
Dynamic Modeling (1999)
CHAMELEON: by G. Karypis, E.H. Han, and V. Kumar’99
Measures the similarity based on a dynamic model
Two clusters are merged only if the interconnectivity and closeness
(proximity) between two clusters are high relative to the internal
interconnectivity of the clusters and closeness of items within the
clusters
Cure ignores information about interconnectivity of the objects,
Rock ignores information about the closeness of two clusters
A two-phase algorithm
1. Use a graph partitioning algorithm: cluster objects into a large
number of relatively small sub-clusters
2. Use an agglomerative hierarchical clustering algorithm: find the
genuine clusters by repeatedly combining these sub-clusters
April 10, 2016
Data Mining: Concepts and Techniques
49
Overall Framework of CHAMELEON
Construct
Partition the Graph
Sparse Graph
Data Set
Merge Partition
Final Clusters
April 10, 2016
Data Mining: Concepts and Techniques
50
CHAMELEON (Clustering Complex Objects)
April 10, 2016
Data Mining: Concepts and Techniques
51
Chapter 7. Cluster Analysis
1. What is Cluster Analysis?
2. Types of Data in Cluster Analysis
3. A Categorization of Major Clustering Methods
4. Partitioning Methods
5. Hierarchical Methods
6. Density-Based Methods
7. Grid-Based Methods
8. Model-Based Methods
9. Clustering High-Dimensional Data
10. Constraint-Based Clustering
11. Outlier Analysis
12. Summary
April 10, 2016
Data Mining: Concepts and Techniques
52
Density-Based Clustering Methods
Clustering based on density (local cluster criterion), such
as density-connected points
Major features:
Discover clusters of arbitrary shape
Handle noise
One scan
Need density parameters as termination condition
Several interesting studies:
DBSCAN: Ester, et al. (KDD’96)
OPTICS: Ankerst, et al (SIGMOD’99).
DENCLUE: Hinneburg & D. Keim (KDD’98)
CLIQUE: Agrawal, et al. (SIGMOD’98) (more grid-based)
April 10, 2016
Data Mining: Concepts and Techniques
53
Density-Based Clustering: Basic Concepts
Two parameters:
Eps: Maximum radius of the neighbourhood
MinPts: Minimum number of points in an Epsneighbourhood of that point
NEps(p):
{q belongs to D | dist(p,q) <= Eps}
Directly density-reachable: A point p is directly densityreachable from a point q w.r.t. Eps, MinPts if
p belongs to NEps(q)
core point condition:
|NEps (q)| >= MinPts
April 10, 2016
Data Mining: Concepts and Techniques
p
q
MinPts = 5
Eps = 1 cm
54
Density-Reachable and Density-Connected
Density-reachable:
A point p is density-reachable from
a point q w.r.t. Eps, MinPts if there
is a chain of points p1, …, pn, p1 =
q, pn = p such that pi+1 is directly
density-reachable from pi
p
p1
q
Density-connected
A point p is density-connected to a
point q w.r.t. Eps, MinPts if there
is a point o such that both, p and
q are density-reachable from o
w.r.t. Eps and MinPts
April 10, 2016
p
Data Mining: Concepts and Techniques
q
o
55
DBSCAN: Density Based Spatial Clustering of
Applications with Noise
Relies on a density-based notion of cluster: A cluster is
defined as a maximal set of density-connected points
Discovers clusters of arbitrary shape in spatial databases
with noise
Outlier
Border
Eps = 1cm
Core
April 10, 2016
MinPts = 5
Data Mining: Concepts and Techniques
56
DBSCAN: The Algorithm
Arbitrary select a point p
Retrieve all points density-reachable from p w.r.t. Eps
and MinPts.
If p is a core point, a cluster is formed.
If p is a border point, no points are density-reachable
from p and DBSCAN visits the next point of the database.
Continue the process until all of the points have been
processed.
April 10, 2016
Data Mining: Concepts and Techniques
57
DBSCAN: The Algorithm
April 10, 2016
Data Mining: Concepts and Techniques
58
DBSCAN: Sensitive to Parameters
April 10, 2016
Data Mining: Concepts and Techniques
59
CHAMELEON (Clustering Complex Objects)
April 10, 2016
Data Mining: Concepts and Techniques
60
OPTICS: A Cluster-Ordering Method (1999)
OPTICS: Ordering Points To Identify the Clustering
Structure
Ankerst, Breunig, Kriegel, and Sander (SIGMOD’99)
Produces a special order of the database wrt its
density-based clustering structure
This cluster-ordering contains info equiv to the densitybased clusterings corresponding to a broad range of
parameter settings
Good for both automatic and interactive cluster analysis,
including finding intrinsic clustering structure
Can be represented graphically or using visualization
techniques
April 10, 2016
Data Mining: Concepts and Techniques
61
OPTICS: Some Extension from
DBSCAN
New Definition:
Reachability
-distance
undefined
‘
April 10, 2016
Data Mining: Concepts and Techniques
Cluster-order
of the objects
63
Density-Based Clustering: OPTICS & Its Applications
April 10, 2016
Data Mining: Concepts and Techniques
64
Chapter 7. Cluster Analysis
1. What is Cluster Analysis?
2. Types of Data in Cluster Analysis
3. A Categorization of Major Clustering Methods
4. Partitioning Methods
5. Hierarchical Methods
6. Density-Based Methods
7. Grid-Based Methods
8. Model-Based Methods
9. Clustering High-Dimensional Data
10. Constraint-Based Clustering
11. Outlier Analysis
12. Summary
April 10, 2016
Data Mining: Concepts and Techniques
65
Grid-Based Clustering Method
Using multi-resolution grid data structure
Several interesting methods
STING (a STatistical INformation Grid approach) by Wang,
Yang and Muntz (1997)
WaveCluster by Sheikholeslami, Chatterjee, and Zhang
(VLDB’98)
A multi-resolution clustering approach using wavelet
method
CLIQUE: Agrawal, et al. (SIGMOD’98)
April 10, 2016
On high-dimensional data (thus put in the section of clustering
high-dimensional data
Data Mining: Concepts and Techniques
66
STING: A Statistical Information Grid Approach
Wang, Yang and Muntz (VLDB’97)
The spatial area area is divided into rectangular cells
There are several levels of cells corresponding to different
levels of resolution
April 10, 2016
Data Mining: Concepts and Techniques
67
The STING Clustering Method
Each cell at a high level is partitioned into a number of
smaller cells in the next lower level
Statistical info of each cell is calculated and stored
beforehand and is used to answer queries
Parameters of higher level cells can be easily calculated from
parameters of lower level cell
count, mean, s, min, max
type of distribution—normal, uniform, etc.
Use a top-down approach to answer spatial data queries
Start from a pre-selected layer—typically with a small
number of cells
For each cell in the current level compute the confidence
interval
April 10, 2016
Data Mining: Concepts and Techniques
68
Comments on STING
Remove the irrelevant cells from further consideration
When finish examining the current layer, proceed to the
next lower level
Repeat this process until the bottom layer is reached
Advantages:
Query-independent, easy to parallelize, incremental
update
O(K), where K is the number of grid cells at the lowest
level
Disadvantages:
All the cluster boundaries are either horizontal or
vertical, and no diagonal boundary is detected
April 10, 2016
Data Mining: Concepts and Techniques
69
Chapter 7. Cluster Analysis
1. What is Cluster Analysis?
2. Types of Data in Cluster Analysis
3. A Categorization of Major Clustering Methods
4. Partitioning Methods
5. Hierarchical Methods
6. Density-Based Methods
7. Grid-Based Methods
8. Model-Based Methods
9. Clustering High-Dimensional Data
10. Constraint-Based Clustering
11. Outlier Analysis
12. Summary
April 10, 2016
Data Mining: Concepts and Techniques
70
Summary
Cluster analysis groups objects based on their similarity
and has wide applications
Measure of similarity can be computed for various types
of data
Clustering algorithms can be categorized into partitioning
methods, hierarchical methods, density-based methods,
grid-based methods, and model-based methods
Outlier detection and analysis are very useful for fraud
detection, etc. and can be performed by statistical,
distance-based or deviation-based approaches
There are still lots of research issues on cluster analysis
April 10, 2016
Data Mining: Concepts and Techniques
71
www.cs.uiuc.edu/~hanj
Thank you !!!
April 10, 2016
Data Mining: Concepts and Techniques
72
CABDET
( Clustering Algorithm based on Building a DEnsity-Tree )
Classical density-based methods require Two
parameters: radius of neighborhood:
density threshold:MinPts
Three application Alogrithm:
(1)DBSCAN
(2)OPTICS
(3)DILC: Density-isoline clustering
One parameter is fixed or Both are fixed
April 10, 2016
Data Mining: Concepts and Techniques
73
CABDET—two Changeable Parameters
Definitions
Definition 1 :( point density) The number of objects
within ε-neighborhood of a given object P is called the εneighborhood’s density of object P, denoted by Density
(P, ε).
Definition 2 :( neighborhood coefficient) If object Q
is within the εp-neighborhood of object P, then the
neighborhood coefficient of object Q is defined by:
Density(Q, P )
Density(P, Q )
April 10, 2016
Data Mining: Concepts and Techniques
74
CABDET
Example:
April 10, 2016
Data Mining: Concepts and Techniques
75
CABDET
Establishing adjacency
Method:the current object is regarded as father node whose sons are
other objects in the radius of neighborhood of the current object.
These sons on the same level rank form left to right according to the
distance with their father from short to long. If a son node has several
fathers, we choose the leftmost node as its father, the processed
sequence of nodes is breadth-first.
April 10, 2016
Data Mining: Concepts and Techniques
76
The Algorithm:b4 Steps
Step 1: calculate the distance array Dist (i,j)
among all objects.
Step 2: calculate the allowed max-radius of the
neighborhood.
Step 3: extend the density-tree until no point is
added according to its dynamic-generating radius
of the neighborhood.
Step 4: delete clusters containing one object or
very few objects.
April 10, 2016
Data Mining: Concepts and Techniques
77
Step 1 & Step 2
calculate the distance array Dist (i,j)
Similarity matrix Dist (i, j) is computed
from the distance function which is defined
according to similarity.
Calculate the allowed max-radius of the
neighborhood
(1) search Dist(i, j) and compute point density according to ε0
(2) rank objects with descending sequence by point density,
record corresponding identifiers into vector array Desc(n)
(3) compute median value of point density :
April 10, 2016
Data Mining: Concepts and Techniques
78
Step 3 (1)
April 10, 2016
Data Mining: Concepts and Techniques
79
Step 3 (2)
April 10, 2016
Data Mining: Concepts and Techniques
80
Step 3 (3)
April 10, 2016
Data Mining: Concepts and Techniques
81
Step 4 Delete Noise
Deleting clusters containing one point or very
few points
The algorithm CABDET directly deletes
those clusters containing objects less than its
threshold predefined by users
Computational complexity :
O(n2)
where n is the num of objects.
April 10, 2016
Data Mining: Concepts and Techniques
82
Performance: Synthetic Test Sets
April 10, 2016
Data Mining: Concepts and Techniques
83
Performance: Real Test Sets
April 10, 2016
Data Mining: Concepts and Techniques
84
Performance: Real Test Sets
April 10, 2016
Data Mining: Concepts and Techniques
85
Performance: Real Test Sets
Improved Methods:
(1) Decrease the Running Time
(2) Merge the micro-clusters by hierarchical clustering
(3) Others
April 10, 2016
Data Mining: Concepts and Techniques
86