No Slide Title

Download Report

Transcript No Slide Title

Clustering and
Object Similarity Evaluation
©Jiawei Han and Micheline Kamber
with Additions and Modifications by Ch. Eick
Organization for COSC 6340:
1. What is Clustering?
2. Object Similarity Measurement
3. K-Means Clustering Algorithm
4. Hierarchical Clustering
Han, Kamber, Eick: Object Similarity & Clustering
1
Examples of Clustering Applications





Marketing: Help marketers discover distinct groups in their
customer bases, and then use this knowledge to develop
targeted marketing programs
Land use: Identification of areas of similar land use in an
earth observation database
Insurance: Identifying groups of motor insurance policy
holders with a high average claim cost
City-planning: Identifying groups of houses according to
their house type, value, and geographical location
Earth-quake studies: Observed earth quake epicenters
should be clustered along continent faults
Han, Kamber, Eick: Object Similarity & Clustering
3
Requirements of Clustering in Data
Mining

Scalability

Ability to deal with different types of attributes

Discovery of clusters with arbitrary shape

Minimal requirements for domain knowledge to
determine input parameters

Able to deal with noise and outliers

Insensitive to order of input records

High dimensionality

Incorporation of user-specified constraints

Interpretability and usability
Han, Kamber, Eick: Object Similarity & Clustering
4
Data Structures for Clustering


Data matrix
 (n objects,
p attributes)
(Dis)Similarity
 (nxn)
Han, Kamber, Eick: Object Similarity & Clustering
 x11

 ...
x
 i1
 ...
x
 n1
... x1f
... ...
... xif
...
...
... xnf
 0
 d(2,1)
0

 d(3,1) d ( 3,2)
matrix 
:
 :
d ( n,1) d ( n,2)
... x1p 

... ... 
... xip 

... ... 
... xnp 





0

:

... ... 0
5
Quality Evaluation of Clusters





Dissimilarity/Similarity metric: Similarity is expressed in
terms of a normalized distance function d, which is
typically metric; typically: d (oi, oj) = 1 - d (oi, oj)
There is a separate “quality” function that measures the
“goodness” of a cluster.
The definitions of similarity functions are usually very
different for interval-scaled, boolean, categorical, ordinal
and ratio variables.
Weights should be associated with different variables
based on applications and data semantics.
It is hard to define “similar enough” or “good enough”

the answer is typically highly subjective.
Han, Kamber, Eick: Object Similarity & Clustering
6
Challenges in Obtaining
Object Similarity Measures

Many Types of Variables

Interval-scaled variables
Binary variables and nominal variables
Ordinal variables

Ratio-scaled variables



Objects are characterized by variables belonging to
different types (mixture of variables)
Han, Kamber, Eick: Object Similarity & Clustering
7
Case Study: Patient Similarity
The following relation is given (with 10000 tuples):
Patient(ssn, weight, height, cancer-sev, eye-color, age)
 Attribute Domains
 ssn: 9 digits



weight between 30 and 650; mweight=158 sweight=24.20
height between 0.30 and 2.20 in meters; mheight=1.52
sheight=19.2

cancer-sev: 4=serious 3=quite_serious 2=medium 1=minor

eye-color: {brown, blue, green, grey }

age: between 3 and 100; mage=45 sage=13.2
Task: Define Patient Similarity
Han, Kamber, Eick: Object Similarity & Clustering
8
Generating a Global Similarity Measure
from Single Variable Similarity Measures
Assumption: A database may contain up to six
types of variables: symmetric binary, asymmetric
binary, nominal, ordinal, interval and ratio.
1. Standardize variable and associate similarity
measure di with the standardized i-th variable
and determine weight wi of the i-th variable.
2. Create the following global (dis)similarity
measure d:
d (o , o ) 
i
Han, Kamber, Eick: Object Similarity & Clustering
j
p

df (oi oj ) * wf
f 1
p

wf
f 1
,
9
A Methodology to Obtain a Similarity Matrix
1. Understand Variables
2. Remove (non-relevant and redundant) Variables
3. (Standardize and) Normalize Variables (typically using z4.
5.
6.
7.
scores or variable values are transformed to numbers in
[0,1])
Associate (Dis)Similarity Measure df/df with each Variable
Associate a Weight (measuring its importance) with each
Variable
Compute the (Dis)Similarity Matrix
Apply Similarity-based Data Mining Technique (e.g.
Clustering, Nearest Neighbor, Multi-dimensional
Scaling,…)
Han, Kamber, Eick: Object Similarity & Clustering
10
Interval-scaled Variables

Standardize data using z-scores

Calculate the mean absolute deviation:
s f  1n (| x1 f  m f |  | x2 f  m f | ... | xnf  m f |)
where

m f  1n (x1 f  x2 f  ...  xnf )
.
Calculate the standardized measurement (z-score)
xif  m f
zif 
sf

Using mean absolute deviation is more robust than using
standard deviation
Han, Kamber, Eick: Object Similarity & Clustering
11
Normalization in [0,1]
Problem: If non-normalized variables are used the maximum
distance between two values can be greater than 1.
Solution: Normalize interval-scaled variables using
 (x  min ) /((max  min )*s )
z
if
if
f
f
f
where minf denotes the minimum value and maxf denotes
the maximum value of the f-th attribute in the data set
and s is constant that is choses depending on the
similarity measure (e.g. if Manhattan distance is used s is
chosen to be 1).
Han, Kamber, Eick: Object Similarity & Clustering
12
Other Normalizations
Goal:Limit the maximum distance to 1
 Start using a distance measure df(x,y)
 Determine the maximum distance dmaxf that
can occur for two values of the f-th attribute
(e.g. dmaxf=maxf-minf ).
 Define df(x,y)=1- (df(x,y)/ dmaxf )
Advantage: Negative similarities cannot occur.
Han, Kamber, Eick: Object Similarity & Clustering
13
Similarity Between Objects


Distances are normally used to measure the similarity or
dissimilarity between two data objects
Some popular ones include: Minkowski distance:
d (i, j)  q (| x  x |q  | x  x |q ... | x  x |q )
i1
j1
i2
j2
ip
jp
where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are
two p-dimensional data objects, and q is a positive
integer

If q = 1, d is Manhattan distance
d (i, j) | x  x |  | x  x | ... | x  x |
i1 j1 i2 j2
ip jp
Han, Kamber, Eick: Object Similarity & Clustering
14
Similarity Between Objects (Cont.)

If q = 2, d is Euclidean distance:
d (i, j)  (| x  x |2  | x  x |2 ... | x  x |2 )
i1 j1
i2
j2
ip
jp

Properties





d(i,j)  0
d(i,i) = 0
d(i,j) = d(j,i)
d(i,j)  d(i,k) + d(k,j)
Also one can use weighted distance, parametric Pearson
product moment correlation, or other disimilarity
measures.
Han, Kamber, Eick: Object Similarity & Clustering
15
Similarity with respect to
a Set of Binary Variables

A contingency table for binary data
Object j
Object i
1
0
1
0
sum
a
c
b
d
a b
cd
sum a  c b  d
p
a
dJaccard (i, j ) 
abc
a

d
dsym(i, j ) 
abcd
Han, Kamber, Eick: Object Similarity & Clustering
Ignores agreements in O’s
Considers agreements in 0’s and 1’s
to be equivalent.
16
Similarity between Binary Variable Sets

Example
Name
Jack
Mary
Jim



Gender
M
F
M
Fever
Y
Y
Y
Cough
N
N
P
Test-1
P
P
N
Test-2
N
N
N
Test-3
N
P
N
Test-4
N
N
N
gender is a symmetric attribute
the remaining attributes are asymmetric binary
let the values Y and P be set to 1, and the value N be set to 0
2
 0.67
2  0 1
1
dJacc( jack, jim) 
 0.33
111
1
dJacc( jim, m ary) 
 0.25
11 2
dJacc( jack, m ary) 
Han, Kamber, Eick: Object Similarity & Clustering
17
Nominal Variables


A generalization of the binary variable in that it can take
more than 2 states, e.g., red, yellow, blue, green
Method 1: Simple matching

m: # of matches, p: total # of variables
m
d (o , o )  p 
p
i

j
Method 2: use a large number of binary variables

creating a new binary variable for each of the M
nominal states
Han, Kamber, Eick: Object Similarity & Clustering
18
Ordinal Variables

An ordinal variable can be discrete or continuous

order is important (e.g. UH-grade, hotel-rating)

Can be treated like interval-scaled


replacing xif by their rank: rif  {1,...,Mf }
map the range of each variable onto [0, 1] by replacing
the f-th variable of i-th object by
rif 1
zif 
M 1
f

compute the dissimilarity using methods for intervalscaled variables
Han, Kamber, Eick: Object Similarity & Clustering
19
Ratio-Scaled Variables


Ratio-scaled variable: a positive measurement on a
nonlinear scale, approximately at exponential scale,
such as AeBt or Ae-Bt
Methods:

treat them like interval-scaled variables — not a good
choice! (why?)

apply logarithmic transformation
yif = log(xif)

treat them as continuous ordinal data treat their rank
as interval-scaled.
Han, Kamber, Eick: Object Similarity & Clustering
20
Case Study --- Normalization
Patient(ssn, weight, height, cancer-sev, eye-color, age)
 Attribute Relevance: ssn no; eye-color minor; other major
 Attribute Normalization:
 ssn remove!




weight between 30 and 650; mweight=158 sweight=24.20;
transform to zweight= (xweight-158)/24.20 (alternatively,
zweight=(xweight-30)/620));
height normalize like weight!
cancer_sev: 4=serious 3=quite_serious 2=medium
1=minor; transform 4 to 1, 3 to 2/3, 2 to 1/3, 1 to 0
and then normalize like weight!
age: normalize like weight!
Han, Kamber, Eick: Object Similarity & Clustering
21
Case Study --- Weight Selection
and Similarity Measure Selection
Patient(ssn, weight, height, cancer-sev, eye-color, age)
 For normalized weight, height, cancer_sev, age values use
Manhattan distance function; e.g.:
dweight(w1,w2)= 1  | ((w1-158)/24.20 )  ((w2-158)/24.20) |

For eye-color use: deye-color(c1,c2)= if c1=c2 then 1 else 0

Weight Assignment: 0.2 for eye-color; 1 for all others
Final Solution --- chosen Similarity Measure d:
Let o1=(s1,w1,h1,cs1,e1,a1) and o2=(s2,w2,h2,cs2,e2,a2)
d(o1,o2):= (dweight(w1,w2) + dheight(h1,h2) + dcancersev(cs1,cs2)
+ dage(a1,a2) + 0.2* deye-color(e1,e2) ) /4.2
Han, Kamber, Eick: Object Similarity & Clustering
22
Major Clustering Approaches

Partitioning algorithms: Construct various partitions and
then evaluate them by some criterion

Hierarchy algorithms: Create a hierarchical decomposition
of the set of data (or objects) using some criterion

Density-based: based on connectivity and density functions

Grid-based: based on a multiple-level granularity structure

Model-based: A model is hypothesized for each of the
clusters and the idea is to find the best fit of that model to
each other
Han, Kamber, Eick: Object Similarity & Clustering
23
Partitioning Algorithms: Basic Concept


Partitioning method: Construct a partition of a database D
of n objects into a set of k clusters
Given a k, find a partition of k clusters that optimizes the
chosen partitioning criterion

Global optimal: exhaustively enumerate all partitions

Heuristic methods: k-means and k-medoids algorithms

k-means (MacQueen’67): Each cluster is represented by
the center of the cluster

k-medoids or PAM (Partition around medoids) (Kaufman
& Rousseeuw’87): Each cluster is represented by one of
the objects in the cluster
Han, Kamber, Eick: Object Similarity & Clustering
24
The K-Means Clustering Method

Given k, the k-means algorithm is implemented in 4
steps:
 Partition objects into k nonempty subsets
 Compute seed points as the centroids of the
clusters of the current partition. The centroid is
the center (mean point) of the cluster.
 Assign each object to the cluster with the nearest
seed point.
 Go back to Step 2, stop when no more new
assignment.
Han, Kamber, Eick: Object Similarity & Clustering
25
The K-Means Clustering Method

Example
10
10
9
9
8
8
7
7
6
6
5
5
4
4
3
3
2
2
1
1
0
0
0
1
2
3
4
5
6
7
8
9
10
0
10
10
9
9
8
8
7
7
6
6
5
5
4
4
3
3
2
2
1
1
0
1
2
3
4
5
6
7
8
9
10
0
0
1
2
Han, Kamber, Eick: Object Similarity & Clustering
3
4
5
6
7
8
9
10
0
1
2
3
4
5
6
7
8
9
10
26
Comments on the K-Means Method

Strength



Relatively efficient: O(tkn), where n is # objects, k is #
clusters, and t is # iterations. Normally, k, t << n.
Often terminates at a local optimum. The global optimum
may be found using techniques such as: deterministic
annealing and genetic algorithms
Weakness
 Applicable only when mean is defined, then what about
categorical data?
 Need to specify k, the number of clusters, in advance
 Unable to handle noisy data and outliers
 Not suitable to discover clusters with non-convex shapes
Han, Kamber, Eick: Object Similarity & Clustering
27
Hierarchical Clustering

Use distance matrix as clustering criteria. This method
does not require the number of clusters k as an input,
but needs a termination condition
Step 0
a
Step 1
Step 2 Step 3 Step 4
ab
b
abcde
c
cde
d
de
e
Step 4
agglomerative
(AGNES)
Step 3
Han, Kamber, Eick: Object Similarity & Clustering
Step 2 Step 1 Step 0
divisive
(DIANA)
28
A Dendrogram Shows How the
Clusters are Merged Hierarchically
Decompose data objects into a several levels of nested
partitioning (tree of clusters), called a dendrogram.
A clustering of the data objects is obtained by cutting the
dendrogram at the desired level, then each connected
component forms a cluster.
Han, Kamber, Eick: Object Similarity & Clustering
29
CAL-FU/UH Database Clustering
Similarity Assessment Environments
Library of
clustering algorithms
Training
Date
A set of
clusters
Learning
Tool
Object
View
Data Extraction
Tool
DBMS
Han, Kamber, Eick: Object Similarity & Clustering
Clustering Tool
User Interface
Similarity
measure
Similarity
Measure Tool
Default choices
and domain
information
Library of
similarity
measures
Type and
weight
information
30
Prototypes of Similarity Assessment Tools



Prototype1 (CAL State Fullerton): Supported the interactive
definition of similarity measures; knowledge representation
format does not rely on modular units; provides a nearest
neighbor clustering algorithm for database clustering; functions
were supported outside a DBMS
Prototype 2 (UH): Similarity measures are defined using a
special language (not interactively); tool supports modular units
and functions are provided using a Java/SQL-Server 2000
framework; functions were partially moved inside a DBMS
(although some are still inside Java); analysis results are stored
in the database and therefore available for further analysis.
Prototype 3 (UH): Supports the learning of similarity measures,
functions are provided inside the DBMS, other capabilities???
Han, Kamber, Eick: Object Similarity & Clustering
31
Self-organizing feature maps (SOMs)






SOM employ neural network techniques for clustering
Clustering is also performed by having several units
competing for the current object
The unit whose weight vector is closest to the current
object wins
The winner and its neighbors learn by having their
weights adjusted
SOMs are believed to resemble processing that can
occur in the brain
Useful for visualizing high-dimensional data in 2- or 3-D
space
Han, Kamber, Eick: Object Similarity & Clustering
32
Problems and Challenges



Considerable progress has been made in scalable
clustering methods

Partitioning: k-means, k-medoids, CLARANS

Hierarchical: BIRCH, CURE

Density-based: DBSCAN, CLIQUE, OPTICS

Grid-based: STING, WaveCluster

Model-based: Autoclass, Denclue, Cobweb
Current clustering techniques do not address all the
requirements adequately
Constraint-based clustering analysis: Constraints exist in
data space (bridges and highways) or in user queries
Han, Kamber, Eick: Object Similarity & Clustering
33
Summary Object Similarity & Clustering




Cluster analysis groups objects based on their similarity
and has wide applications
Appropriate similarity measures have to be chosen for
various types of variables and combined into a global
similarity measure.
Clustering algorithms can be categorized into partitioning
methods, hierarchical methods, density-based methods,
grid-based methods, and model-based methods
Methods to measure, compute, and learn object similarity
are quite important, not only for clustering, but also for
nearest neighbor approaches, information retrieval in
general, and for data visualization.
Han, Kamber, Eick: Object Similarity & Clustering
34
References (1)










R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. Automatic subspace clustering of
high dimensional data for data mining applications. SIGMOD'98
M. R. Anderberg. Cluster Analysis for Applications. Academic Press, 1973.
M. Ankerst, M. Breunig, H.-P. Kriegel, and J. Sander. Optics: Ordering points to identify
the clustering structure, SIGMOD’99.
P. Arabie, L. J. Hubert, and G. De Soete. Clustering and Classification. World Scietific, 1996
M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discovering
clusters in large spatial databases. KDD'96.
M. Ester, H.-P. Kriegel, and X. Xu. Knowledge discovery in large spatial databases: Focusing
techniques for efficient class identification. SSD'95.
D. Fisher. Knowledge acquisition via incremental conceptual clustering. Machine Learning,
2:139-172, 1987.
D. Gibson, J. Kleinberg, and P. Raghavan. Clustering categorical data: An approach based
on dynamic systems. In Proc. VLDB’98.
S. Guha, R. Rastogi, and K. Shim. Cure: An efficient clustering algorithm for large
databases. SIGMOD'98.
A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Printice Hall, 1988.
Han, Kamber, Eick: Object Similarity & Clustering
35
References (2)









L. Kaufman and P. J. Rousseeuw. Finding Groups in Data: an Introduction to Cluster
Analysis. John Wiley & Sons, 1990.
E. Knorr and R. Ng. Algorithms for mining distance-based outliers in large datasets.
VLDB’98.
G. J. McLachlan and K.E. Bkasford. Mixture Models: Inference and Applications to
Clustering. John Wiley and Sons, 1988.
P. Michaud. Clustering techniques. Future Generation Computer systems, 13, 1997.
R. Ng and J. Han. Efficient and effective clustering method for spatial data mining.
VLDB'94.
E. Schikuta. Grid clustering: An efficient hierarchical clustering method for very large
data sets. Proc. 1996 Int. Conf. on Pattern Recognition, 101-105.
G. Sheikholeslami, S. Chatterjee, and A. Zhang. WaveCluster: A multi-resolution
clustering approach for very large spatial databases. VLDB’98.
W. Wang, Yang, R. Muntz, STING: A Statistical Information grid Approach to Spatial
Data Mining, VLDB’97.
T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH : an efficient data clustering method
for very large databases. SIGMOD'96.
Han, Kamber, Eick: Object Similarity & Clustering
36