No Slide Title

Transcript No Slide Title

Data Mining:
Concepts and Techniques
— Slides for Textbook —
— Chapter 8 —
©Jiawei Han and Micheline Kamber
Intelligent Database Systems Research Lab
School of Computing Science
Simon Fraser University, Canada
http://www.cs.sfu.ca
July 18, 2015
Data Mining: Concepts and Techniques
1
Chapter 8. Cluster Analysis

What is Cluster Analysis?

Types of Data in Cluster Analysis

A Categorization of Major Clustering Methods

Partitioning Methods

Hierarchical Methods

Density-Based Methods

Grid-Based Methods

Model-Based Clustering Methods

Outlier Analysis

Summary
July 18, 2015
Data Mining: Concepts and Techniques
2
Clustering Problem Formally



Given a database D={t1,t2,…,tn} of tuples and an integer
value k, the Clustering Problem is to define a mapping
f:D  {1,..,k} where each ti is assigned to one cluster Kj,
1<=j<=k.
A cluster, Kj, contains precisely those tuples mapped to
it.
Unlike classification problem, clusters are not known a
priori.
July 18, 2015
Data Mining: Concepts and Techniques
4
General Applications of Clustering





Pattern Recognition
Spatial Data Analysis
 create thematic maps in GIS by clustering feature
spaces
 detect spatial clusters and explain them in spatial data
mining
Image Processing
Economic Science (especially market research)
WWW
 Document classification
 Cluster Weblog data to discover groups of similar
access patterns
July 18, 2015
Data Mining: Concepts and Techniques
5
Examples of Clustering Applications





Marketing: Help marketers discover distinct groups in their
customer bases, and then use this knowledge to develop
targeted marketing programs
Land use: Identification of areas of similar land use in an
earth observation database
Insurance: Identifying groups of motor insurance policy
holders with a high average claim cost
City-planning: Identifying groups of houses according to
their house type, value, and geographical location
Earth-quake studies: Observed earth quake epicenters
should be clustered along continent faults
July 18, 2015
Data Mining: Concepts and Techniques
6
Clustering Issues
– The appropriate number of clusters for each data set.
– How to define similarity or the criterion used to
group data together.
– Outlier handling is difficult. Should they be a part of
an existing cluster, or another cluster?
– Dynamic database, how to update the clusters when
there are changes in data.
– The semantic meaning of each cluster. (Contrast with
classes in classification process, each has a definitive
meaning.)
– Type of attributes that the clustering algorithm can
handle.
– Scalability to large datasets.
July 18, 2015
Data Mining: Concepts and Techniques
7
Notion of a cluster is ambigious
July 18, 2015
Data Mining: Concepts and Techniques
8
Different types of clusters
Cluster 1
Cluster 2
Cluster 1
Cluster 2
Cluster 1
Cluster 2
Cluster 1
Cluster 2
Cluster 3
Cluster 4
July 18, 2015
Data Mining: Concepts and Techniques
9
What Is Good Clustering?



A good clustering method will produce high quality
clusters with

high intra-class similarity

low inter-class similarity
The quality of a clustering result depends on both the
similarity measure used by the method and its
implementation.
The quality of a clustering method is also measured by its
ability to discover some or all of the hidden patterns.
July 18, 2015
Data Mining: Concepts and Techniques
10
Requirements of Clustering in Data Mining

Scalability

Ability to deal with different types of attributes

Discovery of clusters with arbitrary shape

Minimal requirements for domain knowledge to
determine input parameters

Able to deal with noise and outliers

Insensitive to order of input records

High dimensionality

Incorporation of user-specified constraints

Interpretability and usability
July 18, 2015
Data Mining: Concepts and Techniques
11
Chapter 8. Cluster Analysis

What is Cluster Analysis?

Types of Data in Cluster Analysis

A Categorization of Major Clustering Methods

Partitioning Methods

Hierarchical Methods

Density-Based Methods

Grid-Based Methods

Model-Based Clustering Methods

Outlier Analysis

Summary
July 18, 2015
Data Mining: Concepts and Techniques
12
Similarity and Dissimilarity Metric
•Similarity
- Numerical measure of how alike two data objects are.
- Is higher when objects are more alike.
- Often falls in the range [0,1]
• Dissimilarity
- Numerical measure of how different two data objects are.
- Is lower when objects are more alike.
- Minimum dissimilarity is often 0.
- Upper limit varies
• Proximity refers to a similarity or dissimilarity
July 18, 2015
Data Mining: Concepts and Techniques
13
Data Structures

Data matrix



This represents n objects, such as
persons, with p variables (also called
measurements or attributes), such as age,
height, gender, race, and so on.
Called “two modes” : since rows and
columns represent different entities
Dissimilarity matrix



July 18, 2015
Stores a collection of proximities that are
available for all pairs of n objects. (n by n
matrix)
Called “one mode” : since it reprsents the
same entity
d(i,j) is the measured difference or
dissimilarity between objects i and j.
 x11

 ...
x
 i1
 ...
x
 n1
... x1f
... ...
... xif
...
...
... xnf
... x1p 

... ... 
... xip 

... ... 
... xnp 

 0

 d(2,1)

0


 d(3,1) d ( 3,2) 0



:
:
:


d ( n,1) d ( n,2) ... ... 0
Data Mining: Concepts and Techniques
14
Measure the Quality of Clustering





Dissimilarity/Similarity metric: Similarity is expressed in
terms of a distance function, which is typically metric:
d(i, j)
There is a separate “quality” function that measures the
“goodness” of a cluster.
The definitions of distance functions are usually very
different for interval-scaled, boolean, categorical, ordinal
and ratio variables.
Weights should be associated with different variables
based on applications and data semantics.
It is hard to define “similar enough” or “good enough”

the answer is typically highly subjective.
July 18, 2015
Data Mining: Concepts and Techniques
15
Type of data in clustering analysis

Interval-scaled variables:

Binary variables:

Nominal, ordinal, and ratio variables:

Variables of mixed types:
July 18, 2015
Data Mining: Concepts and Techniques
16
Interval-valued variables

Interval-scaled (based) variables are continuous measurements of a roughly
linear scale (such as weight, height, weather).

The measurement unit used can affect the clustering analysis. Using inches or
meters for a measurement may lead to a very different clustering structure. To
avoid dependence on on the choice of measurement units, the data should be
standardized.

How to Standardize data

Calculate the mean absolute deviation:
sf  1
n (| x1 f  m f |  | x2 f  m f | ... | xnf  m f |)
where
mf  1
n (x1 f  x2 f  ...  xnf )

Calculate the standardized measurement (z-score)
xif  m f
zif 
sf
.

Using mean absolute deviation is more robust than using standard deviation
July 18, 2015
Data Mining: Concepts and Techniques
17
Similarity and Dissimilarity Between Objects


Distances are normally used to measure the similarity or
dissimilarity between two data objects
Some popular ones include: Minkowski distance:
d (i, j)  (| x  x |  | x  x | ... | x  x | )
i1 j1
i2 j 2
ip jp
q
q
q
q
where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two pdimensional data objects, and q is a positive integer

If q = 1, d is Manhattan distance
d (i, j) | x  x |  | x  x | ... | x  x |
i1 j1 i2 j2
ip jp
July 18, 2015
Data Mining: Concepts and Techniques
18
Similarity and Dissimilarity Between Objects (Cont.)

If q = 2, d is Euclidean distance:
d (i, j)  (| x  x |2  | x  x |2 ... | x  x |2 )
i1 j1
i2
j2
ip
jp

Properties





d(i,j)  0
d(i,i) = 0
d(i,j) = d(j,i)
d(i,j)  d(i,k) + d(k,j)
Also one can use weighted distance, parametric Pearson
product moment correlation, or other disimilarity
measures.
July 18, 2015
Data Mining: Concepts and Techniques
19
Euclidean Distance
Source : S. Ranka
July 18, 2015
Data Mining: Concepts and Techniques
20
Similarity and Dissimilarity Between Objects (Cont.)

Determine similarity between two objects.
Definition:
The similarity between two tuples ti and t j , sim(ti ,t j ), in a database
D is a mapping from D  D to the range[0,1]. Thus, sim(ti ,t j )  [0,1].
Similarity characteristics:
ti  D, sim(ti ,ti )  1
ti ,t j  D, sim(ti ,t j )  0 if ti and t j are not alike at all
ti ,t j ,tk  D, sim(ti ,t j )  sim(ti ,tk ) if ti is more like tk than it is like t j
Source : Dunham
July 18, 2015
Data Mining: Concepts and Techniques
21
Similarity and Dissimilarity Between Objects (Cont.)
2h 1tiht jh
k
Dice : sim(ti ,t j ) 

k
t  h 1t 2jh
k
2
h 1 ih


k
Jaccard: sim(ti ,t j ) 
t t
h 1 ih jh
k
2
h 1 jh
t
 t
t t

Cosine : sim(t ,t ) 
 t  t
t t

Overlap: sim(t ,t ) 
min  t , 
k
2
h 1 ih
h 1tiht jh
k
k
i
h 1 ih jh
j
k
k
2
h 1 ih
2
h 1 jh
k
i
j
h 1 ih jh
k
k
2
2
ih
h 1
h 1 jh
t

ti  ti1,,tik ,t j  t j1,,t jk
July 18, 2015
Data Mining: Concepts and Techniques
22
Similarity and Dissimilarity Between Objects (Cont.)

Measure dissimilarity between objects
Euclidean (L2 ) : dis(ti ,t j ) 
2
(
t

t
)
h1 ih jh
k
Manhattan(L1) : dis(ti ,t j )  h1|(tih  t jh ) |
k
July 18, 2015
Data Mining: Concepts and Techniques
23
Binary Variables

How can we compute the dissimilaty between objects descired by by either
symmetic or asymmetic binary variables. A binary variable has only two
states 0 and 1.

Symetric : both states are equally valuable and carry the same weight.


Example: gender having states male and female
Asymmetric : the outcome states are not equally important, such as the
positive and negative outcomes of a disease test.

Example :

HIV positive - represented by 1 (rarest)

HIV negative – represented by 0
July 18, 2015
Data Mining: Concepts and Techniques
24
Binary Variables

A contingency table for binary data
1
0
Object i
Object j
1
0
sum
a
c
b
d
a b
cd
sum a  c b  d
p
• Simple matching coefficient (invariant, if the binary variable is
symmetric):
d (i, j) 
bc
a bc  d
• Jaccard coefficient (noninvariant if the binary variable is asymmetric):
d (i, j) 
July 18, 2015
bc
a bc
Data Mining: Concepts and Techniques
25
Dissimilarity between Binary Variables

Example
Name
Jack
Mary
Jim



Gender
M
F
M
Fever
Y
Y
Y
Cough
N
N
P
Test-1
P
P
N
Test-2
N
N
N
Test-3
N
P
N
Test-4
N
N
N
gender is a symmetric attribute
the remaining attributes are asymmetric binary
let the values Y and P be set to 1, and the value N be set to 0
01
 0.33
2 01
11
d ( jack, jim ) 
 0.67
111
1 2
d ( jim , mary) 
 0.75
11 2
d ( jack, mary) 
July 18, 2015
Data Mining: Concepts and Techniques
26
Nominal Variables


A nominal variable is a generalization of the binary
variable in that it can take more than 2 states, e.g., red,
yellow, blue, green
Method 1: Simple matching

m: # of matches, p: total # of variables
m
d (i, j)  p 
p

Method 2: use a large number of binary variables

creating a new binary variable for each of the M
nominal states
July 18, 2015
Data Mining: Concepts and Techniques
27
Ordinal Variables

order is important, e.g., rank

An ordinal variable can be discrete or continuous


A discrete ordinal variable resebles a nominal variable, except that M
states of the the ordinal value are ordered in a meaningful sequence
(e.g. Projesional ranks : Assistant, Associate, Full professor)
A continuous ordinal variable looks like a set of continous data of of
an unkwon scale; that is, the realtive ordering of values is essential
but their actual size is not. (e.g. The relative ranking in a particular
sport: gold, silver, and bronze)
July 18, 2015
Data Mining: Concepts and Techniques
28
Ordinal Variables

They can be treated like interval-scaled
Suppose f is a variable from a set of ordinal variables descibing n
objects


The value of f for the ith object is xif
f has Mf ordered states 1,....,Mf
Replace each xif by its rank corresponding rank

rif {1,...,M f }

map the range of each variable onto [0, 1] by replacing i-th object in
the f-th variable by

zif

rif 1

M f 1
compute the dissimilarity using methods for interval-scaled variables
July 18, 2015
Data Mining: Concepts and Techniques
29
Ratio-Scaled Variables


Ratio-scaled variable: A ratio scale variable makes a positive
measurement on a nonlinear scale, approximately at exponential
scale, such as AeBt or Ae-Bt
Methods:

treat them like interval-scaled variables — not a good choice!
(why?) it is likely that the scale may be distorted.

apply logarithmic transformation
yif = log(xif)

treat them as continuous ordinal data treat their rank as intervalscaled.
July 18, 2015
Data Mining: Concepts and Techniques
30
Variables of Mixed Types


A database may contain all the six types of variables
 symmetric binary, asymmetric binary, nominal, ordinal,
interval and ratio.
One may use a weighted formula to combine their
effects.
 pf  1 ij( f ) dij( f )
d (i, j) 
 pf  1 ij( f )
 f is binary or nominal:
dij(f) = 0 if xif = xjf , or dij(f) = 1 otherwise
 f is interval-based: use the normalized distance
 f is ordinal or ratio-scaled
r 1
z

 compute ranks rif and
if
M 1
 and treat zif as interval-scaled
if
f
July 18, 2015
Data Mining: Concepts and Techniques
31
Chapter 8. Cluster Analysis

What is Cluster Analysis?

Types of Data in Cluster Analysis

A Categorization of Major Clustering Methods

Partitioning Methods

Hierarchical Methods

Density-Based Methods

Grid-Based Methods

Model-Based Clustering Methods

Outlier Analysis

Summary
July 18, 2015
Data Mining: Concepts and Techniques
32
Major Clustering Approaches (Han)

Partitioning algorithms: Construct various partitions and
then evaluate them by some criterion

Hierarchy algorithms: Create a hierarchical decomposition
of the set of data (or objects) using some criterion

Density-based: based on connectivity and density functions

Grid-based: based on a multiple-level granularity structure

Model-based: A model is hypothesized for each of the
clusters and the idea is to find the best fit of that model to
each other
July 18, 2015
Data Mining: Concepts and Techniques
33
Major Clustering Approaches (Dunham)
Clustering
Hierarchical
Agglomerative
July 18, 2015
Partitional
Divisive
Categorical
Sampling
Data Mining: Concepts and Techniques
Large DB
Compression
34

No Slide Title

Transcript No Slide Title

Directory