No Slide Title
Download
Report
Transcript No Slide Title
Data Mining:
Concepts and Techniques
— Slides for Textbook —
— Chapter 8 —
©Jiawei Han and Micheline Kamber
Intelligent Database Systems Research Lab
School of Computing Science
Simon Fraser University, Canada
http://www.cs.sfu.ca
July 18, 2015
Data Mining: Concepts and Techniques
1
Chapter 8. Cluster Analysis
What is Cluster Analysis?
Types of Data in Cluster Analysis
A Categorization of Major Clustering Methods
Partitioning Methods
Hierarchical Methods
Density-Based Methods
Grid-Based Methods
Model-Based Clustering Methods
Outlier Analysis
Summary
July 18, 2015
Data Mining: Concepts and Techniques
2
Clustering Problem Formally
Given a database D={t1,t2,…,tn} of tuples and an integer
value k, the Clustering Problem is to define a mapping
f:D {1,..,k} where each ti is assigned to one cluster Kj,
1<=j<=k.
A cluster, Kj, contains precisely those tuples mapped to
it.
Unlike classification problem, clusters are not known a
priori.
July 18, 2015
Data Mining: Concepts and Techniques
4
General Applications of Clustering
Pattern Recognition
Spatial Data Analysis
create thematic maps in GIS by clustering feature
spaces
detect spatial clusters and explain them in spatial data
mining
Image Processing
Economic Science (especially market research)
WWW
Document classification
Cluster Weblog data to discover groups of similar
access patterns
July 18, 2015
Data Mining: Concepts and Techniques
5
Examples of Clustering Applications
Marketing: Help marketers discover distinct groups in their
customer bases, and then use this knowledge to develop
targeted marketing programs
Land use: Identification of areas of similar land use in an
earth observation database
Insurance: Identifying groups of motor insurance policy
holders with a high average claim cost
City-planning: Identifying groups of houses according to
their house type, value, and geographical location
Earth-quake studies: Observed earth quake epicenters
should be clustered along continent faults
July 18, 2015
Data Mining: Concepts and Techniques
6
Clustering Issues
– The appropriate number of clusters for each data set.
– How to define similarity or the criterion used to
group data together.
– Outlier handling is difficult. Should they be a part of
an existing cluster, or another cluster?
– Dynamic database, how to update the clusters when
there are changes in data.
– The semantic meaning of each cluster. (Contrast with
classes in classification process, each has a definitive
meaning.)
– Type of attributes that the clustering algorithm can
handle.
– Scalability to large datasets.
July 18, 2015
Data Mining: Concepts and Techniques
7
Notion of a cluster is ambigious
July 18, 2015
Data Mining: Concepts and Techniques
8
Different types of clusters
Cluster 1
Cluster 2
Cluster 1
Cluster 2
Cluster 1
Cluster 2
Cluster 1
Cluster 2
Cluster 3
Cluster 4
July 18, 2015
Data Mining: Concepts and Techniques
9
What Is Good Clustering?
A good clustering method will produce high quality
clusters with
high intra-class similarity
low inter-class similarity
The quality of a clustering result depends on both the
similarity measure used by the method and its
implementation.
The quality of a clustering method is also measured by its
ability to discover some or all of the hidden patterns.
July 18, 2015
Data Mining: Concepts and Techniques
10
Requirements of Clustering in Data Mining
Scalability
Ability to deal with different types of attributes
Discovery of clusters with arbitrary shape
Minimal requirements for domain knowledge to
determine input parameters
Able to deal with noise and outliers
Insensitive to order of input records
High dimensionality
Incorporation of user-specified constraints
Interpretability and usability
July 18, 2015
Data Mining: Concepts and Techniques
11
Chapter 8. Cluster Analysis
What is Cluster Analysis?
Types of Data in Cluster Analysis
A Categorization of Major Clustering Methods
Partitioning Methods
Hierarchical Methods
Density-Based Methods
Grid-Based Methods
Model-Based Clustering Methods
Outlier Analysis
Summary
July 18, 2015
Data Mining: Concepts and Techniques
12
Similarity and Dissimilarity Metric
•Similarity
- Numerical measure of how alike two data objects are.
- Is higher when objects are more alike.
- Often falls in the range [0,1]
• Dissimilarity
- Numerical measure of how different two data objects are.
- Is lower when objects are more alike.
- Minimum dissimilarity is often 0.
- Upper limit varies
• Proximity refers to a similarity or dissimilarity
July 18, 2015
Data Mining: Concepts and Techniques
13
Data Structures
Data matrix
This represents n objects, such as
persons, with p variables (also called
measurements or attributes), such as age,
height, gender, race, and so on.
Called “two modes” : since rows and
columns represent different entities
Dissimilarity matrix
July 18, 2015
Stores a collection of proximities that are
available for all pairs of n objects. (n by n
matrix)
Called “one mode” : since it reprsents the
same entity
d(i,j) is the measured difference or
dissimilarity between objects i and j.
x11
...
x
i1
...
x
n1
... x1f
... ...
... xif
...
...
... xnf
... x1p
... ...
... xip
... ...
... xnp
0
d(2,1)
0
d(3,1) d ( 3,2) 0
:
:
:
d ( n,1) d ( n,2) ... ... 0
Data Mining: Concepts and Techniques
14
Measure the Quality of Clustering
Dissimilarity/Similarity metric: Similarity is expressed in
terms of a distance function, which is typically metric:
d(i, j)
There is a separate “quality” function that measures the
“goodness” of a cluster.
The definitions of distance functions are usually very
different for interval-scaled, boolean, categorical, ordinal
and ratio variables.
Weights should be associated with different variables
based on applications and data semantics.
It is hard to define “similar enough” or “good enough”
the answer is typically highly subjective.
July 18, 2015
Data Mining: Concepts and Techniques
15
Type of data in clustering analysis
Interval-scaled variables:
Binary variables:
Nominal, ordinal, and ratio variables:
Variables of mixed types:
July 18, 2015
Data Mining: Concepts and Techniques
16
Interval-valued variables
Interval-scaled (based) variables are continuous measurements of a roughly
linear scale (such as weight, height, weather).
The measurement unit used can affect the clustering analysis. Using inches or
meters for a measurement may lead to a very different clustering structure. To
avoid dependence on on the choice of measurement units, the data should be
standardized.
How to Standardize data
Calculate the mean absolute deviation:
sf 1
n (| x1 f m f | | x2 f m f | ... | xnf m f |)
where
mf 1
n (x1 f x2 f ... xnf )
Calculate the standardized measurement (z-score)
xif m f
zif
sf
.
Using mean absolute deviation is more robust than using standard deviation
July 18, 2015
Data Mining: Concepts and Techniques
17
Similarity and Dissimilarity Between Objects
Distances are normally used to measure the similarity or
dissimilarity between two data objects
Some popular ones include: Minkowski distance:
d (i, j) (| x x | | x x | ... | x x | )
i1 j1
i2 j 2
ip jp
q
q
q
q
where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two pdimensional data objects, and q is a positive integer
If q = 1, d is Manhattan distance
d (i, j) | x x | | x x | ... | x x |
i1 j1 i2 j2
ip jp
July 18, 2015
Data Mining: Concepts and Techniques
18
Similarity and Dissimilarity Between Objects (Cont.)
If q = 2, d is Euclidean distance:
d (i, j) (| x x |2 | x x |2 ... | x x |2 )
i1 j1
i2
j2
ip
jp
Properties
d(i,j) 0
d(i,i) = 0
d(i,j) = d(j,i)
d(i,j) d(i,k) + d(k,j)
Also one can use weighted distance, parametric Pearson
product moment correlation, or other disimilarity
measures.
July 18, 2015
Data Mining: Concepts and Techniques
19
Euclidean Distance
Source : S. Ranka
July 18, 2015
Data Mining: Concepts and Techniques
20
Similarity and Dissimilarity Between Objects (Cont.)
Determine similarity between two objects.
Definition:
The similarity between two tuples ti and t j , sim(ti ,t j ), in a database
D is a mapping from D D to the range[0,1]. Thus, sim(ti ,t j ) [0,1].
Similarity characteristics:
ti D, sim(ti ,ti ) 1
ti ,t j D, sim(ti ,t j ) 0 if ti and t j are not alike at all
ti ,t j ,tk D, sim(ti ,t j ) sim(ti ,tk ) if ti is more like tk than it is like t j
Source : Dunham
July 18, 2015
Data Mining: Concepts and Techniques
21
Similarity and Dissimilarity Between Objects (Cont.)
2h 1tiht jh
k
Dice : sim(ti ,t j )
k
t h 1t 2jh
k
2
h 1 ih
k
Jaccard: sim(ti ,t j )
t t
h 1 ih jh
k
2
h 1 jh
t
t
t t
Cosine : sim(t ,t )
t t
t t
Overlap: sim(t ,t )
min t ,
k
2
h 1 ih
h 1tiht jh
k
k
i
h 1 ih jh
j
k
k
2
h 1 ih
2
h 1 jh
k
i
j
h 1 ih jh
k
k
2
2
ih
h 1
h 1 jh
t
ti ti1,,tik ,t j t j1,,t jk
July 18, 2015
Data Mining: Concepts and Techniques
22
Similarity and Dissimilarity Between Objects (Cont.)
Measure dissimilarity between objects
Euclidean (L2 ) : dis(ti ,t j )
2
(
t
t
)
h1 ih jh
k
Manhattan(L1) : dis(ti ,t j ) h1|(tih t jh ) |
k
July 18, 2015
Data Mining: Concepts and Techniques
23
Binary Variables
How can we compute the dissimilaty between objects descired by by either
symmetic or asymmetic binary variables. A binary variable has only two
states 0 and 1.
Symetric : both states are equally valuable and carry the same weight.
Example: gender having states male and female
Asymmetric : the outcome states are not equally important, such as the
positive and negative outcomes of a disease test.
Example :
HIV positive - represented by 1 (rarest)
HIV negative – represented by 0
July 18, 2015
Data Mining: Concepts and Techniques
24
Binary Variables
A contingency table for binary data
1
0
Object i
Object j
1
0
sum
a
c
b
d
a b
cd
sum a c b d
p
• Simple matching coefficient (invariant, if the binary variable is
symmetric):
d (i, j)
bc
a bc d
• Jaccard coefficient (noninvariant if the binary variable is asymmetric):
d (i, j)
July 18, 2015
bc
a bc
Data Mining: Concepts and Techniques
25
Dissimilarity between Binary Variables
Example
Name
Jack
Mary
Jim
Gender
M
F
M
Fever
Y
Y
Y
Cough
N
N
P
Test-1
P
P
N
Test-2
N
N
N
Test-3
N
P
N
Test-4
N
N
N
gender is a symmetric attribute
the remaining attributes are asymmetric binary
let the values Y and P be set to 1, and the value N be set to 0
01
0.33
2 01
11
d ( jack, jim )
0.67
111
1 2
d ( jim , mary)
0.75
11 2
d ( jack, mary)
July 18, 2015
Data Mining: Concepts and Techniques
26
Nominal Variables
A nominal variable is a generalization of the binary
variable in that it can take more than 2 states, e.g., red,
yellow, blue, green
Method 1: Simple matching
m: # of matches, p: total # of variables
m
d (i, j) p
p
Method 2: use a large number of binary variables
creating a new binary variable for each of the M
nominal states
July 18, 2015
Data Mining: Concepts and Techniques
27
Ordinal Variables
order is important, e.g., rank
An ordinal variable can be discrete or continuous
A discrete ordinal variable resebles a nominal variable, except that M
states of the the ordinal value are ordered in a meaningful sequence
(e.g. Projesional ranks : Assistant, Associate, Full professor)
A continuous ordinal variable looks like a set of continous data of of
an unkwon scale; that is, the realtive ordering of values is essential
but their actual size is not. (e.g. The relative ranking in a particular
sport: gold, silver, and bronze)
July 18, 2015
Data Mining: Concepts and Techniques
28
Ordinal Variables
They can be treated like interval-scaled
Suppose f is a variable from a set of ordinal variables descibing n
objects
The value of f for the ith object is xif
f has Mf ordered states 1,....,Mf
Replace each xif by its rank corresponding rank
rif {1,...,M f }
map the range of each variable onto [0, 1] by replacing i-th object in
the f-th variable by
zif
rif 1
M f 1
compute the dissimilarity using methods for interval-scaled variables
July 18, 2015
Data Mining: Concepts and Techniques
29
Ratio-Scaled Variables
Ratio-scaled variable: A ratio scale variable makes a positive
measurement on a nonlinear scale, approximately at exponential
scale, such as AeBt or Ae-Bt
Methods:
treat them like interval-scaled variables — not a good choice!
(why?) it is likely that the scale may be distorted.
apply logarithmic transformation
yif = log(xif)
treat them as continuous ordinal data treat their rank as intervalscaled.
July 18, 2015
Data Mining: Concepts and Techniques
30
Variables of Mixed Types
A database may contain all the six types of variables
symmetric binary, asymmetric binary, nominal, ordinal,
interval and ratio.
One may use a weighted formula to combine their
effects.
pf 1 ij( f ) dij( f )
d (i, j)
pf 1 ij( f )
f is binary or nominal:
dij(f) = 0 if xif = xjf , or dij(f) = 1 otherwise
f is interval-based: use the normalized distance
f is ordinal or ratio-scaled
r 1
z
compute ranks rif and
if
M 1
and treat zif as interval-scaled
if
f
July 18, 2015
Data Mining: Concepts and Techniques
31
Chapter 8. Cluster Analysis
What is Cluster Analysis?
Types of Data in Cluster Analysis
A Categorization of Major Clustering Methods
Partitioning Methods
Hierarchical Methods
Density-Based Methods
Grid-Based Methods
Model-Based Clustering Methods
Outlier Analysis
Summary
July 18, 2015
Data Mining: Concepts and Techniques
32
Major Clustering Approaches (Han)
Partitioning algorithms: Construct various partitions and
then evaluate them by some criterion
Hierarchy algorithms: Create a hierarchical decomposition
of the set of data (or objects) using some criterion
Density-based: based on connectivity and density functions
Grid-based: based on a multiple-level granularity structure
Model-based: A model is hypothesized for each of the
clusters and the idea is to find the best fit of that model to
each other
July 18, 2015
Data Mining: Concepts and Techniques
33
Major Clustering Approaches (Dunham)
Clustering
Hierarchical
Agglomerative
July 18, 2015
Partitional
Divisive
Categorical
Sampling
Data Mining: Concepts and Techniques
Large DB
Compression
34