Data Mining: Concepts & Techniques

Download Report

Transcript Data Mining: Concepts & Techniques

Data Mining Techniques
Clustering
Purpose
• In clustering analysis, there is no pre-classified
data
• Instead, clustering analysis is a process where
a set of objects is partitioned into several
clusters
• All members in one cluster are similar to each
other and different from the members of other
clusters, according to some similarity metric
(e.g., the opposite of distance between objects)
Cluster Analysis
Y (Age)
Cluster
Customer
(Object)
Variables
X (Income)
Cluster Analysis
n objetcs
p variables
Data Matrix
Dissimilarity
Matrix (nn)
Attribute Types Involved in
Cluster Analysis
• Interval Variables
– An interval variable contains continuous measurements
(e.g., height, weight, temperature, cost, etc.) which
follow a linear scale
– It is essential that intervals keep the same importance
throughout the scale
• Nominal Variables
– A nominal variable takes on more than two states. For
example, the eye color of a person can be blue, brown,
green or grey eyes
– These states may be coded as 1, 2, ..., M, however their
order and the interval between any two states do not
have any meaning
Attribute Types Involved in
Cluster Analysis
• Ordinal Variables
– An ordinal variable takes on more than two states. For
example, you may ask someone to convey his/her
appreciation of some paintings in terms of the
following categories: 1=detest, 2=dislike, 3=indifferent,
4=like and 5=admire
– In an ordinal variable, their states are ordered in a
meaningful sequence. However, the interval between
any two consecutive states are not equally distanced
• Binary Variables
– Binary variables have only two possible states. For
example, the gender of a person is either female or male
Dissimilarity (Distance)
Measure
Dissimilarity (Distance)
Measure
Dissimilarity (Distance)
Measure
Dissimilarity (Distance)
Measure
Dissimilarity (Distance)
Measure
Dissimilarity (Distance)
Measure
Dissimilarity (Distance)
Measure
Dissimilarity (Distance)
Measure
Dissimilarity (Distance)
Measure
Categorization of Clustering
Methods
• Exclusive vs. Non-Exclusive (Overlapping)
• Hierarchical Methods vs. Partitioning Methods
• Hierarchical Methods
– Single Link Method
– Complete Link Method
• Partitioning Methods
–
–
–
–
–
Kohonen Self-Organizing Feature Maps
K-Means Methods
K-Medoids Methods (PAM, CLARA, CLARANS)
Density-Based Methods
…
Hierarchical Methods
Dissimilarity
Matrix (55)
K-Means Methods
K-Means Methods
K-Means Methods
K-Means Methods
Sensitive to
Outlier!
Exercise 7
Number of clusters = 2
Using Single Link, Complete Link and K-Means to cluster the following data:
Object
1
2
3
X
22
40
60
Y
60
25
30
4
5
6
64
80
82
66
30
55