K-prototypes Algorithm

Download Report

Transcript K-prototypes Algorithm

Extensions to the K-means
Algorithm for Clustering Large
Data Sets with Categorical Values
Author: Zhexue Huang
Advisor: Dr. Hsu
Graduate: Yu-Wei Su
2001/11/06
The Lab of Intelligent Database System, IDS
Outline










Motivation
Objective
Research Review
Notation
K-means Algorithm
K-mode Algorithm
K-prototype Algorithm
Experiment
Conclusion
Personal opinion
2001/11/06
The Lab of Intelligent Database System, IDS
Motivation



K-means methods are efficient for processing
large data sets
K-means is limited to numeric data
Numeric and categorical data are mixed with
million objects in real world
2001/11/06
The Lab of Intelligent Database System, IDS
Objective

Extending K-means to categorical domains
and domains with mixed numeric and
categorical values
2001/11/06
The Lab of Intelligent Database System, IDS
Research review

Partition methods




Partitioning algorithm organizes the objects into K
partition(K<N)
K-means[ MacQueen, 1967]
K-medoids[ Kaufman and Rousseeuw, 1990]
CLARANS[ Ng and Han, 1994]
2001/11/06
The Lab of Intelligent Database System, IDS
Notation




[A1,A2,…..Am] means attribute numbers ,each
Ai describes a domains of values, denoted by
DOM(Ai)
X={X1,X2,…..,Xn} be a set of n objects,object
Xi is represented as [Xi,1,Xi,2,…..,Xi,m}
Xi=Xk if Xi,j =Xk,j for 1<=j<=m
r
r
r
c
c
,
,....,
,
,....,
[ x1 x2 x p x p1 xm ], the first p elements
are numeric values, the rest are categorical
values
2001/11/06
The Lab of Intelligent Database System, IDS
K-means Algorithm
Problem P
k
minimise
P (W , Q)   wi ,l d ( Xi, Ql )
l 1 i 1
k
Subject to
n
w
l 1
i ,l
 1 ,1<=i<=n
wi ,l  0,1 ,1<=i<=n, 1<=l<=k
K is clustering numbers, n is objects number
W is an nxk partition matrix, Q={Q1,Q2,…Qk} is a set
of objects in the same object domain
d(.,.) is the Euclidean distance between two objects
2001/11/06
The Lab of Intelligent Database System, IDS
K-means Algorithm (cont.)

Problem P can be solved by iteratively solving the
following two problems:
 Problem P1: fix Q= Q , reduced problem P(W, Q )


wi,l=1 if d(Xi,Ql) <= d(Xi,Qt), for 1 <= t <= k
wi,t=0 for t <> l



Problem P2: fix W= W , reduced problem P( W,Q)
w x


 w
n
ql , j
i 1
n
i 1
2001/11/06
i ,l
i, j
,1 <= l <= k, and 1<= j <= m
i ,l
The Lab of Intelligent Database System, IDS
K-means Algorithm (cont.)
1.
2.
o
o
Q
Q
Choose an initial
and solve P(W, ) to
obtain W 0. Set t=0


t 1
t
Q
Let W = W and solve P( W ,Q) to obtain
.



t 1
t
if P( W, Q )=P(W, Q ), output W,Q t and stop;
otherwise, go to 3
t 1
Q
Q
Let =
and solve P(W,Q) to obtain W t 1.

3.

t

t 1

t

if P( W , Q )=P( W , Q ), output W , Q and stop;
otherwise, let t=t+1 and go to 2
2001/11/06
The Lab of Intelligent Database System, IDS
K-mode Algorithm



Using a simple matching dissimilarity
measure for categorical objects
Replacing means of clusters by modes
Using a frequency-based method to find the
modes
2001/11/06
The Lab of Intelligent Database System, IDS
K-mode Algorithm( cont.)

Dissimilarity measure
m
d1 ( X , Y )    ( x j , y j )
j 1
where

0
 (x j , y j )  
1
(x j  y j )
(x j  y j )
Mode of a set
A mode of X ={X1,X2,…..,Xn} is a vector Q=[q1,q2,…,qm]
minimise
n
D( X , Q)   d1 ( X i , Q)
i 1
2001/11/06
The Lab of Intelligent Database System, IDS
K-mode Algorithm( cont.)
Find a mode for a set
let n be the number of objects having the Kth
category c in attribute A
n
f (A  c | X ) 
the relative frequency of
n
category c in X
Theorem 1
D(X,Q) is minimised iff f ( A  q | X )  f ( A  c | X )
for qj <> c k , j for all j=1,…,m

ck , j
k, j
j
ck , j
r
j
k, j
k, j
r
2001/11/06
j
j
The Lab of Intelligent Database System, IDS
r
j
k, j
K-mode Algorithm( cont.)
Two initial mode selection methods

1.
2.
2001/11/06
Select the first K distinct records from the data sets as the
K modes
Select the K modes by frequency-based method
The Lab of Intelligent Database System, IDS
K-mode Algorithm( cont.)
k
n
m
P(W , Q)   wi ,l  ( xi , j , ql ,m )
l 1 i 1 j 1
where wi ,l  W and Ql

 [ql ,1 , ql , 2 ,....., ql ,m ]  Q
To calculate the total cost P against the whole
data set each time when a new Q or W is
obtained
2001/11/06
The Lab of Intelligent Database System, IDS
K-mode Algorithm( cont.)
1.
2.
Select K initial modes, one for each cluster
Allocate an object to the cluster whose
mode is the nearest to it . Update the mode
of the cluster after each allocation according
to theorem 1
2001/11/06
The Lab of Intelligent Database System, IDS
K-mode Algorithm( cont.)
3.
4.
After all objects have been allocated to
clusters, retest the dissimilarity of objects
against the current modes if an object is
found its nearest mode belongs to another
cluster, reallocate the object to that cluster
and update the modes of both clusters
Repeat 3 until no objects has changed
clusters
2001/11/06
The Lab of Intelligent Database System, IDS
K-prototypes Algorithm


To integrate the k-means and k-modes
algorithms and to cluster the mixed-type
objects
A1r , A2r ,...., Apr , Apc 1 ,..., Amc ,m is the attribute numbers
the first p means numeric data, the rest
means categorical data
2001/11/06
The Lab of Intelligent Database System, IDS
K-prototypes Algorithm( cont.)
p
d 2 ( X ,Y )   (x j  y j )  
2
j 1


m
 (x
j  p 1
j
, yj)
The first term is the Euclidean distance
measure on the numeric attributes and the
second term is the simple matching
dissimilarity measure on the categorical
attributes
The weight  is used to avoid favouring either
type of attribute
2001/11/06
The Lab of Intelligent Database System, IDS
K-prototypes Algorithm( cont.)

Cost function
Minimise
k
n
p
n
P(W , Q)   ( wi ,l  ( xi , j  ql , j )    wi ,l
2
l 1
2001/11/06
i 1
j 1
i 1
The Lab of Intelligent Database System, IDS
m
 (x
j  p 1
i, j
, ql , j ))
K-prototypes Algorithm( cont.)
Choose
clusters
Modify
the mode
2001/11/06
The Lab of Intelligent Database System, IDS
K-prototypes Algorithm( cont.)
Modify
the mode
2001/11/06
The Lab of Intelligent Database System, IDS
Experiment


K-modes
the data set was the soybean disease data
set, with 4 diseases 47 instances:
{D=10,C=10,R=10,p=17}, 21 attributes
K-prototype
the second data was the credit approval data
set, with 2 class 666 instances
{ approval=299, reject=367}, 6 numeric and 9
categorical attributes
2001/11/06
The Lab of Intelligent Database System, IDS
Experiment( cont.)
2001/11/06
The Lab of Intelligent Database System, IDS
Experiment( cont.)
2001/11/06
The Lab of Intelligent Database System, IDS
Experiment( cont.)
2001/11/06
The Lab of Intelligent Database System, IDS
Conclusion



The k-modes algorithm is faster than the kmeans and k-prototypes algorithm because it
needs less iterations to converge
How many clusters are in the data?
The weight  adds an additional problem
2001/11/06
The Lab of Intelligent Database System, IDS
Personal opinion



Conceptual inclusion relationships
Outlier problem
Massive data sets cause efficient problem
2001/11/06
The Lab of Intelligent Database System, IDS