C j

Transcript C j

Using category-Based Adherence
to Cluster Market-Basket Data
Author : Ching-Huang Yun, Kun-Ta Chuang,
Ming-Syan Chen
Graduate : Chien-Ming Hsiao
Outline

Motivation
Objective
Introduction
Preliminary
Design of algorithm CBA
Experimental results
Conclusion

Personal Opinion






Motivation

The features of market-basket data are known to
be of high dimensionality, sparsity, and with
massive outliers.
Objective

To devise an efficient algorithm for clustering
market-basket data
Introduction

In market-basket data, the taxonomy of items
defines the generalization relationships for the
concepts in different abstraction levels
Preliminary

Problem Description

Information Gain Validation Model
Problem description

A database of transaction is denoted by
D  t1 , t2 ,, th 

where each transaction tj is represented by a set of items i1 , i2 ,, ih 
Problem description

A clustering U  C , C ,, C is a partition of
transactions into k clusters, where Cj is a cluster
consisting of a set of transactions.

Items in the transactions can be generalized to
multiple concept level of the taxonomy.
1
2
k
Problem description




The leaf nodes are called the item nodes
The internal nodes are called the category
nodes
The root node in the highest level is a
virtual concept of the generalization of all
categories
Use the measurement of the occurrence
count to determine which items or
categories are major features of each
cluster.
Problem description

Definition 1 : the count of an item ik in a cluster Cj, denoted by
Count(ik , Cj), is defined as the number of transactions in cluster Cj that
contain this item ik


An item ik in a cluster Cj is called a large item if Count(ik , Cj) exceeds a
predetermined threshold.
Definition 2 : the count of a category ck in a cluster Cj, denoted by
Count(ck , Cj), is defined as the number of transactions containing
items under this category ck in cluster

A category ck in a cluster Cj is called a large item if Count(ck , Cj) exceeds
a predetermined threshold.
Example
C1  10,40,50,60,90,120
Problem description

Definition 3 : for cluster Cj, the minimum support count
Sc(Cj) is defined as :
S c C j   S p  C j

Where | Cj | denotes the number of transactions in cluster Cj
Information Gain Validation Model

To evaluate the quality of clustering results

The nature feature of numeric data is quantitative
(e.g., weight or length), whereas that of categorical
data is qualitative (e.g., color or gender)

To propose a validation model based on
Information Gain (IG) to assess the qualities of the
clustering results.
Information Gain Validation Model

Definition 4 : the entropy of an attribute Ja in
database D is defined as :
n
J ai
i 1
D
I  J a , D   


 log 2
J ai
D
Where |D| is the number of transaction in the database D
J ai denotes the number of the transactions whose attribute Ja is
classified as the value J ai in the database D
Example
5 7
7
 5
I  g , D     log 2  log 2 
12 12
12 
 12
Information Gain Validation Model

Definition 5 : the entropy of an attribute Ja in a
cluster Cj is defined as :
n
J ai , C j
i 1
Cj
I  J a , D   


 log 2
J ai , C j
Cj
Where |Cj| is the number of transaction in the database Cj
J ai , C denotes the number of the transactions whose attribute J is
a
i
classified as the value J a in the database D
j
Information Gain Validation Model

Definition 6 : let a clustering U contain C1 , C2 ,, Cm
clusters. Thus, the entropy of an attribute Ja in the
clustering U is defined as :
E J a ,U  


C j U
Cj
D
I J a , C j 
Definition 7 : the information gain obtained by separating
Ja into the clusters of the clustering U is defined as :
GainJ a ,U   I J a , D   E J a ,U 
Information Gain Validation Model

Definition 8 : the information gain of the
clustering U is defined as :
IGU  
 GainJ
J a I



a
,U 
Where I is the data set of the total items purchased in the whole
market-basket data records.
IGtotal U   IGitem U   IGcat U 
For clustering market-basket data, the larger an IG value,
the better the clustering quality is
Gaing , U   I g , D   E g , U 
5 7
7
 5
   log 2  log 2  
12 12
12 
 12
7  4
4 3
3 
log
log

 
2
2
12 7
7
7
7
 
 
5 1
1 4
4 
log
log
 
 
2
2
12
5
5
5
5

 

 0.10
Design of Algorithm CBA

Similarity Measurement : Category-Based
Adherence

Procedure of Algorithm CBA
Similarity Measurement:
Category-Based Adherence

Definition 9 : (Distance of an item to a cluster) for an item
ik of a transaction, the distance of ik to a given cluster Cj,
denoted by d(ik, Cj), is defined as the number of links
between ik and the nearest large nodes of ik. If ik is a large
node in cluster Cj, then d(ik, Cj) = 0.
C1
Example :
TID 50  g , x, y
d g , C1   0, d x, C1   2, d  y, C1   2
TID 70  k , m, n
d k , C1   1, d m, C1   1, d n, C1   0
Similarity Measurement:
Category-Based Adherence

Definition 10 : (Adherence of a transaction to a cluster)
For a transaction t  i1 , i2 ,, i p  , the adherence of t to a given
cluster Cj
p
H t , C j  

1
d ik , C j 

p k 1
Where d(ik, Cj) is the distance of ik in cluster Cj
C1
Example :
TID 50  g , x, y
H 50, C1  
1
0  2  2  4
3
3
TID 70  k , m, n
H 70, C1  
1
1  1  0  2
3
3
Procedure of Algorithm CBA




Step1 : Randomly select k transaction as the seed
transactions of the k clusters from the database D.
Step2 : Read each transaction sequentially and allocates it
to the cluster with the minimum category-based adherence.
For each moved transaction, the counts of items and their
ancestors are increased by one.
Step3 : Repeat Step2 until no transaction is moved
between clusters.
Step4 : Output the taxonomy tree for each cluster as the
visual representation of clustering result.
Experimental Results

Data generation



Take the real market-basket data from a large bookstore company for performance study.
There are |D| = 100K transactions
NI = 21807 items
Performance Study
Conclusion

As validated by both real and synthetic datasets, it
was shown by our experimental results, with the
taxonomy information, algorithm CBA devised in
this paper significantly outperforms the prior
works in both the execution efficiency and the
clustering quality for market-basket data.
Personal Opinion

An Illustrative Example is necessary.

The data in data mining often contains both
numeric and categorical values. This characteristic
prohibits many existing clustering algorithms.

C j

Transcript C j

Directory