Clustering With SLOPE

Download Report

Transcript Clustering With SLOPE

CLOPE: a Fast and Effective
Clustering Algorithm for
Transactional Data
Advisor : Dr. Hsu
Graduate : Sheng-Hsuan Wang
Author : Yiling Yang
Xudong Guan
Jinyuan You
Outline








Motivation
Objective
Introduction
Clustering With sLOPE
Implementation
Experiments
Conclusions
Personal opinion
Motivation

This paper studies the problem of
categorical data clustering, especially for
transactional data characterized by high
dimensionality and large volume.
Objective

To present a fast and efficient clustering algorithm
CLOPE for transactional data.
Introduction

Clustering is an important data mining technique.

Objective : to group data into sets



Intra-cluster similarity is maximized
Inter-cluster similarity is minimized
The basic idea behind CLOPE


Uses global criterion function that tries to increase the
intra-cluster overlapping of transaction items by
increasing the height-to-width ratio of the cluster
histogram.
A parameter to control the tightness of the cluster.
Introduction
Clustering With SLOPE

The Criterion function can be defined
locally or globally.

Locally


The criterion function is built on the pair-wise
similarity between transactions.
Globally

Clustering quality is measured in the cluster
level,utilizing information like the sets of large
and small items in the clustering.
Clustering With SLOPE

We define the size S(C) and width W(C)
of a cluster C below:
S (C ) 

iD ( C )
Occ (i, C )   | ti |
ti C
W (C ) | D (C ) |

The height of a cluster is defined as
H (C )  S (C ) / W (C )

We use gradient G(C )  H (C ) / W (C )  S (C ) / W (C )2
instead of H(C) as the quality measure
for cluster C.
Clustering With SLOPE
Clustering With SLOPE
k
k
profit (C ) 
 G (C )  | C
i
i 1
i
|

k
| C
i 1
i
S (Ci )
 | Ci |

2
i 1 W (Ci )
k
| C
|
i 1
i
|
k
profit r (C ) 
S (Ci )
 | Ci |

r
i 1 W (Ci )
k
| C
i 1
i
|
Here, r is a positive real number called repulsion,
used to control the level of intra-cluster similarity.
Clustering With SLOPE

Problem definition


Given D and r, find a clustering C that
maximize profit r (C) .
The sketch of the CLOPE algorithm.
Implementation

RAM data structure



Remark


We keep only the current transaction and
A small amount of information for each cluster.The
information, called cluster features.
The total memory required for item occurrences is
approximately M*K*4 bytes using array of 4-byte
integers.
The computation of profit

Computing the delta value of adding t to C.
Implementation
Implementation

The time and space complexity





Suppose the average length of a transaction is A.
The total number of transactions is N.
The maximum number of clusters is K.
The time complexity for one iteration is
O( N  K  A ).
The space requirement for CLOPE is approximately
the memory size of the cluster features.
Experiments

We analyze the effectiveness and execution speed
of CLOPE with two real-life datasets.


For effectiveness, we compare the clustering quality of
CLOPE on a labeled dataset with those of LargeItem
and ROCK.
For execution speed, we compare CLOPE with
LargeItem on a large web log dataset.
Mushroom

Mushroom dataset (real-life)


The mushroom dataset from the UCI machine learning
repository has been used by both ROCK and LargeItem
for effectiveness tests.
It contains 8124 records with two classes,4208 edible
mushrooms and 3916 poisonous mushrooms.
Mushroom
Cost ,w (C )  w  Intra  Inter
Mushroom
Mushroom


The result of CLOPE on mushroom is
better than that of LargeItem and close
to that of ROCK.
Sensitivity to data order


The results are different but very close to
the original ones.
It shows that CLOPE is not very sensitive
to the order of input data.
Berkeley web logs


Web log data is another typical category of
transactional databases.
The web log files from
http://www.cs.berkeley.edu/log/ as the
dataset for our second experiment and test
the scalability as well as performance of
CLOPE.

There are about 7 million entries in the raw log file
and 2 million of them are kept after non-html
entries removed.
Conclusions




The CLOPE algorithm is proposed based on the intuitive
idea of increasing the height-to-width ratio of the cluster
histogram.
The idea is generalized with a repulsion parameter that
controls tightness of transactions in a cluster.
Experiments show that CLOPE is quite effective in finding
interesting clusterings.
Moreover,CLOPE is not very sensitive to data order and
requires little domain knowledge in controlling the number
of clusters.
Personal Opinion

The idea behind CLOPE is very simple, can help
beginner easy to learning.