Transcript Clustering
Knowledge discovery & data mining
Clustering & customer segmentation
Fosca Giannotti and
Dino Pedreschi
Pisa KDD Lab
CNUCE-CNR & Univ. Pisa
http://www-kdd.di.unipi.it/
A tutorial @ EDBT2000
Module outline
The clustering task
Application to a case-study in customer
segmentation
Quick overview of main clustering
techniques
EDBT2000 tutorial - Clust
Konstanz, 27-28.3.2000
2
Problem 1: customer segmentation
Given:
Large data base of customer data containing
their properties and past buying records
Goal:
Find groups of customers with similar behavior
Find customers with unusual behavior
EDBT2000 tutorial - Clust
Konstanz, 27-28.3.2000
3
Problem 2: CAD databases
Given:
Large data base of CAD data containing
abstract feature vectors
Goal:
Find homogeneous groups of similar CAD parts
Determine standard parts for each group
Use standard parts instead of special parts
reduction of the number of parts to be produced
EDBT2000 tutorial - Clust
Konstanz, 27-28.3.2000
4
Clustering: general problem description
Given:
A data set with N d-dimensional data items.
Find:
a natural partitioning of the data set into a
number of clusters (k) and noise.
Clusters should be such that:
items in same cluster are similar
intra-cluster similarity is maximized
items from different clusters are different
inter-cluster similarity is minimized
EDBT2000 tutorial - Clust
Konstanz, 27-28.3.2000
5
Clustering = unsupervised classification
No predefined classes!
Used either as a stand-alone tool to get insight
into data distribution
or as a preprocessing step for other algorithms.
Help to understand how objects in a data set tend
to naturally group together
Cluster: a number of things of the same kind being
close together in a group
(Longman dictionary of contemporary English)
EDBT2000 tutorial - Clust
Konstanz, 27-28.3.2000
6
Research perspective
From the past …
clustering is a well-known problem in statistics
more recent research in
machine learning
databases
visualization
… to the future
effective and efficient clustering algorithms for
large high-dimensional data sets with high noise
requires scalability with respect to
the number of data points (N)
the number of dimensions (d)
the noise level
EDBT2000 tutorial - Clust
Konstanz, 27-28.3.2000
7
Basic methods
Partitioning methods
k-means, k-medoids
Hierarchical methods
Agglomerative/divisive, BIRCH, CURE
Linkage-based methods
Density-based methods
DBSCAN, DENCLUE
Statistical methods
IBM-IM demographic clustering, COBWEB
EDBT2000 tutorial - Clust
Konstanz, 27-28.3.2000
8
A small example
Source: DM@CINECA
http://open.cineca.it/datamining/dmCineca/index.htm
Domain: technology watch - a.k.a.
competitive intelligence
Which are the emergent technologies?
Which of my competitors are investing on
them?
In which area are my competitors active?
Which area will my competitor drop in the near
future?
Data sources:
public (on-line) databases
EDBT2000 tutorial - Clust
Konstanz, 27-28.3.2000
9
The Derwent database
Contains all patents filed worldwide in last
10 years
Searching this database by keywords may
yield thousands of documents
Derwent documents are semi-structured:
many long text fields
Goal: analyze Derwent documents to build a
model of competitors’ strategy
EDBT2000 tutorial - Clust
Konstanz, 27-28.3.2000
10
Structure of Derwent documents
EDBT2000 tutorial - Clust
Konstanz, 27-28.3.2000
11
Example dataset
Patents in the area: patch technology
(medical plaster)
105 companies from 12 countries
94 classification codes
52 Derwent codes
EDBT2000 tutorial - Clust
Konstanz, 27-28.3.2000
12
Clustering output
EDBT2000 tutorial - Clust
Konstanz, 27-28.3.2000
13
Zoom on cluster 2
EDBT2000 tutorial - Clust
Konstanz, 27-28.3.2000
14
Zoom on cluster 2 - profiling competitors
EDBT2000 tutorial - Clust
Konstanz, 27-28.3.2000
15
Activity of competitors in the clusters
EDBT2000 tutorial - Clust
Konstanz, 27-28.3.2000
16
Temporal analysis of clusters
EDBT2000 tutorial - Clust
Konstanz, 27-28.3.2000
17
Module outline
The clustering task
Application to a case-study in customer
segmentation
Quick overview of main clustering
techniques
EDBT2000 tutorial - Clust
Konstanz, 27-28.3.2000
18
A case-study on customer segmentation
Source: Gary Saarenvirta, “Mining customer
data”. DB2 magazine on line, 1998
http://www.db2mag.com/98fsaar.html
Gratefully acknowledged!
EDBT2000 tutorial - Clust
Konstanz, 27-28.3.2000
19
Customer segmentation
Task: use customer-purchase transaction
data to
track buying behavior
create strategic business initiatives
divide customers into segments based on
shareholder value variables:
customer profitability,
measure of risk,
measure of the lifetime value of a customer,
retention probability.
EDBT2000 tutorial - Clust
Konstanz, 27-28.3.2000
20
Attractive customer segments
high-profit, high-value, and low-risk customers
highly attractive: typically 10% to 20% of
customers who create 50% to 80% of a company's
profits
strategic initiative for the segment is retention
low-profit, high-value, and low-risk customers
attractive
strategic initiative for the segment is to increase
profitability
cross-selling (selling new products)
up-selling (selling more of what customers currently buy)
EDBT2000 tutorial - Clust
Konstanz, 27-28.3.2000
21
Behavioral vs. demographic segments
Within behavioral segments:
demographic subsegments.
Customer demographic data are not used to
create segments.
Demographic (sub)segmenting is used to
select appropriate tactics (advertising,
marketing channels, and campaigns) to
implement the strategic initiatives.
EDBT2000 tutorial - Clust
Konstanz, 27-28.3.2000
22
AIR MILES
is a Canadian reward program for a coalition
of more than 125 companies in all industry
sectors
finance, credit card, retail, grocery, gas, telecom.
60% of Canadian households enrolled
is a frequent-shopper program:
consumers collect bonuses that can be redeemed
for rewards (air travel, hotel accommodation,
theatre/sport tickets, …)
EDBT2000 tutorial - Clust
Konstanz, 27-28.3.2000
23
Data capture
The coalition partners capture consumer
transactions and transmit them to AIR
MILES, which
stores these transactions and uses the
data for database marketing initiatives.
AIR MILES data warehouse contained (at
the date of the project)
more than 6.3 million household records
1 billion transaction records.
EDBT2000 tutorial - Clust
Konstanz, 27-28.3.2000
24
Before data mining
Analytical techniques used:
Recency, Frequency, Monetary value (RFM)
analysis
online analytic processing tools
linear statistical methods
to analyze the success of the various
marketing initiatives undertaken by the
coalition.
EDBT2000 tutorial - Clust
Konstanz, 27-28.3.2000
25
Data mining project at AMRP
Goal: create a customer segmentation using
a data mining tool and
compare the results to an existing
segmentation developed using RFM analysis.
data mining platform
IBM - DB2 Universal Database Enterprise
parallelized over a five-node RS/6000 SP
system.
IBM - Intelligent Miner for Data
EDBT2000 tutorial - Clust
Konstanz, 27-28.3.2000
26
Dataset
~ 50,000
customers
and their
associated
transactions
for a 12month
period.
EDBT2000 tutorial - Clust
Konstanz, 27-28.3.2000
27
Data preparation
“shareholder value” variables
revenue
customer tenure
number of sponsor companies shopped at over the
customer tenure
number of sponsor companies shopped at over the last 12
months,
recency (in months) of the last transaction
calculated by aggregating the transaction data and
then adding then to each customer record
84 variables =
14 categories of sponsor companies
3 variables per category
2 quarters (first two quarters of 1997)
EDBT2000 tutorial - Clust
Konstanz, 27-28.3.2000
28
Data cleaning - missing values
demographic data
is usually categorical
has a high % of missing values
the missing values can be set to either unknown
or unanswered (if result of unanswered
questions)
if a large portion of the attribute is
missing, it may be discarded
in the case study, missing numeric values
set to 0
EDBT2000 tutorial - Clust
Konstanz, 27-28.3.2000
29
Data transformation
Ratio variables.
E.g.: profitability = profit / tenure
Time-derivative variables.
E.g.: profit 2nd quarter - profit 1st quarter
Discretization using quantiles.
E.g., break points at 10, 25, 50, 75, and 90.
Discretization using predefined ranges.
E.g., those used in census
Log transforms.
E.g., for very skewed distributions
EDBT2000 tutorial - Clust
Konstanz, 27-28.3.2000
30
Distribution of original data
EDBT2000 tutorial - Clust
Konstanz, 27-28.3.2000
31
Distribution of discretized data
EDBT2000 tutorial - Clust
Konstanz, 27-28.3.2000
32
Before/after discretization
EDBT2000 tutorial - Clust
Konstanz, 27-28.3.2000
33
Clustering/segmentation methodology
EDBT2000 tutorial - Clust
Konstanz, 27-28.3.2000
34
IBM-IM demographic clustering
Designed for categorical variables
Similarity index:
increases with number of common values on
same attribute
decreases with number of different values on
same attribute
# of clusters is not fixed a priori
only upper bound set
EDBT2000 tutorial - Clust
Konstanz, 27-28.3.2000
35
Demographic clustering: data structure
EDBT2000 tutorial - Clust
Konstanz, 27-28.3.2000
36
Demographic clustering: parameters
EDBT2000 tutorial - Clust
Konstanz, 27-28.3.2000
37
Demographic clustering: similarity index
proportional to 1-1 (# of common 1’s)
inversely proportional to 0-1 and 1-0
unaffected by 0-0 (# of common 0’s)
Condorcet index=
N11 / (N11 + 1/2(N01 + N10))
Dice index=
N11 / (N11 + 1/4(N01 + N10))
Dice looser then Condorcet
appropriate with highly different objects
Indexes generalize from objects to clusters
EDBT2000 tutorial - Clust
Konstanz, 27-28.3.2000
38
Demographic clustering: similarity index
Similarity threshold
i,j assumed similar if s(i,j)>
low values (<0.5) appropriate with highly
different objects
Weights for attributes
importance of attributes in the similarity index
may be varied with different weights
default weight = 1
EDBT2000 tutorial - Clust
Konstanz, 27-28.3.2000
39
IM Demographic clustering
basic parameters to set:
Maximum number of clusters.
Maximum number of passes through the data.
Accuracy: a stopping criterion for the algorithm.
If the change in the Condorcet criterion between
data passes is smaller than the accuracy (as %),
the algorithm will terminate.
In the case study:
maximum # of clusters: 9
maximum # of passes: 5
accuracy: 0.1
EDBT2000 tutorial - Clust
Konstanz, 27-28.3.2000
40
Input dataset
dataset: all continuous variables discretized.
input variables :
# of products purchased over customer’s lifetime
# of products purchased in the last 12 months
Customer's revenue contribution over lifetime
Customer tenure in months
Ratio of revenue to tenure
Ratio of number of products to tenure
Region
Recency
Tenure (# of months since customer first enrolled in the
program).
EDBT2000 tutorial - Clust
Konstanz, 27-28.3.2000
41
Input dataset
Other discrete and categorical variables
and some interesting continuous variables
were input as supplementary variables:
variables used to profile the clusters but
not to define them.
easier interpretation of clusters using data
other than the input variables.
EDBT2000 tutorial - Clust
Konstanz, 27-28.3.2000
42
Output of demographic clustering
EDBT2000 tutorial - Clust
Konstanz, 27-28.3.2000
43
Visualization of clusters
horizontal strip = a cluster
clusters are ordered from top to bottom in
order of size
variables are ordered from left to right in
order of importance to the cluster, based
on a chi-square test between variable and
cluster ID.
other metrics include entropy, Condorcet
criterion, and database order.
EDBT2000 tutorial - Clust
Konstanz, 27-28.3.2000
44
Visualization of clusters
variables used to define clusters are without
brackets, while the supplementary variables appear
within brackets.
numeric (integer), discrete numeric (small integer),
binary, and continuous variables have their
frequency distribution shown as a bar graph.
red bars = distribution of the variable within the
current cluster.
gray solid bars = distribution of the variable in
the whole universe.
EDBT2000 tutorial - Clust
Konstanz, 27-28.3.2000
45
Visualization of clusters
Categorical variables are shown as pie charts.
inner pie = distribution of the categories for the
current cluster
outer ring = distribution of the variable for the
entire universe.
The more different the cluster distribution is
from the average, the more interesting or distinct
the cluster.
EDBT2000 tutorial - Clust
Konstanz, 27-28.3.2000
46
Qualitative characterization of clusters
Gold98 is a binary variable that indicates
the best customers in the database,
created previously by the business using
RFM analysis.
The clusters have almost all Gold or no
Gold customers.
Gold segment is confirmed!
EDBT2000 tutorial - Clust
Konstanz, 27-28.3.2000
47
Qualitative characterization of clusters
Clustering results
validate the existing concept of Gold customers,
suggest the introduction of a cluster within the Gold98
segment - a platinum segment
Cluster 5:
9 %of the population.
revenue, bonus collected are all in the 75th
percentile and above, skewed to almost all
greater than the 90th percentile.
looks like a very profitable cluster
EDBT2000 tutorial - Clust
Konstanz, 27-28.3.2000
48
Detailed view of cluster 5
EDBT2000 tutorial - Clust
Konstanz, 27-28.3.2000
49
Profiling clusters
Goal: assess the potential business value of
each cluster quantitatively by profiling the
aggregate values of the shareholder value
variables by cluster.
EDBT2000 tutorial - Clust
Konstanz, 27-28.3.2000
50
Profiling clusters
leverage = ratio of revenue to customer.
product index = ratio of the average number of
products purchased by the customers in the
cluster divided by the average number of products
purchased overall.
as profitability increases, so does the average
number of products purchased.
customer profitability increases as tenure
increases.
EDBT2000 tutorial - Clust
Konstanz, 27-28.3.2000
51
Business opportunities
Best customers in clusters 2, 5, and 7:
indication: retention
Clusters 2, 6, and 0
indication: cross-selling by contrasting with clusters 5
and 7.
2, 6, and 0 have a product index close to that of 5 and
7, which have the highest number of products purchased.
Try to convert customers from clusters 2, 6, and 0 to
clusters 5 and 7. By comparing which products are bought
we can find products that are candidates for crossselling.
EDBT2000 tutorial - Clust
Konstanz, 27-28.3.2000
52
Business opportunities
Clusters 3 and 4
indication: cross-selling to clusters 2, 6, and 0
Cluster 1
indication: wait and see. It appears to be a
group of new customers
Cluster 8
indication: no waste of marketing dollars
EDBT2000 tutorial - Clust
Konstanz, 27-28.3.2000
53
Follow-up
Reactions from the management
visualization of results allowed for meaningful and
actionable analysis.
original segmentation methodology validated, and obtained
refinements potentially valuable.
decision to undertake further data mining projects,
including
predictive models for direct mail targeting,
further work on segmentation using more detailed behavioral
data,
opportunity identification using association algorithms within
the segments discovered.
EDBT2000 tutorial - Clust
Konstanz, 27-28.3.2000
54
Module outline
The clustering task
Application to a case-study in customer
segmentation
Quick overview of main clustering
techniques
EDBT2000 tutorial - Clust
Konstanz, 27-28.3.2000
55
A taxonomy of main clustering methods
Partitioning methods
K-means (MacQueen 67), expectation maximization (Lauritzen
95), CLARANS (Ng and Han 94)
Hierarchical methods
Agglomerative/divisive, BIRCH (Zhang et al 96), CURE (Guha
et al 98)
Linkage-based methods
Density-based methods
DBSCAN (Ester et al 96), DENCLUE (Hinnenburg and Keim 98)
Statistical methods
IBM-IM demographic clustering, COBWEB (Fisher 87),
Autoclass (Cheeseman 96)
EDBT2000 tutorial - Clust
Konstanz, 27-28.3.2000
56
Which notion of distance?
Given two samples with n continuous
attributes X=<x1,…,xn> and Y=<y1,…,yn>,
the Minkowski distance between X and Y is:
d(X,Y)= (|x1-y1|q + … + |xn-yn|q)1/q
Euclidean distance:
Manhattan distance:
EDBT2000 tutorial - Clust
q=2
q=1
Konstanz, 27-28.3.2000
57
K-Means
Determine k prototypes (p1, … ,pk) of a given data
set
Assign data points to nearest prototype
Minimize distance stop-criterion:
Iterative Algorithm
Shift the prototypes towards the mean of their point set
Re-assign the data points to the nearest prototype
EDBT2000 tutorial - Clust
Konstanz, 27-28.3.2000
58
K-Means: example
2
3
EDBT2000 tutorial - Clust
2
1
3
1
Konstanz, 27-28.3.2000
59
Expectation Maximization
Generalization of k-Means
probabilistic assignment of points to
clusters
Idea:
Estimate parameters of k Gaussians
Optimize the probability that the mixture of
parameterized Gaussians fits the data
Iterative algorithm similar to k-Means
EDBT2000 tutorial - Clust
Konstanz, 27-28.3.2000
60
Linkage-based methods (from statistics)
Single Linkage
Connected components for distance d
EDBT2000 tutorial - Clust
Konstanz, 27-28.3.2000
61
Linkage-based methods
Method of Wishart
Min. no. of neighboring points: 4
Reduce data set
Apply Single Linkage
EDBT2000 tutorial - Clust
Konstanz, 27-28.3.2000
62
Kernel Density Estimation
Data Set
Influence Function
Density Function
Influence Function: Influence of a data point in its
neighborhood
Density Function: Sum of the influences of all data
points
EDBT2000 tutorial - Clust
Konstanz, 27-28.3.2000
63
Kernel Density Estimation
Influence Function
The influence of a data point y at a point x in
the data space is modeled by a function Kernel
Density Estimation:
Density Function
The density at a point x in the data space is
defined as the sum of influences of all data
points:
EDBT2000 tutorial - Clust
Konstanz, 27-28.3.2000
64
KDE and visual clustering
EDBT2000 tutorial - Clust
Konstanz, 27-28.3.2000
65
KDE and visual clustering
EDBT2000 tutorial - Clust
Konstanz, 27-28.3.2000
66
Scalability problem of clustering methods
Effectiveness degenerates
with dimensionality d
with noise level
Efficiency degenerates
(at least) linearly with no of data points N
exponentially with dimensionality d
EDBT2000 tutorial - Clust
Konstanz, 27-28.3.2000
67
References - Clustering
L. Kaufman and P. J. Rousseeuw. Finding groups in data: an introduction to cluster analysis. John Wiley & Sons, 1990.
J. MacQueen. Some methods for classification and analysis of multivariate observations. Proc. 5th Berkeley Synp. On Math.,
Stat., and Prob., 1:281-297, 1967.
S. L. Lauritzen. The EM algorithm for graphical association models with missing data. Computational Statistics and Data
Analysis, 19:191-201, 1995.
Charu C. Agrawal, Ceilia Procopiuc, Joel L. Wolf, Philip S. Yu, and Jong Soo Prk, Fast Algorithms for Projected Clustering, the
ACM SIGMOD Conference on Management of Data, Philadelphia, PA, June 1999.
Rakesh Agrawal, Johannes Gehrke, Dimitrios Gunopulos, Prabhakar Raghavan, Automatic Subspace Clustering on High Dimensional
Data for Data Mining Applications, the ACM SIGMOD Conference on Management of Data, Seattle, Washington, June 1998.
P. Cheeseman and J. Stutz. Bayesian Classification (Autoclass): theory and results. In U. M. Fayyad et al (eds.) Advances in
Knowledge Discovery and Data Mining, p. 153-180, AAAI/MIT Press, 1996.
Martin Ester, Hans-Peter Kriegel, Jorg Sander, Michael Wimmer, and Xiaowei Xu, Incremental Clustering for Mining in a Data
Warehousing Environment, the VLDB Conference, New York City, New York, August 1998.
Martin Ester, Hans-Peter Kriegel, Jorg Sander, and Xiaowei Xu, A density-based algorithm for discovering clusters in large
spatial database with noise. In Proc. KDD’96
A. Hinnenburg and D. A. Keim. An efficient approach to clustering in large multimedia databases with noise. In Proc. KDD’98.
P. Michaud. Clustering techniques. Future Generation Computer Systems, 13, 1997.
D. Fisher. Knowledge acquisition via incremental conceptual clustering. Machine Learning, 2:139-172, 1987.
Venkatesh Ganti, Raghu Ramakrishnan, and Johannes Gehrke, Clustering Large Datasets in Arbitrary Metric Spaces, the 15th
International Conference on Data Engineering, Sydney, Australia, April 1999.
Sudipto Guha, Rajeev Rastogi and Kyuseok Shim, CURE: An Efficient Clustering Algorithm for Large Databases, the ACM
SIGMOD Conference on Management of Data, Seattle, Washington, June 1998
Sudipto Guha, Rajeev Rastogi and Kyuseok Shim, ROCK: A Robust Clustering Algorithm for Categorical Attributes, the 15th
International Conference on Data Engineering, Sydney, Australia, April 1999.
Anil K. Jain and Richard C. Dubes, Algorithms for Clustering Data, Prentice Hall, Englewood Cliffs, New Jersey, 1988.
Raymond T. Ng and Jiawei Han, Efficient and effective clustering methods for spatial data mining, the VLDB Conference,
Santiago, Chile, September 1994.
Tian Zhang, Raghu Ramakrishnan, and Miron, Birch: An efficient data clustering method for very large databases, the ACM
SIGMOD Conference on Management of Data, Montreal, Canada, June 1996.
EDBT2000 tutorial - Clust
Konstanz, 27-28.3.2000
68