Clustering

Transcript Clustering

Knowledge discovery & data mining
Clustering & customer segmentation
Fosca Giannotti and
Dino Pedreschi
Pisa KDD Lab
CNUCE-CNR & Univ. Pisa
http://www-kdd.di.unipi.it/
A tutorial @ EDBT2000
Module outline
The clustering task
Application to a case-study in customer
segmentation
Quick overview of main clustering
techniques
EDBT2000 tutorial - Clust
Konstanz, 27-28.3.2000
2
Problem 1: customer segmentation
Given:
Large data base of customer data containing
their properties and past buying records
Goal:
Find groups of customers with similar behavior
Find customers with unusual behavior
EDBT2000 tutorial - Clust
Konstanz, 27-28.3.2000
3
Problem 2: CAD databases
Given:
Large data base of CAD data containing
abstract feature vectors
Goal:
Find homogeneous groups of similar CAD parts
Determine standard parts for each group
Use standard parts instead of special parts
reduction of the number of parts to be produced
EDBT2000 tutorial - Clust
Konstanz, 27-28.3.2000
4
Clustering: general problem description
Given:
A data set with N d-dimensional data items.
Find:
a natural partitioning of the data set into a
number of clusters (k) and noise.
Clusters should be such that:
items in same cluster are similar

intra-cluster similarity is maximized
items from different clusters are different

inter-cluster similarity is minimized
EDBT2000 tutorial - Clust
Konstanz, 27-28.3.2000
5
Clustering = unsupervised classification
 No predefined classes!
 Used either as a stand-alone tool to get insight
into data distribution
 or as a preprocessing step for other algorithms.
 Help to understand how objects in a data set tend
to naturally group together
 Cluster: a number of things of the same kind being
close together in a group
(Longman dictionary of contemporary English)
EDBT2000 tutorial - Clust
Konstanz, 27-28.3.2000
6
Research perspective
From the past …
clustering is a well-known problem in statistics
more recent research in
machine learning
databases
visualization
… to the future
effective and efficient clustering algorithms for
large high-dimensional data sets with high noise
requires scalability with respect to
the number of data points (N)
the number of dimensions (d)
the noise level
EDBT2000 tutorial - Clust
Konstanz, 27-28.3.2000
7
Basic methods
Partitioning methods
k-means, k-medoids
Hierarchical methods
Agglomerative/divisive, BIRCH, CURE
Linkage-based methods
Density-based methods
DBSCAN, DENCLUE
Statistical methods
IBM-IM demographic clustering, COBWEB
EDBT2000 tutorial - Clust
Konstanz, 27-28.3.2000
8
A small example
Source: DM@CINECA
http://open.cineca.it/datamining/dmCineca/index.htm
Domain: technology watch - a.k.a.
competitive intelligence
Which are the emergent technologies?
Which of my competitors are investing on
them?
In which area are my competitors active?
Which area will my competitor drop in the near
future?
Data sources:
public (on-line) databases
EDBT2000 tutorial - Clust
Konstanz, 27-28.3.2000
9
The Derwent database
Contains all patents filed worldwide in last
10 years
Searching this database by keywords may
yield thousands of documents
Derwent documents are semi-structured:
many long text fields
Goal: analyze Derwent documents to build a
model of competitors’ strategy
EDBT2000 tutorial - Clust
Konstanz, 27-28.3.2000
10
Structure of Derwent documents
EDBT2000 tutorial - Clust
Konstanz, 27-28.3.2000
11
Example dataset
Patents in the area: patch technology
(medical plaster)
105 companies from 12 countries
94 classification codes
52 Derwent codes
EDBT2000 tutorial - Clust
Konstanz, 27-28.3.2000
12
Clustering output
EDBT2000 tutorial - Clust
Konstanz, 27-28.3.2000
13
Zoom on cluster 2
EDBT2000 tutorial - Clust
Konstanz, 27-28.3.2000
14
Zoom on cluster 2 - profiling competitors
EDBT2000 tutorial - Clust
Konstanz, 27-28.3.2000
15
Activity of competitors in the clusters
EDBT2000 tutorial - Clust
Konstanz, 27-28.3.2000
16
Temporal analysis of clusters
EDBT2000 tutorial - Clust
Konstanz, 27-28.3.2000
17
Module outline
The clustering task
Application to a case-study in customer
segmentation
Quick overview of main clustering
techniques
EDBT2000 tutorial - Clust
Konstanz, 27-28.3.2000
18
A case-study on customer segmentation
Source: Gary Saarenvirta, “Mining customer
data”. DB2 magazine on line, 1998
 http://www.db2mag.com/98fsaar.html
Gratefully acknowledged!
EDBT2000 tutorial - Clust
Konstanz, 27-28.3.2000
19
Customer segmentation
Task: use customer-purchase transaction
data to
track buying behavior
create strategic business initiatives
divide customers into segments based on
shareholder value variables:
customer profitability,
measure of risk,
measure of the lifetime value of a customer,
retention probability.
EDBT2000 tutorial - Clust
Konstanz, 27-28.3.2000
20
Attractive customer segments
high-profit, high-value, and low-risk customers
highly attractive: typically 10% to 20% of
customers who create 50% to 80% of a company's
profits
strategic initiative for the segment is retention
low-profit, high-value, and low-risk customers
attractive
strategic initiative for the segment is to increase
profitability
cross-selling (selling new products)
up-selling (selling more of what customers currently buy)
EDBT2000 tutorial - Clust
Konstanz, 27-28.3.2000
21
Behavioral vs. demographic segments
Within behavioral segments:
demographic subsegments.
Customer demographic data are not used to
create segments.
Demographic (sub)segmenting is used to
select appropriate tactics (advertising,
marketing channels, and campaigns) to
implement the strategic initiatives.
EDBT2000 tutorial - Clust
Konstanz, 27-28.3.2000
22
AIR MILES
is a Canadian reward program for a coalition
of more than 125 companies in all industry
sectors
finance, credit card, retail, grocery, gas, telecom.
60% of Canadian households enrolled
is a frequent-shopper program:
consumers collect bonuses that can be redeemed
for rewards (air travel, hotel accommodation,
theatre/sport tickets, …)
EDBT2000 tutorial - Clust
Konstanz, 27-28.3.2000
23
Data capture
The coalition partners capture consumer
transactions and transmit them to AIR
MILES, which
stores these transactions and uses the
data for database marketing initiatives.
AIR MILES data warehouse contained (at
the date of the project)
more than 6.3 million household records
1 billion transaction records.
EDBT2000 tutorial - Clust
Konstanz, 27-28.3.2000
24
Before data mining
Analytical techniques used:
Recency, Frequency, Monetary value (RFM)
analysis
online analytic processing tools
linear statistical methods
to analyze the success of the various
marketing initiatives undertaken by the
coalition.
EDBT2000 tutorial - Clust
Konstanz, 27-28.3.2000
25
Data mining project at AMRP
Goal: create a customer segmentation using
a data mining tool and
compare the results to an existing
segmentation developed using RFM analysis.
data mining platform
IBM - DB2 Universal Database Enterprise
parallelized over a five-node RS/6000 SP
system.
IBM - Intelligent Miner for Data
EDBT2000 tutorial - Clust
Konstanz, 27-28.3.2000
26
Dataset
~ 50,000
customers
and their
associated
transactions
for a 12month
period.
EDBT2000 tutorial - Clust
Konstanz, 27-28.3.2000
27
Data preparation
 “shareholder value” variables
revenue
customer tenure
number of sponsor companies shopped at over the
customer tenure
number of sponsor companies shopped at over the last 12
months,
recency (in months) of the last transaction
 calculated by aggregating the transaction data and
then adding then to each customer record
 84 variables =
14 categories of sponsor companies 
3 variables per category 
2 quarters (first two quarters of 1997)
EDBT2000 tutorial - Clust
Konstanz, 27-28.3.2000
28
Data cleaning - missing values
demographic data
is usually categorical
has a high % of missing values
the missing values can be set to either unknown
or unanswered (if result of unanswered
questions)
if a large portion of the attribute is
missing, it may be discarded
in the case study, missing numeric values
set to 0
EDBT2000 tutorial - Clust
Konstanz, 27-28.3.2000
29
Data transformation
Ratio variables.
E.g.: profitability = profit / tenure
Time-derivative variables.
E.g.: profit 2nd quarter - profit 1st quarter
Discretization using quantiles.
E.g., break points at 10, 25, 50, 75, and 90.
Discretization using predefined ranges.
E.g., those used in census
Log transforms.
E.g., for very skewed distributions
EDBT2000 tutorial - Clust
Konstanz, 27-28.3.2000
30
Distribution of original data
EDBT2000 tutorial - Clust
Konstanz, 27-28.3.2000
31
Distribution of discretized data
EDBT2000 tutorial - Clust
Konstanz, 27-28.3.2000
32
Before/after discretization
EDBT2000 tutorial - Clust
Konstanz, 27-28.3.2000
33
Clustering/segmentation methodology
EDBT2000 tutorial - Clust
Konstanz, 27-28.3.2000
34
IBM-IM demographic clustering
Designed for categorical variables
Similarity index:
increases with number of common values on
same attribute
decreases with number of different values on
same attribute
# of clusters is not fixed a priori
only upper bound set
EDBT2000 tutorial - Clust
Konstanz, 27-28.3.2000
35
Demographic clustering: data structure
EDBT2000 tutorial - Clust
Konstanz, 27-28.3.2000
36
Demographic clustering: parameters
EDBT2000 tutorial - Clust
Konstanz, 27-28.3.2000
37
Demographic clustering: similarity index
proportional to 1-1 (# of common 1’s)
inversely proportional to 0-1 and 1-0
unaffected by 0-0 (# of common 0’s)
Condorcet index=
N11 / (N11 + 1/2(N01 + N10))
Dice index=
N11 / (N11 + 1/4(N01 + N10))
Dice looser then Condorcet
appropriate with highly different objects
Indexes generalize from objects to clusters
EDBT2000 tutorial - Clust
Konstanz, 27-28.3.2000
38
Demographic clustering: similarity index
Similarity threshold 
i,j assumed similar if s(i,j)>
low values (<0.5) appropriate with highly
different objects
Weights for attributes
importance of attributes in the similarity index
may be varied with different weights
default weight = 1
EDBT2000 tutorial - Clust
Konstanz, 27-28.3.2000
39
IM Demographic clustering
basic parameters to set:
Maximum number of clusters.
Maximum number of passes through the data.
Accuracy: a stopping criterion for the algorithm.
If the change in the Condorcet criterion between
data passes is smaller than the accuracy (as %),
the algorithm will terminate.
In the case study:
maximum # of clusters: 9
maximum # of passes: 5
accuracy: 0.1
EDBT2000 tutorial - Clust
Konstanz, 27-28.3.2000
40
Input dataset
 dataset: all continuous variables discretized.
 input variables :
# of products purchased over customer’s lifetime
# of products purchased in the last 12 months
Customer's revenue contribution over lifetime
Customer tenure in months
Ratio of revenue to tenure
Ratio of number of products to tenure
Region
Recency
Tenure (# of months since customer first enrolled in the
program).
EDBT2000 tutorial - Clust
Konstanz, 27-28.3.2000
41
Input dataset
Other discrete and categorical variables
and some interesting continuous variables
were input as supplementary variables:
variables used to profile the clusters but
not to define them.
easier interpretation of clusters using data
other than the input variables.
EDBT2000 tutorial - Clust
Konstanz, 27-28.3.2000
42
Output of demographic clustering
EDBT2000 tutorial - Clust
Konstanz, 27-28.3.2000
43
Visualization of clusters
horizontal strip = a cluster
clusters are ordered from top to bottom in
order of size
variables are ordered from left to right in
order of importance to the cluster, based
on a chi-square test between variable and
cluster ID.
other metrics include entropy, Condorcet
criterion, and database order.
EDBT2000 tutorial - Clust
Konstanz, 27-28.3.2000
44
Visualization of clusters
 variables used to define clusters are without
brackets, while the supplementary variables appear
within brackets.
 numeric (integer), discrete numeric (small integer),
binary, and continuous variables have their
frequency distribution shown as a bar graph.
 red bars = distribution of the variable within the
current cluster.
 gray solid bars = distribution of the variable in
the whole universe.
EDBT2000 tutorial - Clust
Konstanz, 27-28.3.2000
45
Visualization of clusters
 Categorical variables are shown as pie charts.
 inner pie = distribution of the categories for the
current cluster
 outer ring = distribution of the variable for the
entire universe.
 The more different the cluster distribution is
from the average, the more interesting or distinct
the cluster.
EDBT2000 tutorial - Clust
Konstanz, 27-28.3.2000
46
Qualitative characterization of clusters
Gold98 is a binary variable that indicates
the best customers in the database,
created previously by the business using
RFM analysis.
The clusters have almost all Gold or no
Gold customers.
Gold segment is confirmed!
EDBT2000 tutorial - Clust
Konstanz, 27-28.3.2000
47
Qualitative characterization of clusters
 Clustering results
validate the existing concept of Gold customers,
suggest the introduction of a cluster within the Gold98
segment - a platinum segment
Cluster 5:
9 %of the population.
revenue, bonus collected are all in the 75th
percentile and above, skewed to almost all
greater than the 90th percentile.
looks like a very profitable cluster
EDBT2000 tutorial - Clust
Konstanz, 27-28.3.2000
48
Detailed view of cluster 5
EDBT2000 tutorial - Clust
Konstanz, 27-28.3.2000
49
Profiling clusters
Goal: assess the potential business value of
each cluster quantitatively by profiling the
aggregate values of the shareholder value
variables by cluster.
EDBT2000 tutorial - Clust
Konstanz, 27-28.3.2000
50
Profiling clusters
 leverage = ratio of revenue to customer.
 product index = ratio of the average number of
products purchased by the customers in the
cluster divided by the average number of products
purchased overall.
 as profitability increases, so does the average
number of products purchased.
 customer profitability increases as tenure
increases.
EDBT2000 tutorial - Clust
Konstanz, 27-28.3.2000
51
Business opportunities
 Best customers in clusters 2, 5, and 7:
indication: retention
 Clusters 2, 6, and 0
indication: cross-selling by contrasting with clusters 5
and 7.
2, 6, and 0 have a product index close to that of 5 and
7, which have the highest number of products purchased.
Try to convert customers from clusters 2, 6, and 0 to
clusters 5 and 7. By comparing which products are bought
we can find products that are candidates for crossselling.
EDBT2000 tutorial - Clust
Konstanz, 27-28.3.2000
52
Business opportunities
Clusters 3 and 4
indication: cross-selling to clusters 2, 6, and 0
Cluster 1
indication: wait and see. It appears to be a
group of new customers
Cluster 8
indication: no waste of marketing dollars
EDBT2000 tutorial - Clust
Konstanz, 27-28.3.2000
53
Follow-up
 Reactions from the management
visualization of results allowed for meaningful and
actionable analysis.
original segmentation methodology validated, and obtained
refinements potentially valuable.
decision to undertake further data mining projects,
including
predictive models for direct mail targeting,
further work on segmentation using more detailed behavioral
data,
opportunity identification using association algorithms within
the segments discovered.
EDBT2000 tutorial - Clust
Konstanz, 27-28.3.2000
54
Module outline
The clustering task
Application to a case-study in customer
segmentation
Quick overview of main clustering
techniques
EDBT2000 tutorial - Clust
Konstanz, 27-28.3.2000
55
A taxonomy of main clustering methods
Partitioning methods
K-means (MacQueen 67), expectation maximization (Lauritzen
95), CLARANS (Ng and Han 94)
Hierarchical methods
Agglomerative/divisive, BIRCH (Zhang et al 96), CURE (Guha
et al 98)
Linkage-based methods
Density-based methods
DBSCAN (Ester et al 96), DENCLUE (Hinnenburg and Keim 98)
Statistical methods
IBM-IM demographic clustering, COBWEB (Fisher 87),
Autoclass (Cheeseman 96)
EDBT2000 tutorial - Clust
Konstanz, 27-28.3.2000
56
Which notion of distance?
Given two samples with n continuous
attributes X=<x1,…,xn> and Y=<y1,…,yn>,
the Minkowski distance between X and Y is:
d(X,Y)= (|x1-y1|q + … + |xn-yn|q)1/q
Euclidean distance:
Manhattan distance:
EDBT2000 tutorial - Clust
q=2
q=1
Konstanz, 27-28.3.2000
57
K-Means
 Determine k prototypes (p1, … ,pk) of a given data
set
 Assign data points to nearest prototype
 Minimize distance stop-criterion:
 Iterative Algorithm
Shift the prototypes towards the mean of their point set
Re-assign the data points to the nearest prototype
EDBT2000 tutorial - Clust
Konstanz, 27-28.3.2000
58
K-Means: example
2
3
EDBT2000 tutorial - Clust
2
1
3
1
Konstanz, 27-28.3.2000
59
Expectation Maximization
Generalization of k-Means
probabilistic assignment of points to
clusters
Idea:
Estimate parameters of k Gaussians
Optimize the probability that the mixture of
parameterized Gaussians fits the data
Iterative algorithm similar to k-Means
EDBT2000 tutorial - Clust
Konstanz, 27-28.3.2000
60
Linkage-based methods (from statistics)
Single Linkage
Connected components for distance d
EDBT2000 tutorial - Clust
Konstanz, 27-28.3.2000
61
Linkage-based methods
Method of Wishart
Min. no. of neighboring points: 4
Reduce data set
Apply Single Linkage
EDBT2000 tutorial - Clust
Konstanz, 27-28.3.2000
62
Kernel Density Estimation

Data Set
Influence Function
Density Function
Influence Function: Influence of a data point in its
neighborhood
Density Function: Sum of the influences of all data
points
EDBT2000 tutorial - Clust
Konstanz, 27-28.3.2000
63
Kernel Density Estimation
Influence Function
The influence of a data point y at a point x in
the data space is modeled by a function Kernel
Density Estimation:
Density Function
The density at a point x in the data space is
defined as the sum of influences of all data
points:
EDBT2000 tutorial - Clust
Konstanz, 27-28.3.2000
64
KDE and visual clustering
EDBT2000 tutorial - Clust
Konstanz, 27-28.3.2000
65
KDE and visual clustering
EDBT2000 tutorial - Clust
Konstanz, 27-28.3.2000
66
Scalability problem of clustering methods
Effectiveness degenerates
with dimensionality d
with noise level
Efficiency degenerates
(at least) linearly with no of data points N
exponentially with dimensionality d
EDBT2000 tutorial - Clust
Konstanz, 27-28.3.2000
67
References - Clustering

















L. Kaufman and P. J. Rousseeuw. Finding groups in data: an introduction to cluster analysis. John Wiley & Sons, 1990.
J. MacQueen. Some methods for classification and analysis of multivariate observations. Proc. 5th Berkeley Synp. On Math.,
Stat., and Prob., 1:281-297, 1967.
S. L. Lauritzen. The EM algorithm for graphical association models with missing data. Computational Statistics and Data
Analysis, 19:191-201, 1995.
Charu C. Agrawal, Ceilia Procopiuc, Joel L. Wolf, Philip S. Yu, and Jong Soo Prk, Fast Algorithms for Projected Clustering, the
ACM SIGMOD Conference on Management of Data, Philadelphia, PA, June 1999.
Rakesh Agrawal, Johannes Gehrke, Dimitrios Gunopulos, Prabhakar Raghavan, Automatic Subspace Clustering on High Dimensional
Data for Data Mining Applications, the ACM SIGMOD Conference on Management of Data, Seattle, Washington, June 1998.
P. Cheeseman and J. Stutz. Bayesian Classification (Autoclass): theory and results. In U. M. Fayyad et al (eds.) Advances in
Knowledge Discovery and Data Mining, p. 153-180, AAAI/MIT Press, 1996.
Martin Ester, Hans-Peter Kriegel, Jorg Sander, Michael Wimmer, and Xiaowei Xu, Incremental Clustering for Mining in a Data
Warehousing Environment, the VLDB Conference, New York City, New York, August 1998.
Martin Ester, Hans-Peter Kriegel, Jorg Sander, and Xiaowei Xu, A density-based algorithm for discovering clusters in large
spatial database with noise. In Proc. KDD’96
A. Hinnenburg and D. A. Keim. An efficient approach to clustering in large multimedia databases with noise. In Proc. KDD’98.
P. Michaud. Clustering techniques. Future Generation Computer Systems, 13, 1997.
D. Fisher. Knowledge acquisition via incremental conceptual clustering. Machine Learning, 2:139-172, 1987.
Venkatesh Ganti, Raghu Ramakrishnan, and Johannes Gehrke, Clustering Large Datasets in Arbitrary Metric Spaces, the 15th
International Conference on Data Engineering, Sydney, Australia, April 1999.
Sudipto Guha, Rajeev Rastogi and Kyuseok Shim, CURE: An Efficient Clustering Algorithm for Large Databases, the ACM
SIGMOD Conference on Management of Data, Seattle, Washington, June 1998
Sudipto Guha, Rajeev Rastogi and Kyuseok Shim, ROCK: A Robust Clustering Algorithm for Categorical Attributes, the 15th
International Conference on Data Engineering, Sydney, Australia, April 1999.
Anil K. Jain and Richard C. Dubes, Algorithms for Clustering Data, Prentice Hall, Englewood Cliffs, New Jersey, 1988.
Raymond T. Ng and Jiawei Han, Efficient and effective clustering methods for spatial data mining, the VLDB Conference,
Santiago, Chile, September 1994.
Tian Zhang, Raghu Ramakrishnan, and Miron, Birch: An efficient data clustering method for very large databases, the ACM
SIGMOD Conference on Management of Data, Montreal, Canada, June 1996.
EDBT2000 tutorial - Clust
Konstanz, 27-28.3.2000
68

Clustering

Transcript Clustering

Directory