Statistics for Marketing and Consumer Research
Download
Report
Transcript Statistics for Marketing and Consumer Research
Cluster Analysis
(from Chapter 12)
Statistics for Marketing & Consumer Research
Copyright © 2008 - Mario Mazzocchi
1
Cluster analysis
• It is a class of techniques used to classify
cases into groups that are
• relatively homogeneous within themselves and
• heterogeneous between each other
• These groups are called clusters
Statistics for Marketing & Consumer Research
Copyright © 2008 - Mario Mazzocchi
2
Market segmentation
• Cluster analysis is especially useful for market
segmentation
• Segmenting a market means dividing its potential
consumers into separate sub-sets where
• Consumers in the same group are similar with respect to
a given set of characteristics
• Consumers belonging to different groups are dissimilar
with respect to the same set of characteristics
• This allows one to calibrate the marketing mix
differently according to the target consumer group
Statistics for Marketing & Consumer Research
Copyright © 2008 - Mario Mazzocchi
3
Other uses of cluster analysis
• Clustering of similar brands or products according to their
characteristics allow one to identify competitors, potential
market opportunities and available niches.
• Data reduction
• Factor analysis and principal component analysis allow to reduce the
number of variables.
• Cluster analysis allows to reduce the number of observations, by
grouping them into homogeneous clusters.
• Maps profiling simultaneously consumers and products,
market opportunities and preferences as in preference or
perceptual mappings.
Statistics for Marketing & Consumer Research
Copyright © 2008 - Mario Mazzocchi
4
Steps to conduct a cluster analysis
•
•
•
•
•
Select a distance measure
Select a clustering algorithm
Define the distance between two clusters
Determine the number of clusters
Validate the analysis
Statistics for Marketing & Consumer Research
Copyright © 2008 - Mario Mazzocchi
5
Distance measures for individual
observations
• To measure similarity between two
observations a distance measure is needed.
• Multiple variables require an aggregate
distance measure
• The most known measure of distance is the
Euclidean distance, which is the concept we
use in everyday life for spatial coordinates.
Statistics for Marketing & Consumer Research
Copyright © 2008 - Mario Mazzocchi
6
Examples of distances
Dij
x
n
k 1
ki
xkj
2
Euclidean distance
A
B
A
n
Dij xki xkj
City-block (Manhattan) distance
k 1
B
Dij distance between cases i and j
xkj value of variable xk for case j
Problems:
Different measures = different weights
Correlation between variables (double counting)
Solution: Standardization, rescaling, principal component
analysis
Statistics for Marketing & Consumer Research
Copyright © 2008 - Mario Mazzocchi
7
Clustering procedures
• Hierarchical procedures
• Agglomerative (start from n clusters to get to 1
cluster)
• Divisive (start from 1 cluster to get to n
clusters)
• Non hierarchical procedures
• K-means clustering (knowledge of the number
of clusters (c) is required).
Statistics for Marketing & Consumer Research
Copyright © 2008 - Mario Mazzocchi
8
Distance between clusters
• Algorithms vary according to the way the
distance between two clusters is
defined.
• The most common algorithm for
hierarchical methods include
•
•
•
•
•
single linkage method
complete linkage method
average linkage method
Ward algorithm
centroid method
Statistics for Marketing & Consumer Research
Copyright © 2008 - Mario Mazzocchi
9
Linkage methods
• Single linkage method (nearest neighbour):
distance between two clusters is the minimum
distance among all possible distances between
observations belonging to the two clusters.
• Complete linkage method (furthest neighbour):
nests two cluster using as a basis the maximum
distance between observations belonging to
separate clusters.
• Average linkage method: the distance between
two clusters is the average of all distances
between observations in the two clusters
Statistics for Marketing & Consumer Research
Copyright © 2008 - Mario Mazzocchi
10
Hierarchical vs. non-hierarchical
methods
Hierarchical Methods
No decision about the number of
clusters
Problems when data contain a high
level of error
Can be very slow, preferable with
small data-sets
Initial decisions are more influential
(one-step only)
At each step they require computation
of the full proximity matrix
Statistics for Marketing & Consumer Research
Copyright © 2008 - Mario Mazzocchi
Non-hierarchical methods
Faster, more reliable, works with
large data sets
Need to specify the number of
clusters
Need to set the initial seeds
Only cluster distances to seeds need
to be computed in each iteration
11
The number of clusters c
• Two alternatives
• Determined by the analysis
• Fixed by the researchers
• In segmentation studies, the c represents the number of
potential separate segments.
• Preferable approach: “let the data speak”
• Hierarchical approach and optimal partition identified through
statistical tests (stopping rule for the algorithm)
• However, the detection of the optimal number of clusters is subject
to a high degree of uncertainty
• If the research objectives allow a choice rather than
estimating the number of clusters, non-hierarchical
methods are the way to go.
Statistics for Marketing & Consumer Research
Copyright © 2008 - Mario Mazzocchi
12
Example: fixed number of clusters
• A retailer wants to identify several shopping
profiles in order to activate new and targeted
retail outlets
• The budget only allows him to open three types of
outlets
• A partition into three clusters follows naturally,
although it is not necessarily the optimal one.
• Fixed number of clusters and (k-means), non
hierarchical approach
Statistics for Marketing & Consumer Research
Copyright © 2008 - Mario Mazzocchi
13
Determining the optimal number of
cluster from hierarchical methods
(in SPSS)
• Agglomeration schedule (programma di
agglomerazione)
• Icicle plot (grafico a “stalattite”)
• Dendrogram
Statistics for Marketing & Consumer Research
Copyright © 2008 - Mario Mazzocchi
14