Statistics for Marketing and Consumer Research

Download Report

Transcript Statistics for Marketing and Consumer Research

Cluster Analysis
(from Chapter 12)
Statistics for Marketing & Consumer Research
Copyright © 2008 - Mario Mazzocchi
1
Cluster analysis
• It is a class of techniques used to classify
cases into groups that are
• relatively homogeneous within themselves and
• heterogeneous between each other
• These groups are called clusters
Statistics for Marketing & Consumer Research
Copyright © 2008 - Mario Mazzocchi
2
Market segmentation
• Cluster analysis is especially useful for market
segmentation
• Segmenting a market means dividing its potential
consumers into separate sub-sets where
• Consumers in the same group are similar with respect to
a given set of characteristics
• Consumers belonging to different groups are dissimilar
with respect to the same set of characteristics
• This allows one to calibrate the marketing mix
differently according to the target consumer group
Statistics for Marketing & Consumer Research
Copyright © 2008 - Mario Mazzocchi
3
Other uses of cluster analysis
• Clustering of similar brands or products according to their
characteristics allow one to identify competitors, potential
market opportunities and available niches.
• Data reduction
• Factor analysis and principal component analysis allow to reduce the
number of variables.
• Cluster analysis allows to reduce the number of observations, by
grouping them into homogeneous clusters.
• Maps profiling simultaneously consumers and products,
market opportunities and preferences as in preference or
perceptual mappings.
Statistics for Marketing & Consumer Research
Copyright © 2008 - Mario Mazzocchi
4
Steps to conduct a cluster analysis
•
•
•
•
•
Select a distance measure
Select a clustering algorithm
Define the distance between two clusters
Determine the number of clusters
Validate the analysis
Statistics for Marketing & Consumer Research
Copyright © 2008 - Mario Mazzocchi
5
Distance measures for individual
observations
• To measure similarity between two
observations a distance measure is needed.
• Multiple variables require an aggregate
distance measure
• The most known measure of distance is the
Euclidean distance, which is the concept we
use in everyday life for spatial coordinates.
Statistics for Marketing & Consumer Research
Copyright © 2008 - Mario Mazzocchi
6
Examples of distances
Dij 
x
n
k 1
ki
 xkj 
2
Euclidean distance
A
B
A
n
Dij   xki  xkj
City-block (Manhattan) distance
k 1
B
Dij distance between cases i and j
xkj value of variable xk for case j
Problems:
Different measures = different weights
Correlation between variables (double counting)
Solution: Standardization, rescaling, principal component
analysis
Statistics for Marketing & Consumer Research
Copyright © 2008 - Mario Mazzocchi
7
Clustering procedures
• Hierarchical procedures
• Agglomerative (start from n clusters to get to 1
cluster)
• Divisive (start from 1 cluster to get to n
clusters)
• Non hierarchical procedures
• K-means clustering (knowledge of the number
of clusters (c) is required).
Statistics for Marketing & Consumer Research
Copyright © 2008 - Mario Mazzocchi
8
Distance between clusters
• Algorithms vary according to the way the
distance between two clusters is
defined.
• The most common algorithm for
hierarchical methods include
•
•
•
•
•
single linkage method
complete linkage method
average linkage method
Ward algorithm
centroid method
Statistics for Marketing & Consumer Research
Copyright © 2008 - Mario Mazzocchi
9
Linkage methods
• Single linkage method (nearest neighbour):
distance between two clusters is the minimum
distance among all possible distances between
observations belonging to the two clusters.
• Complete linkage method (furthest neighbour):
nests two cluster using as a basis the maximum
distance between observations belonging to
separate clusters.
• Average linkage method: the distance between
two clusters is the average of all distances
between observations in the two clusters
Statistics for Marketing & Consumer Research
Copyright © 2008 - Mario Mazzocchi
10
Hierarchical vs. non-hierarchical
methods
Hierarchical Methods
 No decision about the number of
clusters
 Problems when data contain a high
level of error
 Can be very slow, preferable with
small data-sets
 Initial decisions are more influential
(one-step only)
 At each step they require computation
of the full proximity matrix
Statistics for Marketing & Consumer Research
Copyright © 2008 - Mario Mazzocchi
Non-hierarchical methods
 Faster, more reliable, works with
large data sets
 Need to specify the number of
clusters
 Need to set the initial seeds
 Only cluster distances to seeds need
to be computed in each iteration
11
The number of clusters c
• Two alternatives
• Determined by the analysis
• Fixed by the researchers
• In segmentation studies, the c represents the number of
potential separate segments.
• Preferable approach: “let the data speak”
• Hierarchical approach and optimal partition identified through
statistical tests (stopping rule for the algorithm)
• However, the detection of the optimal number of clusters is subject
to a high degree of uncertainty
• If the research objectives allow a choice rather than
estimating the number of clusters, non-hierarchical
methods are the way to go.
Statistics for Marketing & Consumer Research
Copyright © 2008 - Mario Mazzocchi
12
Example: fixed number of clusters
• A retailer wants to identify several shopping
profiles in order to activate new and targeted
retail outlets
• The budget only allows him to open three types of
outlets
• A partition into three clusters follows naturally,
although it is not necessarily the optimal one.
• Fixed number of clusters and (k-means), non
hierarchical approach
Statistics for Marketing & Consumer Research
Copyright © 2008 - Mario Mazzocchi
13
Determining the optimal number of
cluster from hierarchical methods
(in SPSS)
• Agglomeration schedule (programma di
agglomerazione)
• Icicle plot (grafico a “stalattite”)
• Dendrogram
Statistics for Marketing & Consumer Research
Copyright © 2008 - Mario Mazzocchi
14