Clustering

Transcript Clustering

Clustering analysis workshop
CITM, Lab 3
18, Oct 2014
Facilitator:
Hosam Al-Samarraie, PhD.
Outline
• – The basic concepts of cluster analysis.
• – The different types of clustering procedures.
• – How to execute and generate clustering
results.
• – The SPSS clustering outputs.
• – The learning machine outputs.
What Does Data Mining Do?
• Data mining extract patterns from data
– Pattern? A mathematical (numeric and/or
symbolic) relationship among data items
• Types of patterns
– Association
– Prediction
– Cluster (segmentation)
Knowledge Discovery
Steps in a Knowledge Discovery
process
Supervised vs. Unsupervised Learning
• Supervised learning (classification)
– Supervision: I know the output and I want to examine
the effect between the Independent variable on
Dependent one.
• Unsupervised learning (clustering)
– The class or the nature of the variables is unknown
– Given a set of measurements, observations, etc. with
the aim of establishing the existence of classes or
clusters in the data
The concept of cluster analysis
Cluster analysis is
unsupervised learning for
identifying homogenous
groups of objects called
clusters.
Cluster share many
characteristics, but are very
dissimilar to objects not
belonging to that cluster.
Cont…
• Measuring distances (differences or
dissimilarities between subjects)
• Measuring proximities (similarity
between subjects)
Types of Data!!
Not numeric
Numeric
Count
Length
Gender….
Age group
Typical research questions the Cluster
Analysis answers are as follows:
•
•
Medicine – What are the diagnostic clusters?
To answer this question the researcher would devise a diagnostic questionnaire that
entails the symptoms (for example in psychology standardized scales for anxiety,
depression etc.). The cluster analysis can then identify groups of patients that present
with similar symptoms and simultaneously maximize the difference between the groups.
•
•
Marketing – What are the customer segments?
To answer this question a market researcher conducts a survey most commonly covering
needs, attitudes, demographics, and behavior of customers. The researcher then uses
the cluster analysis to identify homogenous groups of customers that have similar needs
and attitudes but are distinctively different from other customer segments.
•
•
Education – What are student groups that need special attention?
The researcher measures a couple of psychological, aptitude, and achievement
characteristics. A cluster analysis then identifies what homogeneous groups exist among
students (for example, high achievers in all subjects, or students that excel in certain
subjects but fail in others, etc.).
A discriminant analysis then profiles these performance clusters and tells us what
psychological, environmental, aptitudinal, affective, and attitudinal factors characterize
these student groups.
•
Types of clustering
Hierarchical Clustering
1.
2.
use agglomerative ("bottom-up”) algorithms begin with each element as a
separate cluster and merge them into successively larger clusters.
Handles continuous data.
Cont…
• Can be visualized as a dendrogram
– A tree-like diagram that records the sequences of
merges or splits
0.2
0.15
0.1
0.05
0
1
3
2
5
4
6
Non hierarchical
K-means clustering
1. Begin with two starting center points and allocate each item
to nearest cluster center.
2. Allocate items to nearest cluster center.
Mix
Two-Steps Clustering
1. designed to handle very large data sets.
2. can handle both continuous and categorical variables or
attributes.
3. automatically select the number of clusters.
Generate clustering
1
1. Decide on cluster variables
• At the beginning of the clustering process, we
have to select appropriate variables for
clustering.
Note!!!
• It is important to avoid using an abundance of clustering
variables, as this increases the odds that the variables are
no longer dissimilar.
• Meaning? If highly correlated variables are used for cluster
analysis, specific aspects covered by these variables will be
overrepresented in the clustering solution.
• In this regard, absolute correlations above 0.90 are always
problematic.
• For example, measuring happiness and joy of a person.
Insight!!
• When we usually use factor analysis, we usually get factor solution
that does not explain a certain amount of variance;
• As such, discarding of information will be performed before
identifying the segments.
• However, removing variables with low loadings on all the extracted
factors means that some potential information for the identification
of segments are discarded.
• This in turn reduce the possibility of identifying different groups.
• Finally, the resulted factors based on the original variables become
questionable.
2
2.Decide on the Clustering Procedure
• Refers to the process of forming the cluster.
Dataset
• Lets say I have different people with different
measures of height and weight (variables).
• Now, if I want to group those people by
weight and height into different groups, then I
need to use Cluster analysis.
The SPSS
clustering
Variables
People to be clustered.
It can be performance, achievement,
etc…
Cont…
Hierarchical Methods: If there is a limited
number of observation, usually <200.
▸ Analyze ▸ Classify ▸ Hierarchical Cluster
K-Means: If there are many observations,
usually > 500.
▸ Analyze ▸ Classify ▸ K-Means Cluster
Two-step cluster: If there are many
observations and the clusters are measured
on different scale levels (5 likert scale,
nominal, ordinal, etc..)
▸ Analyze ▸ Classify ▸ Two-Step Cluster
In Hierarchical
Select a Clustering Algorithm
• Ward’s method
•
•
(only hierarchical clustering)
▸ Analyze ▸ Classify ▸ Hierarchical Cluster ▸
Method ▸ Cluster Method
Select measure of Similarity
In hierarchal
• Only apply for Hierarchal and two-steps methods
Euclidean is the most commonly
used type when it comes to
analyzing ratio or interval-scaled
data.
Select measure of Similarity
In Two-step
Two-step clustering:
• ▸ Analyze ▸ Classify ▸ Two-Step Cluster ▸
Distance Measure
Standardize in Hierarchal only.
In both methods, convert variables with multiple
categories (on a range of 0 to 1 or 1 to 1, or use
Z score).
3
Identifying the number of clusters?
• For hierarchical clustering by examining the
dendrogram:
Not always recommend
• ▸ Analyze ▸ Classify ▸ Hierarchical Cluster
• ▸ Plots ▸ Dendrogram
Alternative solution
• Draw a scree plot (e.g., using Microsoft Excel) based
on the coefficients in the agglomeration schedule.
(Elbow method)..2 clusters are possible to use..
Cofficent
9000
8000
7000
6000
5000
4000
Cofficent
3000
2000
1000
0
-1000
-2000
0
5
10
15
20
25
For two-step and k-means
• Note: two-step clustering identify the number
of clusters automatically.
• However, K-means use default of 2. The most
recommended one is 3-4 clusters.
• So you need to try both and see which one
provides useful output.
Save membership
• After identifying the number of clusters, we
save the memberships between the cases.
Click save
Add 2
Membership to be used
Here is the membership
4
Assess the solution’s stability
• By using other methods and compare
between each other.....
Assess the solution’s validity
• Criterion validity: Evaluate whether there are
significant differences between the segments
resulted from the membership step.
• P<0.05 We are doing well…
Interpret the cluster solution
• Examine cluster centroids and assess whether
these differ significantly from each other (e.g.,
by means of t-tests or ANOVA). As we did
earlier.
• Identify names or labels for each cluster and
characterize each cluster by means of
observable variables, if necessary (cluster
profiling).
SPSS
• That’s all…..now lets try it in spss. 
Another example
• Lets say I want to explore children that needs
special learning.
• So I collected some data about children's
reading and cognitive performance gain.
Now I ask the question,
• What are children groups that need extra
learning?
•
•
•
•
•
For the data place this url
www.hosamspace.com/data
Download the cluster children data.
Open the file in spss (or just double click)
Now observe the data.
Thank you
• Any further inquiry:

Clustering

Transcript Clustering

Directory