Cluster Analysis

Download Report

Transcript Cluster Analysis

Cluster Analysis
The term cluster analysis encompasses a number of
different algorithms and methods for grouping objects
of similar kind into respective categories.
A general question facing researchers in many areas of
inquiry is how to organize observed data into meaningful
structures, that is, to develop taxonomies.
In other words cluster analysis is an exploratory data
analysis tool which aims at sorting different objects
into groups in a way that the degree of association
between two objects is maximal if they belong to the
same group and minimal otherwise.
Monday, 28 March 2016
12:08 AM
1
Cluster Analysis
In other words, cluster analysis simply discovers
structures in data without explaining why they exist.
For an overview see
Hierarchical Cluster Analysis: Comparison of Three Linkage
Measures and Application to Psychological Data
Odilia Yim and Kylee T. Ramdeen
The Quantitative Methods for Psychology (TQMP) 2015
11(1) 8-21
Cluster Analysis – A Standard Setting Technique In
Measurement And Testing
Muhammad Naveed Khalid
Journal of Applied Quantitative Methods 2011 6(2) 46-58.
2
Cluster Analysis
The data (set a) are correlations between variables
relating to home and school circumstances of children.
The file contains the full matrix of correlations, which
we use as similarities.
X1
Parental circumstances in 1964
X2
Details of class teacher in 1964
X3
School-parent interaction in 1964
X4
Girl's attitude in 1964
X5
Test score in 1964
X6
Type of school in 1968
X7
Parental circumstances in 1968
X8
School-parent interaction in 1968
X9
Test score in 1968
3
Cluster Analysis
Rather than clustering individuals (as is usual) the aim is
to examine how five measurements made on secondary
school girls in 1964 relate to four measurements made
on the same girls in 1968. About a quarter of the
children could not be traced, which may bias the
results. The data are for 398 girls in their final year of
primary school in 1964 and fourth year of secondary in
1968. The nine variables are composite measures.
4
Cluster Analysis
Source
The Analysis and Interpretation of Multivariate Data
for Social Scientists
David J. Bartholomew, Fiona Steele, Irini Moustaki
andJ.I. Galbraith
2002 by Chapman and Hall/CRC
Table 2.17
The Plowden children four years later
Peaker G.F.
1971 National Foundation for Educational Research in
England and Wales
Table 7
5
Cluster Analysis
The data (set a) are correlations between variables relating to
home and school circumstances of children. The file contains the
full matrix of correlations, which we use as similarities.
In order to have the matrix of proximities recognized as such by
the cluster procedure, we must add two variables to the matrix file
and we must run the procedure as a syntax command. The two
variables are ROWTYPE_ and VARNAME_. Both variables are string
variables with a width of 8 characters.
6
Cluster Analysis
First load the data matrix (set a) either by activating
the button on the web site. Or by loading the data to
your local machine using the following instructions.
File >
Open >
Data
7
Cluster Analysis
First load the data matrix (set a) either by activating
the button on the web site. Or by saving the data to
your local machine and using the following instructions.
Where you navigate to
the location of the
required file, then
select “Open”.
8
Cluster Analysis
Now open the syntax window.
File >
New >
Syntax
9
Cluster Analysis
Enter the following into the syntax window.
CLUSTER
/MATRIX IN (*)
/METHOD COMPLETE
/PRINT SCHEDULE
/PLOT DENDROGRAM.
Simply cut and paste.
10
Cluster Analysis
Now run the syntax (Run > All).
11
Cluster Analysis
The agglomeration schedule describes the successive
formation of the clusters. 1 links to 7, then 5 to 9 then
5 to 6 and so on.
Agglomeration Schedule
Cluster Combined
Stage
Cluster 1
Stage Cluster First Appears
Cluster 2
Coefficients
Cluster 1
Cluster 2
Next Stage
1
1
7
.770
0
0
4
2
5
9
.758
0
0
3
3
5
6
.572
2
0
4
4
1
5
.388
1
3
5
5
1
3
.305
4
0
6
6
1
4
.193
5
0
7
7
1
8
.128
6
0
8
8
1
2
-.050
7
0
0
12
Cluster Analysis
The Dendrogram
summarises the
data from the
previous slide, its
all you really
need.
The second
column are the
variables
described by
their sequence
(x)1, (x)7 … (x)2.
13
Cluster Analysis
The dendrogram
shows that two
pairs of
variables,
parental
circumstances in
1964 (x1) and
1968 (x7), and
total test scores
1964 (x5) and
1968 (x9), are
each closely
linked.
14
Cluster Analysis
While those for
the school parent
interaction (x3
and x8) are not,
only being linked
at the sixth out
of eight steps.
15
Cluster Analysis
We might conclude
that the teacher’s
characteristics
(x2), the girl’s
attitude in 1964
(x4), and the
school-parent
interaction in 1968
(x8) are only weakly
associated with the
test scores (x5 and
x9), whereas the
other four variables
have stronger
associations with
the test scores (x1,
x3, x6 and x7).
16
Cluster Analysis
The above conclusions can be confirmed by examining the
correlation matrix.
17
Cluster Analysis
In SPSS numerous methods and measures are available.
The three methods are;
K-means cluster is a method to quickly cluster large
data sets, which typically take a while to compute with
the preferred hierarchical cluster analysis. The
researcher must to define the number of clusters in
advance. This is useful to test different models with a
different assumed number of clusters (for example, in
customer segmentation).
18
Cluster Analysis
Hierarchical cluster is the most common method. We
will discuss this method shortly. It takes time to
calculate, but it generates a series of models with
cluster solutions from 1 (all cases in one cluster) to n (all
cases are an individual cluster). Hierarchical cluster also
works with variables as opposed to cases; it can cluster
variables together in a manner somewhat similar to
factor analysis. In addition, hierarchical cluster analysis
can handle nominal, ordinal, and scale data, however it is
not recommended to mix different levels of
measurement.
19
Cluster Analysis
Two-step cluster analysis is more of a tool than a single
analysis. It identifies the groupings by running preclustering first and then by hierarchical methods.
Because it uses a quick cluster algorithm upfront, it can
handle large data sets that would take a long time to
compute with hierarchical cluster methods. In this
respect, it combines the best of both approaches. Also
two-step clustering can handle scale and ordinal data in
the same model. Two-step cluster analysis also
automatically selects the number of clusters, a task
normally assigned to the researcher in the two other
methods.
20
Cluster Analysis
For a second example (set b) the data are
the percentage employed in different
industries in Europe countries during 1979.
The job categories are agriculture, mining,
manufacturing, power supplies,
construction, service industries, finance,
social and personal services, and transport
and communications.
Agr
Min
Man
PS
Con
SI
Fin
SPS
TC
agriculture
mining
manufacturing
power supplies
construction
service industries
finance
social and personal services
transport and communications
It is important to note that these data were collected during the Cold
War (source).
The data may be loaded by utilising the link on the module web site, or
saving locally and using the approach previously described.
21
The raw data
Cluster Analysis
22
Cluster Analysis
Select
Analyze
> Classify
> Hierarchical Cluster
23
Cluster Analysis
Select all the variables and case labels
24
Cluster Analysis
Select the desired plots
25
Cluster Analysis
Select the desired method and scaling
26
Cluster Analysis
Can you see any
“structure”?
27
Cluster Analysis
A single-linkage
cluster analysis
show that the
countries cluster
together into three
main groups along
political lines.
28
Cluster Analysis
Group 1 at the top
of the plot contains
countries of
capitalist Western
Europe.
29
Cluster Analysis
Group 2 contains
countries of the
communist East
Bloc. I suspect
Spain was affected
by its Facist past.
30
Cluster Analysis
Group 3 contains
Yugoslavia, which
was unaligned and
shared some
characteristics of
both other groups,
and Turkey, which is
probably more
properly classified
as an Asian nation
since only a small
percentage of its
land area lies on the
European continent.
31
Cluster Analysis
Alternately enter the following into the syntax window.
DATASET DECLARE D0.08685272375079045.
PROXIMITIES
Agr Min Man PS Con SI Fin SPS TC
/MATRIX OUT(D0.08685272375079045)
/VIEW=CASE
/MEASURE=EUCLID
/PRINT NONE
/ID=Country
/STANDARDIZE=VARIABLE SD.
CLUSTER
/MATRIX IN(D0.08685272375079045)
/METHOD SINGLE
/ID=Country
/PRINT SCHEDULE
/PLOT DENDROGRAM.
Dataset Close D0.08685272375079045.
This syntax is particularly complex since it scales the
data. Why is this important?
32
Cluster Analysis
You could enter the following into the syntax window,
with no scaling.
CLUSTER
Agr Min Man PS Con SI Fin SPS TC
/METHOD SINGLE
/MEASURE=EUCLID
/ID=Country
/PRINT NONE
/PLOT DENDROGRAM.
Leading to a very different view.
33
Cluster Analysis
Leading to a very
different view
(structure) .
Clearly Turkey is still
“unusual”.
34
Cluster Analysis
For those who wish to investigate far beyond the scope of
this course see
Comparing the performance of biomedical clustering
methods. Nature Methods, 2015; DOI: 10.1038/nmeth.3583
Christian Wiwie, Jan Baumbach, Richard Röttger.
Identifying groups of similar objects is a popular first step in biomedical data analysis, but it is
error-prone and impossible to perform manually. Many computational methods have been developed to
tackle this problem. Here we assessed 13 well-known methods using 24 data sets ranging from gene
expression to protein domains. Performance was judged on the basis of 13 common cluster validity
indices. We developed a clustering analysis platform, ClustEval, to promote streamlined evaluation,
comparison and reproducibility of clustering results in the future. This allowed us to objectively
evaluate the performance of all tools on all data sets with up to 1,000 different parameter sets each,
resulting in a total of more than 4 million calculated cluster validity indices. We observed that there
was no universal best performer, but on the basis of this wide-ranging comparison we were able to
develop a short guideline for biomedical clustering tasks. ClustEval allows biomedical researchers to
pick the appropriate tool for their data type and allows method developers to compare their tool to
the state of the art.
35