Transcript Slide 1

College of Science & Technology
Dep. Of Computer Science & IT
BCs of Information Technology
Data Mining
Chapter 6_2: Clustering Methods
2013
Prepared by: Mahmoud Rafeek Al-Farra
www.cst.ps/staff/mfarra
Course’s Out Lines
2











Introduction
Data Preparation and Preprocessing
Data Representation
Classification Methods
Evaluation
Clustering Methods
Mid Exam
Association Rules
Knowledge Representation
Special Case study : Document clustering
Discussion of Case studies by students
Out Lines
3




Definition of Clustering
Clustering Process
Clustering Algorithms (Methods)
Cluster validation
Definition ?
4

Clustering is a division of data into groups of similar objects.

Each group, called cluster, consists of objects that are similar
between themselves and dissimilar to objects of other groups.

Clustering is an unsupervised classification problem.
Clustering Process (Document Case)
5
Preprocessing step
• Document cleaning
•Feature selection or extraction.
Clustering Algorithm
• Similarity Measure
• Criterion Clustering Function
Documents samples
1
2
Clusters
3
4
Results interpretation
Knowledge
Cluster validation
• External Indices
• Internal Indices
• Relative Indices.
Clustering algorithm design or selection
6



This step is usually combined with the selection of
a corresponding proximity measure and the
construction of a criterion function.
Obviously, the proximity measure directly affects
the formation of the resulting clusters.
Almost all clustering algorithms are explicitly or
implicitly connected to some definition of
proximity measure.
Clustering algorithm design or selection
7



In order to be able to group similar data objects a
proximity metric has to be used to find which
objects (or clusters) are similar.
Similarity Measure can be done through measure
how much two objects are similar to each other
(Similarity) or measure how mach two objects are
different (dissimilarity ).
There is a large number of similarity metrics
reported in the literature due to the large number of
representation models and clustering algorithms.
Clustering algorithm design or selection
8
Document cluster
Inter-ClusterDocument cluster
Sim.
Intra-Cluster
Sim.
Document cluster
Clustering Algorithms
9

Once a proximity measure is chosen, the
construction of a clustering criterion function
makes the partition of clusters an optimization
problem, which is well defined mathematically, and
has rich solutions in the literature.
Clustering Algorithms
10
Partitional Clustering
• K-means
• Fuzzy C-means
• Bisecting k-means
NN Clustering
Clustering
Algorithms
Density Clustering
Grid Clustering
Agglomerative
(AHC)
Hierarchical Clustering
Divisive
(DHC)
Hierarchical Clustering
11


Hierarchical techniques produce a nested sequence of
partitions, with a single all-inclusive cluster at the top and
singleton clusters of individual objects at the bottom.
The result of a hierarchical clustering algorithm can be viewed
as a tree, called a dendogram.
{a, b,c,d,e}
{a}, {b,c,d,e}
{a}, {b,c}, {d,e}
{a}, {b,c}, {d}, {e}
{a}, {b}, {c}, {d}, {e}
a
b
c
d
e
Hierarchical Clustering
12


AHC starts with the set of objects as individual
clusters; then, at each step merges the most two
similar clusters.
This process is repeated until a minimal number of
clusters have been reached, or, if a complete
hierarchy is required then the process continues
until only one cluster is left.
Hierarchical Clustering
13

DHC Methods work from top to bottom, starting
with the whole data set as one cluster, and at each
step split a cluster until only singleton clusters of
individual objects remain
Partitional Clustering
14



Partitional clustering techniques create a one-level
(un-nested) partitioning of the data points.
If K is the desired number of clusters, the partitional
approaches typically find all K clusters at once.
The most known class of partitional clustering
algorithms are the k-means algorithm and its
variants.
Centroids
Neural Networks-Based Clustering
15


Neural networks (NNs) are able to learn complex
relationships from data samples either in a
supervised or unsupervised fashion.
In supervised leaning, a labeled set of data is used
to train the network for modeling the input and
output functions, prior to testing. Whereas
unsupervised networks do not use such a priori
knowledge but they can learn the underlying
relationships from the data.
Next:
16


Cluster validation
Examples of Clustering algorithm
Prepare 2 slides for each of the following clustering
algorithm:
• Density Clustering
• Grid Clustering
Thanks
17