Transcript Slide 1

College of Science & Technology
Dep. Of Computer Science & IT
BCs of Information Technology
Data Mining
Chapter 6_3: Clustering Methods
2013
Prepared by: Mahmoud Rafeek Al-Farra
www.cst.ps/staff/mfarra
Course’s Out Lines
2











Introduction
Data Preparation and Preprocessing
Data Representation
Classification Methods
Evaluation
Clustering Methods
Mid Exam
Association Rules
Knowledge Representation
Special Case study : Document clustering
Discussion of Case studies by students
Out Lines
3

Cluster validation
 Similarity
 Overall
measure
Similarity
 Entropy

Examples of Document Clustering algorithm

Suffix Tree Clustering algorithm

DIG for document representation

A SOM-Based document clustering using phrases

Text Clustering using SOM on Semantic Graphs

Graph-based Growing Hierarchal Self-Organizing Map
Cluster validation- intro.
4

The results of any clustering algorithm should be
evaluated using an informative quality measure that
reflects the “goodness” of the resulting clusters.

In addition, it gives us the possibility to compare the
different
clustering
algorithms
that
approaches usually lead to different clusters.
different
Cluster validation- intro.
5

Generally, there are two main measures for testing
quality of clusters, using of them depends on
whether we have labeled data or there is no prior
knowledge about the classification of data objects.
Measure Type
Measure Example
Internal quality measure
Overall similarity
External quality measure
Entropy and F-measure
Similarity measure
6

Similarity in VSM is based on the distance or angle
between two vectors. One of the most widely used
distance measure is known as the family of
Minkowski distances which described as:
xy

p

n
p
x
i 1
i
 yi
Where:
X
and Y the vectors of two objects
 i : the feature, n number of features
 p: assumes values greater than or equal to 1
p
Similarity measure
7

A more common similarity measure that is used
specifically in document clustering is the cosine
correlation measure, it is defined as follows:
x y
cos(x,y) 
x y

Where:
 Where
(.) indicates the vector dot product ||.|| and
indicates the length of the vector.
 ||x|| = x.x , x.y = x . y  x . y  ...
1 1
2 2
Overall Similarity
8


The overall similarity is an internal measure which
is used for computing the cluster cohesiveness in
the absence of any external knowledge.
It uses the weighted similarity of the internal
cluster similarity, as in:
OverallSimilarity(Cu) 

Where
 Cu:
1
Cu
2
Sim(O ,O
1
2
)
O1 ,O2 Cu
is the cluster under consideration,
 Sim (O1, O2): is the similarity between the two objects
O1 and O2 which are belonging to the cluster Cu.
 |cu| : number of documents
Entropy Measure
9

Entropy is one of the external measure, which provides a
measure of “goodness” for un-nested clusters or for the
clusters at one level of a hierarchical clustering.

Using entropy, the best clustering algorithm is obtained when
each cluster contains exactly one data point.

While the entropy is decreasing, the quality of the clustering
algorithm is better that the best quality using entropy is 0.
Suffix Tree Clustering algorithm
10

The STC method basically involves on using of a
compact tree structure to represent shared phrases
between documents.
D1: cat ate cheese
D2: mouse ate cheese too
D3: cat ate mouse too
and
Then
2015-07-16
10
Document Index Graph for clustering (DIG)
11


It based on constructing an incremental cumulative
graph represents the collection of documents such
that each node represents a term and it stores all the
required information about this term while the edges
represent the relation between terms to represent the
phrases.
Then use the incremental clustering of documents
using a histogram-based method to maximize the
tightness of clusters by carefully watching the
similarity distribution inside each cluster.
A SOM-Based document clustering using phrases
12



This algorithm represents the document as a vector of
phrases instead of single terms.
Extracting phrases here is achieved by a phrase grammar
extraction technique which based on mutual information.
Then the documents is represented as a phrase vector
space, and then input them to SOM.
.
.
.
.
.
Documents
.
.
.
SOM
Phrase Grammar Extraction
Phrase Feature Vectors
Text Clustering using SOM on Semantic Graphs
13


The semantic relations are captured by the Universal
Networking Language (UNL) which expresses a
document in the form of a semantic graph, with nodes as
disambiguated words and semantic relations between
them as edges.
Then convert these graphs into vectors and applied SOM
on them as the clustering method.
plt
plt
coo
.
.
.
.
Semantic Graph
.
.
Documents
.
.
coo
SOM
Feature Vectors
Graph-based Growing Hierarchal Self-Organizing Map
well-structured
XML documents
14
D
C
X
L
B
C
A
Web Documents
1G
0
Document Clusters
One Document
Cumulative
Document
Graph Rep.
0G
2G 2G
1
2
0
0G
1G
1
1G
s
1
2G 2G
1
2
Graph Rep.
0G
SOM
1G
0
A
G
G
Preprocessing
Step
B
O
O
S
X
s
1G
1
2G 2G
1
2
Hierarchy Growing SOM
Similarity
Measure
Next:
15

Association Rules
Thanks
16