2. Method

Transcript 2. Method

TECHNIQIE OF CLUSTERING
( Panorama of methods, Works of Benno Stein group )
Mikhail Alexandrov
Centro de Investigación en Computación
Instituto Politécnico Nacional de México
(+ 52) 5729-6000 ext. 56544; [email protected]
Results of Investigation of Benno Stein Group
Nature of data structure
Definition of the best data structure
based on graph presentation
Numerical method of clustering
Development of MajorClust method from
the density based group of methods
Validity and usability of clustering
F-measure for cluster usability (expert opinion)
Density index ρ for cluster validity (formal estimation)
Numerical experiments
Comparison of different methods of clustering
and different ways of index selection
Subject of Grouping
TEXTUAL DATA
NON TEXTUAL DATA
Local Terminology
It is not important what is the
source of data: textual o non
textual.
Data: work in the space of
parameters
Texts: work in the space of
words
Presentation of Textual Data
TEXTS (‘indexed’)
TEXTS (parametrized)
Terminology local
Indexed texts are
parametrized texts in
the space of words
Types of Grouping
Clustering
Classificación
Characteristics:
Absence of patterns or descriptions
of classes, so the results are
defined by the nature of
the data themselves ( N >1 )
Caracteristics:
Presence of patterns o descripciones
of classes, so the results are
defined by the user ( N>=1 )
Sinonyms:
Classification without teacher
Unsupervised classification
Sinonyms:
Classification with teacher
Supervised classificacion
Number of clusters
[ ] is known exactly
[x] is known approximately
[ ] is not known => searching
Specials terms:
Categorization (of documents)
Diagnostics (technics, medicine)
Recognition (technics, science)
Objectives of Grouping
1. Organization (structuring) of object set
Process is named data structuring
Tecnologies: clustering, categorización
2. Improvement of the process of searching interesting patterns
Process is named navigation
Tecnology: cluster-based search
3. Groupng for other applications:
- Knowledge discovering
Tecnology: clustering
- Summarization (generalization) of documents
Tecnologies: clustering, categorización
Nota: do not mix the type of groupng and its objective
Example 1.
Clustering of opinions of the experts invited by the Moscow
City Government
Example 2.
Categorization of letters and complaints of Moscow dwellers
directed to the Major of the city
Learning
General books about Information Retrieval
1. Baeza-Yates, R., Ribero-Neto, B. Modern Information Retrieval.
Addison Wesley, 1999
2. Manning, D.C., Schutze, H. Foundations of statistical natural language
processing. MIT Press, 1999.
Journals and Congresses about Information Retrieval
1. Journal “Knowledge and Information Systems”, Springer
2. KDD - Knowledge Discovery from Data Base
3. NLDB – Natural Language and Data Bases
4. XX-AI – National Congresses on Artificial Intelligence (MICAI, ….)
Learning
General books and articles about clustering
1. Hartigan J. , Clustering Algorithms, Wiley, 1975 <= !
2. Kaufman L. and Rousseuw P., Finding Group in Data, Wiley, 1990
3. Gordon D. “How many clusters?”, Proc. of IFCS-96, Springer,
serie “Studies in classification, data analysis…..”, 1996
Articles about clustering texts
1. Eissen S., Stein B. “Analysis of clustering algorithms for Web based
search”, Springer, LNAI, N_2569
2. Stein B., Eissen S., Wibrock F. “On cluster validity and the information
need of users”, Proc. 3-rd Conf. Artif. Intel. Appl., IASTED, 2003
Universal packets with algorithms of clustering
1. ClustAn (Scotland) www. clustan.com
Clustan Graphics-6 (2003)
2. MatLab
Descriptions are in Internet
3. Statistica
Descriptions are in Internet
General Scheme of Clustering Process
Here:
Rude and good matrixes
are Matrix Obj/Atr
Preprocessing <=
Processing
Principal idea :
To transform texts to numerical form
to use matematical tools ( the problem
is text grouping documents but not
their undestanding)
General Scheme of Clustering Process
Preprocessing
Processing
<=
Here:
Instead of matrix Obj/Obj
matrix Atr/Atr can be used
Matrixes to be Considered
Clustering for Categorization
Colour matrix “words-words”
before clustering
Clustering for Categorizatión
Colour matriz “words-words”
after clustering
Importance of Preprocessing
Definitions
Def. 1 “Let us V be the set of objects. Clustering
C = { Ci | Ci є V } of V is division of V on subsets, for
which we have : UiCi = V and Ci ∩Cj = 0 i ≠j“
Set
Def. 2 “Let us V be the set of nodes, E be arcs, φ is weight
function that reflects the distance between objects, so we have
a weighted graph G = { V,E, φ }. In this case C is named as
clustering of G.”
Clique
In the framework of the second definition every Ci produced
subgraph G(Ci). Both subsets Ci and subgraphs G(Ci) are
named clusters.
Graph
Definitions
Principal note
Both definitions SAYS NOTHING:
- about quality of clusters
- about numbers of clusters
Reason of difficulties
Nowadays there is no any general agreement about any
universal defintion of the term cluster
What means that clustering is good ?
1. Closeness between objets inside clusters is
essentially more than the closeness between
clusters themselves
2. Constructed clusters correspond to intuitive
presentations of users ( they are natural clusters)
Classification of methods
Based on the way of grouping
1. Hierarchical methods
Any neighbors
( N is not given)
2. Methods oriented on examples
K-means
( N is given)
3. Methods oriented on density
MajorClust
(N is calculated automatically)
Classification of methods
Based on belonging to
cluster
Exclusive methods
Every object belongs only to one
cluster. Methods are named hard
grouping methods
Based on data presentation
Methods oriented on sets in
free metric space
Every object is presented as a
point in a free space
Methods oriented on graphs
Non-exclusive methods
Every object can belong to several
clusters. Methods are named soft
grouping methods.
Every object is presented as an
element on graph
Fuzzy Grouping
Hard classification
Hard clustering
Hard categorization
Soft classification
Soft clustering
Soft categorization
Example.
The distribution of letters of Moscovites
to the Government is soft categorization
(numbers in the table reflect the relative
weight of each theme)
Hierarchical methods
Aglomerative methods (bottom=>top)
- Relations with neighbors
- Ward’s method
División methods (top => bottom)
- Min cut method
- Methods based on dissimilarity
he
I
if this the his a
Hierarchical aglomerative methods
Neighbors
1. Nearest neighbor (single-linkage)
2. Farthest neighbor (complete linkage)
3. Averaged neighbor (average linkage) King
- Unweighted Pair-Group Average UPGMA
- Weighted Pair-Group Average
WPGMA
- Unweighted Pair-Group Centroid WPGMC <= King method
Density
Ward método
Difference between methods
Evaluation of closeness between clusters
Hierarchical methods
Neighbors.
Every object is cluster
Hierarchical methods
General algorithm
1.
Initially every object is one
cluster
2. The series of steps are
performed. On every step
the pair of cluster to be the
closest are searched. They
are merged.
3. At the end we have one
cluster.
Neighbors.
Every object is cluster
Hierarchical methods
Nearest neighbor method (NN)
Hierarchical methods
Farthest neighbor method
Hierarchical methods
Averaged neighbor
(Unweighted Pair-Group Centroid WPGMC)
Hierarchical methods
Averaged neighbors
- Unweighted Pair-Group
Centroid WPGMC (a)
- Unweighted Pair-Group
(a)
Average UPGMA (b)
- Weighted Pair-Group
Average
WPGMA
Difference
Evaluation the closeness
between clusters
(b)
Methods oriented on examples
Variants of
K-means methods
K-means, centroid (a)
K-means, medoid (b)
K-means fuzzy
(a)
Difference
Election of centers
(b)
Methods oriented on examples
K - means, centroid
Methods oriented on examples
Method K-means
General algorithm
1.
Initially K centers are selected by
any random way
2.
Series of stepsare performed. On
every step the objects are
distributed between centers
according the criterion of the
nearest center. Then all centers are
recalculated.
3.
The end of searching is fixed when
clusters are not changed.
Natural Structure of Data
u
Graph decomposition
C=(C1, C2,….Cn) is decomposition of graph
Λ(C) is weighted partial connectivity of C
λ(Ci) = λi is edge connectivity of G(Ci)
H (C*, E(C*)) is connectivity structure
Density based methods
MajorClust method
Principal idea
Un object belongs to the cluster
to which the majority of his
neighbors belong
Suboptimal solution
Only part of neighbors are
considered on every step.
Density based methods
Comparason of Nearest Neighbor method and
MajorClust
NN-method does not change belonging of objects.
MajorClust changes belonging of objects
Λ - criterion and MajorClust
Validity
Principal question:
What of the formal validiy
measures reflects a user
opinion by the best way?
Dunn index
Davies- Bouldin index
etc.
Dunn index (to be max)
Validity and Usability
Conclusion
Expected density measure
corresponds to F-measure
reflecting expert’s opinion
Classification of Tecnologies
Meta methods (Tecnology)
They construct separated data sets using criteria of optimization and
limitations:
- Neither much nor small number of clusteres
- Neither large nor small size of clusters
etc.
Visual methods (Tecnology)
They present visual images to a user in order to select manually the clusters
- Using different methods
- Comparing results
Algorithm
Meta Methods
1. Initially the number of clusters Kini and their
centers Ci is given (aleatoriamente)
2. Method K-medoid (or any other one) is completed
3. If there is N > Nmax in any cluster, than this
cluster is divided along with its diagonal. Go to p.2
4. If there is N < Nmin in any cluster, than the
closest clusters are joined. Go to p.2
5. If there is D > Dmax in any cluster, than this
cluster is divided along with its diagonal. Go to p.2
6. If there is D < Dmin in any cluster, than the
closest clusters are joined. Go to p.2
7. When the number of iteration I > Imax, Stop
Here: N is the number of objects in a given cluster
D is the diagonal of a given cluster
Technique of Visual Clustering
Clustering on dendrite
Clustering in space of factors
Demographic Map of the World,
clusters in space of factors and on dendrite
Searching Leaders in Clusteres
Objectives ( Leader = Representive )
• To reduce time for elaboration of a document corpus
• To improve the procedure of navigation in Internet ( new)
etc.
Typical documents
A. Typical document (the averaged one)
It reflects the principal idea of a given cluster
B. The least typical document
It is good to reach the consensus between clusters
C. The most typical document
It reflects the idea of difference between clusters
Variants of Leaders
C. Most typical
Examples
A. Typical
Typical: the most popular costume in
B. Least typical
Chine is jeans.
Cluster
The most typical: the most typical
costume in China is kimono
The least typical: the least typical
Typical, the most typical and
costume in China is European siut
the least typical documents
(it is the closest one to the other
form so called semantic sphere
clusters)
(boundaries) of cluster
Clustering Metods of Clustering
C | X1 X 2 |
Open Problems
1. Clustering with filtering (a)
Given object distribution reflects:
real structure (nature) +
noise (alien objects)
2. Validation of clusters (b)
What is realy inside the black box?
3. Statistical clustering (c)
How take into account the
randomness of data?
4. Visual presentation (d)
Facility for visual clustering
Certain Observations
The numbers of clustering methods is a little bit more than the
numbers of researchers working in this area.
Problem does not consist in searching the best method for all cases.
Problem consists in searching the method being relevant for your data.
Only you know what methods are the best for you own data.
To be sure that results are really good and do not depend on the method used
it is recommended to test these results using any other methods. And these
methods are desirable to be antopodes.
Solomon G, 1977. The most antipodes are: NN-method and K-means
Principal problem consists in choice of measure of closeness
being adecuate to a given problem and given data
Frecuently the results are bad because of the bad measure but not the bad
method !
CLUSTERING DOCUMENTS
( General information, Works of Benno Stein group )
End

2. Method

Transcript 2. Method

Directory