SOM - Artificial Intelligence & Information Analysis Laboratory

Download Report

Transcript SOM - Artificial Intelligence & Information Analysis Laboratory

Document Organization using
Self – Organizing
Feature Maps
(WEBSOFM)
Apostolos Georgakis
Artificial Intelligence
and
Information Analysis Lab
Department of Informatics
Aristotle University of Thessaloniki
Self-Organizing Maps Algorithm
SOMs are neural networks with:
•Two-layer structure
•Feed-forward topology
They:
•form a non-linear projection from an arbitrary data
manifold onto a low dimensional discrete map.
•perform competitive, unsupervised training
•map the probability density function (pdf) of the input
space on a 2D or 3D lattice
The topology of the lattice can be either hexagonal or
orthogonal
Department of Informatics
Aristotle University of Thessaloniki
SOMs belong to the class of vector quantizers (k-means, VQ,
LVQ, etc)
Let xi(t)=[xi1(t), xi2(t),…, xim(t)]T denote an m1 feature vector and mj(t)
denote the weight vector of the j-th neuron.
xi (feature vectors)
mj (weight vector)
neurons
Department of Informatics
Aristotle University of Thessaloniki
Index of the winning neuron:

c  arg min xi  t   m j  t 
j

Updating function:

m j  t   hcj  t   xi  t   m j  t   , j  N c
m j  t  1  
, j  N c

 m j t 
where Nc is the neighbourhood centered on the winner which
is modified toward xi(t)
Department of Informatics
Aristotle University of Thessaloniki
 r r 2
c
j

hcj  t   a  t  exp 
 2 2  t 







a(t) is the learning rate, and σ(t) denotes the diameter of the
updating kernel.
rc and rj is the location on the lattice of the winner neuron and the
neuron being updated, respectively.
Department of Informatics
Aristotle University of Thessaloniki
Information Retrieval of
Domain Specific Material
Information retrieval (IR) consists of:
•Organization
 representation and storage of the available data
•Retrieval
 exploration of the organized data
Prior to retrieval, the data repository (corpus) has to be
organized according to the retrieval method to be applied.
Without a retrieval-oriented organization method the retrieval
of even a fraction of the available documents becomes onerous.
Department of Informatics
Aristotle University of Thessaloniki
Information Retrieval of
Domain Specific Material
• Scenario:
 A user is interested in locating documents or web pages
related to a certain topic.
 Query the IR system in Natural Language or use available
sources (documents and / or web pages)
Department of Informatics
Aristotle University of Thessaloniki
Information Retrieval
• The above problem can be dealt either by the:
 Boolean model
Index terms
 Vector model
Vectors in t-dimensional space
 Probabilistic model
Probability theory
Department of Informatics
Aristotle University of Thessaloniki
Boolean Model
Documents: represented by index terms or keywords
• Indexing techniques:
 Suffix arrays
(text suffix: a string that goes from a text position to the end of the
text)
 Signature files
(word oriented index structures based on hashing)
 Inverted files
Vocabulary file
Occurrences file
• Boolean operators are used in the retrieval.
Department of Informatics
Aristotle University of Thessaloniki
• Drawbacks of the Boolean model
 It does not support significance-weighting of the query
terms.
 Users are not always familiar with the Boolean operators.
 No sufficient mechanism exists for ranking the retrieved
documents.
 Capable of handling only keyword-based queries.
• Alternatives
 Fuzzy Boolean operators.
Department of Informatics
Aristotle University of Thessaloniki
Vector Space Model
• The words and the documents are represented by vectors.
• Advantages
 Supports term-weighting.
 The retrieved documents can be sorted according to their
similarity with the help of the various vector norms.
 It is a user-friendly model.
Department of Informatics
Aristotle University of Thessaloniki
Preprocessing
• Text processing
Html cleaning
Plain text cleaning (Email addresses, URLs, word separators,
numbers etc)
• Stemming
Clustering by elimination of word suffixes
• Feature Vector Extraction
Calculation of the contextual statistics
Formation of the feature vectors (Vector space model)
Department of Informatics
Aristotle University of Thessaloniki
Language Modeling (Contextual statistics )
• Consider both its preceding and following words.
 Nw


nli el 
 l 1

 l i


1 
1
 e

xi 
i
Ni 

 Nw

 nim em 
 m 1

 m  i


xi 
1
Ni
 nli el
l
or

Department of Informatics
Aristotle University of Thessaloniki
The word categories
map for a 27x27
network
Department of Informatics
Aristotle University of Thessaloniki
Department of Informatics
Aristotle University of Thessaloniki
k-th document
A Prague Guide - Czech Restaurants - Prag - Praha - From
Andel 3W Tourist Service English.htm
PRAGUE guid CZECH restaur PRAG PRAHA ANDEL tourist servic
ENGLISH CZECH restaur HOSTINEC KALICHA NEBOZIZEK
NOVOMESTSKY PIVOVAR POD KRIDLEM FLEKU KRKAVCU MALTEZSKYCH
RYTIRU HOSTINEC KALICHA NA BOJISTI PRAGUE famou CZECH
restaur PRAGUE popular . folklor musican realli get live
mood drink glass beer gobbl traddit CZECH GULAS DUMPLINGS
pick histori . reserv recommend . back list NEBOZIZEK NA
BOJISTI PRAGUE famou CZECH restaur PRAGUE popular .
….
Combined together and
apply smoothing
Department of Informatics
Aristotle University of Thessaloniki
Document Map
Department of Informatics
Aristotle University of Thessaloniki
Document Map
Department of Informatics
Aristotle University of Thessaloniki
Mean Square Error
The solid vertical line
denotes the time instant
when the simulated
annealing process was
used to overcome local
minima in the clustering
procedure.
Department of Informatics
Aristotle University of Thessaloniki
Dimensionality reduction
The dimensionality of the feature vectors is exceptionally high,
i.e. for the HyperGeo corpus (393 web pages, 3084 stems) the
dimension is 3084*3=9252.
Dimensionality reduction techniques:
•PCA, ICA, LDA
•Random projection
•Component elimination
Department of Informatics
Aristotle University of Thessaloniki
Random Projection
In random projection we compute the m*3Nw matrix R having
the following properties:
•The components in each column are chosen to be independent,
identically distributed Gaussian variables with zero mean and
unit variance.
•Each column is normalized to unit norm.
xi  Rm3Nw xi
Department of Informatics
Aristotle University of Thessaloniki
Component Elimination
To overcome the computational complexity in locating
the winner we order the components of the feature
vectors according to the distance:
k
uj 
 x
ij
i 1
t   E j  x t 
2
where
k

1
E j  x t  
xij  t 
k i 1
and k is the number of feature vectors.
Department of Informatics
Aristotle University of Thessaloniki
Rearrange the components in each feature vector
using uj so that the components with the strongest
values appear first. The Euclidean distance is then
computed by
xi  t   m j  t  
d
 x
in
n 1
t   m jn t  
2
m

 x
in
n  d 1
t   m jn t  
By omitting the second sum we still can get an
estimate of the winning neuron.
Department of Informatics
Aristotle University of Thessaloniki
2
Percentage of the
correctly identified
winners with respect to
the dimensionality difference between the
original space and the
“truncated” sub-space.
Department of Informatics
Aristotle University of Thessaloniki
References
1.
2.
3.
4.
5.
6.
7.
R. Baeza - Yates and B. Ribeiro - Neto, Modern Information Retrieval (ACM,
1999).
M. W. Berry and M. Browne, Understanding Search Engines: Mathematical
Modelling and Text Retrieval (SIAM, 1999).
G. Salton and M. J. McGill, Introduction to Modern Information Retrieval
(McGraw Hill, 1983).
S. Kaski, K. Lagus, T. Honkela, and T. Kohonen, Statistical aspects of the
WEBSOM System in Organizing Document Collections, Computing Science
and Statistics, 29, 1998, 281-290.
T. Kohonen, Self-Organizing Maps (Springer Verlag, 1997).
T. Kohonen, Self-organization of very large document collections: State of the
art, Proc. of the 8th Int. Conf. on Artificial Neural Networks, 1, Springer, 1998,
65-74.
H. Ritter and T. Kohonen, Self-Organizing Semantic Map, Biol. Cybernetics,
61, 1989, 241-254.
Department of Informatics
Aristotle University of Thessaloniki
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
S. Haykin, Neural Networks, A Comprehensive Foundation (Prentice-Hall, Inc,
1999).
T. Kohonen and S. Kaski and K. Lagus and J. Salojarvi and J. Honkela, and V.
Paatero and A. Saarela, Self-Organization of Massive Document Collection,
IEEE Trans. On Neural Networks, 11:3, 2000, 574-585.
S. Kaski, Data exploration using self-organizing maps, PhD Thessis, Helsinki
University of Technology, 1997.
D. Manning and H. Schutze, Foundation of Statistical Natural Language
Processing (MIT Press, 1999).
W. B. Frakes and R. Baeza - Yates, Information Retrieval: Data Structures and
Algorithms (Prentice-Hall, Inc, 1992).
M. F. Porter, An algorithm for suffix stripping, Program, 14, 1980, 130-137.
C. Becchetti and L. P. Ricotti, Speech Recognition: Theory and C++
Implementation (Wiley, 1998).
M. P. Oak, Statistics for Corpus Linguistics (University Press, 1998).
S. Kaski, Dimensionality reduction by random mapping: Fast similarity
computation for clustering, Proc. of the 8th Int. Conf. on Neural Networks,
IEEE, 1, 1998, 413-418.
S. Kaski, Fast winner search for SOM-based monitoring and retrieval of highdimensional data, Proc. of the 9th Int. Conf. on Artificial Neural Networks,
Edinburgh, U.K., September 1999.
Department of Informatics
Aristotle University of Thessaloniki