Artificial Immune Systems: An Emerging Technology
Download
Report
Transcript Artificial Immune Systems: An Emerging Technology
Document Maps
Slawomir Wierzchon , Mieczyslaw Klopotek
Michal Draminski Krzysztof Ciesielski
Mariusz Kujawiak
Institute of Computer Science, Polish Academy of Sciences
Warsaw
Research partially supported by the KBN research project 4 T11C
026 25 "Maps and intelligent navigation in WWW using
Bayesian networks and artificial immune systems"
Agenda
Motivation
What is a document map
Map creation
Clustering
Experimental results
Future directions
Motivation
The Web as well as intranets become
increasingly content-rich: simple ranked
lists or even hierarchies of results seem
not to be adequate anymore
A good way of presenting massive
document sets in an understandable
form will be crucial in the near future
Document map
Many attempts have been made to visualize
sets of dicuments not just like a list, but rather
in two dimensions
A document map is a mapping of a set of
documents to 2-D representing their interrelationships
Linear relationship presentation
(Internet Cartographer)
A relationship
A link between hypertext documents
Citation in the bibliography
Content similarity
A tree of relations with central
subject (Inxight – Tree Studio )
Selforganizing map (WebSOM)
dissimilarity of grouops of
documents
Document frequency in clusters
A meta search engine map
Our approach – multiple
representations (BEATCA)
Map visualizations in 3D
(BEATCA)
Future research –
hypergeometric representation
(Fish-Eye eEffect)
Processing Flow Diagram - BEATCA
Search Engine
DB
REGISTRY
Spider
INTERNET Downloading HT-Base
Do Spid
wn er
loa
di n
Indexing +
Optimizing VEC-Base
Clustering
of docs
DocGR
-Base
Mapping
MAPBase
Clustering
of cells
CellGRBase
........
........
g
........
........
........
HT-Base
........
The preparation of documents is done by an indexer, which
turns a document into a vector-space model representation
Indexer also identifies frequent phrases in document set for
clustering and labelling purposes
Subsequently, dictionary optimization is performed - extreme
entropy and extremely frequent terms excluded
The map creator is applied, turning the vector-space
representation into a form appropriate for on-the-fly map
generation
‘The best’ (wrt some similarity measure) map is used by the
query processor in response to the user’s query
How are the maps created
A modified WebSOM method is used:
– compact reference vectors representation
– broad-topic initialization method
– joint winner search method
– multi-level (hierarchical) maps
– multi-phase document clustering:
•
•
•
•
initial grouping to identify major topics
Initial document grouping
WEBSOM on document groups
fuzzy cell clusters extraction and labelling
My dog likes
Document model in search
this food
engines
In the so-called
vector model a
document is
considered as a
vector in space
spanned by the
words it contains.
dog
food
When walking, I
take some food
walk
Document model in search
engines
The relevance of
a document to a
query or to
another document
is measured as
cosine of angle
between the query
and the
document.
dog
food
Query: walk
walk
Reference vector representation
Vectors are sparse by nature
During learning process they become even
sparser
Represented as a balanced red-black trees
Tolerance threshold imposed
Terms (dimensions) below threshold are removed
Significant complexity reduction without
negative quality impact
Topic-sensitive initialization
Inter-topic similarities important both for map
learning and visualization/cluster extraction
Simple approach:
– Use LSI to select K main broad topics
– Select K map cells (evenly spread over the map) as the
fixpoints for individual topics
– Initialize selected fixpoints with broad topics
– Initialize remaining cells with „in-between values”
Clustering document vectors
r
x
m
Mocna zmiana
położenia (gruba
strzałka)
Document space
2D map
Important difference to general clustering: not only clusters
with similar documents, but also neighboring clusters similar
Joint winner search
Global winner search: accurate but slow
Local winner search: faster but can be inaccurate
during rapid changes
Start with single phase of global search
Document movements become more smooth
during learning process: usually local search is
enough
Use global search when occassional sudden
moves occur (eg. outliers, neighbourhood width
decrease)
Hierarchical maps
Bottom-up
approach
Feasible (with joint
winner search
method)
Start with most
detailed map
Compute weighted
centroids of map
areas
Use them as seeds
for coarser map
Top-down approach
is possible but
requires fixpoints
Clustering document groups
Numerous methods exists but none of them
directly applicable:
– Extremely fuzzy structure of topical groups in SOM
cells
– Neccesity of taking into account similiarity measures
both in original document space and in the map space
– Outlier-handling problem during cluster formation
– No a priori estimation of the number of topical groups
Fuzzy C-MEANS on lattice of map cells applied
Graph theoretical approach (density- and
distance- based MST) combined with fuzzy
clustering
Clustered documents are labeled by weighted
centroids of cell reference vectors scaled with
between-group entropy
Experiments with map
convergence
We examined the convergence of the maps
to a stable state depending on:
– type of alpha function (search radius
reduction)
– type of winner search method
– type of initialization method
Convergence – alpha functions
(linear versus reciprocal)
Convergence – winner search
(joint versus local)
Experiments with execution
time
The impact of the following factors on the speed
of map creation was investigated:
– Map size (total number of cells)
– Optimization methods:
• dictionary optimization
• reference vector representation
Map quality assessment:
– Compare with ‘ideal’ map (e.g. without optimizations)
– Identical initialization and learning parameters
– Compute sum of squared distances of location of each
document on both maps
Execution time - map size
Execution time - optimizations
Future research
Maps for joint term-citation model, taking into
account between-group link flow direction
Fully distributed map creation
Adaptive document retrieval and clustering:
– Bayesian network based relevance measure
– Survival models for document update rate estimation
– Dead link propagation methods for page freshness
estimation
We also intend to integrate Bayesian and immune
system methodologies with WebSOM in order to
achieve new clustering effects
Future research
Bayesian networks will be applied in
particular to:
– measure relevance and classify documents
– accelerate document clustering processes
– construct a thesaurus supporting query
enrichment
– keyword extraction
– between-topic dependencies estimation
Thank you!
Any
questions?