Artificial Immune Systems: An Emerging Technology

Download Report

Transcript Artificial Immune Systems: An Emerging Technology

Document Maps
Slawomir Wierzchon , Mieczyslaw Klopotek
Michal Draminski Krzysztof Ciesielski
Mariusz Kujawiak
Institute of Computer Science, Polish Academy of Sciences
Warsaw
Research partially supported by the KBN research project 4 T11C
026 25 "Maps and intelligent navigation in WWW using
Bayesian networks and artificial immune systems"
Agenda
 Motivation
 What is a document map
 Map creation
 Clustering
 Experimental results
 Future directions
Motivation
 The Web as well as intranets become
increasingly content-rich: simple ranked
lists or even hierarchies of results seem
not to be adequate anymore
 A good way of presenting massive
document sets in an understandable
form will be crucial in the near future
Document map
 Many attempts have been made to visualize
sets of dicuments not just like a list, but rather
in two dimensions
 A document map is a mapping of a set of
documents to 2-D representing their interrelationships
Linear relationship presentation
(Internet Cartographer)
A relationship
 A link between hypertext documents
 Citation in the bibliography
 Content similarity
A tree of relations with central
subject (Inxight – Tree Studio )
Selforganizing map (WebSOM)
dissimilarity of grouops of
documents
Document frequency in clusters
A meta search engine map
Our approach – multiple
representations (BEATCA)
Map visualizations in 3D
(BEATCA)
Future research –
hypergeometric representation
(Fish-Eye eEffect)
Processing Flow Diagram - BEATCA
Search Engine
DB
REGISTRY
Spider
INTERNET Downloading HT-Base
Do Spid
wn er
loa
di n
Indexing +
Optimizing VEC-Base
Clustering
of docs
DocGR
-Base
Mapping
MAPBase
Clustering
of cells
CellGRBase
........
........
g
........
........
........
HT-Base
........
 The preparation of documents is done by an indexer, which
turns a document into a vector-space model representation
 Indexer also identifies frequent phrases in document set for
clustering and labelling purposes
 Subsequently, dictionary optimization is performed - extreme
entropy and extremely frequent terms excluded
 The map creator is applied, turning the vector-space
representation into a form appropriate for on-the-fly map
generation
 ‘The best’ (wrt some similarity measure) map is used by the
query processor in response to the user’s query
How are the maps created
 A modified WebSOM method is used:
– compact reference vectors representation
– broad-topic initialization method
– joint winner search method
– multi-level (hierarchical) maps
– multi-phase document clustering:
•
•
•
•
initial grouping to identify major topics
Initial document grouping
WEBSOM on document groups
fuzzy cell clusters extraction and labelling
My dog likes
Document model in search
this food
engines
 In the so-called
vector model a
document is
considered as a
vector in space
spanned by the
words it contains.
dog
food
When walking, I
take some food
walk
Document model in search
engines
 The relevance of
a document to a
query or to
another document
is measured as
cosine of angle
between the query
and the
document.
dog
food
Query: walk
walk
Reference vector representation
 Vectors are sparse by nature
 During learning process they become even
sparser
 Represented as a balanced red-black trees
 Tolerance threshold imposed
 Terms (dimensions) below threshold are removed
 Significant complexity reduction without
negative quality impact
Topic-sensitive initialization
 Inter-topic similarities important both for map
learning and visualization/cluster extraction
 Simple approach:
– Use LSI to select K main broad topics
– Select K map cells (evenly spread over the map) as the
fixpoints for individual topics
– Initialize selected fixpoints with broad topics
– Initialize remaining cells with „in-between values”
Clustering document vectors
r
x
m
Mocna zmiana
położenia (gruba
strzałka)
Document space
2D map
Important difference to general clustering: not only clusters
with similar documents, but also neighboring clusters similar
Joint winner search
 Global winner search: accurate but slow
 Local winner search: faster but can be inaccurate
during rapid changes
 Start with single phase of global search
 Document movements become more smooth
during learning process: usually local search is
enough
 Use global search when occassional sudden
moves occur (eg. outliers, neighbourhood width
decrease)
Hierarchical maps
 Bottom-up
approach
 Feasible (with joint
winner search
method)
 Start with most
detailed map
 Compute weighted
centroids of map
areas
 Use them as seeds
for coarser map
 Top-down approach
is possible but
requires fixpoints
Clustering document groups
 Numerous methods exists but none of them
directly applicable:
– Extremely fuzzy structure of topical groups in SOM
cells
– Neccesity of taking into account similiarity measures
both in original document space and in the map space
– Outlier-handling problem during cluster formation
– No a priori estimation of the number of topical groups
 Fuzzy C-MEANS on lattice of map cells applied
 Graph theoretical approach (density- and
distance- based MST) combined with fuzzy
clustering
 Clustered documents are labeled by weighted
centroids of cell reference vectors scaled with
between-group entropy
Experiments with map
convergence
 We examined the convergence of the maps
to a stable state depending on:
– type of alpha function (search radius
reduction)
– type of winner search method
– type of initialization method
Convergence – alpha functions
(linear versus reciprocal)
Convergence – winner search
(joint versus local)
Experiments with execution
time
 The impact of the following factors on the speed
of map creation was investigated:
– Map size (total number of cells)
– Optimization methods:
• dictionary optimization
• reference vector representation
 Map quality assessment:
– Compare with ‘ideal’ map (e.g. without optimizations)
– Identical initialization and learning parameters
– Compute sum of squared distances of location of each
document on both maps
Execution time - map size
Execution time - optimizations
Future research
 Maps for joint term-citation model, taking into
account between-group link flow direction
 Fully distributed map creation
 Adaptive document retrieval and clustering:
– Bayesian network based relevance measure
– Survival models for document update rate estimation
– Dead link propagation methods for page freshness
estimation
 We also intend to integrate Bayesian and immune
system methodologies with WebSOM in order to
achieve new clustering effects
Future research
 Bayesian networks will be applied in
particular to:
– measure relevance and classify documents
– accelerate document clustering processes
– construct a thesaurus supporting query
enrichment
– keyword extraction
– between-topic dependencies estimation
Thank you!
Any
questions?