xavierPolanco

Download Report

Transcript xavierPolanco

Textual Information Clustering
and Visualization for Knowledge
Discovery and Management
Xavier Polanco
URI-INIST-CNRS
Introduction
• We are concerned with the design and
development of computer-based
information analysis tools
• Cluster analysis, computational linguistics
and artificial intelligence techniques are
combined
2
On the technology side
• An information analysis computer-based
system is
• an integrated environment that somehow
assisted a user
• in carrying out the complex process of
converting information from the textual data
sources to knowledge
3
Information Analysis System
French or English
text-data
Dataset or
Corpus
Bibliometric
statistics
Lexicons or
terminological
resources
Term Extraction
And
Indexation
Clustering
and
Mapping
DBMS-R
WWW
Server
SDOC
MIRIAD
ILC
HENOCH
NEURODOC
Mac
PC
WS
4
Home Pages
Intranet
Extranet
5
Plan
•
•
•
•
•
Text Mining
Cluster Analysis
Visualization or Mapping
Knowledge Discovery
Knowledge Management
6
Textual Information
• Big amount of information is available in
textual form in databases and online sources
• In this context, manual analysis and
effective extraction of useful information
are not possible
• It is relevant to provide automatic tools for
analyzing large textual collections
7
Text Mining
• Text mining consists of extraction information
from hidden patterns in large text-data
collections
• The results can be important both:
– for the analysis of the collection, and
– for providing intelligent navigation and browsing
methods
8
Process
• The text mining process can be organized
roughly into five-major steps:
•
•
•
•
•
Data Selection
Term Extraction and Filtering
Data Clustering and Classification
Mapping or Visualization
Result Interpretation
• Iterative and interactive process
9
Natural Language Processing
• Experience shows that linguistic
engineering approach insures a higher
performance of the data mining algorithms
• Part-of-speech tagging (tagging texts), and
lemmatization are tasks generally admit
10
The approach
• Our approach to text mining is based on
extracting meaningful terms from
documents
• In this presentation, the focus is on the term
extraction process, and
• The need of the organization of the
generated terms in a taxonomy
11
The main tasks
• Term extraction or acquisition
• Indexation
• Human control and screening
 Indexing quality control
 Index screening  clustering phase
12
Language Engineering
Lexicons
Text-DB
Natural Language
Engineering System
Indexed
Corpus
Lexicons: Management and Linguistic Processing
Texts: Part-of-speech tagging, lemmatization, and indexation
13
Variation
Normal Form
Syntactic Variation
Morpho-syntactic
Variation
Resistance gene Resistance methylase gene
Resistance and susceptibility gene
Gene of the antibiotic resistance
Rare species
Rarely encountered
enterococus species
14
Taxonomy
• A taxonomic structure should improve text mining
• Considering the clustering techniques that might
be used in text mining. One must be mindful that
more taxonomic classifying capabilities would be
incorporated into text mining
• A taxonomic classifying capability might also
facilitate cluster interpretation by giving the user
some kind of rules
15
Clustering
• Clustering is a descriptive task where one
seeks to identify a finite set of categories
• Clustering is used to segment a database
into subsets or clusters
• Clustering means finding the clusters
themselves from a given set of data
16
Clustering Process
Lexicons
Natural Language
Engineering System
Similarity Measures: s(x,y)
Text-DB
Indexed
Corpus
D(n,p)
Clustering
Algorithm
C(m,p)
Dissimilarity Measures: d(x,y)
17
Documents  Keywords
KW1 KW2 KW3 KW4 KW5 KW6
Di  KWj = {1,0}
Di  KWj = {1, 2, …, n}
D1
D2
D3
D4
1
1
0
1
0
0
1
0
1
1
0
0
0
0
1
1
1
1
0
0
1
1
0
1
C1 = ({D1,D2}{KW1,KW3,KW5,KW6})
C2 = ({D4}{KW1,KW4,KW6})
C3 = ({D3}{KW2,KW4})
18
Clustering Algorithms
• Major families of clustering methods:
• Sequential algorithms
• Hierarchical algorithms
– Agglomerative algorithms
– Divisive algorithms
• Fuzzy clustering algorithms
19
Information Analysis Process
•
The text-data information analysis is
divided into two phases:
1. Cluster generation
2. Map display of clusters
•
A hypertext user interface enables the
analyst to explore and interpret results
20
Example
Antibiotic Resistance
2 DB
4025 documents (1998-1999)
Data
Medicine
Molecular
Biology
30
Clusters
Map
Hypertext
21
Information Visualization
• Definition : The use of computer-supported,
interactive, visual representation of abstract data to
amplify the acquisition or use of knowledge (Card
et al., 1999)
• Visual artifacts aid human thought
• The progress of civilization can be read in the
invention of visual artifacts, from writing to
mathematics, to maps, to diagrams, to visual
computing
22
Process
• Raw Data  Data Tables
• Data Tables  Clustering
• Clustering  Visual Structures : Map
• Visual Structures  Views
23
Visual Structures
• Data Tables are mapped to Visual Structures,
which augment a spatial substrate with marks and
graphical properties to encode information
• A Graphic Representation is said to be expressive
if all and only the data in the Data Table are also
represented in the Visual Structure
• A Graphic Representation is said to be more
effective if it is faster to interpret
24
Map Display
•
We are concerned with map display of the
clusters
• A problem of particular interest is how to
visualize data set with many variables:
1. Multivariate-Data are clustered, and
2. Clusters are mapped
25
Mapping tools
• For mapping, we use the following
techniques:
–
–
–
–
–
Density and Centrality Diagrams
Principal Component Analysis (PCA)
Multi-Layer Perceptrons (MLP)
Self-Organizing Maps (SOM)
Multi-SOMs
26
Multi-Layer Perceptron 1
ISE=||s-x||2
prion
proteins
x1
Wcij
Wsjk
s1
sk
xi
human disease
spongiform
encephalopathy
mankind
Wc(p,2) Ws(2,p)
xp
scrapie
sp
CJD
27
Multi-Layer Perceptron 2
Input
Layer
First Hidden
Layer
Output
Layer
x1
y1
xp
yp
C(m,p)
protein
Second Hidden Layer
(Cartography)
Polarizer node
infection
resistance
Agrobacterium
plasmids
28
Multi-SOM Platform
Raw Data
DB
Processing System
Pre-processing
Graphic-Hypertext
User Interface
SOMPACK
Post-processing
MAPS
MULTISOM
Java Application
29
Multi-Self-Organizing Map
Display
Maps associated to 5 viewpoints :
Map 1  Plants
Map 2  Plant Parts
Map 3  Pathogen Agents
Map 4  Genetic Techniques
Map 5  Patenting Firms
5
4
2
3
1
Rice Area Activated
Use of the inter-Map Communication Mechanism
30
Knowledge Discovery
• KD is informally defined as the extraction
of useful knowledge from databases or large
amounts of data
• One of the most important research topics in
KD is the rule discovery or extraction
• The discovered knowledge is usually
expressed in the form of « if-then » rules
31
Association Rules
• Association rules can be seen as one of the
key tasks of KDD
• The intuitive meaning of an association rule
X  Y, where X and Y are keywords or
descriptors, is : “a document set containing
keyword X is likely to also contain keyword Y”
32
Example
• In a given a food-industry corpus:
• “98% of the documents which are interested
on apple juice does it related with the
chromatography analytic technique”
• X  Y : “apple juice  chromatography”
33
The Galois Lattice
• Our current research includes an approach
based on the lattice structure to discover
concepts and rules to the objects
(documents) and their properties
(keywords)
• The Galois lattice approach is also known
as conceptual clustering
34
The concept lattice
Given the context (D1,T1) where
D1 = {d1,d2,d3,d4} & T1 = {t1,t2,t3,t4,t5,t6}
Hasse Diagram
C1:(D1,Ø)
R t1 t2 t3 t4 t5 t6
d1 1 0 1 0 1 1
d2 1 0 1 0 1 1
d3 0 1 0 1 0 0
d4 1 0 0 1 0 1
C2:({d1,d2,d4},{t1,t6}
C3:({d3,d4},{t4}
C4:({d1,d2},{t1,t3,t5,t6} C5:({d4},{t1,t4,t6} C6:({d3},{t2,t4}
Table: The input relation
R = documents  keywords
The formal concept
C4 has two own terms
{t3,t5} and two inherited
terms {t1,t6}
C7:(Ø, T1)
35
Association Rules Extraction
• The formal concept C4 makes it possible the following
rules
• R1 : t3  t1  t6
• R2 : t5  t1  t6
• R3 : t3  t5
• The interpretation of the R1 and R2: The use of terms t3 or
t5 is always associated with that of terms t1 and t6
• The rule R3 express mutual equivalence of the terms
{t3,t5: All the documents which have the term t3 also have
the t5 term.
36
Summary
Text Mining
Clustering
Mapping
Knowledge Discovery
37
Knowledge Management
• A knowledge management system is
concerned with the identification,
acquisition, development, diffusion, use,
and preservation of the enterprise’s
knowledge
38
KM Objectives
• Using advanced technology
• For facilitating creation, access, and reuse
of knowledge
• For converting knowledge from the sources
accessible to an organization and
connecting people with that knowledge
39
Project
• Adding to the information analysis
system a formalized operator for
processing together:
– The knowledge that is extracted from
databases
– The knowledge that the experts produce when
they analyze the clusters, maps, concepts and
rules
40
We have reached our last subject,
but not the end !
41
Xavier Polanco
42