Text mining by self

Download Report

Transcript Text mining by self

國立雲林科技大學
National Yunlin University of Science and Technology
A text mining approach for automatic
construction of hypertexts
Advisor : Dr. Hsu
Graduate : Kuo-min Wang
Authors
: Hsin-Chang Yanga,*,
Chung-Hong Leeb
2005 Expert Systems with Applications
1
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Outline










Motivation
Objective
Introduction
System Overview
Related Work
Text mining by self-organizing maps
Automatic hypertext construction
Experimental results
Conclusions
Personal Opinion
2
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Motivation

Traditionally, hyperlinks construction method


by the creators of the web pages with or without the
help of some authoring tools.
Thus an automatic hypertext construction
method is necessary for content providers to
efficiently produce adequate information.
3
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Objective

Propose a new automatic hypertext construction
method based on a text mining approach
4
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Introduction


The use of hypertext provide a feasible
mechanism to retrieve related documents.
A hypertext contains a number of
navigational hyperlinks that point to some
related hypertext or locations of the same
hypertext
5
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Introduction (cont.)


To transform a flat text to a hypertext we
need to decide where to insert a hyperlink in
the text.
A hyperlink
……………
……………
……………
……………
……………
……………
6
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Introduction (cont.)

Structural link



Made available to navigate an information space by its
structure & not necessarily specifically related to the
content of the page they're found on
referential links and associative links
The critical points in creating associative
hyperlinks are two fold
1.Find the source text to be linked
2.Find the documents which are semantically relevant to
these sources
7
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Introduction (cont.)


Our method applies the self-organizing map
algorithm to cluster some at text documents in
a training corpus and generate two maps.
Then use these maps to identify the sources
and destination of some important hyperlinks
within these training documents.
8
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
System overview
1
Transform texts in to index terms
大陸
天津
word1
…….
2
4
外交部
3
9
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Related Work

Hypertext construction methodologies and system

Salton et al., & Allan 1992,


Agosti and Crestani(1993),


Computation of the similarity between fragments of
documents in order to identify content links.
Establish a concept model that consist of three levels,
concept level, index term level, and document level and
devise a five-steps process to construct the links.
McClellan(1995),

Combines formal grammar and document indexing
techniques to convert semi-structured text to hypertext.
10
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Related Work (cont.)

Shin,Nam, and Kim(1997),


Green(1997,2000),


combine the statistical similarity and semantic similarity to
create good hypertext.
Analyzing lexical chains in a text based on the WordNet.
Hyperlinks analysis

Mehler(1999)


Uses the concept of lexical cohesion for text linkage
If two texts comprise semantically similar words so that those
texts represent valid candidates for text linkage.
11
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Text mining by self-organizing maps

Document encoding and preprocessing
1.Keep only the nonus
2.Ignore the terms appear extremely few times
3.Manually constructed a stop list to filter out less
important words.
4.Use binary vector scheme to
encode the documents and ignore
and kind of term weighting schemes.
12
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Text mining by self-organizing maps

Document clustering using SOM
xi  {xin | 1  n  N },1  i  M
w j  {w jn | 1  n  N },1  j  J

Input data
xi  {xi1 , xi2 ,..., xin }
mj
Winner node
m j  {m j1 , m j2 ,..., m jn }
SOM algorithm
1. Randomly select a training vector xi.
xi  wk
2. Find the BMU xi  w j  1min
k  j
3. Update the weight of neuron wlnew  wlold   (t )( xi  wlold )
4. Repeat 1-3 until all training vectors have been selected
5. Increase epoch t, and decrease  (t ) and neighborhood
size
13
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Text mining by self-organizing maps

Document clustering using SOM
WCM
DCM
14
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Text mining by self-organizing maps

Mining document and word associations

Word labeling process


For the weight vector of the jth neuron wj, if its nth component
exceeds a predetermined threshold, the corresponding word of
that component is labeled to this neuron.
Discover the ideas of a set of related documents

Linking the DCM and the WCM according to the neuron
locations
linking
15
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Automatic hypertext construction

Finding sources

Two kinds of words

Themes of other documents but not of this document.
(inter-cluster hyperlinks)


For instance, a user is reading an document about music, he may
see the word oboe and want to learn more about it.
Themes of this document
(intra-cluster hyperlinks)


Link documents that are related to this document for referential
purpose.
Share a common theme and provide good references for the users.
16
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Automatic hypertext construction

Obtain the sources of these two kinds of hyperlinks
k i Wc and ki Wm for m  c

Introduce spanning factor to limit number of
hyperlinks



Generate the ranking by averaging the total weights of all word
clusters
Dj
Top-ranked  1(inter-cluster)
Wc
Top  2words (intra-cluster)
1
wic 
Ci
w
mCi
17
im
WCM
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Automatic hypertext construction

Finding destinations
(1) Dj and Dki belong to different document clusters.
(2) c is the neuron index of the word cluster that contains
Dj ki Wm , m  c and wim  max wil
l  c ,1l  M
(3) The distance between Dki and Dj is minimum ,i.e.
xki  x j  min xl  x j
1l  M
Dj
Wc
18
WCM
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Experimental results


The test corpus contains 3268 news articles
which were posted by the CAN
To reduce the size of the vocabulary


Discard the word only occur once in a document
Discard manually constructed stoplist which contains
5259 Chinese words
19
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Experimental results (cont.)
20
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Experimental results (cont.)
L( D j ) 
{k i | ki  S j }
{ki | xi j  1}
21
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Conclusions


We devised a novel method for automatic
hypertext construction.
Experiments show that not only the text
mining approach successfully reveals the cooccurrence patterns, but also the devise
hypertext construction process effectively
constructs semantic hyperlinks.
22
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Personal opinions


Combined many method to apply
……
23
Intelligent Database Systems Lab