An On-line Document Clustering Method Based on Forgetting

Download Report

Transcript An On-line Document Clustering Method Based on Forgetting

An On-line
Document Clustering Method
Based on Forgetting Factors
Yoshiharu Ishikawa, Yibing Chen
Hiroyuki Kitagawa
University of Tsukuba, Japan
Sept. 7, 2001
ECDL2001
1
Outline







Background and Objectives
F2ICM Incremental Document Clustering
Method
Document Similarity Based on Forgetting Factor
Updating Statistics and Probabilities
Document Expiration and Parameter Setting
Experimental Results
Conclusions and Future Work
2
Background

The Internet enabled on-line document delivery services



Important technologies (and applications) for on-line
documents




newsfeed services over the network
periodically issued on-line journals
information filtering
document summarization, information extraction
topic detection and tracking (TDT)
Clustering works as a core technique for these
applications
3
Our Objectives (1)



Development of an on-line clustering method which
considers the novelty of each document
Presents a snapshot of clusters in an up-to-date manner
Example: articles from sports news feed
Formula 1 & M. Schumacher
U.S. Open Tennis
Other articles
Soccer World Cup

τ  τ
time
4
Our Objectives (2)


Development of a novelty-based clustering
method for on-line documents
Features:

It weights high importance on newer documents than
older ones and forgets obsolete ones


Incremental clustering processing


introduction of a new document similarity measure that
considers novelty and obsolescence of documents
low processing cost to generate a new clustering result
Automatic maintenance of target documents

obsolete documents are automatically deleted from the
clustering target
5
Incremental Clustering Process (1)

when t = 0 (initial state)
1. arrival of new
documents
A
A
A
Cluster 1 AA
4. cluster documents
and present the
result
:
Clustering
Cluster k AA
Module
3. calculate and
store statistics
2. store new documents
in the repository
A
A
A
t=0
6
Incremental Clustering Process (2)

when t = 1
1. arrival of new
documents
A
A
A
4. cluster documents
and present the
result
Clustering
Module
3. update
statistics
A
Cluster
Cluster1 1 AAA
::
A
Cluster
Clusterk k AAA
2. store new documents
in the repository
A
A
A
A
A
A
t=0
t=1
7
Incremental Clustering Process (3)

when t =  + 
1. arrival of new
documents
A
A
A
5. cluster documents
and present the
result
Clustering
Module
3. update
statistics
A
Cluster
Cluster1 1 AAA
::
A
Cluster
Clusterk k AAA
2. store new documents
in the repository
A
A
A
A
A
A
A
A
... A
t=
t=
4. delete old documents
t=
A
A
A
t = +
8
Outline


Background and Objectives
F2ICM Incremental Document Clustering
Method







C2ICM Clustering Method
F2ICM Clustering Method
Document Similarity Based on Forgetting Factor
Updating Statistics and Probabilities
Document Expiration and parameter Setting
Experimental Results
Conclusions and Future Work
9
C2ICM Clustering Method




Cover-Coefficient-based Incremental Clustering
Methodology
Proposed by F. Can (ACM TOIS, 1993) [3]
Incremental Clustering Method with Low
Update Cost
Seed-based Clustering Method



Based on the concept of seed powers
Seed powers are defined probabilistically
Documents with highest seed powers are selected as
cluster seeds
10
Decoupling/Coupling Coefficients

Two important notions in C2ICM method


used to calculate seed powers
Decoupling coefficient of document di :
δi  Pr(di | di )



the probability that the document di is obtained when a
document di itself is given
an index to measure the independence of di
Coupling coefficient of document di:
i  1  δi

an index to measure the dependence of di
11
Seed Power

Seed power spi for document di measures the
appropriateness (moderate dependence) of di as a
cluster seed
spi  δi   i   j 1 freq(di , t j )  δj   j
n



freq(di, tj): the occurrence frequency of term tj within
document di
δj : decoupling coefficient for term tj δj  Pr(t j | t j )
 j : coupling coefficient for term tj
 j  1  δj
12
C2ICM Clustering Algorithm (1)

Initial phase

Red: F1 & Schumacher
Green: U.S. Open Tennis
1. Select new seeds
based on the
seed powers
2. Other
documents are
assigned to the
cluster with the
most similar
seed
13
C2ICM Clustering Algorithm (2)

Incremental update phase
1. Select new seeds
based on the
seed powers
2. Other
documents are
assigned to the
cluster with the
τ  τ

most similar
Red: F1 & Schumacher
seed
Green: U.S. Open Tennis
Orange: Soccer World Cup
14
Outline


Background and Objectives
F2ICM Incremental Document Clustering
Method







C2ICM Clustering Method
F2ICM Clustering Method
Document Similarity Based on Forgetting Factor
Updating Statistics and Probabilities
Document Expiration and parameter Setting
Experimental Results
Conclusions and Future Work
15
F2ICM Clustering Method


Extension of C2ICM method
Main differences



Introduction of a new document similarity measure
based on the notion of the forgetting factor: it
weights high importance on newer documents to
generate clusters
Incremental maintenance of statistics
Automatic deletion of obsolete old documents
16
Outline



Background and Objectives
F2ICM Incremental Document Clustering
Method
Document Similarity Based on Forgetting Factor






Document forgetting model
Derivation of document similarity measure
Updating Statistics and Probabilities
Document Expiration and Parameter Setting
Experimental Results
Conclusions and Future Work
17
Document Similarity Based on
Forgetting Factor

New Document Similarity Measure Based on
Document Forgetting Model


Assumption: each delivered document gradually
loses its value (weight) as time passes
Derivation of document similarity measure based on
the assumption


put high weights on new documents and low weights on
old ones  old documents have low effects on clustering
Using the derived document similarity measure, we
can achieve a novelty-based clustering
18
Document Forgetting Model (1)


Ti : acquisition time of
document di
Information value (weight)
of di is defined as
dwi |τ  λ


dwi
1
τ  Ti
λ
τ Ti
Document weight

t
exponentially decreases as
Ti
time passes
 (0 <  < 1) determines acquisition time current time
the forgetting speed
of document di
19
Document Forgetting Model (2)

Why we use the exponential forgetting model?

It inherits the ideas from the behavioral law of human
memory


Relationship with citation analysis:




The Power Law of Forgetting [1]: human memory
exponentially decreases as time passes
Obsolescence (aging) of citation can be measured by
measuring citation rates
Some simple obsolescence model takes exponential forms
Efficiency: based on the model, we can obtain an
efficient statistics maintenance procedure
Simplicity: we can control the forgetting speed using
the parameter 
20
Outline



Background and Objectives
F2ICM Incremental Document Clustering
Method
Document Similarity Based on Forgetting Factor






Document forgetting model
Derivation of document similarity measure
Updating Statistics and Probabilities
Document Expiration and Parameter Setting
Experimental Results
Conclusions and Future Work
21
Our Approach for Document
Similarity Derivation



Probabilistic derivation based
on the document forgetting
model
Let Pr(di, dj) be the
probability to select the
document pair (di, dj) from
the document repository
We regard the coocurrence
probability Pr(di, dj) as their
similarity sim(di, dj)
Pr(di, dj)
doc di doc dj
A
A
A
A
A
A
A
A
22
Derivation of Similarity Formula (1)

tdw: total weights of all the m documents

simple summation of all document weights
m
tdw   dwl
l 1

where dwl |τ  λ
τ  Ti
Pr(di): subjective probability to select document
di from the repository
dwi
Pr( d i ) 
tdw
Since old documents have
small document weights,
their selection probabilities
are small
23
Derivation of Similarity Formula (2)

Pr(tk|di): selection probability of term tk from
document di
Pr(tk | d i ) 
freq(d i , tk )
n
 freq(d , t
l 1


i
k
)
freq(di, tk): the number of occurrence of tk in di
the probability corresponds to term frequency
tf (di , tk )  Pr(tk | di )
24
Derivation of Similarity Formula (3)

Pr(tk): occurrence probability of term tk
m
Pr(tk )   Pr(tk | di )  Pr(di )
i 1

this probability corresponds to document frequency of
term tk
df (tk )  Pr(tk )

the reciprocal of df(tk) represents IDF (inverse
document frequency)
1
idf (tk ) 
df (tk )
25
Derivation of Similarity Formula (4)

Using Bayes’ theorem,
Pr(d j | tk )  Pr(d j )  tf (d j , tk )  idf (tk )

Then we get
n
Pr(d j | d i )   Pr(d j | t k )  Pr(t k | d i )
k 1
n
 Pr(d j ) tf (d i , t k )  tf (d j , t k )  idf (t k )
k 1
26
Derivation of Similarity Formula (5)

Therefore, the coocurrence probability of di, dj is:
n
Pr(di , d j )  Pr(di ) Pr(d j ) tf (di , tk )  tf (d j , tk )  idf (tk )
k 1
inner prodocut of document vectors
based on TF-IDF weighting
old documents have low similarity scores

The more a document di becomes old, the smaller its
similarity scores with other documents are

because old documents have low Pr(di) values
27
Outline



Background and Objectives
F2ICM Incremental Document Clustering
Method
Document Similarity Based on Forgetting Factor






Document forgetting model
Derivation of document similarity measure
Updating Statistics and Probabilities
Document Expiration and Parameter Setting
Experimental Results
Conclusions and Future Work
28
Updating Statistics and Probabilities

when t =  + 
1. arrival of new
documents
A
A
A
A
Cluster
Cluster1 1 AAA
5. present new
clustering result
Clustering
Module
3. update
statistics
::
A
Cluster
Clusterk k AAA
2. store new documents
in the repository
A
A
A
t=
4. delete old documents
A
A
A
A
A
... A
t=
t=
A
A
A
t = +
29
Approach to Update Processing (1)

In every incremental clustering step, we have to
calculate document similarities


To compute similarities, we need to calculate
document statistics and probabilities beforehand
It is inefficient to compute statistics every time from
scratch
Incremental Update Processing

Store the calculated statistics and probabilities and
utilize them for later computation
30
Approach to Update Processing (2)

Formulation




d1, ..., dm: document set consists of m documents
t1, ..., tn: index term sets that appear in d1, ..., dm
t = : the latest update time of the document set
Assumption



when t =  +  , new documents dm + 1, ..., dm + m’ are
appended to the document set
new documents dm + 1, ..., dm + m’ introduce additional
terms tn + 1, ..., tn + n’
m >> m and n >> n are satisfied
31
Update Processing Method (1)

Update of document weight dwi


Since  unit time has passed from the previous
update time t = , the weight of each document
decreases according to 
For each new document, assign initial value 1
λ
dwi |τ  τ  

τ  τ Ti
τ
 λ dwi |τ
1
(1  i  m)
(m  1  i  m  m' )
32
Update Processing Method (2)


Example of Incremental Update Processing:
Updating from tdw| to tdw|+
Naive Approach: compute tdw|+ from scratch
tdw |τ  τ

m  m
 dw |
l 1
l τ  τ
 dw1 |τ  τ    dwm  m |τ  τ


λ τ  τ T1    λ τ  τ Tmm
time consuming!
33
Update Processing Method (3)

Smart Approach: compute tdw|+ incrementally
tdw |τ  τ
m  m
λ

l 1

m
λ
τ  τ Tl
l 1



τ  τ Tl

m  m
λ
l  m 1
τ  τ Tl
m
m  m
l 1
τ
l  m 1
λ τ  λ τ  τ Tl 
1
λ tdw |τ  m
exponential weighting enables efficient incremental
computation
34
Updating Processing Method (4)

Occurrence probability of each document Pr(di)
can be easily recalculated
tdw |τ  τ 

m  m
τ  τ Ti
τ
λ

λ
tdw |τ m'

l 1
We need to calculate term frequencies tf(di, tk)
only for new documents dm + 1, ..., dm + m’
dwi |τ  τ
Pr(d i ) |τ  τ 
tdw |τ  τ
35
Updating Processing Method (5)

Update formulas for document frequency of each
term df(tk)

we expand the formula of df(tk) as follows, then store
~
each df (tk ) permanently
~
1
df (t k ) |τ 
df (t k ) |τ
tdw |τ
m
~
df (t k ) |τ   dwi |τ tf (d i , t k )
i 1

~
df (tk ) can be incrementally updated using the formula
~
~
τ
df (tk ) |τ τ  λ  df (tk ) |τ tfsum (tk )
36
Update Processing Method (6)

Calculation of new decoupling coefficient i is
easy:
δi  k 1
n  n

2
tf (di , tk )
~
df (tk ) |τ  τ
Update formulas for decoupling coefficient for
terms δi


incremental update is also possible
details are shown in the paper
37
Summary of Update Processing

Following statistics are maintained persistently (m: no. of
documents, n: no. of terms)







dwi: weight of document di (1  i  m)
tdw: total weight of documents
freq(di, tk): term occurrence frequency (1  i  m, 1  k  n)
docleni: document length (1  i  m)
~
df (tk ) : statistics fo compute df(tk) (1  k  n)
~
δi: statistics to compute δi (1  i  m)
Incremental statistics update cost


O(m + m’n)  O(m + n) with storage cost O(m + n): linear cost
cf. naive method (not incremental) costs O(mn) cost
38
Outline







Background and Objectives
F2ICM Incremental Document Clustering
Method
Document Similarity Based on Forgetting Factor
Updating Statistics and Probabilities
Document Expiration and Parameter Setting
Experimental Results
Conclusions and Future Work
39
Expiration of Old Documents (1)

when t =  + 
1. arrival of new
documents
A
A
A
A
Cluster
Cluster1 1 AAA
5. present new
clustering result
Clustering
Module
3. update
statistics
::
A
Cluster
Clusterk k AAA
2. store new documents
in the repository
A
A
A
t=
4. delete old documents
A
A
A
A
A
... A
t=
t=
A
A
A
t = +
40
Expiration of Old Documents (2)

Two reasons to delete old documents:



reduction of storage area
old documents have only tiny effects on the resulting
clustering structure
Our approach:


If dwi <  ( is a small parameter constant) is
satisfied, delete document di
When we delete di, related statistics values are
deleted


e.g., freq(di, tk)
details are in the proceedings and [6]
41
Parameter Setting Methods

F2ICM uses two parameters in its algorithms:



forgetting factor  (0 <  < 1): specifies the
forgetting speed
expiration parameter  (0 <  < 1): threshold value
for document deletion
We use the following metaphors:

: half-life span of the value of a document


 = ½ is satisfied, namely
λ  exp( log 2 / β )
:life span of a document

 is determined by  = 
42
Outline







Background and Objectives
F2ICM Incremental Document Clustering
Method
Document Similarity Based on Forgetting Factor
Updating Statistics and Probabilities
Document Expiration and Parameter Setting
Experimental Results
Conclusions and Future Work
43
Dataset and Parameter Settings

Dataset: Mainich Daily Newspaper articles

Each article consists of the following information:




The articles we used in the experiment:



issue date
subject area (e.g., economy, sports)
keyword list (50  150 words in Japanese)
issue date: January 1994 to February 1994
subject area: international affairs
Parameter Settings



nc (no. of clusters) = 10
 (half-life span) = 7: the value of an article reduces to ½ in
one week
 (life span) = 30: every document will be deleted after 30
days
44
Computational Cost for Clustering
Sequences


60
Plot of CPU time and response time for each clustering
performed everyday
Costs linearly increase until 30th day, then becomes almost
constant
CPU/Response Time (sec)
50
40
CPU Time
30
Response Time
20
10
0
1
11
21
31
Days
41
51
45
Overview of Clustering Result (1)

Summarization of 10 clusters after 30 days (at
January 31, 1994)
No. Subject
1
East Europe, NATO, Russia, Ukraine
2
Clinton (White Water/politics), military issue (Korea/Myanmar/Mexico/Indonesia)
3
China (import and export/U.S.)
4
U.S. politics (economic sanctions and Vietnam/elections)
5
Clinton (Syria/South East issue/visiting Europe), Europe (France/Italy/Switzerland)
6
South Africa (ANC/human rights), East Europe (Bosnia-Herzegovina, Croatia),
Russia (Zhirinovsky/ruble/Ukraine)
7
Russia (economy/Moscow/U.S.), North Korea (IAEA/nuclear)
8
China (Patriot missiles/South Korea/Russia/Taiwan/economics)
9
Mexico (indigenous peoples/riot), Israel
10
South East Asia (Indonesia/Cambodia/Thailand), China (Taiwan/France), South Korea
(politics)
46
Overview of Clustering Result (2)

Summarization of 10 clusters after 57 days (at March
1, 1994)
No. Subject
1
Bosnia-Herzegovina (NATO/PKO/UN/Serbia), China (diplomacy)
2
U. S. issue (Japan/economy/New Zealand/Bosnia/Washington)
3
Myanmar, Russia, Mexico
4
Bosnia-Herzegovina (Sarajevo/Serbia), U.S. (North Korea/economy/military)
5
North Korea (IAEA/U.S./nuclear)
6
East Asia (Hebron/random shooting/PLO), Myanmar, Bosnia-Herzegovina
7
U.S. (society/crime/North Korea/IAEA)
8
U.N. (PKO/Bosnia-Herzegovina/EU), China
9
Bosnia-Herzegovina (U.N./PKO/Sarajevo), Russia (Moscow/Serbia)
10
Sarajevo (Bosnia-Herzegovina), China (Taiwan/Tibet)
47
Summary of the Experiment

Brief observations



F2ICM groups similar articles into a cluster as far as
an appropriate seed is selected
But a cluster obtained by the experiment usually
contains multiple topics, and different clusters
contain similar topics: clusters are not well separated
Reasons of the observed phenomena:


Selected seeds are not well separated in topics: more
sophisticated seed selection method is required
The number of keywords for an articles is rather
small (50  150 words)
48
Outline







Background and Objectives
F2ICM Incremental Document Clustering
Method
Document Similarity Based on Forgetting Factor
Updating Statistics and Probabilities
Document Expiration and Parameter Setting
Experimental Results
Conclusions and Future Work
49
Conclusions and Future Work

Conclusions

Development of an on-line clustering method which considers
the novelty of documents






Introduction of document forgetting model
F2ICM: Forgetting Factor-based Incremental Clustering Method
Incremental statistics update method (linear update cost)
Automatic document expiration and parameter setting
methods
Preliminary report of the experiments
Current and Future Work



Revision of the clustering algorithms based on Scatter/Gather
approach [4]
More detailed experiments and their evaluation
Development of automatic parameter tuning methods
50