News Indexing-tree
Download
Report
Transcript News Indexing-tree
國立雲林科技大學
National Yunlin University of Science and Technology
New Event Detection Based on
Indexing-tree and Named Entity
Advisor : Dr. Hsu
Presenter : Hsin-Yi Huang
Authors : Zhang Kuo, Li Juan Zi, Wu Gang
2007.SIGIR.8
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Outline
Introduction
Motivation
Objective
Basic New Event Detection (NED) Model
News Indexing-tree
Term reweighting approach
Experiment
Conclusion
Comments
2007/8/15
2
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Introduction
Traditional New Event Detection (NED) System
1.Story Representation
(1) Preprocessing
marriage
News storyA
stream
0
B
0.9
C
0
D
0.4
E
0
(2) Term
calculation
storm weightexplode
film
0.8
0.5
0
2.Similarity
Calculation
0.1
0
0
0
1.0
0.3
3.Detection
Procedure
(1)S-S
0.2type
(2)S-C type
0
diet
0
0
1.The decision
2.The confidence
of the decision
0
0
0
0.1
0.5
0.7
0
NED model
2007/8/15
3
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Motivation
How to speed up the detection procedure while
do not decrease the detection accuracy?
How to make good use of cluster (topic)
information to improve accuracy?
How to obtain news story representation by
better understanding of named entities?
2007/8/15
4
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Objective
Efficiency
News Indexing-tree
Accuracy
Using of cluster (topic) information
To make use of named entities based on news
classification
2007/8/15
5
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Basic NED Model
TF-IDF
(term frequency–inverse document frequency)
Incremental TF-IDF
1
2007/8/15
2
…
t-1
6
t
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Basic NED Model (cont.)
Similarity Calculation
Detection Procedure
a old story
No
Yes
2007/8/15
7
an new story
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
News Indexing-tree
2007/8/15
8
Intelligent Database Systems Lab
Term reweighting approach
Base on Distribution Distance
2007/8/15
9
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Term reweighting approach (cont.)
Base on Term Type and Story Class
2007/8/15
10
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Term reweighting approach (cont.)
2007/8/15
11
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
N.Y.U.S.T.
I. M.
Experiment
Datasets
TDT2 (news story from January to June,1998)
TDT3 (English story from Oct. to Dec. ,1998)
Evaluation Metric
term weight calculate
System-1
Ststem-2
Ststem-3
Ststem-4
Ststem-5
Ststem-6
Ststem-7
Ststem-8
2007/8/15
Similarity Calculation
Detection Procedure
S-S type
S-C type
incremental TF-IDF
Hellinger distance
term distributions
Indexing-tree
Term Type and Story Class
the other NED systems
12
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Experiment (cont.)
2007/8/15
13
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Conclusion
To reduce comparing time without hurting NED
accuracy.
The two extensions contribute to improvement in
accuracy.
Future work
to collect news set which span for a longer period
from internet, and integrate time information in NED
task.
to refine cluster granularity to event-level, and identify
different events and their relations within a topic
2007/8/15
14
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Comments
Advantage
More efficient
More accurate
Drawback
Ambiguous signs
Too many parameters
Application
…
2007/8/15
15
Intelligent Database Systems Lab