News Indexing-tree

Download Report

Transcript News Indexing-tree

國立雲林科技大學
National Yunlin University of Science and Technology
New Event Detection Based on
Indexing-tree and Named Entity
Advisor : Dr. Hsu
Presenter : Hsin-Yi Huang
Authors : Zhang Kuo, Li Juan Zi, Wu Gang
2007.SIGIR.8
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Outline
 Introduction
 Motivation
 Objective
 Basic New Event Detection (NED) Model
 News Indexing-tree
 Term reweighting approach
 Experiment
 Conclusion
 Comments
2007/8/15
2
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Introduction
 Traditional New Event Detection (NED) System
1.Story Representation
(1) Preprocessing
marriage
News storyA
stream
0
B
0.9
C
0
D
0.4
E
0
(2) Term
calculation
storm weightexplode
film
0.8
0.5
0
2.Similarity
Calculation
0.1
0
0
0
1.0
0.3
3.Detection
Procedure
(1)S-S
0.2type
(2)S-C type
0
diet
0
0
1.The decision
2.The confidence
of the decision
0
0
0
0.1
0.5
0.7
0
NED model
2007/8/15
3
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Motivation
 How to speed up the detection procedure while
do not decrease the detection accuracy?
 How to make good use of cluster (topic)
information to improve accuracy?
 How to obtain news story representation by
better understanding of named entities?
2007/8/15
4
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Objective
 Efficiency
 News Indexing-tree
 Accuracy
 Using of cluster (topic) information
 To make use of named entities based on news
classification
2007/8/15
5
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Basic NED Model
 TF-IDF
(term frequency–inverse document frequency)
 Incremental TF-IDF
1
2007/8/15
2
…
t-1
6
t
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Basic NED Model (cont.)
 Similarity Calculation
 Detection Procedure
a old story
No
Yes
2007/8/15
7
an new story
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
News Indexing-tree
2007/8/15
8
Intelligent Database Systems Lab
Term reweighting approach
 Base on Distribution Distance
2007/8/15
9
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Term reweighting approach (cont.)
 Base on Term Type and Story Class
2007/8/15
10
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Term reweighting approach (cont.)
2007/8/15
11
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
N.Y.U.S.T.
I. M.
Experiment
 Datasets
 TDT2 (news story from January to June,1998)
 TDT3 (English story from Oct. to Dec. ,1998)
 Evaluation Metric
term weight calculate
System-1
Ststem-2
Ststem-3
Ststem-4
Ststem-5
Ststem-6
Ststem-7
Ststem-8
2007/8/15
Similarity Calculation
Detection Procedure
S-S type
S-C type
incremental TF-IDF
Hellinger distance
term distributions
Indexing-tree
Term Type and Story Class
the other NED systems
12
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Experiment (cont.)
2007/8/15
13
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Conclusion
 To reduce comparing time without hurting NED
accuracy.
 The two extensions contribute to improvement in
accuracy.
 Future work
 to collect news set which span for a longer period
from internet, and integrate time information in NED
task.
 to refine cluster granularity to event-level, and identify
different events and their relations within a topic
2007/8/15
14
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Comments
 Advantage
 More efficient
 More accurate
 Drawback
 Ambiguous signs
 Too many parameters
 Application
…
2007/8/15
15
Intelligent Database Systems Lab