Transcript Document
N.Y.U.S.T.
I. M.
An information-pattern-based approach
to novelty detection
Presenter : Lin, Shu-Han
Authors : Xiaoyan Li, W. Bruce Croft
Information Processing and Management (2008)
Intelligent Database Systems Lab
Outline
Motivation
Objective
Definition
Observation
Methodology
Experiments
Conclusion
Personal Comments
N.Y.U.S.T.
I. M.
2
Intelligent Database Systems Lab
Motivation - specific topic
N.Y.U.S.T.
I. M.
It is very difficult for traditional word-based approaches to separate
the two non-relevant sentences(3&4) from the two relevant
sentences(1&2).
The two non-relevant sentences are very likely to be indentified as
novel because they contain many new words that do not appear in
previous sentences.
3
Intelligent Database Systems Lab
Motivation - general topic
N.Y.U.S.T.
I. M.
It is very difficult for traditional word-based approaches to separate
the non-relevant sentence(2) from the relevant sentence(1).
4
Intelligent Database Systems Lab
Objectives
N.Y.U.S.T.
I. M.
To attack above hard problem:
To provide a new and more explicit definition of novelty. Novelty is defined as new
answers to the potential questions representing a user’s request or information need.
To propose a new concept in novelty detection – query-related information patterns. Very
effective information patterns for novelty detection at the sentence level have been
identified.
To propose a unified pattern-based approach that includes the following three steps: query
analysis, relevant sentence detection and new pattern detection. The unified approach
works for both specific topics and general topics.
5
Intelligent Database Systems Lab
Definition - Information Patterns
Information patterns of specific topics
Table. Word patterns for the five types of NE(Name Entities)-questions
Information patterns of general topics
Opinion patterns and opinion sentences
Table. Examples of opinion patterns
Event patterns and event sentences
6
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Observation – information patterns
Sentence lengths
Table. Statistics of sentence lengths
Relevant sentences on average have more words than non-relevant sentences.
Novel sentences on average have slightly more words than relevant sentences.
Opinion patterns
Table. Statistics on opinion patterns for 22 opinion topics (2003)
There are relatively more opinion sentences in relevant (and novel) sentences than in nonrelevant sentences.
The novel sentences’ percentage of opinion sentences is slightly larger than relevant
sentences’.
7
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Observation – information
patterns(Cont.)
N.Y.U.S.T.
I. M.
NE(Named entity)
combinations
PLD(PERSON, LOCATION, DATE) types
are more effective in separating
relevant and non-relevant
sentence.
POLD types(PERSON, ORGANIZATION,
LOCATION, DATE) will be used in
new pattern detection; NEs of
the ORGANIZATION type may
provide different sources of
new information.
NEs of the PLD types play a
more important role in event
topics than in opinion topics.
8
Intelligent Database Systems Lab
Methodology
9
N.Y.U.S.T.
I. M.
Fig. ip-BAND: a unified information-pattern-based approach to novelty detection.
Intelligent Database Systems Lab
Methodology(Cont.)
N.Y.U.S.T.
I. M.
(1) Query analysis and question formulation
How many (2)
Where (3)
10
Intelligent Database Systems Lab
Methodology(Cont.)
N.Y.U.S.T.
I. M.
(2) Using patterns in relevance re-ranking
Ranking with TFISF(term frequency –inverse sentence frequency) models
TFISF with information patterns
Sentence lengths
Name Entities
Opinion patterns
(3) Novel sentence extraction
11
Intelligent Database Systems Lab
Experiments
N.Y.U.S.T.
I. M.
Baseline approaches
B-NN: initial retrieval ranking
B-NW: new word detection
B-NWT: new word detection with a threshold
B-MMR: Maximal Marginal Relevance(MMR)
12
Intelligent Database Systems Lab
Experiments
N.Y.U.S.T.
I. M.
Performance for specific topics from TREC 2002, 2003, 2004
③
②
④
①
Table. Performance of novelty detection for 8 specific topics (queries) from TREC 2002
3.4 of 15
novel sentence
Table. Performance of novelty detection for 15 specific topics (queries) from TREC 2003
10.1 of 15
novel sentence
Table. Performance of novelty detection for 11 specific topics (queries) from TREC 2004
4.6 of 15
novel sentence
Note:
13Data with * pass significance test at 95% confidence level by the Wilcoxon test and ** for significance test at 90% level.
Chg%: Improvement over the first(B-NN) baseline in %.
Intelligent Database Systems Lab
Experiments
N.Y.U.S.T.
I. M.
Performance for general topics from TREC 2002, 2003, 2004
①
Table. Performance of ④
novelty detection for 41 general topics (queries) from TREC 2002
3.2 of 15
novel sentence
Table. Performance of novelty detection for 35 general topics (queries) from TREC 2003
7.5 of 15
novel sentence
Table. Performance of novelty detection for 3 general topics (queries) from TREC 2004
3.4 of 15
novel sentence
Note:
14Data with * pass significance test at 95% confidence level by the Wilcoxon test and ** for significance test at 90% level.
Chg%: Improvement over the first(B-NN) baseline in %.
Intelligent Database Systems Lab
Experiments
N.Y.U.S.T.
I. M.
Comparison among specific, general and all topics at top 15 ranks
Table. Comparison among specific, general and all topics at top 15 ranks
Note:
Chg%: Improvement over the first baseline in percentage;
Nvl#: Number of true novel sentences;
Rdd#: Number of relevant but redundant sentences;
NRl#: Number of non-relevant sentences.
15
Intelligent Database Systems Lab
Conclusions
N.Y.U.S.T.
I. M.
Novelty means new answers to the potential questions representing a
user’s request or information need.
The proposed ip-BAND outperforms all baselines for specific topics
and general topics, and specific topics is better than general topics.
It is impossible to collect complete novelty judgments in reality
Baseline selection and evaluation measure by human assessors
Misjudgment of relevance and/or novelty by human assessors and disagreement of
judgments between the human assessors
Limitation and accuracy of question formulations
Novelty detection precision will be low since some non-relevant sentences may be treated
as novel.
16
Intelligent Database Systems Lab
Personal Comments
Advantage
…
Drawback
N.Y.U.S.T.
I. M.
…
Application
…
17
Intelligent Database Systems Lab