Transcript Document

N.Y.U.S.T.
I. M.
An information-pattern-based approach
to novelty detection
Presenter : Lin, Shu-Han
Authors : Xiaoyan Li, W. Bruce Croft
Information Processing and Management (2008)
Intelligent Database Systems Lab
Outline

Motivation

Objective

Definition

Observation

Methodology

Experiments

Conclusion

Personal Comments
N.Y.U.S.T.
I. M.
2
Intelligent Database Systems Lab
Motivation - specific topic
N.Y.U.S.T.
I. M.

It is very difficult for traditional word-based approaches to separate
the two non-relevant sentences(3&4) from the two relevant
sentences(1&2).

The two non-relevant sentences are very likely to be indentified as
novel because they contain many new words that do not appear in
previous sentences.
3
Intelligent Database Systems Lab
Motivation - general topic

N.Y.U.S.T.
I. M.
It is very difficult for traditional word-based approaches to separate
the non-relevant sentence(2) from the relevant sentence(1).
4
Intelligent Database Systems Lab
Objectives

N.Y.U.S.T.
I. M.
To attack above hard problem:



To provide a new and more explicit definition of novelty. Novelty is defined as new
answers to the potential questions representing a user’s request or information need.
To propose a new concept in novelty detection – query-related information patterns. Very
effective information patterns for novelty detection at the sentence level have been
identified.
To propose a unified pattern-based approach that includes the following three steps: query
analysis, relevant sentence detection and new pattern detection. The unified approach
works for both specific topics and general topics.
5
Intelligent Database Systems Lab
Definition - Information Patterns

Information patterns of specific topics
Table. Word patterns for the five types of NE(Name Entities)-questions

Information patterns of general topics

Opinion patterns and opinion sentences
Table. Examples of opinion patterns

Event patterns and event sentences
6
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Observation – information patterns


Sentence lengths
Table. Statistics of sentence lengths

Relevant sentences on average have more words than non-relevant sentences.

Novel sentences on average have slightly more words than relevant sentences.
Opinion patterns
Table. Statistics on opinion patterns for 22 opinion topics (2003)


There are relatively more opinion sentences in relevant (and novel) sentences than in nonrelevant sentences.
The novel sentences’ percentage of opinion sentences is slightly larger than relevant
sentences’.
7
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Observation – information
patterns(Cont.)

N.Y.U.S.T.
I. M.
NE(Named entity)
combinations

PLD(PERSON, LOCATION, DATE) types
are more effective in separating
relevant and non-relevant
sentence.

POLD types(PERSON, ORGANIZATION,
LOCATION, DATE) will be used in
new pattern detection; NEs of
the ORGANIZATION type may
provide different sources of
new information.

NEs of the PLD types play a
more important role in event
topics than in opinion topics.
8
Intelligent Database Systems Lab
Methodology
9
N.Y.U.S.T.
I. M.
Fig. ip-BAND: a unified information-pattern-based approach to novelty detection.
Intelligent Database Systems Lab
Methodology(Cont.)

N.Y.U.S.T.
I. M.
(1) Query analysis and question formulation
How many (2)
Where (3)
10
Intelligent Database Systems Lab
Methodology(Cont.)


N.Y.U.S.T.
I. M.
(2) Using patterns in relevance re-ranking

Ranking with TFISF(term frequency –inverse sentence frequency) models

TFISF with information patterns

Sentence lengths

Name Entities

Opinion patterns
(3) Novel sentence extraction
11
Intelligent Database Systems Lab
Experiments

N.Y.U.S.T.
I. M.
Baseline approaches

B-NN: initial retrieval ranking

B-NW: new word detection

B-NWT: new word detection with a threshold

B-MMR: Maximal Marginal Relevance(MMR)
12
Intelligent Database Systems Lab
Experiments

N.Y.U.S.T.
I. M.
Performance for specific topics from TREC 2002, 2003, 2004
③
②
④
①
Table. Performance of novelty detection for 8 specific topics (queries) from TREC 2002
3.4 of 15
novel sentence
Table. Performance of novelty detection for 15 specific topics (queries) from TREC 2003
10.1 of 15
novel sentence
Table. Performance of novelty detection for 11 specific topics (queries) from TREC 2004
4.6 of 15
novel sentence
Note:
13Data with * pass significance test at 95% confidence level by the Wilcoxon test and ** for significance test at 90% level.
Chg%: Improvement over the first(B-NN) baseline in %.
Intelligent Database Systems Lab
Experiments

N.Y.U.S.T.
I. M.
Performance for general topics from TREC 2002, 2003, 2004
①
Table. Performance of ④
novelty detection for 41 general topics (queries) from TREC 2002
3.2 of 15
novel sentence
Table. Performance of novelty detection for 35 general topics (queries) from TREC 2003
7.5 of 15
novel sentence
Table. Performance of novelty detection for 3 general topics (queries) from TREC 2004
3.4 of 15
novel sentence
Note:
14Data with * pass significance test at 95% confidence level by the Wilcoxon test and ** for significance test at 90% level.
Chg%: Improvement over the first(B-NN) baseline in %.
Intelligent Database Systems Lab
Experiments

N.Y.U.S.T.
I. M.
Comparison among specific, general and all topics at top 15 ranks
Table. Comparison among specific, general and all topics at top 15 ranks
Note:
Chg%: Improvement over the first baseline in percentage;
Nvl#: Number of true novel sentences;
Rdd#: Number of relevant but redundant sentences;
NRl#: Number of non-relevant sentences.
15
Intelligent Database Systems Lab
Conclusions
N.Y.U.S.T.
I. M.

Novelty means new answers to the potential questions representing a
user’s request or information need.

The proposed ip-BAND outperforms all baselines for specific topics
and general topics, and specific topics is better than general topics.

It is impossible to collect complete novelty judgments in reality

Baseline selection and evaluation measure by human assessors

Misjudgment of relevance and/or novelty by human assessors and disagreement of
judgments between the human assessors

Limitation and accuracy of question formulations

Novelty detection precision will be low since some non-relevant sentences may be treated
as novel.
16
Intelligent Database Systems Lab
Personal Comments

Advantage


…
Drawback


N.Y.U.S.T.
I. M.
…
Application

…
17
Intelligent Database Systems Lab