Transcript ppt

Threshold Setting and Performance Monitoring
for Novel Text Mining
Wenyin Tang and Flora S. Tsai
School of Electrical and Electronic Engineering
Nanyang Technological University
E-mail: [email protected], [email protected]
May 2, 2009
1
Outline
• Introduction
– Novel Text Mining (NTM) System
– Performance Evaluation of NTM
• Adaptive Threshold Setting for NTM
– Motivations
– Our Method: Gaussian-based Adaptive
Threshold Setting (GATS)
– Experimental Result
• Conclusion
2
Overview of Novel Text Mining System
Prepare a clean data
matrix which can be
easily processed by
a computer.
Categorise each
incoming document or
sentence into its
relevant topic bin.
Interact with users:
input documents, output
novel info, preference
setting and feedback.
Detect novel yet
relevant documents or
sentences in each
3
topic.
Novel Text Mining Algorithm
Given a set of relevant documents
in a specific topic, e.g. “football
games”, NTM retrieves the novel
documents by:
– Step 1: rank documents in the
topic “football games” in a
chronological order.
D1, D2, D3, D4 …
– Step 2: assign a novelty score
for each document by
comparing the document with its
history documents.
D1
– Step 3: predict the document as
“novel” if its novelty score is
greater than the predefined
novelty threshold.
Unfortunately, I am
“non-novel” because I
am very similar to my
nearest neighbor D3
D4
D3
I am “novel”
because I am
dissimilar with
my nearest
neighbor D2
D2
I am “novel”
because I am
the first
document
I am “novel”
because I am
dissimilar to D1
Vector space
4
NTM Performance Evaluation
novel
• Given a set of documents D1, D2, to D10, relevant to
some topic, for example,
D1, D2, D3, D4, D5, D6, D7, D8, D9, D10
non-novel
# Novel:
System (S):
8
Assessor (A):
5
Matched (M):
4
•
•
•
Precision (P) reflects how likely the system retrieved docs are truly
novel. P=M/S=4/8=0.5, i.e. 50% system retrieved docs are truly
novel.
Recall (R) reflects how likely the truly novel docs can be retrieved
by the system. R=M/A=4/5=0.8, i.e. 80% truly novel docs can be
retrieved by the system.
1
Fβ score: the function of P and R: F 
 1 
P

R
5
Threshold Setting vs. Users’ Requirements
I want to read
the most
novel
information in
a short time1.
I am not sure
until I can see
the documents
The NTM system
should define the
novelty threshold
based on the users’
requirements
adaptively.
I do not want
to miss any
novel
information2.
Different users may have different performance requirements.
6
1. High-precision NTM systems are desired;
2. High-recall NTM systems are desired.
Why Adaptive Threshold Setting
Motivations:
1. As NTM system is a real-time system, there is little or no
training information in the initial stages of NTM.
Therefore, the threshold cannot be predefined with
confidence.
2. As NTM system is an accumulating system, more
training information will be available for threshold setting,
based on user’s feedback given over time.
3. Different users may have different definitions of
“novelty”:
– One user: a document with 50% novel info
– Another user: a document with 90% novel info
7
Gaussian-based Adaptive Threshold
Setting (GATS)
Basic idea:
• GATS is a score distribution-based threshold
setting method. It models the score distributions
of both novel and non-novel documents (based
on the user feedback);
• This parametric model provides the global
information of data, from which we can construct
an optimization criterion of desired performance
to search the best threshold.
8
Novelty Score Distributions
Novel
Gaussian probability
distribution approximation
Non-novel
Empirical probability distribution and its Gaussian
probability distribution approximation for TREC 2004
Novelty Track data topic N54
9
Optimization Criterion
Satisfy 2 conditions:
1. Criterion is a function of Threshold:
J=f (θ)
2. Criterion is directly related to system
performance:
J=Fβ (θ)
Optimization Criterion
Non-novel
Novel
S1
θ
S0
 n1 

0
P ( ) 
S1 ( )
S1 ( )  S 0 ( )
S ( )
R ( )  1
n1
θ
S1 ( )  n1 Pr( x   | c1 )
S0 ( )  n0 Pr( x   | c0 )
p( x | c1 )dx
 *  arg max F ( )  arg max


1

1 

P( ) R( )
11
Flow Chart of NTM with GATS
Experimental Data
Sentence-level data: TREC 2004 Novelty Track data
The news providers of the document set are Xinghua English (XIE) , New
York Times (NYT), and Associated Press Worldstream (APW). The NIST
assessors created 50 topics for this data. Each topic consists of around 25
documents. These documents were ordered chronologically and then
segmented into sentences. Each sentence was given an identifier and
concatenated together to form the target sentence set. In this data, the
overall percentage of novel sentences is around 41.4%. The statistics of data
is summarized in Table 1.
Table 1 Statistics of TREC 2004 Novelty Track data
#Novel
#Non-novel Sum
Relevant
3454
4889
8343
(41.4%)
(58.6%)
13
Experimental Data
Document-level data: APWSJ
APWSJ consists of news articles from Associate Press (AP) and Wall Street
Journal (WSJ), which cover the same period from 1988 to 1990 [Zhang et al.,
2002]. There are 50 TREC topics from Q101 to Q150 in this data and 5
topics (Q131, Q142, Q145, Q147, Q150) that lack non-novel documents are
excluded from the experiments. The statistics of this data are summarized in
Table 2.
Table 2 Statistics of APWSJ data
Relevant
#Novel
10,839
(91.1%)
#Non-novel Sum
1057
11,896
(8.9%)
14
Methods & Parameters
• Baseline:
– Fixed threshold setting θ from 0.05~0.95 with
an equal step 0.05.
• Our method, GATS:
– Complete feedback: with β from 0.1~0.9 with
an equal step 0.1.
– Partial feedback: with β from 0.1~0.9 with an
equal step 0.1, percentages of feedback:
10%, 20%, 50% and 80%.
Experimental Result
Precision
Sentence-Level NTM on TREC 2004 Data
Recall
16
Experimental Result
Redundancy-Precision
Document-Level NTM on APWSJ Data
Redundancy-Recall
17
Comparison: GATS vs. Fixed Threshold
• For precision-recall tradeoff
– Fixed threshold θ cannot reflect the tradeoff of the
precision and recall directly.
– GATS parameter β reflects the weights of precision
and recall directly.
• Under various performance requirements, GATS is
able to approximate the best fixed threshold.
Table 3 Comparison of Fβ on TREC 2004 Novelty Track data
Experimental Result
Precision
Sentence-Level NTM on TREC 2004 Data
Recall
PR curves of GATS (tuned for Fβ) with different percentages of
the user’s feedback.
19
Experimental Result
Redundancy-Precision
Document-Level NTM on APWSJ Data
Redundancy-Recall
R-PR curves of GATS with different percentages of the
user’s feedback.
20
Conclusion
• A Gaussian-based Adaptive Threshold Setting (GATS)
algorithm was proposed for NTM system.
• GATS is a generic method, which can be tuned
according to different performance requirements varying
from high-precision to high-recall.
• By testing the proposed method on both document and
sentence-level datasets, we found the experimental
results showed the promising performance of GATS for a
real-time NTM system.
21
Q&A
22