Natural Language Processing

Download Report

Transcript Natural Language Processing

Natural Language Processing
Berlin Chen
Department of Computer Science & Information Engineering
National Taiwan Normal University
References:
1. Foundations of Statistical Natural Language Processing
2. Speech and Language Processing
Motivation for NLP (1/2)
• Academic: Explore the nature of linguistic communication
– Obtain a better understanding of how languages work
• Practical: Enable effective human-machine
communication
– Conversational agents are becoming an important form of humancomputer communication
– Revolutionize the way computers are used
• More flexible and intelligent
2
Motivation for NLP (2/2)
• Different Academic Disciplines: Problems and Methods
– Electrical Engineering,
Statistics
– Computer Science
– Linguistics
– Psychology
Linguistics
Psychology
NLP
Computer
Science
Electrical
Engineering,
Statistics
• Many of the techniques presented were first develpoed
for speech and then spread over into NLP
– E.g. Language models in speech recognition
3
Turing Test
• Alan Turing,1950
interrogator
– Alan predicted at the end of 20 century a machine with 10
gigabytes of memory would have 30% chance of fooling a
human interrogator after 5 minutes of questions
• Does it come true?
4
Hollywood Cinema
• Computers/robots can listen, speak, and answer our
questions
– E.g.: HAL 9000 computer in “2001: A Space Odyssey”
(2001太空漫遊 )
5
State of the Art (1/3)
• Canadian computer program accepted daily weather
data and generated weather reports (1976)
• Read student essays and grade them
• Automated reading tutor
• Spoken Dialogues
– AT&T, How May I Help You?
6
Major Topics for NLP (2/2)
• Semantics/Meaning
– Representation of Meaning
– Semantic Analysis
– Word Sense Disambiguation
• Pragmatics
– Natural Language Generation
– Discourse, Dialogue and Conversational Agents
– Machine Translation
7
Dissidences
• Rationalists (e.g. Chomsky)
– Humans are innate language faculties
– (Almost fully) encoded rules plus reasoning mechanisms
– Dominating between 1960’s~mid 1980’s
• Empiricists (e.g. Shannon)
– The mind does not begin with detailed sets of principles and
procedures for language components and cognitive domains
– Rather, only general operations for association, pattern
recognition, generalization etc., are endowed with
• General language models plus machine learning approaches
– Dominating between 1920’s~mid 1960’s and resurging 1990’s~
8
Dissidences: Statistical and Non-Statistical NLP
• The dividing line between the two has become much
more fuzzy recently
– An increasing number of non-statistical researches use corpus
evidence and incorporate quantitative methods
• Corpus: “a body of texts” (大量的文稿)
– Statistical NLP needs to start with all the scientific knowledge
available about a phenomenon when building a probabilistic
model, rather than closing one’s eye and taking a clean-slate
approach
• Probabilistic and data-driven
• Statistical NLP → “Language Technology” or “Language
Engineering”
9
(I) Part-of-Speech Tagging
Review
• Tagging (part-of-speech tagging)
– The process of assigning (labeling) a part-of-speech or other
lexical class marker to each word in a sentence (or a corpus)
• Decide whether each word is a noun, verb, adjective, or
whatever
The/AT representative/NN put/VBD chairs/NNS on/IN the/AT table/NN
Or
The/AT representative/JJ put/NN chairs/VBZ on/IN the/AT table/NN
– An intermediate layer of representation of syntactic structure
• When compared with syntactic parsing
– Above 96% accuracy for most successful approaches
Tagging can be viewed as a kind of syntactic disambiguation
11
Introduction
• Parts-of-speech
– Known as POS, word classes, lexical tags, morphology classes
• Tag sets
– Penn Treebank : 45 word classes used (Francis, 1979)
• Penn Treebank is a parsed corpus
– Brown corpus: 87 word classes used (Marcus et al., 1993)
– ….
The/DT grand/JJ jury/NN commented/VBD on/IN a/DT number/NN of/IN other/JJ topics/NNS ./.
12
The Penn Treebank POS Tag Set
13
Disambiguation
• Resolve the ambiguities and choose the proper tag for
the context
• Most English words are unambiguous (have only one tag)
but many of the most common words are ambiguous
– E.g.: “can” can be a (an auxiliary) verb or a noun
– E.g.: statistics of Brown corpus
- 11.5% word types are
ambiguous
- But 40% tokens are ambiguous
(However, the probabilities of
tags associated a word are
not equal → many ambiguous
tokens are easy to disambiguate)
Pt1 w  Pt2 w  
14
Process of POS Tagging
A String of Words
A Specified
Tagset
Tagging Algorithm
A Single Best Tag of Each Word
VB DT NN .
Book that flight .
VBZ DT NN VB NN ?
Does that flight serve dinner ?
Two information sources used:
- Syntagmatic information (looking at information about tag sequences)
- Lexical information (predicting a tag based on the word concerned)
15
POS Tagging Algorithms (1/2)
Fall into One of Two Classes
• Rule-based Tagger
– Involve a large database of handcrafted disambiguation rules
• E.g. a rule specifies that an ambiguous word is a noun rather
than a verb if it follows a determiner
• ENGTWOL: a simple rule-based tagger based on the
constraint grammar architecture
• Stochastic/Probabilistic Tagger
“a new play”
P(NN|JJ) ≈ 0.45
P(VBP|JJ) ≈ 0.0005
– Also called model-based tagger
– Use a training corpus to compute the probability of a given word
having a given context
– E.g.: the HMM tagger chooses the best tag for a given word
(maximize the product of word likelihood and tag sequence probability)
16
POS Tagging Algorithms (1/2)
• Transformation-based/Brill Tagger
– A hybrid approach
– Like rule-based approach, determine the tag of an ambiguous
word based on rules
– Like stochastic approach, the rules are automatically induced
from previous tagged training corpus with the machine learning
technique
• Supervised learning
17
Rule-based POS Tagging (1/3)
• Two-stage architecture
– First stage: Use a dictionary to assign each word a list of
potential parts-of-speech
– Second stage: Use large lists of hand-written disambiguation
rules to winnow down this list to a single part-of-speech for each
word
Pavlov
Pavlov
had
had shown that salivation …
An example for
PAVLOV N NOM SG PROPER
The ENGTOWL tagger
HAVE V PAST VFIN SVO (preterit)
HAVE PCP2 SVO (past participle)
shown
SHOW PCP2 SVOO SVO SV
that
ADV
A set of 1,100 constraints
PRON DEM SG
can be applied to the input
DET CENTRAL DEM SG
sentence
CS (complementizer)
salivation N NOM SG
18
Rule-based POS Tagging (2/3)
• Simple lexical entries in the ENGTWOL lexicon
past participle
19
Rule-based POS Tagging (3/3)
Example:
It isn’t that odd!
ADV
A
I consider that odd.
Compliment NUM
20
HMM-based Tagging (1/8)
• Also called Maximum Likelihood Tagging
– Pick the most-likely tag for a word
• For a given sentence or words sequence , an HMM
tagger chooses the tag sequence that maximizes the
following probability
For a word at position i :


tagi  arg max P wordi tag j  Ptag j previous n  1 tags
j
word/lexical likelihood
tag sequence probability
N-gram HMM tagger
21
HMM-based Tagging (2/8)
• Assumptions made here
– Words are independent of each other
• A word’s identity only depends on its tag
– “Limited Horizon” and “Time Invariant” (“Stationary”)
• Limited Horizon: a word’s tag only depends on the previous
few tags (limited horizon) and the dependency does not
change over time (time invariance)
• Time Invariant: the tag dependency won’t change as tag
sequence appears different positions of a sentence
Do not model long-distance relationships well !
- e.g., Wh-extraction,…
22
HMM-based Tagging (3/8)
• Apply a bigram-HMM tagger to choose the best tag for a
given word
– Choose the tag ti for word wi that is most probable given the
previous tag ti-1 and current word wi
t i  arg max Pt j t i1 , wi 
j
– Through some simplifying Markov assumptions
 
ti  arg max Pt j ti1 P wi t j
j
tag sequence probability
word/lexical likelihood
23
HMM-based Tagging (4/8)
• Example: Choose the best tag for a given word
Secretariat/NNP is /VBZ expected/VBN to/TO race/VB tomorrow/NN
0.34
to/TO race/???
0.00003
P(VB|TO) P(race|VB)=0.00001
0.021
0.00041
P(NN|TO) P(race|NN)=0.000007
Pretend that the previous
word has already tagged
24
HMM-based Tagging (5/8)
• The Viterbi algorithm for the bigram-HMM tagger
 
1. Initializa tion 1  j   π j P w1 t j , 1  j  J

 
 
 
, πj  P tj

2. Induction  i  j   max  i 1 k P t j t k  P wi t j , 2  i  n, 1  j  J
 k

 i  j   argmax  i 1 k P t j t k
1 k  J

3.Terminat ion X n  argmax  n  j 
1 j  J
for i : n-1 to 1 step - 1 do
X i   i  X i 1 
end
25
HMM-based Tagging (6/8)
• Apply trigram-HMM tagger to choose the best sequence
of tags for a given sentence
– When trigram model is used

Tˆ  arg max  Pt1 Pt 2 t1 
t1 ,t2 ,..,tn

n

i 3

Pt i t i2 , t i1  

n

i 1

Pwi t i 

• Maximum likelihood estimation based on the relative
frequencies observed in the pre-tagged training corpus
(labeled data)
cti  2 ti 1ti 
PML ti ti  2 , ti 1  
 cti  2 ti 1t j 
j
cwi , ti 
PML wi ti  
 cw j , ti 
Smoothing or linear interpolation
are needed !
Psm oothedti ti 2 , ti 1     PML ti ti 2 , ti 1     PML ti ti 1 
 (1     )  PML ti 
j
26
HMM-based Tagging (7/8)
Pti ti 1  
cti 1ti 
 cti 1t j 
j
Pwi ti  
cwi , ti 
 cw j , ti 
j
• Probability smoothing of Pti ti1  and Pwi ti  is necessary
27
HMM-based Tagging (8/8)
• Probability re-estimation based on unlabeled data
• EM (Expectation-Maximization) algorithm is applied
– Start with a dictionary that lists which tags can be
assigned to which words
» word likelihood function cab be estimated Pwi ti 
» tag transition probabilities set to be equal Pti ti 1 
– EM algorithm learns (re-estimates) the word likelihood
function for each tag and the tag transition
probabilities
• However, a tagger trained on hand-tagged data worked better
than one trained via EM
– Treat the model as a Markov Model in training but treat
them as a Hidden Markov Model in tagging
Secretariat/NNP is /VBZ expected/VBN to/TO race/VB tomorrow/NN
28
Transformation-based Tagging (1/8)
• Also called Brill tagging
– An instance of Transformation-Based Learning (TBL)
• Motive
– Like the rule-based approach, TBL is based on rules that specify
what tags should be assigned to what word
– Like the stochastic approach, rules are automatically induced
from the data by the machine learning technique
• Note that TBL is a supervised learning technique
– It assumes a pre-tagged training corpus
29
Transformation-based Tagging (2/8)
• How the TBL rules are learned
– Three major stages
1. Label every word with its most-likely tag using a set of
tagging rules (use the broadest rules at first)
2. Examine every possible transformation (rewrite rule), and
select the one that results in the most improved tagging
(supervised! should compare to the pre-tagged corpus )
3. Re-tag the data according this rule
– The above three stages are repeated until some stopping
criterion is reached
• Such as insufficient improvement over the previous pass
– An ordered list of transformations (rules) can be finally obtained
30
Transformation-based Tagging (3/8)
• Example
P(NN|race)=0.98
P(VB|race)=0.02
1
So, race will be initially coded as NN
(label every word with its most-likely tag)
2
(a). is/VBZ expected/VBN to/To race/NN tomorrow/NN
(b). the/DT race/NN for/IN outer/JJ space/NN
Refer to the correct tag
Information of each word,
and find the tag of race
in (a) is wrong
3
Learn/pick a most suitable transformation rule: (by examining every possible transformation)
Change NN to VB while the previous tag is TO
Rewrite rule: expected/VBN to/To race/NN → expected/VBN to/To race/VB
31
Transformation-based Tagging (4/8)
• Templates (abstracted transformations)
– The set of possible transformations may be infinite
– Should limit the set of transformations
– The design of a small set of templates (abstracted transformations)
is needed
E.g., a strange rule like:
transform NN to VB if the previous word was “IBM” and
the word “the” occurs between 17 and 158 words before that
32
Transformation-based Tagging (5/8)
• Possible templates (abstracted transformations)
Brill’s templates.
Each begins with
“Change tag a to tag
b when ….”
33
Transformation-based Tagging (6/8)
• Learned transformations
Verb, 3sg, past tense
Modal verbs (should, can,…)
Rules learned by
Brill’s original tagger
Verb, past participle
Verb, 3sg, Present
Constraints for tags
more valuable player
Constraints for words
34
Transformation-based Tagging (7/8)
• Reference for tags used in the previous slide
35
Transformation-based Tagging (8/8)
• Algorithm
for all combinations
of tags
traverse Y
corpus
X
Z
append to the rule list
score
Get best instance
for each transformation
The GET_BEST_INSTANCE procedure in the example algorithm is
“Change tag from X to Y if the previous tag is Z”.
Check if it is better
than the best instance
achieved in previous
iterations
36
(II) Extractive Spoken Document Summarization
- Models and Features
Introduction (1/3)
• World Wide Web has led to a renaissance of the research
of automatic document summarization, and has extended
it to cover a wider range of new tasks
• Speech is one of the most important sources of
information about multimedia content
• However, spoken documents associated with multimedia
are unstructured without titles and paragraphs and thus
are difficult to retrieve and browse
– Spoken documents are merely audio/video signals or a very long
sequence of transcribed words including errors
– It is inconvenient and inefficient for users to browse through each
of them from the beginning to the end
38
Introduction (2/3)
• Spoken document summarization, which aims to generate
a summary automatically for the spoken documents, is the
key for better speech understanding and organization
• Extractive vs. Abstractive Summarization
– Extractive summarization is to select a number of indicative
sentences or paragraphs from original document and sequence
them to form a summary
– Abstractive summarization is to rewrite a concise abstract that can
reflect the key concepts of the document
– Extractive summarization has gained much more attention in the
recent past
39
Introduction (3/3)
Acoustic
model
Features
Extraction
Linguistic
model
Speech
Recognition
Speech
Transcription
Speech Signal
Confidence Score
Prosodic Features
Extraction
Prosody Features
Important Unit
(Sentence)
Extraction
Statistical Features
Lexical Features
Large-text
Corpus
Language Model
Word Dependency
Probability
Sentence
Compaction
Summary
Records
Captions
Indexes
40
History of Summarization Research
Spoken-document Summarization
The emergence of new areas such as multidocument summarization (1997), multiligual
summarization, and multimedia summarization
(1997)
Text-document Summarization
More natural language generation work
begins to focus on text summarization
The first discourse-based
approaches based on story
grammars (1980)
The first SVD-based approach (1995)
The use of location
features (1969)
The first entity-level approaches
based on syntactic analysis
(1961)
The first training approach (1995)
A variety of different work (entity-level approaches based on AI、
logic and production rules semantic networks、hybrid approaches)
The emergence of more extensive
entity-level approaches (1972)
Early system using a
surface-level approach
(1958)
1950
1960
The extended surface-level approach
to include the used of cue phrases
1970
1980
1990
Recent work has almost exclusively focused on extract
rather than abstracts.
A renewed interest in earlier surface-level approaches.
2000
41
Extraction Based on Sentence Locations/Structures
• Sentence extraction using sentence location information
– Lead (Hajime and Manabu 2000)
– Focusing on the introductory and concluding segments (Hirohata
et al. 2005)
– Specific structure on some domain (Maskey et al. 2003)
• E.g., broadcast news programs-sentence position, speaker type,
previous-speaker type, next-speaker type, speaker change
42
Statistical Summarization Approaches (1/7)
• Spoken sentences are ranked and selected based on
some similarity measures or significant scores
y

Si
(a) Similarity Measures


D
x
– Vector Space Model (VSM) (Ho 2003)
– The document and sentence of it are represented in vector forms
– The sentences that have the highest relevance scores to the
whole document are selected
– To summarize more important and different concepts in a
document
• Relevance measure (Gong et al. 2001)
• Maximum Marginal Relevance (MMR) (Murray et al. 2005)
43
Statistical Summarization Approaches (2/7)
(a) Similarity Measures
– Relevance measure (Gong et al. 2001)
S1 , S2 ,.., Si ,..., Sm 
The candidate sentence set
Delete sentence S max and
recompute the weighted term
frequency vector D for the
document
Compute the relevance 
score between sentence
Si

and document D
Select sentence S max that has
the highest relevance score
Stop
Yes
If the number of sentence
in the summary reaches
the predefined value
No
– Maximum Marginal Relevance, MMR (Murray et al. 2005)
Summary
 

MMR
S
(i )  a( Sim ( Si , D))  (1  a)( Sim ( Si , Summ))
D

Si
44
Statistical Summarization Approaches (3/7)
(b) SVD-based Method
– The sentence can also be represented as a semantic vector
– While the sentence with more topic or semantic information are
selected
– LSA (Gong et al. 2001)
M sentences
S1 S2
SM
w1
J content
words
Information of word j
j
w2

A
– DIM (Hirohata et al. 2005)
 (
k 1
2
v
)
k ik
k
K
Vt
Left singular
vector matrix
 a1i 
a 
 2i 
Ai   a3i 
 
  
aMi 
1
2

2
U
Term-sentence
matrix
Si 
i

wJ
K
1
Information of sentence i
SVD
Weighted
word-frequency vector
singular
value matrix
Right singular
vector matrix
  1vi1  Dimension
  v  reduction
Aˆ i   2 i 2 
  


 N viN 
Weighted
singular-value
vector
  1vi1 
i    
 K viK 
Reduced
dimension vector
45
Statistical Summarization Approaches (4/7)
(c) Sentence Significance Score (SIG)
– Each sentence in the document is represented as a sequence of
terms, which can be simply given by a significance score
– Features such as the confidence score, linguistic score or
prosodic information also can be further integrated
– Sentence selection can be performed based on this score
– E.g., Given a sentence Si  w1, w2 ,..., w j ,..., wJ
1 J
Si   L(w j )  I I (w j )  C C (w j )
J j 1
L(w j )  log P(w j | .....w j 1)
• Linguistic score:
F
• Significance score: I ( w j )  f j log A
Fj
1 J
• Or Sentence Significance Score Si   I ( w j )
J j 1
(Hirohata et al. 2005)
46
Statistical Summarization Approaches (5/7)
(c) Sentence Significance Score
– Sentence: Si  w1, w2 ,..., w j ,..., wJ


1 J
Si   1s( w j )  2l ( w j )  3c( w j )  4 g ( w j ) 5b( Si )
J j 1
•
•
•
•
•
s( w j ) :statistical measure, such as TF/IDF
l ( w j ) :linguistic measure, e.g., named entities and POSs
c( w j ) :confidence score
g (w j ) :N-gram score
b( Si ) is calculated from the grammatical structure of the sentence
• Statistical measure also can be evaluated using PLSA (Probabilistic
Latent Semantic Analysis)
– Topic Significance
– Term Entropy
47
Statistical Summarization Approaches (6/7)
(d) Classification-based Methods
– These methods need a set of training documents (or labeled data)
for training the classifiers
– Naïve Bayes’ Classifier/Bayesian Network Classifier
(Kupiec 1995, Koumpis et al. 2005, Maskey et al. 2005)
V
p ( x)   p (xv | C )
Summary
Non-summary
v 1
Si
– Support Vector Machine (SVM) (Zhu and Penn 2005)
– Logistic Regression (Zhu and Penn 2005)
– Gaussian Mixture Models (GMM) (Murray et al. 2005)
48
Statistical Summarization Approaches (7/7)
(e) Combined Methods (Hirohata et al. 2005)
– Sentence Significance Score (SIG) combined with Location
Information
– Latent semantic analysis (LSA) combined with Location
Information
– DIM combined with Location Information
49
Probabilistic Generative Approaches (1/7)
• MAP criterion for sentence selection
PS Di  
PDi S PS 
PDi 
 PDi S PS 
Sentence model Sentence prior
• Sentence prior
– Sentence prior is simply set to uniform here
– Or may have to do with
• Sentence duration/position, correctness of sentence
boundary, confidence score, prosodic information, etc.
• Each sentence of the document can be ranked by this
likelihood value
50
Probabilistic Generative Approaches (2/7)
• Hidden Markov Model (HMM)
– Each sentence S of the spoken document Di is treated as a
probabilistic generative model of N-grams, while the spoken
document is the observation
PHMM





D S      P w S  1    P w C 
c t j , Di
i
j

j
w j Di
– Pw j S : the sentence model, estimated from the sentence S
– Pw j C : the collection model, estimated from a large corpus C
(In order to have some probability of every term in the vocabulary)
Sentence model S
–
 : a weighting parameter
Document (Observation)
D i  w1 w 2 .. w j .. w Li
PHMM D i S  

P w
1 
j
S

P w j C 
   P w S   1     P w
j
w j Di
j

C
c t j , Di

51
Probabilistic Generative Approaches (3/7)
• Relevance Model, RM
– In HMM, the true sentence model Pw j S  might not be
S (w)
accurately estimated (by MLE)
P( w | S i )  i
Si
• Since the sentence consists only of few terms
– In order to improve estimation of the sentence model
S
• Each sentence
has its own associated relevant model
,
RS
constructed by the subset of documents in the collection that
are relevant to the sentence
• The relevance model is then linearly combined with the
original sentence model to form a more accurate sentence
model
Pˆ w S     Pw S   1     Pw R 
j
PˆHMM Di S  
j
j

S

c w j , Di 
ˆ
  P w j S   1     P w j C 
w j Di
52
Probabilistic Generative Approaches (4/7)
• A schematic diagram of extractive spoken document
summarization jointly using the HMM and RM models
Spoken Documents to be Summarized
D1
D2
Local Feedback
Di
IR System

Sentence S

Document Likelihood
PHMM ( Di | S )
S
S’s RM
Model
Di
S’s HMM
Model
Retrieved
Relevant
Documents
of S
Contemporary
Text News
Collection
General
Text News
Collection
53
Probabilistic Generative Approaches (5/7)
• Topical Mixture Model (TMM)
– Build a probabilistic latent topical space
– Measure the likelihood of a sentence generating a given
document in such space
K

PD S i     Pw Tk PTk S i 
wD  k 1

n  w, D 
A sentence model
P T1 S i 
Document
D=w1w2…wn…wN
P wn T 1


P wn T 2

P T 2 S i 
P T K S i 


P wn T K

The TMM model for s specific sentence Si
54
Probabilistic Generative Approaches (6/7)
• Word Topical Mixture Model (wTMM)
– To explore the co-occurrence relationship between words of the
language
– Each word w j of the language as a topical mixture model M w j for
predicting the occurrence of the other word w
K
P( w | M w j )   P( w | Tk ) P(Tk | M w j )
k 1
– Each sentence of the spoken document to be summarized was
treated as a composite word TMM model for generating the
document
– The likelihood of the document D being generated by S i can be
expressed as:
n  w, D 
K


PD S i       j ,i  Pw Tk P Tk M w j 
k 1
wD 

w j Si


55
Probabilistic Generative Approaches (7/7)
• Word Topical Mixture Model (wTMM)
P(T1 | M w1 )
Text Documents
D1
Titles
H1
D2
P(T2 | M w1 )
P( w | T2 )
P(TK | M w1 )
P(w | TK )
Spoken Document D
P(w | T1)
Sentence S1
P(T1 | M w2 )
P(T2 | M w2 )
H2
P(w | T1)

P( w | T2 )
Dc
P(TK | M w2 )
Hc
DC
HC
P(T1 | M wV )
P(TK | M wV )
 
c
j
c

j

Sentence Sn
Sentence Si


P(T2 | M wV )
K


PDc | H c       j ,c  P w Tk P Tk M w 
k 1
wD  w H

P(w | TK )
n  w , Dc 
Sentence S N
P(w | T1)
P( w | T2 )
P(w | TK )
 


K


P( D | Si )      j ,i  P w Tk P Tk M w 
k 1
wD w S

j
j
n  w, D 
i
56
Comparison of Extractive Summarization Methods
• Literal Term Matching Vs. Concept Matching
– Literal Term Matching:
• Extraction using degree of similarity (VSM, MMR)
• Extraction using features score (Sentence score)
• HMM, HMMRM
– Concept Matching:
• Extraction using latent semantic analysis (LSA, DIM)
• TMM, wTMM
57
Evaluation Metrics (1/3)
• Subjective Evaluation Metrics (Direct evaluation)
– Conducted by human subjects
– Different levels
• Objective Evaluation Metrics
– Automatic summaries were evaluated by objective metrics
• Automatic Evaluation
– Summaries are evaluated by IR
58
Evaluation Metrics (2/3)
• Objective Evaluation Metrics
– Sentence recall/precision (Hirohata et al. 2004)
• Sentence recall/precision is commonly used in evaluating
sentence-extraction-based text summarization.
• Sentence boundaries are not explicitly indicated in input
speech, estimated boundaries based on recognition results
do not always agree with those in manual summaries.
(Kitade et al., 2004)
• F-measure, F-measure/max, F-measure/ave.
2R  P
S man  S sum
S man  S sum
F
P
R
S
RP
S man
sum
59
Evaluation Metrics (3/3)
• Objective Evaluation Metrics
– ROUGE-N (Lin et al. 2003)
• ROUGE-N is an N-gram recall between an automatic
summary and a set of manual summaries.
  Cm ( g n )
ROUGE  N 
S S H g n S
  C( gn )
昨天
馬英九 訪問 中國大陸
昨天
馬英九
結束
訪問
回國
S S H g n S
– Cosine Measure (Saggion et al. 2002, Ho 2003)
H
 SIM AD (m%), Eh, D (m%)   SIM AD (m%), Rh, D 
y
ACC (m%) 

AD


Eh, D
x
1
h 1

D DD
H 2


VAD ( m%)  VEh , D ( m%)
SIM AD (m%), Eh, D (m%)   

VAD ( m%) VEh , D ( m%)
60
HW: Document Summarization
Preprocessing
Given data
Automatically
ranking of the sentences
(By your summarization method)
Evaluation
Reference ranking of
the sentences
The work you have to do
Corpus
• The broadcast news corpus consists of 200 documents
• You have to implement your summary method and
evaluate it as well using these documents
Word ID
Sentence
The content of documents
63
Summarization Result
• Different ratios of the summarization (ranking) result are
taken be the final summary
– The ratio can be 10%, 20%, 30%, 50%, etc.
how many sentence should I take
 ratio
Total sentence of the document
50%
The document have
7 sentences
7  50%  3.5  3
64
Performance Evaluation
• Recall Rate for a Document Di
Recall i 
matching s entences
total sent ences of t his ratio
50%
Di
Test
Recall i 
1
 0.33
3
Reference
• Average Recall Rate:
Average Recall 
i Recall i
i 1
200 in the HW
65