presentation source
Download
Report
Transcript presentation source
Content Analysis:
Automated Summarizing
Prof. Marti Hearst
SIMS 202, Lecture 16
UWMS Data Mining Workshop
Summarization
What is it for?
What kinds of summaries are there?
Abstracts
Extracts
Highlights
Reduce in complexity, and hence in length, while
retaining some of the essential qualities of the
original. (Kupiec et al. 95)
Other definitions?
Difficult to Evaluate Quality
Marti A. Hearst
SIMS 202, Fall 1997
Abstracts
Act as Document Surrogates
Intermediate point between title and
entire document
Summarize main points
Cohesive narratives
Reporting vs. Critical
Marti A. Hearst
SIMS 202, Fall 1997
How to Generate Abstracts
Automatically?
“Understand” the text and
summarize?
Automated text understanding is still not
possible.
Approximations to this require huge
amounts of hand-coded knowledge.
Hard even for people (remember the
SATs?)
Marti A. Hearst
SIMS 202, Fall 1997
Extracts
Excerpt directly from source material
Present inventory of all major topics or
points
Easier to automate than Abstracts
Marti A. Hearst
SIMS 202, Fall 1997
Automatic Extracting
Define Goal
Summarize main point?
Create a survey of topics?
Show relation of document to user’s query
terms?
Show the context of document (for web
page excerpts)?
Marti A. Hearst
SIMS 202, Fall 1997
2. lecture 18
Lecture 18. Index language functions (Text Chapter 13) Objectives. The
student should understand the principle of request-oriented (usercentered)...
http://oriole.umd.edu:8000/courses/670/spring97/lect18.html
- size 4K - 23-Apr-97 - English
5. Actions/change, Accepted articles
Research area of. Planning and Scheduling. Received Research Articles.
The following articles have been received for the ETAI area "Planning
and...
http://www.ida.liu.se/ext/etai/planning/nj/received.html
size 2K - 8-Sep-97 - English
8. Wilson Readers' Guide Abstracts
Wilson Readers' Guide Abstracts. Wilson Readers' Guide Abstracts
includes citations and abstracts for articles from over 250
of the popular English...
http://www.ovid.com/db/databses/wrga.htm
- size 3K - 29-May-97 - English
Marti A. Hearst
SIMS 202, Fall 1997
Automating Extracting
Just about any simple algorithm can
get “good” results for coarse tasks
Pull out “important” phrases
Find “meaningfully” related words
Create extract from document
Major problem: Evaluation
Need to define goal or purpose
Human extractors agree on only about 25% of
sentences (Rath et al. 61)
Marti A. Hearst
SIMS 202, Fall 1997
Summary of Summary Paper
Kupiec, Pedersen, and Chen, SIGIR 94
To summarize is to reduce in complexity, and hence in length,
while retaining some of the essential qualities of the original.
This paper focuses on document extracts, a particular kind of
computed document summary.
Document extracts consisting of roughly 20% of the original can
be as informative as the full text of a document, which suggests
that even shorter extracts may be useful indicative summaries.
The trends in our results are in agreement with those of
Edmunson who used a subjectively weighted combination of
features as opposed to training the feature weights using a
corpus.
We have developed a trainable summarizer program that is
grounded in a solid statistical framework.
Marti A. Hearst
SIMS 202, Fall 1997
Text Pre-Processing
The following steps are typical:
Tokenization
Morphological Analysis (Stemming)
inflectional, derivational, or crude IR methods
Part-of-Speech Tagging
I/Pro see/VP Pathfinder/PN on/P Mars/PN ...
Phrase Boundary Identification
[Subj I] [VP saw] [DO Pathfinder] [PP on Mars] [PP
with a telescope].
Marti A. Hearst
SIMS 202, Fall 1997
Extracting
Example: sentence extraction from a
single document (Kupiec et al.)
Start with training set of manuallygenerated extracts
(This allows for objective evaluation.)
Create
heuristics
Train Classification Function to estimate
the probability a sentence is included in
the extract.
42% of assigned sentences actually
belonged in theMarti
extracts.
A. Hearst
SIMS 202, Fall 1997
Heuristic Feature Selection
Sentence Length Cut-off
Key fixed-phrases
“this letter”, “in conclusion”
phrases appearing right after conclusions section
Position
Of paragraph in document (first & last 10
paragraphs)
Of sentence within paragraph (first, last, median)
Thematic Words
Most frequent content words
See Choueka article in reader
Marti A. Hearst
Uppercase Words (proper
SIMS 202, Fallnames)
1997
Classifier Function
For each sentence S, compute the
probability that S will appear in extract E:
If a feature appears in sentences chosen to
be in extracts, and not in other sentences,
that feature is useful.
If a sentence contains many of the useful
features, that sentence is likely to be
chosen for the extract.
Marti A. Hearst
SIMS 202, Fall 1997
Classifier Function
Compute:
How
likely is each feature to occur
anywhere in any document?
How likely is each feature to occur in a
sentence that ends up in an extract?
Combine feature scores for a sentence
to compute probability for that sentence
to be included in the extract.
Marti A. Hearst
SIMS 202, Fall 1997
S Target Extract from training set; s is a sentence
Assume k features Fi ; i 1,..., k
P( s S | F1 , F2,... Fk )
P ( F1 , F2,..., Fk | s S ) * P ( s S )
P( F1 , F2,..., Fk )
We simplify t his and compute it as :
P ( F1 | s S ) * P( F2 | s S ) * ... * P( Fk | s S )
P( F1 ) * P( F2 ) * ... * P( Fk )
1
P( Fi ) (number of times Fi appears in all documents)
N
1
P( Fi | s S ) * (number of times Fi appears
N
in a sentence that ends up in an extract)
N number of sentences in all documents in trainin g set
Marti A. Hearst
SIMS 202, Fall 1997
Evaluation
Corpus of Manual Extracts
Engineering journal articles, average length 86
sentences
188 document/extract pairs from 21 journals
Statistics for manual extracts:
In Training set:
Direct Sentence matches
Direct Joins
19
Unmatchable Sentences
Incomplete Single Sentences 21
Incomplete Joins
27
Total Extract Sentences
Marti A. Hearst
SIMS 202, Fall 1997
451
3%
50
4%
5%
568
79% Join: sentence
combined with
other material
9%
Evaluation
Training set vs. Testing Set
Must keep them separate for legitimate
results
Danger to avoid: over-fitting
Marti A. Hearst
SIMS 202, Fall 1997
Evaluation
Baseline:
Use only sentences from beginning of
document
using length cut-off
121 of the original extracted sentences
extracted by algorithm (24%)
Results using classifier:
Algorithm assigns same number of sentences as manual
abstractors did.
195 direct matches + 6 direct joins = 35% correct
When extracts are larger (25% size of original text),
Marti A. Hearst
algorithm selects SIMS
84%202,ofFall
the
extracted sentences
1997
Evaluation
Performance for each
Feature
A
Paragraph
163 (33%)
Fixed Phrases
145 (29%)
Length Cut-Off
121 (24%)
Thematic Word
101 (20%)
Uppercase Word
211 (42%)
feature:
B
163 (33%)
209 (42%)
217 (44%)
209 (42%)
211 (42%)
A: Sentence level performance for individual features alone. If there
are many sentences with the same feature, they are put in the
abstract in order of appearance of the sentence within the
document.
B: How performance varies as features are combined
cumulatively from the top down.
Marti A. Hearst
SIMS 202, Fall 1997
Thought Points
How well might this work on a
different kind of collection?
How to summarize a collection?
Marti A. Hearst
SIMS 202, Fall 1997