DC-NLP-2014-08 Tony Davis

Transcript DC-NLP-2014-08 Tony Davis

Segmentation in NLP:
A quick survey of
techniques and applications
Tony Davis (3M Health Information Systems)
DC NLP meetup
13 August 2014
What is segmentation?
• Dividing a document, broadly defined as a sequential information-bearing
structure, into parts that have some internal coherence
– Well known examples: book chapters, episodes of TV miniseries
(segments are contiguous, strictly ordered, disjoint, and exhaustive)
– More examples: sections and subsections of a textbook, topically-based
segments defined by an index, plays in a sports broadcast
(these depart from the prototype; they’re hierarchical, overlapping, “fuzzy”)
– Notice that these examples are more or less “preset”, at the time the
document is created, but ad hoc segments can be useful, too
– Labeling the segments with good terms is an additional aspect of
segmentation
Examples: chapter titles, section headers in a clinical report, descriptions of
plays in a football game
Why is segmentation useful?
• Segmentation guides and improves the user
experience
• Segmentation affects meaning
What are we trying to do with
automatic segmentation?
• Create “good”/”useful” segments, label them appropriately,
and relate them to one another
• What makes a good segment?
– Highly dependent on type of content (baseball game,
cooking show, feature-length movie, talk show)
– Thus, the mix of techniques for automatic segmentation
will vary
• Relating segments to one another?
– Topical similarity: the same people speaking about the
same things, similar players or similar kinds of plays,
common themes, …
– Continuity of narrative (subplots in a TV comedy series,
continuation of a project on home repair show)
Topic-based segmentation
• Topic segmentation: dividing a document into topically coherent segments
– Possibly a partition of the document (exhaustive, non-overlapping segments)
– But ad hoc segments may be more useful in some circumstances
– Labeling the segments with good terms is often treated separately
• Several approaches used:
– Discourse features
• Some signal a topic shift; others a continuation
• Highly domain-specific
– Similarity measures between adjacent blocks of text
• Use document similarity measures, as in TextTiling (Hearst, 1994) or Choi’s
algorithm (Choi, 2000)
• Posit boundaries at points where similarity is low
– Lexical chains: repeated occurrences of a term (or of closely related terms)
• Again, posit boundaries where cohesion is low (few lexical chains cross the
boundary (e.g., Galley, et al., 2003)
Topic segmentation – text similarity
• Choi’s algorithm (Choi, 2000)
– Construct a similarity matrix of all sentence pairs
– “Sharpen” the distinctions by replacing similarity values with their ranks in the
local region
– Coherent regions appear as brighter
square areas along the diagonal;
segment boundaries lie at their
corners
Freddy Y.Y. Choi. 2000. Advances in
domain independent linear text
segmentation. Proceedings of NAACL
2000, 109-117.
Topic segmentation – lexical chains
• LCSeg (Galley, et al., 2003)
– Based on lexical chains: repeated occurrences of a term in a document
– Each chain receives a score based on the number of terms within it, and its
length (shorter chains get a higher score)
– Cohesion at each potential boundary is measured by comparing scores of
chains spanning that boundary and chains on either side near that boundary
– Segments are postulated where cohesion is low
Michel Galley, Kathleen McKeown, Eric Fosler-Lussier, and Hongyan Jing. 2003. Discourse
Segmentation of Multi-Party Conversation. Proceedings of the 41st Annual Meeting of
the ACL, 562-569
Topic segmentation – modeling term
influence and relatedness
• Model both the influence of a term beyond the sentence it occurs in and
semantic relatedness among terms
– The range of a term’s influence extends beyond the sentence it occurs in, but
how far? (relevance intervals)
– Semantic relatedness among terms (contextually mediated graphs)
• Apply this model to topic-based segmentation
• And now, a short diversion about streaming media…
Geetu Ambwani and Anthony R. Davis. 2010. Contextually–Mediated Semantic Similarity
Graphs for Topic Segmentation. Proceedings of the Workshop on Graph=based Methods
in NLP, 48th Annual Meeting of the ACL.
Automatic segmentation of streaming
media
•
Streaming media presents particular difficulties
– Its timed nature makes navigation cumbersome, so a useful system
should extract relevant intervals of documents rather than present a list
of documents to the user
– Speech recognition is error-prone, so the system must compensate for
noisy input
• To address these issues, use a mix of statistical and symbolic NLP
– Create relevant intervals for each useful term, using term co-occurrence
information (the “COW model”)
– Term co-occurrence measured using pointwise mutual information (PMI)
within a corpus
PMI(X,Y) = P(X,Y)/P(X)P(Y)
“Original” StreamSage system
architecture
media archive
large
text
corpus
co-occurring
word (COW)
model
speech recognition
topic
query
segmentation expansion
user
interfaces for
applications
compound
interval
calculation
SR output
WSD
anaphor
resolution
relevance
interval
calculation
virtual
document
collection
Relevance Intervals (ad hoc segments)
• Each RI is a contiguous segment of audio/video deemed relevant to
a term
• RIs are calculated for all content words (after lemmatization) and
multi-word expressions
• RI basis: sentence containing the term
Each RI is expanded forward and backward to capture relevant
material, using the techniques including:
1.
2.
3.
4.
Topic boundary detection by changes in COW values across sentences
Topic boundary detection via discourse markers
Synonym-based query expansion
Anaphor resolution
• Nearby RIs for the same term are merged
• Each RI is assigned a magnitude, reflecting its likely importance to a
user searching on that term, based on the number of occurrences
of the term in the RI, and the COW values of other words in the RI
with the term
Relevance Intervals: an Example
• Index term: squatter
Paul Bew is professor of Irish politics at Queens University in
Belfast.
In South Africa the government is struggling to contain a growing
demand for land from its black citizens.
Authorities have vowed to crack down and arrest squatters illegally
occupying land near Johannesburg.
In a most serious incident today more than 10,000 black South
Africans have seized government and privately-owned property.
Hundreds were arrested earlier this week and the government
hopes to move the rest out in the next two days.
NPR’s Kenneth Walker has a report.
Thousands of squatters in a suburb outside Johannesburg cheer
loudly as their leaders deliver angry speeches against whites and
landlessness in South Africa.
“Must give us a place…”
• We build an RI for squatter around each of these sentences…
Relevance Intervals: an Example
• Index term: squatter
Paul Bew is professor of Irish politics at Queens University in
Belfast.
In South Africa the government is struggling to contain a growing
demand for land from its black citizens. [cow-expand]
Authorities have vowed to crack down and arrest squatters illegally
occupying land near Johannesburg.
In a most serious incident today more than 10,000 black South
Africans have seized government and privately-owned property.
[cow-expand]
Hundreds were arrested earlier this week and the government
hopes to move the rest out in the next two days.
NPR’s Kenneth Walker has a report.
Thousands of squatters in a suburb outside Johannesburg cheer
loudly as their leaders deliver angry speeches against whites and
landlessness in South Africa.
“Must give us a place…” [topic segment boundary]
• We build an RI for squatter around each of these sentences…
Relevance Intervals: an Example
• Index term: squatter
Paul Bew is professor of Irish politics at Queens University in Belfast. [topic
segment boundary]
In South Africa the government is struggling to contain a growing demand
for land from its black citizens. [cow-expand]
Authorities have vowed to crack down and arrest squatters illegally
occupying land near Johannesburg.
In a most serious incident today more than 10,000 black South Africans
have seized government and privately-owned property. [cow-expand]
Hundreds were arrested earlier this week and the government hopes to
move the rest out in the next two days. [merge nearby intervals]
NPR’s Kenneth Walker has a report. [merge nearby intervals]
Thousands of squatters in a suburb outside Johannesburg cheer loudly as
their leaders deliver angry speeches against whites and landlessness in
South Africa.
“Must give us a place…” [topic segment boundary]
• Two occurrences of squatter produce a complete merged
interval.
Merging Ris for multiple terms
Occurrence of
Original Term
Russia
Iran
Note that this can only be done at query time,
So it needs to be fairly quick and simple.
Merging RIs
Occurrence of
Original Term
Russia
Iran
Activation
Spreading
Merging RIs
Occurrence of
Original Term
Russia
Iran
Russia and Iran
Evaluating RI Quality
• Evaluation of retrieval effectiveness in timed media raises further
issues:
– Building a gold-standard is painstaking, and potentially quite subjective
– It’s necessary to measure how closely the system’s RIs match the gold
standard’s
– What’s a reasonable baseline?
• We created a gold standard of about 2300 RIs with about 200
queries on about 50 documents (NPR, CNN, ABC, and business
webcasts), and rated each RI on a scale of 1 (highly relevant) to 3
(marginally relevant).
• Testing performed on speech recognizer output
Evaluating RI Quality
• Measure amounts of extraneous and missed content
ideal RI
extraneous
In system
system RI
missed by
system
Evaluating RI Quality
Comparison of percentages of median extraneous and missed content
over all queries between system using COWs and system using only
sentences with query terms present
relevance
most relevant
relevant
most relevant
and relevant
marginally
relevant
system
extra
miss
extra
miss
extra
miss
extra
miss
with
COWs
9.3
12.7
39.2
27.5
29.9
21.6
61.7
15.0
without
COWs
0.0
57.7
0.3
64.7
0.0
63.4
18.8
60.2
How to spot shifts in topic
• Yesterday, I took my dog to the park.
• While there, I took him off the leash to get some
exercise.
• After 2 minutes, Spot began chasing a squirrel.
• ______________(Topic Shift)________________
• I realized I needed to go grocery shopping.
• So I went off to Trader Joe’s.
• Unfortunately, they were out of cashews.
Calculating connection strengths for
edges
• Construct a graph in which each node represents a term and a
sentence, iff the sentence is contained in an RI for that term
Calculating connection strengths for
edges
• Label each sentence with the terms that have Ris including
that sentence; think of these as nodes in a graph
Calculating connection strengths for
edges
Now we’ll connect these nodes to one another,
with edges that have weights
Calculating connection strengths for
edges
Calculating connection strengths for
edges
Calculating connection strengths for
edges
Connection strength formula
and in general, for terms a and b in sentences i and i + 1 respectively:
Build a graph of the terms in the
document, sentence by sentence
Filtering edges in the graph
A real example (from CNN)
• S_190 We've got to get this addressed and
hold down health care costs.
• S_191 Senator Ron Wyden the optimist from
Oregon, we appreciate your time tonight.
• S_192 Thank you.
• ______________(Topic Shift)______________
• S_193 Coming up, the final day of free health
clinic in Kansas City, Missouri.
Graph representation of documents
After pruning low-weight edges, regions near a boundary tend to
be sparse
Data and parameter settings
• Two data sets
– Concatenated New York Times articles
• 186 pseudodocuments, each containing 20 articles
– Closed-captions for 13 TV shows
• News, talk shows, documentaries, lifestyle shows
• “Noisier” than news articles
• Parameter settings
– Edge-pruning threshold: remove all edges with a connection strength below a
set threshold (we tried a couple and usually used 0.5)
– “Normalized novelty”: on the two sides of a potential boundary, the number
of nodes labeled with the same terms, normalized by the total number of
terms
– Threshold (given normalized novelty): posit a boundary between sentences
where normalized novelty falls below a set threshold (we typically used 0.6)
Systems compared
C99
Implementation of Choi (2000), from MorphAdorner*
SN
Our system, using a single node for each term occurrence (no
extension)
FE
Our system, using an extension of a fixed number of sentences for
each term from the sentence it occurs in
SS
Our system, using Ris without “hard” boundaries determined by
the modified Choi algorithm
SS+C
Our full segmentation system, incorporating “hard” boundaries
determined by the modified Choi algorithm
* morphadorner.northwestern.edu/morphadorner/-textsegmenter
Evaluation metrics
• How well does the hypothesized set of boundaries match the
true (reference) set?
• Pk (Beeferman, et al. 1997) and WindowDiff (Pevzner & Hearst,
2002)
– Both compare hypothesis to reference segmentation within a sliding window
– Pk is the proportion of windows in which hypothesis and reference disagree on
the number of boundaries
– WindowDiff tallies the difference in the number of boundaries in each window
– Both commonly used instead of precision and recall, because they take
approximate matching into account
– They have drawbacks of their own, however
Doug Beeferman, Adam Berger, and John Lafferty. 1997. Text Segmentation Using Exponential
Models. Proceedings of EMNLP 2
Lev Pevzner and Marti A. Hearst. 2002. A critique and improvement of an evaluation metric
for text segmentation. Computational Linguistics, 28:1
Evaluation metrics
• Pk and WindowDiff in action: comparing two segmentations
Evaluation metrics
• Pk and WindowDiff: sliding window is half the average reference segment
size
Evaluation metrics
• Pk and WindowDiff in action: window slides one sentence forward
Evaluation metrics
• Pk and WindowDiff in action: still no disagreement between the two
segmentations within the window
Evaluation metrics
• Pk and WindowDiff in action: one black mark against the hypothesis
segmentation, which posits a boundary where the reference doesn’t
Evaluation metrics
• Pk and WindowDiff in action: the same mistake counts again against the
hypothesis (note that mistakes closer to reference boundaries appear in
fewer windows, and are therefore penalized less)
Results on pseudodocuments
185 documents, containing 20 NYT articles each
Number of boundaries not specified to systems
system
precision
recall
F
Pk
WindowDiff
C99
0.404
0.569
0.467
0.338
0.360
SN
0.096
0.112
0.099
0.570
0.702
FE
0.265
0.140
0.176
0.478
0.536
SS
0.566
0.383
0.448
0.292
0.317
SS+C
0.578
0.535
0.537
0.262
0.283
Results on pseudodocuments
86 documents out of the 186, for which C99 and SS+C both
find at least 20 boundaries; top 20 boundaries selected
system
prec = rec = F
Pk
WindowDiff
C99
0.530
0.222
0.231
SS+C
0.643
0.192
0.201
TV show closed-captions: interannotator agreement on segmentation
• Annotators note major and minor boundaries, using 1-5 rating scale
• As expected, IAA is low, so we create a reference annotation
TV show closed-captions: interannotator agreement on segmentation
• Pk values between pairs of annotators: all
boundaries and major boundaries
TV show closed-captions:
segmentation
• Accuracy is low, which is unsurprising given the low IAA
system
precision
recall
F
Pk
WindowDiff
All topic boundaries
C99
0.197
0.186
0.184
0.476
0.507
SS+C
0.315
0.208
0.240
0.421
0.462
Major topic boundaries only
C99
0.170
0.296
0.201
0.637
0.812
SS+C
0.271
0.316
0.271
0.463
0.621
Using segment-level metadata for
sports broadcast segmentation
• Play-by-play data for sports
– Available for several sports in near real-time
– Particularly useful for sports that are inherently segmented
into discrete plays
– Advantages of consistency and quality
• Many other types of segment level metadata,
depending on type of content
– DVD chapters
– Advertising breaks in TV shows
– Screenplays
Segment-level metadata for sports
• Closed captions can be problematic
– Sometimes describe clearly what’s going on…
>> Kenny: THEY START FROM THEIR OWN 35.
AND THIS IS GRANT WHO FUMBLED MOMENTS
AGO. HAD A TERRIFIC HALF OF THE SEASON
• Here is data for this play from Stats, Inc. (near real-time):
<play quarter="1" play-start-time="14:33" playstop-time="14:27" down="1" distance="10" awayscore-before="7" away-score-after="7" homescore-before="0" home-score-after="0"
yardline="GB35" end-yardline="GB43"
possession="GB" end-possession="GB"
details="Ryan Grant rush to the left for 8
yards to the GB43. Tackled by Darryl Tapp."/>
Segment-level metadata for sports
• Alignment of this type of metadata with the video is fairly easily achieved,
because the game clock is displayed on the video
SEA 7
GB 0
•
14:33 Q1
DN 1 YD 10
Stats, Inc. data for play, with added video times:
<play quarter="1" play-start-time="14:33" playstop-time="14:27" video-start-time="0:12:41"
video-stop-time="0:12:45” down="1" distance="10"
away-score-before="7" away-score-after="7" homescore-before="0" home-score-after="0"
yardline="GB35" end-yardline="GB43"
possession="GB" end-possession="GB" details="Ryan
Grant rush to the left for 8 yards to the GB43.
Tackled by Darryl Tapp."/>
Finding interesting plays in a football game
Finding interesting plays in a football game
Segmenting clinical documents
• Clinical documents are divided into sections, but
they can be messy, and the section boundaries
aren’t always clear
• Segmentation affects semantics; family medical
history is different from patient’s current
diagnosis
• Medical coding (and ultimately reimbursement)
depends on accurately discerning the structure of
clinical reports
• Thus, automated medical coding requires, among
other things, good segmentation (“regioning”)
Where segmentation fits into
automated medical coding
………
………
………
………
………
………
……
Clinical
Document
………
………
……
………
……
……
………
Extracted
Sections
code_diagnosis?
Auto-code
diagnosis
code_historical?
Auto-code
historical
code_procedure?
Auto-code
procedure
Assess no_coding?
Ignore
Codability
Tagging of
Computer Assisted
Sections
Coding Module
Billing
codes:
V543.12
432.11
………
………
………
Billing codes
Machine learning for clinical document
segmentation
• Features used include:
– Common section header terms
– Formatting and punctuation (e.g., all caps, colon at end of
line)
– Semantic type of terms within a sentence (e.g., body parts,
medications, procedures, findings)
• For each sentence, generate a label:
– new_region or continuing_region
– Region use labels can be hypothesized at the same time, or
added later
• Results are close to human levels of accuracy on many
kinds of documents
Summing up
• Segmentation problems are ubiquitous in NLP
• Motivations are both user and machine driven
• Many kinds of information to exploit,
depending on the data and the application
• Resources available to get started

DC-NLP-2014-08 Tony Davis

Transcript DC-NLP-2014-08 Tony Davis

Directory