A Sentence Boundary Detection System

Download Report

Transcript A Sentence Boundary Detection System

A Sentence Boundary Detection
System
Student: Wendy Chen
Faculty Advisor: Douglas Campbell
1
Introduction
• People use . ? and !
• End-of-sentence marks are overloaded.
2
Introduction
• Period - most ambiguous.
• Decimals, e-mail addresses, abbreviations,
initials in names, honorific titles.
• For example:
U.S. Dist. Judge Charles L. Powell denied all motions made by
defense attorneys Monday in Portland's insurance fraud trial.
Of the handful of painters that Austria has produced in the 20th
century, only one, Oskar Kokoschka, is widely known in U.S. This
state of unawareness may not last much longer.
3
Introduction
• Sentence boundary detection by humans is
tedious, slow, error-prone, and extremely
difficult to codify.
• Algorithmic syntactic sentence boundary
detection is a necessity.
4
Five Applications
I. Part-of-speech tagging
– Examples of part-of-speech include nouns,
verbs, adverbs, prepositions, conjunctions, and
interjections.
– John [noun] Smith [noun], the [determiner]
president [noun] of [preposition] IBM [noun]
announced [verb] his [pronoun] resignation
[noun] yesterday [noun].
5
Five Applications
II. Natural language parsing
– Identify the hierarchical constituent structure in a
sentence.
S
NP
S
NP
NP
NP
NNP
NNP
DT
John
Smith
the
PP
IN NP
NN
president
VBD
NNP
of
IBM
NP
PRP$
announced
his
NP
NN
resignation
NN
yesterday
6
Five Applications
III. Reading level of a document
– The Bormuth Grade Level, the Flesch Reading
Ease use information on the sentences in the
documents.
7
Five Applications
IV. Text editors
– The command to move to the end of a sentence.
V. Plagiarism detection
8
Related Work
• As of 1997:
“identifying sentences has not received as much attention
as it deserves.” [Reynar and Ratnaparkhi1997]
“Although sentence boundary disambiguation is
essential . . ., it is rarely addressed in the literature
and there are few public-domain programs for performing
the segmentation task.” [Palmer and Hearst1997]
• Two approaches
– Rule based approach
– Machine-learning-based approach
9
Related Work
I. Rule based
– Regular expressions
• [Cutting1991]
• Mark Wasson converted grammar into a finite
automata with 1419 states and 18002 transitions.
– Lexical endings of words
• [Müller1980] uses a large word list.
10
Related Work
II. Machine-learning-based approach
– [Riley1989] uses regression trees.
– [Palmer and Hearst1997] uses decision trees or
neural network.
11
Our Approach
• Punctuation rules to disambiguate end-ofsentence punctuation.
• Punctuation rule-based model is simple in
design, and is easy to modify.
12
Our Reference Corpus
• A “sentence” reference corpus is a corpus
with each sentence put on its own line.
• We manipulated the Brown Corpus to create
a 51590 sentence reference corpus.
• Two sections - training text and final run
text.
13
High Level Architecture
Text Document
Sentenizer Module
Rules
Sentence Recognizer
Sentences
Reference Corpus
Analysis Module
14
Our Sentenizer Module
• Our sentenizer module has two parts:
– A set of end-of-sentence punctuation rules.
– An engine to apply the rules.
15
Our Analysis Module
diff
Sentenizer
Analysis Module
<rule>.txt
Reference Corpus
rules_summary
16
Our Analysis Module
• <rule>.txt
The Japanese want to increase exports to the U.S. ||||
While they have been curbing shipments, they have watched
Hong Kong step in and capture an expanding share of the
big U.S. market.
The Hartsfield home is at 637 E. Pelham Rd. @@PE@@ NE.
But what came in was piling up. |||| @@PE@@ The nearest
undisrupted end of track from Boston was at Concord, N. H.
17
Overview of Experiment Results
*
*
*
*
*
*
*
*
*
|
| Percentage of
Run
|
Key description
| corrected marked
Number |
|
sentences
-------|-----------------------------------------|----------------Run 1 |All marks
|
84.35%
Run 2 |Mark at token end
|
89.03%
Run 3 |Correction of text
|
88.31%
Run 4 |Double punctuation endings
|
89.01%
Run 5 |Check next word capitalization
|
89.53%
Run 6 |Correction of text
|
89.55%
Run 7 |Modify capitalization function
|
91.41%
Run 8 |Correction of text
|
91.35%
Run 9 |Modify capitalization function
|
91.35%
Run 10 |Correction of text
|
90.40%
Run 11 |Correction of text
|
90.58%
Run 12 |Add abbreviation list
|
95.60%
Run 13 |Check single initials
|
98.94%
Run 14 |Form black chunk of token
|
99.12%
Run 15 |Check numbering lists and double initials|
99.85%
Run 16 |Reduce abbreviation list
|
98.90%
Run 17 |Check special abbreviations
|
99.83%
Run 18 |Confidence ratings
|
99.83%
Run 19 |Check sentences with ellipsis points
|
99.83%
Run 20 |Check sentences with parenthesis marks
|
99.83%
18
Evaluation on Testing Corpus
• Testing corpus - 26647 sentences
• Sentenizer - 26613 sentences
• Total 120 errors
– 43 false positives
– 77 false negatives
• 99.84% accuracy
19
Contributions
• Highly accurate
– 99.8% accuracy rate
– Comparable to or better than existing systems
• Highly efficient
– About 50 double spaced papers per second
– About 1000 sentences per second
• Easily modifiable
– A rule-based model
20