EE669 Lecture 5 - National Cheng Kung University
Download
Report
Transcript EE669 Lecture 5 - National Cheng Kung University
Lecture 4: Corpus-Based Work
(Chapter 4 of Manning and Schutze)
Wen-Hsiang Lu (盧文祥)
Department of Computer Science and Information Engineering,
National Cheng Kung University
2008/10/13
(Slides from Dr. Mary P. Harper,
http://min.ecn.purdue.edu/~ee669/)
Fall 2001
EE669: Natural Language Processing
1
What is a Corpus?
(1) A collection of texts, especially if complete and selfcontained: the corpus of Anglo-Saxon verse.
(2) In linguistics and lexicography, a body of texts,
utterances, or other specimens considered more or less
representative of a language, and usually stored as an
electronic database.
Currently, computer corpora may store many millions of
running words, whose features can be analyzed by means
of tagging and the use of concordancing programs.
[from The Oxford Companion to the English Language, ed. McArthur &
McArthur 1992]
Fall 2001
EE669: Natural Language Processing
2
Corpus-Based Work
• Text corpora are usually big, often representative
samples of some population of interest. For
example, the Brown Corpus collected by Kucera
and Francis was designed as a representative
sample of written American English. Balance of
subtypes (e.g., genre) is often desired.
• Corpus work involves collecting a large number of
counts from corpora that need to be accessed
quickly.
• There exists some software for processing corpora
(see useful links on course homepage).
Fall 2001
EE669: Natural Language Processing
3
Taxonomies of Corpora
• Media: printed, electronic text, digitized audio,
video, OCR text, etc.
• Raw (plain text) vs. Annotated (use a markup
scheme to add codes to the file, e.g., part-ofspeech tags)
• Language variables:
– monolingual vs. multilingual
– original vs. translation
Fall 2001
EE669: Natural Language Processing
4
Major Suppliers of Corpora
• Linguistic Data Consortium (LDC):
http://www.ldc.upenn.edu
• European Language Resources Association (ELRA):
http://www.icp.grenet.fr/ELRA/
• Oxford Text Archive (OTA): http://ota.ahds.ac.uk
• Child Language Data Exchange System (CHILDES):
http://childes.psy.cmu.edu/
• International Computer Archive of Modern English
(ICAME): http://nora.hd.uib.no/icame.html
Fall 2001
EE669: Natural Language Processing
5
Fall 2001
EE669: Natural Language Processing
6
Software
• Text Editors: e.g., emacs
• Regular Expressions: to identify patterns in text
(equivalent to a finite state machine; can process
text in linear time).
• Programming Languages: C, C++, Java, Perl,
Prolog, etc.
• Programming Techniques:
– Data structures like hash tables are useful for mapping
words to numbers.
– Need counts to calculate probabilities (two pass: emit
toke and then count later, e.g., CMU-Cambridge
Statistical Language Modeling toolkit.
Fall 2001
EE669: Natural Language Processing
7
Challenges for Corpus Building
• Low-level formatting issues: dealing with junk
and case
• What is a word? -- Tokenization
• To stem or not to stem? tokenization token (or
maybe toke)
• What is a sentence, and how can we detect their
boundaries?
Fall 2001
EE669: Natural Language Processing
8
Low-Level Formatting Issues
• Junk Formatting/Content: Examples include document
headers and separators, typesetter codes, tables and
diagrams, garbled data in the file. Problems arise if data
was obtained using OCR (unrecognized words). May need
to remove junk content before any processing begins.
• Uppercase and Lowercase: Should we keep the case or
not? The, the, and THE should all be treated as the same
token but White in George White and white in white snow
should be treated as distinct tokens. What about sentence
initial capitalization (to downcase or not to downcase)?
Fall 2001
EE669: Natural Language Processing
9
Tokenization: What is a Word?
• Early in processing, we must divide the input text
into meaningful units called tokens (e.g., words,
numbers, puctuation).
• Tokenization is the process of breaking input from
a text character stream into tokens to be
normalized and saved (see Sampson’s 1995 book English for
the Computer by Oxford University Press for a carefully designed and
tested set of tokenization rules).
• A graphic word token (Kucera and Francis):
– A string of contiguous alphanumeric characters with space on
either side which may include hyphens and apostrophes, but no
other punctuation marks.
– Problems:Microsoft or :-)
Fall 2001
EE669: Natural Language Processing
10
Some of the Problems: Period
• Words are not always separated from other
tokens by white space. For example,
periods may signal an abbreviation (do not
separate) or the end of sentence (separate?).
– Abbreviations (haplology): etc. St. Dr.
• A single capital followed by a period, e.g., A. B. C.
• A sequence of letter-period-letter-period’s such as U.S.,
m.p.h.
• Mt. St. Wash.
– End of sentence? I live on Burt St.
Fall 2001
EE669: Natural Language Processing
11
Some of the Problems: Apostrophes
• How should contractions and clitics be
regarded? One or two tokens?
– I’ll or I ’ll
– The dog’s food or The dog ’s food
– The boys’ club
• From the perspective of parsing, I’ll needs
to be separated into two tokens because
there is no category that combines nouns
and verbs together.
Fall 2001
EE669: Natural Language Processing
12
Some of the Problems: Hyphens
•
How should we deal with hyphens? Are hyphenated
words comprised of one or multiple tokens? Useage:
1.
2.
3.
•
Typographical to improve the right margins of a document:
typically the hyphens should be removed since breaks occur at
syllable boundaries; however, the hyphen may be part of the
word too.
Lexical hyphens: inserted before or after small word formatives
(e.g., co-operate, so-called, pro-university).
Word grouping: Take-it-or-leave-it, once-in-a-lifetime, textbased, etc.
How many lexemes will you allow?
–
–
–
Fall 2001
Data base, data-base, database
Cooperate, Co-operate
Mark-up, mark up
EE669: Natural Language Processing
13
Some of the Problems: Hyphens
• Authors may not be consistent with hyphenation,
e.g., cooperate and co-operate may appear in the
same document.
• Dashes can be used as punctuation without
separating them from words with space: I am
happy-Bill is not.
Fall 2001
EE669: Natural Language Processing
14
Different Formats in Text Pattern
Fall 2001
EE669: Natural Language Processing
15
Some of the Problems: Homographs
• In some cases, lexemes have overlapping forms
(homographs) as in:
– I saw the dog.
– When you saw the wood, please wear safety goggles.
– The saw is sharp.
• These forms will need to be distinguished for partof-speech tagging.
Fall 2001
EE669: Natural Language Processing
16
Some of the Problems: No space
between Words
• There are no separators between words in
languages like Chinese, so English tokenization
methods are irrelevant.
• Waterloo is located in the south of Canada.
• Compounds in German:
Lebensversicherungsgesellschaftsangesteller
Fall 2001
EE669: Natural Language Processing
17
Some of the Problems: Spaces
within Words
• Sometimes spaces occur in the middle of
something that we would prefer to call a single
token:
– Phone number: 765 494 3654
– Names: Mr. John Smith, New York, U. S. A.
– Verb plus particle: work out, make up
Fall 2001
EE669: Natural Language Processing
18
Some of the Problems: Multiple
Formats
• Numbers (format plus ambiguous separator):
– English: 123,456.78
• [0-9](([0-9]+[,])*)([.][0-9]+)
– French: 123 456,78
• [0-9](([0-9]+[ ])*)([,][0-9]+)
• There are also multiple formats for:
–
–
–
–
Fall 2001
Dates
Phone numbers
Addresses
Names
EE669: Natural Language Processing
19
Morphology: What Should I Put in
My Dictionary?
• Should all word forms be stored in the lexicon?
Probably ok for English (little morphology) but
not for Czech or German (lots of forms!)
• Stemming: Strip off affixes and leave the stem
(lemma).
Not that helpful in English (from an IR point of view)
Perhaps more useful for other languages or in other
contexts
• Multi-word tokens as a single word token can
help.
Fall 2001
EE669: Natural Language Processing
20
What is a Sentence?
• Something ending with a ‘.’, ‘?’ or ‘!’. True in
90% of the cases.
Sentences may be split up by other punctuation marks
(e.g., : ; --).
Sentences may be broken up, as in: “You should be
here,” she said, “before I know it!”
Quote marks may be at the very end of the sentence.
Identifying sentence boundaries can involve heuristic
methods that are hand-coded. Some effort to automate
the sentence-boundary process has also been tried.
Fall 2001
EE669: Natural Language Processing
21
Heuristic Algorithm
• Place putative sentence boundaries after all
occurrences of . ? !.
• Move boundary after following quotation marks,
if any.
• Disqualify a period boundary in the following
circumstances:
– If it is preceded by a known abbreviation of a sort that
does not normally occur word finally, but is commonly
followed by a capitalized proper name, such as Prof. or
vs.
Fall 2001
EE669: Natural Language Processing
22
Heuristic Algorithm (cont.)
– If it is preceded by a known abbreviation and not
followed by an uppercase word. This will deal correctly
with most usage of abbreviations like etc. or Jr. which
can occur sentence medially or finally.
• Disqualify a boundary with a ? or ! If:
– It is followed by a lowercase letter (or a known name)
• Regard other putative sentence boundaries as
sentence boundaries.
Fall 2001
EE669: Natural Language Processing
23
Adaptive Sentence Boundary Detect
• The group included Dr. J. M. Freeman and T.Boone
Pickens Jr.
• David D. Palmer, Marti A. Hearst, Adaptive
Sentence Boundary Disambiguation, Technical
Report, 97/94 ,UC Berkeley: 98-99% correct
• The part-of-speech probabilities of the tokens
surrounding a punctuation mark are input to a feed
forward neural network, and the network’s output
activation value indicates the role of the
punctuation.
Fall 2001
EE669: Natural Language Processing
24
Adaptive Sentence Boundary Detect
•
•
•
•
(cont.)
To solve the problem of processing cycle, instead
of assigning a single POS to each word, the
algorithm uses the prior probabilities of all POS
for that word. (20)
Input: k*20, where k is the number of words of
context surrounding an instance of an end-ofsentence punctuation mark.
K hidden units with sigmoid squashing activation
function.
1 Output indicates the results of the function.
Fall 2001
EE669: Natural Language Processing
25
Marking up Data: Mark-up Schemes
• Plain text corpora are useful, but more can be learned if
information is added.
–
–
–
–
–
Boundaries for sentences, paragraphs, etc.
Lexical tags
Syntactic Structure
Semantic Representation
Semantic class
• Different Mark-up schemes:
– COCOA format (header information in texts, e.g., author,
date, title): uses angle brackets with the first letter
indicating the broad semantics of the field).
– Standard Generalized Markup Language or SGML
(related: HTML, TEI, XML)
Fall 2001
EE669: Natural Language Processing
26
SGML Examples
• <p> <s> This book does not delve very deeply
into SGML. </s> … <s> In XML, such empty
elements may be specifically marked by ending
the tag name with a forward slash character. </s>
</p>
• <utt speak=“Mary”, date = “now”> SGML can be
very useful. </utt>
• Character and Entity codes: begin with ampersand
and end with semicolon
– C is the less than symbol < is the less than
symbol
– résumé rèsumè
Fall 2001
EE669: Natural Language Processing
27
Marking up Data: Grammatical
Coding
• Tagging corresponds to indicating the various conventional
parts of speech. Tagging can be done automatically (we
will talk about that in a later lecture).
• Different Tag Sets have been used, e.g., Brown Tag Set,
University of Lancaster Tag Set, Penn Treebank Tag Set,
British National Corpus (CLAWS*), Czech National
Corpus
• The Design of a Tag Set:
– Target Features: useful information on the grammatical class
– Predictive Features: useful for predicting behavior of other words
in context (e.g., distinguish modals and auxiliary verbs from
regular verbs)
Fall 2001
EE669: Natural Language Processing
28
Penn Treebank Set
• Adjective: JJ, JJR, JJS
• Cardinal: CD
• Adverb: RB, RBR, RBS,
WRB
• Conjunction: CC, IN
(subordinating and that)
• Determiner: DT, PDT,
WDT
• Noun: NN, NNS, NNP,
NNPS (no distinction for
adverbial)
Fall 2001
• Pronoun: PRP, PRP$, WP,
WP$, EX
• Verb: VB, VBP, VBZ,
VBD, VBG, VBN (have,
be, and do are not
distinguished)
• Infinitive marker (to): TO
• Preposition to: TO
• Other prepositions: IN
• Punctuation: . ; , - $ ( ) `` ’’
• FW, SYM, LS
EE669: Natural Language Processing
29
Tag Sets
• General definition:
– Tags can be represented as a vector: (c1,c2,...,cn)
– Thought of as a flat list T = {ti}i=1..n with some assumed
1:1 mapping
T (C1,C2,...,Cn)
• English tagsets:
– Penn treebank (45) (VBZ: Verb,Pres,3,sg, JJR: Adj. Comp.)
– Brown Corpus (87), Claws c5 (62), London-Lund (197)
Fall 2001
EE669: Natural Language Processing
30
Tag Sets for other Languages
• Differences:
–
–
–
–
Larger number of tags
categories covered (POS, Number, Case, Negation,...)
level of detail
presentation (short names vs. structured (“positional”))
• Example:
VAR
POSSN
GENDER
POS
CASE PERSON
NEG
– Czech: AGFS3----1A----
POSSG DCOMP VOICE
SUBPOS
TENSE
NUMBER
Fall 2001
EE669: Natural Language Processing
31
Sentence Length Distribution
Fall 2001
EE669: Natural Language Processing
32
Fall 2001
EE669: Natural Language Processing
33