Tokenization, Morphological Analysis

Download Report

Transcript Tokenization, Morphological Analysis

SIMS 290-2:
Applied Natural Language Processing
Marti Hearst
Sept 8, 2004
1
Today
Tokenizing using Regular Expressions
Elementary Morphology
Frequency Distributions in NLTK
2
Tokenizing in NLTK
The Whitespace Tokenizer doesn’t work very well
What are some of the problems?
NLTK provides an easy way to incorporate regex’s
into your tokenizer
Uses python’s regex package (re)
http://docs.python.org/lib/re-syntax.html
Modified from Dorr and Habash (after Jurafsky and Martin)
3
Regex’s for Tokenizing
Build up your recognizer piece by piece
Make a string of regex’s combined with OR’s
Put each one in a group (surrounded by parens)
Things to recognize:
urls
words with hyphens in them
words in which hyphens should be removed (end of
line hyphens)
Numerical terms
Words with apostrophes
Modified from Dorr and Habash (after Jurafsky and Martin)
4
Regex’s for Tokenizing
Here are some I put together:
url
= r'((http:\/\/)?[A-Za-z]+(\.[A-Za-z]+){1,3}(\/)?(:\d+)?)‘
» Allows port number but no argument variables.
hyphen = r'(\w+\-\s?\w+)‘
» Allows for a space after the hyphen
apostro = r'(\w+\'\w+)‘
numbers = r'((\$|#)?\d+(\.)?\d+%?)‘
» Needs to handle large numbers with commas
punct
= r'([^\w\s]+)‘
wordr
= r'(\w+)‘
A nice python trick:
regexp = string.join([url, hyphen, apostro, numbers, wordr, punct],"|")
– Makes one string in which a “|” goes in between each substring
5
Regex’s for Tokenizing
More code:
import string
from nltk.token import *
from nltk.tokenizer import *
t = Token(TEXT='This is the girl\'s depart- ment.')
regexp =
string.join([url, hyphen, apostrophe, numbers, wordr, punct],"|")
RegexpTokenizer(regexp,SUBTOKENS='WORDS').tokenize(t)
print t['WORDS']
[<This>, <is>, <the>, <girl's>, <depart- ment>, <store>, <.>]
6
Tokenization Issues
Sentence Boundaries
Include parens around sentences?
What about quotation marks around sentences?
Periods – end of line or not?
– We’ll study this in detail in a couple of weeks.
Proper Names
What to do about
– “New York-New Jersey train”?
– “California Governor Arnold Schwarzenegger”?
Clitics and Contractions
Modified from Dorr and Habash (after Jurafsky and Martin)
7
Morphology
Morphology:
The study of the way words are built up from smaller meaning
units.
Morphemes:
The smallest meaningful unit in the grammar of a language.
Contrasts:
Derivational vs. Inflectional
Regular vs. Irregular
Concatinative vs. Templatic (root-and-pattern)
A useful resource:
Glossary of linguistic terms by Eugene Loos
http://www.sil.org/linguistics/GlossaryOfLinguisticTerms/contents.htm
Modified from Dorr and Habash (after Jurafsky and Martin)
8
Examples (English)
“unladylike”
3 morphemes, 4 syllables
unlady
-like
‘not’
‘(well behaved) female adult human’
‘having the characteristics of’
Can’t break any of these down further without
distorting the meaning of the units
“technique”
1 morpheme, 2 syllables
“dogs”
2 morphemes, 1 syllable
-s, a plural marker on nouns
Modified from Dorr and Habash (after Jurafsky and Martin)
9
Morpheme Definitions
Root
The portion of the word that:
– is common to a set of derived or inflected forms, if any, when
all affixes are removed
– is not further analyzable into meaningful elements
– carries the principle portion of meaning of the words
Stem
The root or roots of a word, together with any derivational
affixes, to which inflectional affixes are added.
Affix
A bound morpheme that is joined before, after, or within a
root or stem.
Clitic
a morpheme that functions syntactically like a word, but
does not appear as an independent phonological word
– Spanish: un beso, las aguas
– English: Hal’s (genetive marker)
Modified from Dorr and Habash (after Jurafsky and Martin)
10
Inflectional vs. Derivational
Word Classes
Parts of speech: noun, verb, adjectives, etc.
Word class dictates how a word combines with morphemes to
form new words
Inflection:
Variation in the form of a word, typically by means of an
affix, that expresses a grammatical contrast.
– Doesn’t change the word class
– Usually produces a predictable, nonidiosyncratic change of
meaning.
Derivation:
The formation of a new word or inflectable stem from another
word or stem.
Modified from Dorr and Habash (after Jurafsky and Martin)
11
Inflectional Morphology
Adds:
tense, number, person, mood, aspect
Word class doesn’t change
Word serves new grammatical role
Examples
come is inflected for person and number:
The pizza guy comes at noon.
las and rojas are inflected for agreement with
manzanas in grammatical gender by -a and in
number by –s
las manzanas rojas
(‘the red apples’)
Modified from Dorr and Habash (after Jurafsky and Martin)
12
Derivational Morphology
Nominalization (formation of nouns from other parts of speech,
primarily verbs in English):
computerization
appointee
killer
fuzziness
Formation of adjectives (primarily from nouns)
computational
clueless
Embraceable
Diffulcult cases:
building  from which sense of “build”?
A resource:
CatVar: Categorial Variation Database
http://clipdemos.umiacs.umd.edu/catvar
Modified from Dorr and Habash (after Jurafsky and Martin)
13
Concatinative Morphology
Morpheme+Morpheme+Morpheme+…
Stems: also called lemma, base form, root, lexeme
hope+ing  hoping
hop  hopping
Affixes
Prefixes: Antidisestablishmentarianism
Suffixes: Antidisestablishmentarianism
Infixes: hingi (borrow) – humingi (borrower) in Tagalog
Circumfixes: sagen (say) – gesagt (said) in German
Agglutinative Languages
uygarlaştıramadıklarımızdanmışsınızcasına
uygar+laş+tır+ama+dık+lar+ımız+dan+mış+sınız+casına
Behaving as if you are among those whom we could not cause to become civilized
Modified from Dorr and Habash (after Jurafsky and Martin)
14
Templatic Morphology
Roots and Patterns
Example: Hebrew verbs
Root:
– Consists of 3 consonants CCC
– Carries basic meaning
Template:
– Gives the ordering of consonants and vowels
– Specifies semantic information about the verb
 Active, passive, middle voice
Example:
– lmd (to learn or study)
 CaCaC -> lamad (he studied)
 CiCeC -> limed (he taught)
 CuCaC -> lumad (he was taught)
Modified from Dorr and Habash (after Jurafsky and Martin)
15
Nouns and Verbs (in English)
Nouns have simple inflectional morphology
cat
cat+s, cat+’s
Verbs have more complex morphology
Modified from Dorr and Habash (after Jurafsky and Martin)
16
Nouns and Verbs (in English)
Nouns
Have simple inflectional morphology
Cat/Cats
Mouse/Mice, Ox, Oxen, Goose, Geese
Verbs
More complex morphology
Walk/Walked
Go/Went, Fly/Flew
Modified from Dorr and Habash (after Jurafsky and Martin)
17
Regular (English) Verbs
Morphological Form Classes
Regularly Inflected Verbs
Stem
walk
merge
try
map
-s form
walks
merges
tries
maps
-ing form
walking
merging
trying
mapping
Past form or –ed participle
walked
merged
tried
mapped
Modified from Dorr and Habash (after Jurafsky and Martin)
18
Irregular (English) Verbs
Morphological Form Classes
Irregularly Inflected Verbs
Stem
eat
catch
cut
-s form
eats
catches
cuts
-ing form
eating
catching
cutting
Past form
ate
caught
cut
-ed participle
eaten
caught
cut
Modified from Dorr and Habash (after Jurafsky and Martin)
19
“To love” in Spanish
Modified from Dorr and Habash (after Jurafsky and Martin)
20
Syntax and Morphology
Phrase-level agreement
Subject-Verb
– John studies hard (STUDY+3SG)
Noun-Adjective
– Las vacas hermosas
Sub-word phrasal structures
‫שבספרינו‬
‫נו‬+‫ים‬+‫ספר‬+‫ב‬+‫ש‬
That+in+book+PL+Poss:1PL
Which are in our books
Modified from Dorr and Habash (after Jurafsky and Martin)
21
Phonology and Morphology
Script Limitations
Spoken English has 14 vowels
– heed hid hayed head had hoed hood who’d hide
how’d taught Tut toy enough
English Alphabet has 5
– Use vowel combinatios: far fair fare
– Consonantal doubling (hopping vs. hoping)
Modified from Dorr and Habash (after Jurafsky and Martin)
22
Computational Morphology
Approaches
Lexicon only
Rules only
Lexicon and Rules
– Finite-state Automata
– Finite-state Transducers
Systems
WordNet’s morphy
PCKimmo
– Named after Kimmo Koskenniemi, much work done by Lauri Karttunen,
Ron Kaplan, and Martin Kay
– Accurate but complex
– http://www.sil.org/pckimmo/
Two-level morphology
– Commercial version available from InXight Corp.
Background
Chapter 3 of Jurafsky and Martin
A short history of Two-Level Morphology
– http://www.ling.helsinki.fi/~koskenni/esslli-2001-karttunen/
Modified from Dorr and Habash (after Jurafsky and Martin)
23
Porter Stemmer
Discount morphology
So not all that accurate
Uses a series of cascaded rewrite rules
ATIONAL -> ATE
(relational -> relate)
ING -> 
if stem contains vowel
(motoring -> motor)
Modified from Dorr and Habash (after Jurafsky and Martin)
24
Porter Stemmer
Step 4: Derivational Morphology I: Multiple Suffixes
(m>0) ATIONAL ->
(m>0) TIONAL ->
ATE
TION
(m>0)
(m>0)
(m>0)
(m>0)
(m>0)
(m>0)
(m>0)
(m>0)
(m>0)
(m>0)
(m>0)
(m>0)
(m>0)
(m>0)
(m>0)
(m>0)
(m>0)
(m>0)
ENCE
ANCE
IZE
ABLE
AL
ENT
E
OUS
IZE
ATE
ATE
AL
IVE
FUL
OUS
AL
IVE
BLE
ENCI
ANCI
IZER
ABLI
ALLI
ENTLI
ELI
OUSLI
IZATION
ATION
ATOR
ALISM
IVENESS
FULNESS
OUSNESS
ALITI
IVITI
BILITI
->
->
->
->
->
->
->
->
->
->
->
->
->
->
->
->
->
->
relational
->
conditional
->
rational
->
valenci
->
hesitanci
->
digitizer
->
conformabli
->
radicalli
->
differentli
->
vileli
- >
analogousli
->
vietnamization ->
predication
->
operator
->
feudalism
->
decisiveness
->
hopefulness
->
callousness
->
formaliti
->
sensitiviti
->
sensibiliti
->
Modified from Dorr and Habash (after Jurafsky and Martin)
relate
condition
rational
valence
hesitance
digitize
conformable
radical
different
vile
analogous
vietnamize
predicate
operate
feudal
decisive
hopeful
callous
formal
sensitive
sensible
25
Porter Stemmer
Errors of Omission
European
analysis
matrices
noise
explain
Europe
analyzes
matrix
noisy
explanation
Errors of Commission
organization
doing
generalization
numerical
university
organ
doe
generic
numerous
universe
Modified from Dorr and Habash (after Jurafsky and Martin)
26
Computational Morphology
WORD
cats
cat
cities
geese
ducks
merging
caught
STEM (+FEATURES)*
cat +N +PL
cat +N +SG
city +N +PL
goose +N +PL
(duck +N +PL) or
(duck +V +3SG)
merge +V +PRES-PART
(catch +V +PAST-PART) or
(catch +V +PAST)
Modified from Dorr and Habash (after Jurafsky and Martin)
27
Lexicon-only Morphology
• The lexicon lists all surface level and lexical level pairs
• No rules …
• Analysis/Generation is easy
• Very large for English
• What about
•Arabic or
•Turkish or
• Chinese?
acclaim
acclaim
acclaimed
acclaimed
acclaiming
acclaims
acclaims
acclamation
acclamations
acclimate
acclimated
acclimated
acclimates
acclimating
Modified from Dorr and Habash (after Jurafsky and Martin)
acclaim $N$
acclaim $V+0$
acclaim $V+ed$
acclaim $V+en$
acclaim $V+ing$
acclaim $N+s$
acclaim $V+s$
acclamation
$N$
acclamation
$N+s$
acclimate
$V+0$
acclimate
$V+ed$
acclimate
$V+en$
acclimate
$V+s$
acclimate
$V+ing$
28
For Next Week
Software status:
Software on 3 lab machines, more coming
Lecture on Monday Sept 13:
Part of speech tagging
For Wed Sept 15
Do exercises 1-3 in Tutorial 2 (Tokenizing)
Do the following exercises from Tutorial 3 (Tagging)
1a-h
2, 3, 4, 5a-b
Turn them in online
(I’ll have something available for this by then)
29