Transcript Document

EECS 595 / LING 541 / SI 661
Natural Language Processing
Fall 2005
Lecture Notes #4
Features and unification
Introduction
• Grammatical categories have properties
• Constraint-based formalisms
• Example: this flights: agreement is difficult to
handle at the level of grammatical categories
• Example: many water: count/mass nouns
• Sample rule that takes into account features: S 
NP VP (but only if the number of the NP is equal
to the number of the VP)
Feature structures
CAT
NP
NUMBER SINGULAR
PERSON 3
CAT
AGREEMENT
NP
NUMBER SG
PERSON 3
Feature paths:
{x agreement number}
Unification
[NUMBER SG]
[NUMBER SG]
+
[NUMBER SG]
[NUMBER PL]
-
[NUMBER SG]
[NUMBER []] = [NUMBER SG]
[NUMBER SG]
[PERSON 3] = ?
Agreement
• S  NP VP
{NP AGREEMENT} = {VP AGREEMENT}
• Does this flight serve breakfast?
• Do these flights serve breakfast?
• S  Aux NP VP
{Aux AGREEMENT} = {NP AGREEMENT}
Agreement
• These flights
• This flight
• NP  Det Nominal
{Det AGREEMENT} = {Nominal AGREEMENT}
• Verb  serve
{Verb AGREEMENT NUMBER} = PL
• Verb  serves
{Verb AGREEMENT NUMBER} = SG
Subcategorization
• VP  Verb
{VP HEAD} = {Verb HEAD}
{VP HEAD SUBCAT} = INTRANS
• VP  Verb NP
{VP HEAD} = {Verb HEAD}
{VP HEAD SUBCAT} = TRANS
• VP  Verb NP NP
{VP HEAD} = {Verb HEAD}
{VP HEAD SUBCAT} = DITRANS
Regular Expressions and
Automata
Regular expressions
• Searching for “woodchuck”
• Searching for “woodchucks” with an
optional final “s”
• Regular expressions
• Finite-state automata (singular: automaton)
Regular expressions
• Basic regular expression patterns
• Perl-based syntax (slightly different from other
notations for regular expressions)
• Disjunctions [abc]
• Ranges [A-Z]
• Negations [^Ss]
• Optional characters ? and *
• Wild cards .
• Anchors ^ and $, also \b and \B
• Disjunction, grouping, and precedence |
Writing correct expressions
• Exercise: write a Perl regular expression to match
the English article “the”:
/the/
/[tT]he/
/\b[tT]he\b/
/[^a-zA-Z][tT]he[^a-zA-Z]/
/(^|[^a-zA-Z])[tT]he[^a-zA-Z]/
A more complex example
• Exercise: Write a regular expression that will
match “any PC with more than 500MHz and 32
Gb of disk space for less than $1000”:
/$[0-9]+/
/$[0-9]+\.[0-9][0-9]/
/\b$[0-9]+(\.[0-9][0-9])?\b/
/\b[0-9]+ *(MHz|[Mm]egahertz|Ghz|
[Gg]igahertz)\b/
/\b[0-9]+ *(Mb|[Mm]egabytes?)\b/
/\b[0-9](\.[0-9]+) *(Gb|[Gg]igabytes?)\b/
Advanced operators
Substitutions and memory
• Substitutions
s/colour/color/
• Memory (\1, \2, etc. refer back to matches)
s/([0-9]+)/<\1>/
Eliza [Weizenbaum, 1966]
User: Men are all alike
ELIZA: IN WHAT WAY
User: They’re always bugging us about something or other
ELIZA: CAN YOU THINK OF A SPECIFIC EXAMPLE?
User: Well, my boyfriend made me come here
ELIZA: YOUR BOYFRIEND MADE YOU COME HERE
User: He says I’m depressed much of the time
ELIZA: I AM SORRY TO HEAR THAT YOU ARE DEPRESSED
Eliza-style regular expressions
Step 1: replace first person references with second person references
Step 2: use additional regular expressions to generate replies
Step 3: use scores to rank possible transformations
s/.* YOU ARE (depressed|sad) .*/I AM SORRY TO
HEAR YOU ARE \1/
s/.* YOU ARE (depressed|sad) .*/WHY DO YOU
THINK YOU ARE \1/
s/.* all .*/IN WHAT WAY/
s/.* always .*/CAN YOU THINK OF A SPECIFIC
EXAMPLE/
Finite-state automata
• Finite-state automata (FSA)
• Regular languages
• Regular expressions
Finite-state automata (machines)
baa!
baaa!
baaaa!
baaaaa!
...
b
q0
baa+!
a
a
q1
a
q2
state
!
q3
transition
q4
final
state
Input tape
q0
a
b
a
!
b
Finite-state automata
•
•
•
•
•
Q: a finite set of N states q0, q1, … qN
: a finite input alphabet of symbols
q0: the start state
F: the set of final states
(q,i): transition function
State-transition tables
State
0
1
2
3
4
b
1
0
0
0
0
Input
a
0
2
3
3
0
!
0
0
0
4
0
The FSM toolkit and friends
• Developed at AT&T Research (Riley, Pereira, Mohri,
Sproat)
• Download:
http://www.research.att.com/sw/tools/fsm/tech.html
http://www.research.att.com/sw/tools/lextools/
• Tutorial available
• 4 useful parts: FSM, Lextools, GRM, Dot (separate)
– /data2/tools/fsm-3.6/bin
– /data2/tools/lextools/bin
– /data2/tools/dot/bin
D-RECOGNIZE
function D-RECOGNIZE (tape, machine) returns accept or reject
index  Beginning of tape
current-state  Initial state of machine
loop
if End of input has been reached then
if current-state is an accept state then
return accept
else
return reject
elsif transition-table [current-state, tape[index]] is empty then
return reject
else
current-state  transition-table [current-state, tape[index]]
index  index + 1
end
Adding a failing state
a
b
q0
a
a
q1
!
q2
q3
q4
!
!
!
b
!
b
b
a
b
qF
a
Languages and automata
• Formal languages: regular languages, non-regular
languages
• deterministic vs. non-deterministic FSAs
• Epsilon () transitions
Using NFSAs to accept strings
• Backup: add markers at choice points, then
possibly revisit underexplored markers
• Look-ahead: look ahead in input
• Parallelism: look at alternatives in parallel
Using NFSAs
Input
State
0
1
2
3
4
b
1
0
0
0
0
a
0
2
2,3
0
0
!
0
0
0
4
0

0
0
0
0
0
More about FSAs
• Transducers
• Equivalence of DFSAs and NFSAs
• Recognition as search: depth-first, breadthsearch
Recognition using NFSAs
Regular languages
• Operations on regular languages and FSAs:
concatenation, closure, union
• Properties of regular languages (closed
under concatenation, union, disjunction,
intersection, difference, complementation,
reversal, Kleene closure)
An exercise
• J&M 2.8. Write a regular expression for the
language accepted by the NFSA in the
Figure.
Morphology and
Finite-State Transducers
Morphemes
• Stems, affixes
• Affixes: prefixes, suffixes, infixes: hingi
(borrow) – humingi (agent) in Tagalog,
circumfixes: sagen – gesagt in German
• Concatenative morphology
• Templatic morphology (Semitic languages)
: lmd (learn), lamad (he studied), limed (he
taught), lumad (he was taught)
Morphological analysis
• rewrites
• unbelievably
Inflectional morphology
•
•
•
•
Tense, number, person, mood, aspect
Five verb forms in English
40+ forms in French
Six cases in Russian:
http://www.departments.bucknell.edu/russian/language/case.html
• Up to 40,000 forms in Turkish (you cause X
to cause Y to … do Z)
Derivational morphology
• Nominalization: computerization,
appointee, killer, fuzziness
• Formation of adjectives: computational,
embraceable, clueless
Finite-state morphological
parsing
•
•
•
•
•
•
•
Cats: cat +N +PL
Cat: cat +N +SG
Cities: city +N +PL
Geese: goose +N +PL
Ducks: (duck +N +PL) or (duck +V +3SG)
Merging: +V +PRES-PART
Caught: (catch +V +PAST-PART) or (catch +V
+PAST)
Principles of morphological
parsing
•
•
•
•
•
Lexicon
Morphotactics (e.g., plural follows noun)
Orthography (easy  easier)
Irregular nouns: e.g., geese, sheep, mice
Irregular verbs: e.g., caught, ate, eaten
FSA for adjectives
•
•
•
•
•
•
•
Big, bigger, biggest
Cool, cooler, coolest, coolly
Red, redder, reddest
Clear, clearer, clearest, clearly, unclear, unclearly
Happy, happier, happiest, happily
Unhappy, unhappier, unhappiest, unhappily
What about: unbig, redly, and realest?
Using FSA for recognition
• Is a string a legitimate word or not?
• Two-level morphology: lexical level +
surface level (Koskenniemi 83)
• Finite-state transducers (FST) – used for
regular relations
• Inversion and composition of FST
Orthographic rules
•
•
•
•
•
Beg/begging
Make/making
Watch/watches
Try/tries
Panic/panicked
x
 
  /  s ^ __ s #
z
 
a  b / c __ d
Combining FST lexicon and rules
• Cascades of transducers:
the output of one becomes the input of another
Weighted Automata
Phonetic symbols
• IPA
• Arpabet
• Examples
Using WFST for language
modeling
• Phonetic representation
• Part-of-speech tagging
Word Classes and
Part Of Speech Tagging
Some POS statistics
•
•
•
•
•
Preposition list from COBUILD
Single-word particles
Conjunctions
Pronouns
Modal verbs
Tagsets for English
• Penn Treebank
• Other tagsets (see Week 1 slides)
POS ambiguity
• Degrees of ambiguity (DeRose 1988)
• Rule-based POS tagging
– ENGTWOL (Voutilainen et al. )
– Sample rule:
• Adverbial-That rule
(“it isn’t that odd”)
(“Given input: “that”
if (+1 A/ADV/QUANT);
(+2 SENT-LIM);
(NOT –1 SVOC/A);
(not a verb like “consider”)
then eliminate non-ADV tags
else eliminate ADV tag
Evaluating POS taggers
• Percent correct
• What is the lower bound on a system’s
performance?
• What about the upper bound?
P( A)  P( E )
k
1  P( E )
Kappa
•
•
•
•
N: number of items (index i)
n: number of categories (index j)
k: number of annotators
when k > .8 – agreement is considered high
k


m
  ij 
 i 1

 Nk 




N
N
n
1
1
2
P( A) 
mij 

Nk (k  1) i 1 j 1
k 1
n
P( E )  
j 1
2
Midterm reading list
• Chapter 1 – Introduction
• Chapter 2 – Regular expressions and automata
• Chapter 3 – Morphology and finite-state
transducers + FSM tutorial
• Chapter 8 – Word classes and POS tagging
• Chapter 9 – Context-free grammars for English
• Chapter 10 – Parsing with context-free grammars
• Chapter 11 - Features and unification
Syntaxscape
• Written by Juno Suk of Lucent
• http://www.cs.columbia.edu/~radev/syntaxscape/
Read by yourselves
• 9.9. Spoken language syntax
• 9.10. Grammar equivalence
• 9.11. Finite-state and context-free grammars