ppt - UMIACS
Download
Report
Transcript ppt - UMIACS
CMSC 723 / LING 645: Intro to
Computational Linguistics
September 15, 2004: Dorr
More about FSA’s,
Finite State Morphology (J&M 3)
Prof. Bonnie J. Dorr
Dr. Christof Monz
TA: Adam Lee
More about FSAs
Transducers
Equivalence of DFSAs and NFSAs
Recognition as search: depth-first, breadth-
search
Recognition using NFSAs
NFSA Recognition of “baaa!”
Breadth-first Recognition
of “baaa!”
should be q2
Regular languages
Regular languages are characterized by
FSAs
For every NFSA, there is an equivalent
DFSA.
Regular languages are closed under
concatenation, Kleene closure, union.
Concatenation
Kleene Closure
Union
Morphology
Definitions and Problems
– What is Morphology?
– Topology of Morphologies
Approaches to Computational Morphology
– Lexicons and Rules
– Computational Morphology Approaches
Morphology
The study of the way words are built up from
smaller meaning units called Morphemes
Abstract versus Realized
HOP +PAST hop +ed hopped /hapt/
Syntax
Lexeme/Inflected Lexeme
Grammars
sentences
Morphology
Morpheme/Allomorph
Morphotactics
words
Phonology
Phoneme/Allophone
Phonotactics
letters
Phonology and Morphology
Phonology vs. Orthography
Historical spelling
– night, nite
– attention, mission, fish
Script Limitations
– Spoken English has 14 vowels
• heed hid hayed head had hoed hood who’d hide how’d taught
Tut toy enough
– English Alphabet has 5
• Use vowel combinatios: far fair fare
• Consonantal doubling (hopping vs. hoping)
Syntax and Morphology
Phrase-level agreement
– Subject-Verb
• John studies hard (STUDY+3SG)
– Noun-Adjective
• Las vacas hermosas
Sub-word phrasal structures
–
–
–
–
conj
שבספרינו
prep
נו+ים+ספר+ב+ש
That+in+book+PL+Poss:1PL
noun
Which are in our books
article plural
poss
Topology of Morphologies
Concatenative vs. Templatic
Derivational vs. Inflectional
Regular vs. Irregular
Concatenative Morphology
Morpheme+Morpheme+Morpheme+…
Stems: also called lemma, base form, root, lexeme
– hope+ing hoping
hop hopping
Affixes
–
–
–
–
Prefixes: Antidisestablishmentarianism
Suffixes: Antidisestablishmentarianism
Infixes: hingi (borrow) – humingi (borrower) in Tagalog
Circumfixes: sagen (say) – gesagt (said) in German
Agglutinative Languages
– uygarlaştıramadıklarımızdanmışsınızcasına
– uygar+laş+tır+ama+dık+lar+ımız+dan+mış+sınız+casına
– Behaving as if you are among those whom we could not cause to become
civilized
Templatic Morphology
Roots and Patterns
ك ت ب
KTB
כת ב
? مَ ? ? و
???ו
مكتوب
כתוב
maktuub
written
ktuuv
written
Templatic Morphology:
Root Meaning
KTB: writing “stuff”
كتاب
book
مكتبة
library
مكتب
office
كتب
مكتوب
كاتب
write
letter
writer
כתב
מכתב
כתב
כתיב
spelling
כתובת
address
Derivational vs. Inflectional
Word Classes
– Parts of speech: noun, verb, adjectives, etc.
– Word class dictates how a word combines with
morphemes to form new words
Derivational morphology
Nominalization: computerization,
appointee, killer, fuzziness
Formation of adjectives: computational,
clueless, embraceable
CatVar: Categorial Variation Database
http://clipdemos.umiacs.umd.edu/catvar/
Inflectional morphology
Adds: Tense, number, person, mood, aspect
Word class doesn’t change
Word serves new grammatical role
Five verb forms in English
Other languages have (lots more)
Nouns and Verbs (in English)
Nouns have simple inflectional
morphology
– cat
– cat+s, cat+’s
Verbs have more complex morphology
Regulars and Irregulars
Nouns
– Cat/Cats
– Mouse/Mice, Ox, Oxen, Goose, Geese
Verbs
– Walk/Walked
– Go/Went, Fly/Flew
Regular (English) Verbs
Morphological Form Classes
Regularly Inflected Verbs
Stem
walk
merge
try
map
-s form
walks
merges
tries
maps
-ing form
walking
merging
trying
mapping
Past form or –ed participle
walked
merged
tried
mapped
Irregular (English) Verbs
Morphological Form Classes Irregularly Inflected Verbs
Stem
eat
catch
cut
-s form
eats
catches
cuts
-ing form
eating
catching cutting
Past form
ate
caught
cut
-ed participle
eaten
caught
cut
“To love” in Spanish
Computational Morphology
Finite State Morphology
– Finite State Transducers (FST)
Input/Output
Analysis/Generation
Computational Morphology
WORD
cats
cat
cities
geese
ducks
STEM (+FEATURES)*
cat +N +PL
cat +N +SG
city +N +PL
goose +N +PL
(duck +N +PL) or
(duck +V +3SG)
merging merge +V +PRES-PART
caught
(catch +V +PAST-PART) or
(catch +V +PAST)
Building a Morphological
Parser
The Rules and the Lexicon
–
–
–
–
General versus Specific
Regular versus Irregular
Accuracy, speed, space
The Morphology of a language
Approaches
– Lexicon only
– Lexicon and Rules
• Finite-state Automata
• Finite-state Transducers
– Rules only
Lexicon-only Morphology
• The lexicon lists all surface level and lexical level pairs
• No rules …?
• Analysis/Generation is easy
• Very large for English
• What about Arabic or Turkish?
• Chinese?
acclaim
acclaim
acclaimed
acclaimed
acclaiming
acclaims
acclaims
acclamation
acclamations
acclimate
acclimated
acclimated
acclimates
acclimating
acclaim $N$
acclaim $V+0$
acclaim $V+ed$
acclaim $V+en$
acclaim $V+ing$
acclaim $N+s$
acclaim $V+s$
acclamation
acclamation
acclimate
acclimate
acclimate
acclimate
acclimate
$N$
$N+s$
$V+0$
$V+ed$
$V+en$
$V+s$
$V+ing$
Building a Morphological
Parser
The Rules and the Lexicon
–
–
–
–
General versus Specific
Regular versus Irregular
Accuracy, speed, space
The Morphology of a language
Approaches
– Lexicon only
– Lexicon and Rules
• Finite-state Automata
• Finite-state Transducers
– Rules only
Lexicon and Rules:
FSA Inflectional Noun Morphology
• English Noun Lexicon
reg-noun
Irreg-pl-noun
Irreg-sg-noun
plural
fox
cat
dog
geese
sheep
mice
goose
sheep
mouse
-s
• English Noun
Rule
Lexicon and Rules: FSA English Verb
Inflectional Morphology
reg-verb-stem
irreg-verb-stem
irreg-past-verb
past
past-part
pres-part
3sg
walk
fry
talk
impeach
cut
speak
spoken
sing
sang
caught
ate
eaten
-ed
-ed
-ing
-s
FSA for Derivational Morphology:
Adjectival Formation
More Complex
Derivational Morphology
Using FSAs for Recognition:
English Nouns and their Inflection
Morphological Parsing
Finite-state automata (FSA)
–
–
Recognizer
One-level morphology
Finite-state transducers (FST)
–
Two-level morphology
•
–
PC-Kimmo (Koskenniemi 83)
input-output pair
Terminology for PC-Kimmo
Upper = lexical tape
Lower = surface tape
Characters correspond to pairs, written a:b
If “a:a”, write “a” for shorthand
Two-level lexical entries
# = word boundary
^ = morpheme boundary
Other = “any feasible pair that is not in this
transducer”
Final states indicated with “:” and non-final states
indicated with “.”
Four-Fold View of FSTs
As a recognizer
As a generator
As a translator
As a set relater
Nominal Inflection FST
Lexical and Intermediate Tapes
Spelling Rules
Name
Rule Description
Example
Consonant Doubling
1-letter consonant doubled before -ing/-ed
beg/begging
E-deletion
Silent e dropped before -ing and -ed
make/making
E-insertion
e added after s,z,x,ch,sh before s
watch/watches
Y-replacement
-y changes to -ie before -s, -i before -ed
try/tries
K-insertion
verbs ending with vowel + -c add -k
panic/panicked
Chomsky and Halle Notation
ε→e/
x
s
z
^ __ s #
Intermediate-to-Surface
Transducer
State Transition Table
Two-Level Morphology
Sample Run
KIMMO DEMO
FSTs and ambiguity
Parse Example 1: unionizable
– union +ize +able
– un+ ion +ize +able
Parse Example 2: assess
– assessv
– assN +essN
Parse Example 3: tender
– tenderAJ
– tenNum+dAJ+erCMP
What to do about
Global Ambiguity?
Accept first successful structure
Run parser through all possible paths
Bias the search in some manner
Computational Morphology
The Rules and the Lexicon
–
–
–
–
General versus Specific
Regular versus Irregular
Accuracy, speed, space
The Morphology of a language
Approaches
– Lexicon only
– Lexicon and Rules
• Finite-state Automata
• Finite-state Transducers
– Rules only
Computational Morphology
The Rules and the Lexicon
–
–
–
–
General versus Specific
Regular versus Irregular
Accuracy, speed, space
The Morphology of a language
Approaches
– Lexicon only
– Lexicon and Rules
• Finite-state Automata
• Finite-state Transducers
– Rules only (next time!!)
Readings for next time
J&M Chapter 6