Computational Morphology
Download
Report
Transcript Computational Morphology
Computational Morphology
Outline
• What is morphology?
– Word structure
– Types of morphological operation
– Levels of affixation
• Computational approaches to morphology
– Finite State transducers
– Two level morphology
– Koskenniemi’s rule formalism
Morphology
S.Ananiadou
2
References
• L. Bauer (1988) Introducing Linguistic Morphology,
EUP
• A. Spencer (1991) Morphological Theory,
Blackwells
• Jurafsky, D. & Martin, J. (2000) Speech and
Language Processing, Chapter 3.
• Koskenniemi, K. & Church, K. (1988) “Complexity,
two-level morphology, and Finnish”, in COLING-88,
Budapest, pp.335-339.
• Ananiadou, S. & McNaught, J. (1986) A Review of
Two-level Morphology. Eurotra Research Paper.
September 1986.
Morphology
S.Ananiadou
3
What is morphology?
• Morphology is the study of the way words are
built up from smaller meaning bearing units,
morphemes.
– ‘antiintellectualism’ -anti -ism -al -intellect
• Free and bound morphemes
– intellect (free)
– anti- -ism, -al (bound)
• Stems and affixes
• Complex words contain a central morpheme, which
contributes the basic meaning, and a collection of
other morphemes serving to modify this meaning
in different ways.
Morphology
S.Ananiadou
4
• ‘disagreements’
agree (stem) dis- -ment -s (affixes)
dis- prefix -ment suffix
-s suffix
• English doesn’t stack more than 4-5 affixes,
Turkish 10 affixes. Agglutinative language.
• Two broad classes of ways to form words from
morphemes: inflection and derivation.
• Inflection: is the combination of a word stem with
a grammatical morpheme, resulting in a word of
the same class
– cat-s cats
play-ed
• Derivation: combination of a word stem with a
grammatical morpheme, resulting in a word of a
different class
– agree -ment
Morphology
S.Ananiadou
5
English Inflectional Morphology
• English nouns have two kinds of inflection: plural &
possessive
– cat cats / ibis ibises / finch finches / box boxes
– llama’s / children’s / llamas’
• English verbal inflection is more complicated
– main verbs (eat/sleep)
– modal verbs (can / will/ should)
– primary verbs (be, have, do) (see Quirk et al: Grammar
of English Language)
– Regular verbs (walk / walks / walking / walked)
– Irregular verbs (eat / eats / eating / ate / eaten)
Morphology
S.Ananiadou
6
Derivational Morphology
• Syntactic category changing e.g. nominalization
computerize computerization
Suffix
Base Noun/Verb/adjective Derived Noun
-ation
computerize
computerization
-ee
appoint
appointee
-er
kill
killer
-ness
fuzzy
fuzziness
-al
computation
computational
-able
like
likeable
-less
clue
clueless
Morphology
S.Ananiadou
7
• Derivation is less productive
• Affixes attach to stems and to each other
according to certain constraints
• Level Ordering in Derivation
• In English we distinguish 2 types of affixation
– class I affixation (+)
– class II affixation (#)
– Class I occurs before class II
I -> ion, ity, ate, ive, ic ...
II -> y, ly, like, ful, ness, less, hood …
danger-ous1-ness2
*fear-less2-ity1
*tender-ness2-ous1
Morphology
S.Ananiadou
8
• Members of the same family may appear in any
order with respect to each other
fear-less-ness
tender-ness-less
• Ordering Hypothesis about occurrence of
morphological processes occur or morphotactics
Class I affixation
Class II affixation
Inflection
Compounding
Morphology
S.Ananiadou
9
Finite State Morphological Parsing
• Take an input like ‘cats’ and produce output forms
like ‘cat +N +PL’ (morphological features)
• In order to build a morphological parser we need:
– lexicon
– morphotactics
– orthographic rules (spelling rules) model the changes
occurring when two morphemes combine e.g. city cities
How to use FSA to model morphotactic information
FST as a way of modeling morphological features in
the lexicon
How to use FSTs to model orthographic rules
Morphology
S.Ananiadou
10
Lexicon and Morphotactics
• A lexicon is a repository of words
• Since we cannot list every word in the language,
computational lexicons are structured as a list of
stems and affixes with a representation of the
morphotactics.
• One way to model morphotactics is the finitestate automaton
Reg-noun
Plural
q1
q0
q2
Irregular-pl-noun
Irreg-sg-noun
Morphology
S.Ananiadou
11
reg-noun
fox
cat
dog
aardvark
irreg-pl-noun
geese
sheep
mice
irreg-sg-noun
goose
sheep
mouse
reg-verb
stem
irreg-verb- irreg-past
stem
verb
past
walk
fry
talk
impeach
cut
speak
sing
sang
spoken
-ed
caught
ate
eaten
plural
-s
past-part pres-part 3sg
-ed
-ing
-s
• English derivational morphology is more complex than
inflectional morphology, automata for modeling are complex
Morphology
S.Ananiadou
12
• Morphotactics for English adjectives
big, bigger, biggest
cool, cooler, coolest, coolly
red, redder, reddesr
clear, clearer, clearest, clearly, unclear, unclearly
happy, happier, happiest, happily
unhappy, unhappier, unhappiest, unhappily
real, unreal, really
• we need to set up classes of roots and specify
which can occur with which suffixes
– Adj-root1 would include adjectives that can occur with
un- and -ly (clear, happy, real)
– Adj-root2 will include adjectives that can’t (big, cool,
red)
Morphology
S.Ananiadou
13
An FSA for a fragment of English adjective
morphology
adj-root1
un-
-er, -ly, -est
q2
q1
q0
adj-root1
q5
q4
q3
adj-root2
Morphology
-er -est
S.Ananiadou
14
• We can use FSAs to solve the problem of
morphological recognition; determining whether an
input string of letters makes up a legitimate
English word or not.
– We do this by taking the morphotactic FSAs and plugging
in each sub-lexicon into the FSA
– we expand each arc (reg-noun-stem arc) with all the
morphemes that make up the set of reg-noun-stem.
Morphology
S.Ananiadou
15
Morphological parsing with FSTs
• Given input cats, we want output cat + N +PL telling
us that cat is a plural noun
• We do this via two-level morphology (TLM)
– TLM represents a word as a correspondence between a
lexical level, which represents a simple concatenation of
morphemes making up a word, and the surface level,
which represents the actual spelling of the final word.
– Morphological parsing is implemented by building mapping
rules that map letter sequences like cats on the surface
level into morpheme and features sequences like cat + N
+ PL on the lexical level
– the automaton used for this mapping is the finite-state
transducer or FST
Morphology
S.Ananiadou
16
FST
• FST maps sets of symbols via a finite automaton
• We visualize an FST as a two-tape automaton
which recognizes or generates pairs of strings.
• FST defines a relation between sets of strings; an
FST is a machine that reads one string and
generates another
lexical
c
surface
c
a
a
t
+N +PL
t
s
Morphology
S.Ananiadou
17
• An FST accepts a language over pairs of symbols,
as in:
= { a : a, b : b, ! : !, a : !, a : , : !}
• For TLM we view an FST as having two tapes; the
upper or lexical tape, is composed from characters
from the left side of the a : b pairs, the lower or
surface tape, is composed of characters from the
right side of the a : b pairs.
• Dictionary, text: each consist of a sequence of
items
– items of the dictionary are expressed according to an
alphabet which consists of {a…z}, 0 (empty character), +
morpheme boundary character, set of archi-phonemes
e.g. S for {s, z}
– items of text are expressed by a subset of this alphabet
{a…z, 0}
Morphology
S.Ananiadou
18
• We can build an FST morphological parser out of a
morphotactic FSA and lexica by adding an extra
lexical tape and the appropriate morphological
features
Reg-noun-stem q1
+N:
q0
Irreg-sg-noun-f q2
q4
+N:
q5
+PL: ^s#
+SG:#
q7
+N:
Irreg-pl-noun-f
q3
Morphology
q6
S.Ananiadou
+PL:#
19
Koskeniemmi’s work
• In this model, all FST’s treating individual
phenomena operate in parallel
– so rule ordering and interactions between rules is not
necessary part of morphological description
– all FSTs share the same two heads but otherwise operate
completely independently
– heads move at the same time
– to have an overall correspondence between lexical and
surface string, two heads must have reached the end of
two strings, and all the FSTs must be in a final state
– when all FSTs agree, a correspondence is reached
– if only one FST blocks while scanning the two strings
then the proposed correspondence is rejected
Morphology
S.Ananiadou
20
……… t r i e s …………
FST1 FST2 FST3 … FSTn
………… t r Y + s …..
text
d o g 0 s
surface tape
FST
dictionary
lexical tape
d
o g+ S
sequence of mappings d,d o,o g,g 0,+ s,S
• the morpheme boundary + corresponds to nothing on the
surface; the S archiphoneme / grapheme corresponds to
surface s.
Morphology
S.Ananiadou
21
Koskenniemi’s rule formalism
• The general form of a rule is
CP op LC --- RC
– CP = correspondence part; this is a concrete or abstract
character pair whose occurrence is restricted by the
rule
– op = an operator, one of four types; four types of rules
– LC, RC = left context, right context
The Rules
Exclusion rule: a : b / LC - RC
a may not be realised as b, in the context LC-RC
a:b not allowed in given context
Morphology
S.Ananiadou
22
Context restriction rule a:b LC --RC
a may be realised as b only in the given context, and
nowhere else; a:b allowed in given context
Surface coercion rule a:b LC-RC
a must be realised as b in the given context; a:b required in
given context
Composite rule a:b LC--RC
this rule is a combination of context restriction and surface
coercion; a lexical a must correspond to surface b in the
given context, and this correspondence is licit only in
that context; a:b required in given context and nowhere
else
Morphology
S.Ananiadou
23
Example of Koskenniemi’s rule
formalism
• Treats epenthesis in English
– Epenthesis: a morpheme boundary +, is realised as an ‘e’
on the surface when it follows ‘ch’, ‘sh’, ‘s’, ‘x’, ‘z’ or ‘y/i’
and occurs before an ‘s’. Otherwise the lexical character
+ corresponds to 0 on the surface (empty string)
– foxes, churches, spies (+:e)
– +/e { { c | s (h) } | S | y/i} --s
CP op
LC
RC
– CP, LC, RC consist of sequences of pairs, the first
member of a pair drawn from the lexical alphabet, the
second from the surface alphabet
Morphology
S.Ananiadou
24