Transcript Ling645-FST

Morphology
What is morphology?
Finite State Transducers
Two Level Morphology
What is morphology?
• Decomposition of words into meaningful units:
anti dis establish ment arian ism
•Interacts with- syntax( categories and word order)
[establish] = verb + ment = noun
phonology: divine divinity
obscene obscenity
• Interacts with semantics:
boy boys
Peter Peterchen
Phonological String
morphological analyzer
dictionary lookup
syntactic analyzer
lexical- semantic analysis
discourse processing
Why store all words as morphemes rather than all
Morphological combinations as words?
What does the morphological analyzer have to output?
The what and the how:
•Efficient and effective algorithm to decompose categories into,
or build categories from, component morphemes.
What this algorithm will be depends on problems it has to solve.
In turn depends on representations computed.
Given stem /lemma ( e.g. ‘jump’ add material to change category
Or grammatical properties of word ‘jumped’, ‘jumpable’
• order of composition matters:
ride/ riding
enoble/ enobling/*nobling Adj ---> V,
trance/*trancing/entrance/entrancing
V===> V+ing
CONCATENATIVE MORPHOLOGICAL
PROCESSES:
COMPOUNDING:
firefighter
PREFIXATION:
Un+ well
INFIXATION: ( TAGALOG)
fikas - strong
fumikas - be strong
SUFFIXATION:
Kick + er
CIRCUMFIXATION: ( German)
ge [sag] t past prefix [say] past suffix
Inflectional Morphology
•non category changing, required by syntax
•Agreement: person/number:
Je parle
Nous parlons
Ils parlent
• Gender:
la petite ( the little one (fem))
le petit ( the little one (masc))
la squelette ( the skeleton)
Derivational Morphology
•changes category. Not required by syntax
Deverbal Nominal:
bak+er
catch+ er
tion: destroy/destruction
Roman's destruction of the city
'er' = agent of action Catcher of the ball
John’s catcher of the ball
'John" ~= one who caught
Regular vs Irregular
Jump/jumped
hit/hit bring/brought sing/sang
Productive/Non-Productive
adore/adorable, kick/kickable, fax/faxable
produce/production destroy/destruction *graft/graftuction
Bring/ brought
Regular (English) Verbs
Morphological Form Classes
Regularly Inflected Verbs
Stem
walk
merge
try
map
-s form
walks
merges
tries
maps
-ing form
walking
merging
trying
mapping
Past form or –ed participle
walked
merged
tried
mapped
Irregular (English) Verbs
Morphological Form Classes Irregularly Inflected Verbs
Stem
eat
catch
cut
-s form
eats
catches
cuts
-ing form
eating
catching cutting
Past form
ate
caught
cut
-ed participle
eaten
caught
cut
“To love” in Spanish
•Productive and rule governed:
fax
fax +er
??? Crudoy
cruduction
•Category sensitivity:
breakable/* manable
sensitivity/ *hittivity
•Semantic sensitivity:
un + well
un + happy
*un + ill
*un+ sad
Store morphemes or words?
lebensversicherungsgesellschaftsangesteller
leben+ versicherung + gesellschaft+s+angesteller
life insurance company +Poss
employee
Turkish:
Turkish verns have 40k forms
Non- concatenative Morphology
• Templatic morphology (Semitic languages):
lmd (learn), lamad (he studied), limed (he
taught), lumad (he was taught)
Concatenation: Beads on a string
Agglutinative ( concatenative) languages are well behaved for FSAs
as long as we don’t include phonological or spelling changes
Verb Lexicon:
jump
kiss
stream
hop, ???
jump+ed
kiss+ed
stream+ed
*hopp+ed
verb
q q0
ed
qq11
q2
Pieces of a Morphological Analyzer
un
q0
-er,est,ly
adj-root
q1
q2
q3
The lexicon stores the lemmas, and divides them into adjective
classes
really/clearly *bigly/redly
Morphotactics:
State sequence indicates order of morpheme composition
e.g. comparative or adverb formation is by suffixation
Lexicon
• Arranged as TRIE ( letter strings in common relative to position
n-k-e-y
D-o
-g
• Classed by part of speech category ( noun, verb) and morphotactic
(which other affixes can precede or follow)
or orthographic considerations.
Orthography
•spelling rules- handle phonological or spelling variation in
orthographic
a morpheme
Try /trying/tries
Cringe/cringing/cringes
FSA for Inflectional Morphology:
English Nouns
FSA for Inflectional
Morphology: English Verbs
FSA for Derivational Morphology:
Adjectival Formation
More Complex
Derivational Morphology
Using FSAs for Recognition:
English Nouns and their Inflection
Orthographic
• Want association between morpheme and semantic function
• Want association between allographs or allophones of the same
phoneme
Allographs:
city -cities
bake- baking
divine-divinity
try tried
Finite State Transducers (FSTs)- the Big Idea
Need to relate lexical level, the level that gives us the morphological
analysis (+plural,+able to the surface level that keeps track of
phonological/
or graphological (spelling_ changes)
Two Level Morphology:
-
Finite state transducer moves pair of read heads
simultaneously along 2 strings and maps one string into
another string, making sure that steps are aligned and
derivation proceeds in parallel
-
Parallelism important to limit growth of transducers
-
Total machine = sum of individual machines.
-
Parallelism important for efficiency.
Parsing vs recognition
• An FSA can give you the string composition
of a morphological sequence, and can tell
you whether a given morphological string is
or is not in the language. It recognizes the
string
• An FST parses the string. It tells you the
morphological structure associated with the
string. Other instances of parsing?
Formal definition
•An FST defines a relation between sets of pairs of strings:
•It contains at least a lexical level that is a concatenation of morphemes
and a surface level that shows the correct spelling for each
morpheme in a given context
cat/sheep
^ s
e.g. noun (instanciated from lexicon) + plural
E s
cats/sheep
Q= finite set of states q0 to qn
finite alphabet of complex symbols (feasible pairs)
i:o with one symbol from the input alphabet
Q0 = the start state
F= set of final states
 = (q, i:o) the transition function or matrix between
states. Takes a state from Q and a complex symbol
i:o from and returns a new state.
feasible pair: a relation of a symbol on one tape to a symbol
on the other tape.
e.g. can + [pl:^s]
• default pair- the upper tape is the same as the lower tape
same input as output :c*a*t/c:c*a:a*t:t*pl:^s
•feasible pairs either stated in lexicon if irregular
g:g*o:e*o:e*s:s*e:e goose:geese
or by an automaton that stipulates correspondence in rule
governed way if the relation is regular. If regular, indicated as
Default paris and usually represented by one symbol.
•FSTs are closed under:
inversion: switches i/o labels
composition: union of two transducers
one after the other.
trie: in lexicon, categories arranged by letter one at a time with
class at end. Allows parallel search as long as things match
e.g. m*e*t*a*l <N> m*e*t*a <root>
metal, meta-language
Kimmo-Based
Morphological Parsing
• Two-level morphology: lexical level +
surface level (Koskenniemi 83)
• Finite-state transducers (FST): input-output
pair
Four-Fold View of FSTs
•
•
•
•
As a recognizer
As a generator
As a translator
As a set relater
Terminology for Kimmo
•
•
•
•
•
•
•
•
Upper = lexical tape
Lower = surface tape
Characters correspond to pairs, written a:b
If “a=b”, write “a” for shorthand
Two-level lexical entries
# = word boundary
^ = morpheme boundary
Other = “any feasible pair that is not in this
tranducer”
Nominal Inflection FST
Lexical and Intermediate Tapes
Spelling Rules
Name
Rule Description
Example
Consonant Doubling
1-letter consonant doubled before -ing/-ed
beg/begging
E-deletion
Silent e dropped before -ing and -ed
make/making
E-insertion
e added after s,z,x,ch,sh before s
watch/watches
Y-replacement
-y changes to -ie before -s, -i before -ed
try/tries
K-insertion
verbs ending with vowel + -c add -k
panic/panicked
Notation
x
e--> e / s
z
^ __ s #
Intermediate-to-Surface
Transducer
Two-Level Morphology
Sample Run
FSTs and ambiguity
Parse Example 1: unionizable
Parse Example 2: assess
What to do about
Global Ambiguity?
• Accept first successful structure
• Run parser through all possible paths
• Bias the search in some manner
Some Limitations
Latin Nominative deletion:
( from Sproat)
stem
leg
lit
fraud
front
dent
sort
Genitive
legis
litis
fraudis
frontis
dentis
sortis
Nominative
legs
lis
fraus
frons
dens
sors
Gloss
law
strife
deceit
brow
tooth
lot
t/d
t/d ---------> 0/[+son ( vowels or nasals)]_______s
Rule ordering:
Finnish:
"T-deletion"
t ------> 0
V______V ( simplified)
naura( to laugh) + ten ( 2nd infinitive)
nauraen
i ------> j/ V________V
talo +
house
i +
pl
talo + i +en
talo + j +en
talojen
ten
Gen
Kimmo treatment of Finnish
- no rule ordering because feeding relationships block parallelism.
- two stems for suffix
iten( genitive plural)
ien
selector feature to trigger j -----> i rule
t a l o & + i e n
t a l o 0 0 j e n
Insert & into V____V ( simplified)
diacritic triggers j ---> i automaton
Cost = something that's rule governed is listed in lexicon. Not expressing
truw generalization.
Gain = parallelism and potentially effciency of finite state transduction.
Stemming
• For some applications,don’t need full morphological analysis.
• IR- don’t care that e.g ‘logician’ is related to ‘logical’ Just want
to know that if you are interested in articles about ‘logic’
may want former two classes as well. So just want to ‘get back
to root list.
• Relate two forms by having a literal relation rule. E.g
al#---> 0
• Is it useful: in a big document may not be necessary because the
will appear in many forms including form in query
• stemming is morphologically impoverished so error driven
- can’t distinguish rules that apply at morpheme boundaries
versus internal to root:
patronization = patron + ize + ation
organization = organize+ ation
But the stemmer will treat these as a single class and derive
“organ” as an underlying root.
-’adverse’/’adversity
‘universe / university
Psycholinguistics
• Is the human lexicon efficient in the way computational lexica
are?
-Stanners et al (1979) :where two words are related inflectionally,then root stored and other forms rule derived. Where
there is a derivational relationship, then both forms are stored
paradigm = repetition priming
‘great, happy, peachy, adorable , round, short, great
small
Repetition priming for ‘turns’ given ‘turning’ but not
‘select’, ‘selective’
• Marslen- Wilson et al (1994): May have priming for
Semantically similar derivationally related words:
permit/permission
* create/creativity
On-line versus long term storage lexicon:
Speech errors: ‘we have screw looses’