Morphology and Finite State Transducer

Download Report

Transcript Morphology and Finite State Transducer

Chapter 3: Morphology and Finite
State Transducer
Heshaam Faili
[email protected]
University of Tehran
Morphology

Morphology is the study of the internal structure of
words


morphemes: (roughly) minimal meaning-bearing unit in a
language, smallest “building block” of words
Morphological parsing is the task of breaking a word
down into its component morphemes, i.e., assigning
structure


going  go + ing
running  run + ing


spelling rules are different from morphological rules
Parsing can also provide us with an analysis

going  go:VERB + ing:GERUND
2
Kinds of morphology

Inflectional morphology = grammatical morphemes
that are required for words in certain syntactic
situations


I run
John runs

-s is an inflectional morpheme marking the third person
singular verb

Derivational morphology = morphemes that are used
to produce new words, providing new meanings
and/or new parts of speech


establish
establishment

-ment is a derivational morpheme that turns verbs into nouns
3
More on morphology


We will refer to the stem of a word (main part) and its
affixes (additions), which include prefixes, suffixes,
infixes, and circumfixes
Most inflectional morphological endings (and some
derivational) are productive – they apply to every word
in a given class



-ing can attach to any verb (running, hurting)
re- can attach to any verb (rerun, rehurt)
Morphology is highly complex in more agglutinative
languages like Persian and Turkish


Some of the work of syntax in English is in the morphology in
Turkish
Shows that we can’t simply list all possible words
4
Overview
A.
B.
C.
D.
Morphological recognition with finitestate automata (FSAs)
Morphological parsing with finite-state
transducers (FSTs)
Combining FSTs
More applications of FSTs
5
A. Morphological recognition
with FSA


Before we talk about assigning a full
structure to a word, we can talk about
recognizing legitimate words
We have the technology to do this:
finite-state automata (FSAs)
6
Overview of English verbal
morphology

4 English regular verb forms: base, -s, -ing, -ed






walk/walks/walking/walked
merge/merges/merging/merged
try/tries/trying/tried
map/maps/mapping/mapped
Generally productive forms
English irregular verbs (~250):




eat/eats/eating/ate/eaten
catch/catches/catching/caught/caught
cut/cuts/cutting cut/cut
etc.
7
Analyzing English verbs


For the -s, and –ing forms, both regular
and irregular verbs use their base forms
Irregulars differ in how they treat the
past and the past participle forms
8
FSA for English verbal
morphology (morphotactics)
initial= 0; final ={1, 2, 3}
 0->verb-past-irreg->3
 0->vstem-reg->1
 1->+past->3
 1->+pastpart->3
 0->vstem-reg->2
 0->vstem-irreg->2
 2->+prog->3
 2+sing->3
N.B. covers ‘morphotactics’, but not spelling rules (latter
9
requires a separate FSA)

A Fun FSA Exercise: Isleta
Morphology







Consider the following data from Isleta, a
dialect of Southern Tiwa, a Native American
language spoken in New Mexico:
[temiban]
‘I went’
[amiban]
‘you went’
[temiwe]
‘I am going’
[mimiay]
‘he was going’
[tewanban]
‘I came’
[tewanhi]
‘I will come’
10
Practising Isleta

List the morphemes corresponding to the following
English translations:











‘I’
‘you’
‘he’
‘go’
‘come’
+past
+present_progressive
+past_progressive
+future
What is the order of morphemes in Isleta?
How would you say each of the following in Isleta?



‘He went’
‘I will go’
‘You were coming’
11
An FSA for Isleta Verbal
Inflection




initial =0; final ={3}
0->mi|te|a->1
1->mi|wan->2
2->ban|we|ay|hi->3
12
B. Morphological Parsing
with FSTs


Using a finite-state automata (FSA) to recognize a
morphological realization of a word is useful
But what if we also want to analyze that word?



A finite-state transducer (FST) can give us the
necessary technology to do this
Two-level morphology:



e.g. given cats, tell us that it’s cat + N + PL
Lexical level: stem plus affixes
Surface level: actual spelling/realization of the word
Roughly, we’ll have the following for cats:

c:c a:a t:t ε:+N s:+PL
13
Finite-State Transducers



While an FSA recognizes (accept/reject) an input
expression, it doesn’t produce any other output
An FST, on the other hand, in addition produces an
output expression  we define this in terms of
relations
So, an FSA is a recognizer, whereas an FST translates
from one expression to another


So, it reads from one tape, and writes to another tape (see
Figure 3.8, p. 71)
Actually, it can also read from the output tape and
write to the input tape

So, FST’s can be used for both analysis and generation (they
14
are bidirectional)
Transducers and Relations


Let’s pretend we want to translate from the Cyrillic
alphabet to the Roman alphabet
We can use a mapping table, such as:







A:A
Б:B
Г:G
Д:D
etc.
We define R = {<A, A>, <Б, B>, <Г, G>, <Д, D>,
..}
We can thing of this as a relation R  Cyrillic X
Roman

To understand FSTs, we need to understand relations
15
The Cyrillic Transducer
initial =0; final = {0}
0–>A:A-> 0
0->Б:B-> 0
0->Г:G-> 0
0->Д:D-> 0
….



Transducers implement
a mapping defined by a
relation
R = {<A, A>, <Б, B>,
<Г, G>, <Д, D>, ..}
These relations are
called regular relations
(since each side
expresses a regular
expression)
16
FSAs and FSTs

FSTs, then, are almost identical to FSAs … Both have:





The difference: the alphabet () for an FST is now
comprised of complex symbols (e.g., X:Y)



Q: a finite set of states
q0: a designated start state
F: a set of final states
: a transition function
FSA:  = a finite alphabet of symbols
FST:  = a finite alphabet of complex symbols, or pairs
As a shorthand, if we have X:X, we can write this as X
17
FSTs for morphology

For morphology, using FSTs means that we can:

set up pairs between the surface level (actual realization)
and the lexical level (stem/affixes)


set up pairs to go from one form to another, i.e., the
“underlying” base form maps to the plural


“c:c a:a t:t ε:+N s:+PL”
“g:g o:e o:e s:s e:e”
Can combine both kinds of information into the same
FST:


g:g o:o o:o s:s e:e ε:+N ε:+SG
g:g o:e o:e s:s e:e ε:+N ε:+PL
18
Isleta Verbal Inflection



I will go
Surface: temihi
Lexical:
te+PRO+1P+mi+hi
+FUTURE
te 

te +PRO +1P
•
•
mi
mi
hi 
hi +FUT
Note that the cells have to
line up across tapes.
So, if an input symbol gives
rise to more/less output
symbols, epsilons have to
be added to the
input/output tape in the
appropriate positions.
19
An FST for Isleta Verbal
Inflection





initial =0; final ={3}
0-> mi:mi+PRO+3P |te:te+PRO+1P
|a:a+PRO+2P ->1
1-> mi|wan ->2
2-> ban:ban+PAST |we:we+PRES+PROG
|ay:ay+PAST+PROG |hi:hi+FUT ->3
Interpret te:te+PRO+1P as shorthand for 3
separate arcs
20
A Lexical Transducer (Xerox)




Remember that FSTs can be used in either direction
l e a v e +VBZ : l e a v e s
l e a v e +VB : l e a v e
l e a v e +VBG : l e a v i n g
l e a v e +VBD : l e f t
l e a v e +NN : l e a v e
l e a v e +NNS : l e a v e s
l e a f +NNS : l e a v e s
l e f t +JJ : l e f t
Left-to-Right Input: leave+VBD (“upper language”)
Output: left
(“lower language”)
Right-to-Left Input: leaves
(lower language)
Output: leave+NNS (upper language)
leave+VBZ
leaf+NNS
21
Transducer Example (Xerox)



L1= [a-z]+.
Consider language L2 that results
from replacing any instances of
"ab" in L1 by "x".
So, to define the mapping, we
define a relation R  L1 X L2


e.g., <"abacab", "xacx"> is in R.
Note: “xacx" in lower language is
paired with 4 strings in upper
language, "abacab", "abacx",
"xacab", & "xacx"
NB: ? = [a-z]\{a,b,x}
22
C. Combining FSTs: Spelling
Rules


So far, we have gone from a lexical level
(e.g., cat+N+PL) to a “surface level” (e.g.,
cats)
But this surface level is actually an
intermediate level  it doesn’t take spelling
into account



So, the lexical level of “fox+N+PL” corresponds to
“fox^s”
We will use “^” to refer to a morpheme boundary
We need another level to account for spelling
23
rules
Lexicon FST

The lexicon FST will convert a lexical level to
an intermediate form





This will be of the form:




dog+N+PL  dog^s
fox+N+PL  fox^s
mouse+N+PL  mouse^s
dog+V+SG  dog^s
0-> f ->1
1-> o ->2
2-> x ->3
And so on
3-> +N:^ ->4
4-> +PL:s ->5
4-> +SG:ε ->6
24
English noun lexicon as a FST
J&M Fig 3.9
Expanding
the aliases
LEX-FST
J&M Fig 3.11
25
LEX-FST

Lets allow  to pad the tape


Then, we won’t force both tapes to have same length
Also, let’s pretend we’re generating
c
c
a
a
t
t
+N
^
g o o
g e e
s
s
e
e
f
f
x
x
+N
^
o
o
+PL
s
+N
#
+PL
S
Morpheme Boundary
Word-Final Boundary
Lexical Tape
#
Intermediate Tape
+PL
#
26
Rule FST

The rule FST will convert the intermediate
form into the surface form




Assuming we include other arcs for every
other character, this will be of the form:




dog^s  dogs (covers both N and V forms)
fox^s  foxes
mouse^s  mice
0-> f ->0
0 -> o ->0
0 -> x -> 1
1-> ^:ε ->2
2-> ε:e ->3
3-> s ->4
This FST is too impoverished …
27
Some English Spelling Rules
Consonant
doubling
E deletion
Beg / Begging / Begged
E insertion
Watch / Watches
Y replacement
Try / Tries
K Insertion
Panic / Panicked
Make / Making
28
E-insertion FST J&M Fig 3.14,
p. 78
x
  e / s ^ __ s #
z
29
E-insertion FST
f
f

o
o
x
x
^
e
s
s
#
#
Intermediate Tape
Surface Tape
Trace:



generating foxes# from fox^s#:
q0-f->q0-o->q0-x->q1-^:->q2-:e->q3-s->q4-#->q0
generating foxs# from fox^s#:
q0-f->q0-o->q0-x->q1-^:->q2-s->q5-#->FAIL
generating salt# from salt#:
q0-s->q1-a->q0-l->q0-t>q0-#->q0

parsing assess#:
q0-a->q0-s->q1-s->q1-^:->q2-:e->q3-s->q4-s->FAIL
q0-a->q0-s->q1-s->q1-e->q0-s->q1-s->q1-#->q0
30
Combining Lexicon and Rule
FSTs


We would like to combine these two
FSTs, so that we can go from the lexical
level to the surface level.
How do we integrate the intermediate
level?


Cascade the FSTs: one after the other
Compose the FSTs: combine the rules at
each state
31
Cascading FSTs

The idea of cascading FSTs is simple:




Input1  FST1  Output1
Output1  FST2  Output2
The output of the first FST is run as the
input of the second
Since both FSTs are reversible, the
cascaded FSTs are still reversible/bidirectional.
32
Composing FSTs

We can compose each transition in one FST with a
transition in another



p0-> d:e ->p1
q0-> e:f -> q0
Composed FST:



FST1: p0-> a:b -> p1
FST2: q0-> b:c -> q1
(p0,q0)-> a:c ->(p1,q1)
(p0,q0)-> d:f ->(p1,q0)
The new state names (e.g., (p0,q0)) seem somewhat
arbitrary, but this ensures that two FSTs with
different structures can still be composed


e.g., a:b and d:e originally went to the same state, but now
we have to distinguish those states
33
Why doesn’t e:f loop anymore?
Composing FSTs for
morphology

With our lexical, intermediate, and surface levels, this
means that we’ll compose:





p2->
p3->
q0->
q1->
x ->p3
+N:^ ->p4
x ->q1
^:ε ->q2
p4->
p4->
q2->
q3->
+PL:s ->p5
ε:ε ->p4
ε:e ->q3
s ->q4
into:




(p2,q0)-> x ->(p3,q1)
(p3,q1)-> +N:ε ->(p4,q2)
(p4,q2)-> ε:e ->(p4,q3)
(p4,q3)-> +PL:s ->(p4,q4)
34
Generating or Parsing with
FST lexicon and rules
35
Lexicon-Free FST: Porter
Stemmer

Used for IR and Search Engine



e.g. by search “Foxes” should relates “Fox”
Stemming
Lexicon-Free: Porter Algorithm


ATIONAL -> ATE (relational -> relate)
ING -> ε if stem contain vowel
(motoring -> motor)
36
D. More applications of FSTs

Syntactic parsing using FSTs



approximate the actual structure
(it won’t work in general for syntax)
Noun Phrase (NP) parsing using FSTs


also referred to as NP chunking, or partial
parsing
often used for prepositional phrases (PPs),
too
37
Syntactic parsing using FSTs

Parsing – more than recognition;
returns a structure


How does syntax work?




For syntactic recognition, FSA could be
used
S  NP VP D  the
NP  (D) N N  girl
VP  V NP V  saw
N  zebras
How do we go about encoding this?
38
Syntactic Parsing using FSTs
S
FST 3: Ss
VP
FST 2: VPs
NP
FST 1: NPs
D
N
The
0
FST1
initial=0; final ={2}
0->N:NP->2
0->D:->1
1->N-:NP->2
NP
V
girl
1
N
saw
2
zebras
3
Input
4
D



N
NP
NP

V
V


N
NP
VP
S
FST1
FST2
FST3
39
Syntactic structure with FSTs



Note that the previous FSTs only output
labels after the phrase has been completed.
Where did the phrase start?
To fully capture the structure of a sentence,
we need an FST which delineates the
beginning and the end of a phrase



0-> Det:NP-Start ->1
1-> N:NP-Finish ->2
Another FST can group the pieces into
complete phrases
40
Why FSTs can’t always be
used for syntax

Syntax is infinite, but we have set up a finite number of
levels (depth) to a tree with a finite number of FSTs

Can still use FSTs, but (arguably) not as elegant

The girl saw that zebras saw the giraffes.



We have a VP over a VP and will have to run FST2 twice at
different times.
Furthermore, we begin to get very complex FST abbreviations—
e.g., /Det? Adj* N PP*/—which don’t match linguistic structure
Center-embedding constructions


Allowed in languages like English
Mathematically impossible to capture with finite-state methods
41
Center embedding

Example:

The man [(that) the woman saw] laughed.
The man [Harry said [the woman saw]] laughed.

S in the middle of another S


Problem for FSA/FST technology:



There’s no way for finite-state grammars to make
sure that the number of NPs matches the number
of verbs
These are ancbn constructions  not regular
We have to use context-free grammars … a
topic we’ll return to later in the course
42
Noun Phrase (NP) parsing
using FSTs

If we make the task more narrow, we can
have more success – e.g., only parse NPs



The man on the floor likes the woman who is a
trapeze artist
[The man]NP on [the floor]NP likes [the woman]NP
who is [ a trapeze artist]NP
Taking the NP chunker output at input, a PP
chunker

[The man]NP [on [the floor]NP]PP likes [the
woman]NP who is [ a trapeze artist]NP
43
Exercises


3.1, 3.4, 3.6 (3.3 , 3.5 : new edition
2005)
Write the Persian morphology (including
name and verb) analyzer by perl

See The document for morphological
Analysis from Shiraz Project
44