Finite State Morphology

Download Report

Transcript Finite State Morphology

Finite State Morphology
Alexander Fraser & Luisa Berlanda
[email protected]
CIS, Ludwig-Maximilians-Universität München
Computational Morphology and Electronic Dictionaries
SoSe 2016
2016-25-05
SFST
• programming language for developing finite-state
transducers
• compiler which translates programs to transducers
• tools for
– applying transducers
– printing transducers
– comparing transducers
SFST Example Session
> echo "Hello\ World\!" > test.fst storing a small test program
> fst-compiler test.fst test.a
calling the compiler
test.fst: 2
> fst-mor test.a
reading transducer...
finished.
analyze> Hello World!
Hello World!
analyze> Hello World
no result for Hello World
analyze> q
interactive transducer usage
transducer is loaded
input
recognised
another input
not recognised
terminate program
Transducer Variables
$Vroot$ = walk | talk | bark
$Vinfl$ = <V>:<> (\
[<inf><n3s>]:<> |\
{<3s>}:{s} |\
{<ger>}:{ing} |\
{<past>}:{ed})
% list of verbs with regular inflection
% regular verbal inflection
$Nroot$ = hat | head | trick
$Ninfl$ = <N>:<> (\
{<sg>}:{} |\
{<pl>}:{s})
% list of nouns with regular inflection
% regular nominal inflection
$Vroot$ $Vinfl$ | $Nroot$ $Ninfl$ % combine stems and inflectional endings
Homework
• Write a pipeline that maps all letters to lowercase and
orders them backwards
• Family Huber has three children. Their first child is called Mia,
the next one Toni and the last one Pia.
• Family Band has three children as well. Michael, Paul and Pia.
• Write a program, that can tell us the following details about
the children: which family does he/she belong to, is it a son or
a daugter, was he/she the first, second or third child.
• Output format:
• <family><family name><first, second child><gender>
Solution
• Write a pipeline that maps all letters to
lowercase and orders them backwards
• [a-z]:[A-Z]* ||
[ZYXWVUTSRQPONMLKJIHFECBA]:[A-Z]*
Solution
{<family><huber>}:{} ({<1><daughter>}:{mia}
|{<1><son>}:{toni} |{<2><daughter>}:{pia}) |\
{<family><band>}:{} ({<1><son>}:{michael}
|{<2><son>}:{paul} |{<1><daughter>}:{pia})
Lexicon Files
$Vroot$ = “verb.lex“
$Nroot$ = “noun.lex“
The command “filename“ reads the respective file line by line and forms the
disjunction of all lines. Only the symbols : \ < > % are treated as operators.
Lexicon Files
$Vroot$ = “verb.lex“
$Vinfl$ = <V>:<> (\
[<inf><n3s>]:<> |\
{<3s>}:{s} |\
{<ger>}:{ing} |\
{<past>}:{ed})
$Nroot$ = “noun.lex“
$Ninfl$ = <N>:<> (\
{<sg>}:{} |\
{<pl>}:{s})
$Vroot$ $Vinfl$ | $Nroot$ $Ninfl$
Symbol Set Variables
#cons# = bcdfghjklmnpqrstvwxzß
#CONS# = BCDFGHJKLMNPQRSTVWXZß
#Cons# = #cons# #CONS#
#vowel# = aeiouäöü
#VOWEL# = AEIOUÄÖÜ
#Vowel# = #vowel# #VOWEL#
#letter# = #vowel# #cons#
#LETTER# = #VOWEL# #CONS#
[#LETTER# #letter#]:[#letter# #LETTER#]*
What would you get for Hallo and Ruß?
Solution
• What would you get for Hallo and Ruß?
•  hALLO
•  rUß
Alphabet
The alphabet defines the set of available symbol pairs which is relevant for
the wildcard symbol „.“, the negation operator „!“ and the replacement
operators (introduced later).
ALPHABET = [A-Z] [A-Z]:[a-z]
The expression on the right-hand side is compiled into a transducer and the
set of character pairs is extracted from its transitions.
A:. is here identical to A:[Aa]
. is identical to .:.
[^A-Z] all characters appearing in the alphabet which are not uppercase
letters, i.e. the set of lowercase letters.
.* maps mixed letter sequences to all uppercase letter sequences (analysis)
Alphabet
fullform.lex:
house<N><sg>
house<>:s<N><pl>
walk<V><inf>
walk<>:i<>:n<>:g<V><ger>
emorph.fst:
ALPHABET = [a-zA-Z] [<V><N><sg><pl><ger><inf>]:<>
“fullform.lex“ || .*
reads the lexicon and deletes the grammatical markers on the surface side.
Orthographic Rules
Replace operator: t ^─> (l _ r)
applies the mapping implemented in the transducer t in the left context l and
right context r. l and r are automata (i.e. transducers mapping strings to
themselves.)
e-elision: bake<V>ing → baking
$Morph$ = bake<V>{<ger>}:{ing}
ALPHABET = [A-Za-z] <V>
$e-elision$ = e:<> ^─> (_ _<V> [ei] )
$Morph$ = $Morph$ || $e-elision$
ALPHABET = [A-Za-z] <V>:<>
$Morph$ || .*
% delete e before <V>e or <V>i
% apply the rule
% delete the <V> marker
Orthographic Rules
What does this program?
$Morph$ = bake<V>{<ger>}:{ing} | crash<V>{<3><sg>}:s |
happy<ADJ>{<comp>}:{er} | fly<V>{<3><sg>}:s
ALPHABET = [A-Za-z] <V><N><ADJ>
$e-elision$ = e:<> ^-> (__<V> [ei])
$e-epenthesis$ = (<V> <>:e) ^-> ([sh]__ s)
$y2i$ = y:i ^-> ([^ae] __[<ADJ><V>] e)
$y2ie$ = y:{ie} ^-> ([^ae] __ [<V><N>] s)
$Morph$ = $Morph$ || $e-elision$ || $e-epenthesis$ || $y2i$ || $y2ie$
ALPHABET = [A-Za-z] [<V><N><ADJ>]:<>
$Morph$ || .*
Solution
• e-epenthesis: crash<V>s → crashes
• y to i: happy<ADJ>er→ happier fly<V>s →
flies
• only in the analyze mode!
Agreement Variables
$Morph$ = big<ADJ>{<comp>}:{er} | fat<ADJ>{<comp>}:{er}
$cons$ = [bcdfghjklmnpqrstvwxz]
$vowel$ = [aeiouy]
#=g# = bdglmnpt
$g$ = [#=g#] <>:[#=g#]
ALPHABET = [A-Za-z] <V><N><ADJ>
$gemination$ = $g$ ^-> ($cons$ $vowel$ __<ADJ> e)
$Morph$ = $Morph$ || $gemination$
ALPHABET = [A-Za-z] [<V><N><ADJ>]:<>
$Morph$ || .*
Solution
• analyze> bigger
 big<ADJ><comp>
• analyze> fater
 no result for fater
• analyze> fatter
 fat<ADJ><comp>
• Thank you for your attention