MLTT Summer Interns
Download
Report
Transcript MLTT Summer Interns
The lexc Language
Prepare to partition your brain to learn a
whole new formalism.
Beesley 2001
The lexc language
• “lexc” stands for “LEXicon Compiler”
• lexc is a high-level, declarative programming language
• lexc is different from regular expressions and from xfst
•
•
•
•
the syntax is different
the assumptions are different
the special characters are different
the interfaces are different
This is all
fertile ground
for confusion!
• BUT, the lexc compiler produces STANDARD Xerox networks
• these networks are fully compatible with networks from xfst
• you can sometimes choose to use lexc or xfst for building a network
Beesley 2001
Why a Separate lexc Language?
• Lexc is intended for use by lexicographers.
• Regular expressions in xfst are often hard to read, especially big ones
• Typing spaces between all the letters , e.g. e l e p h a n t , to be concatenated
in xfst is a nuisance, especially if you need to type 40,000 words
• You can also write {elephant} in xfst regular expressions, but that’s a
nuisance too
• Lexc is more efficient for compiling large natural-language
lexicons (it optimizes the union operation)
• Lexc has better error messages
• But remember:
• lexc is just another formalism for defining finite-state languages and relations
• you can (and will) use lexc and xfst together in building significant
applications
Beesley 2001
The lexc Source File:
Multichar_Symbols
The lexc compiler and the xfst regular-expression compiler
have completely opposite assumptions about
multicharacter symbols:
• In xfst Regular Expressions, the default is to treat a string of
symbols written together, e.g. %+Noun or cat, as a single
symbol. Concatenation of separate symbols is indicated by
manually separating symbols with white space, e.g. [ c a t ], or
by using the curly-brace notation, e.g. {cat}.
• In lexc, in contrast, the default is to treat strings, e.g. cat, as a
concatenation of three symbols. Any multicharacter symbols
must be explicitly declared at the top of the source file.
Beesley 2001
Multichar_Symbols declaration
Multichar_Symbols +Noun +Verb +Adj +Adv +Sg +Pl +1P
+2P +3P ^FEAT1 ^FEAT2
The Multichar_Symbols statement is formally optional and is
placed at the top of your lexc source file. You can declare as many
multicharacter symbols as you find necessary or useful. The
compiler uses this declaration to separate the strings of your lexc
program into symbols. You are strongly encouraged to include a
non-alphabetic character like the plus sign or the circumflex to help
the multicharacter symbol stand out visually.
Beesley 2001
The Body of your lexc Program
• The body of a lexc program is composed of LEXICONs.
• There should be one LEXICON named Root. It
corresponds to the Start State in the resulting Network.
LEXICON Root
dog
N;
cat
N;
bird
N;
• If you don’t define a LEXICON Root, lexc will try to use
the first LEXICON in the file as the Start State.
Beesley 2001
Entries in a LEXICON
• Each defined LEXICON must have at least one entry.
• An entry consists of two parts and is terminated with a
semicolon
data
continuation-class ;
• The data part has to fit one of four formats:
•
•
•
•
string
upper:lower
< regular-expression >
empty
Beesley 2001
e.g.
e.g.
e.g.
e.g.
dog
swim:swam
< a b* c >
upper:lower Entries
The upper:lower entries are the simplest way to specify portions of
the network where the upper-side and lower-side differ. They are
especially useful for irregularies/suppletions.
Multichar_Symbols +Verb +Past +Noun +Sg +Pl
LEXICON Root
swim+Verb+Past:swam
go+Verb+Past:went
child+Noun+Pl:children
ox+Noun+Pl:oxen
Beesley 2001
#;
#;
#;
#;
upper:lower Entries
In upper:lower entries, you can overtly indicate where the epsilons
should go.
Multichar_Symbols +Verb +Past +Noun +Sg +Pl +Nom
LEXICON Root
poder+Verb:pod0r
FutCond ;
Danger: the lexc upper:lower notation is not quite the
same as the regular-expression colon notation.
Beesley 2001
Regular Expressions in lexc
• Any data written as a regular expression must be
surrounded with angle brackets, e.g.
<elephant>
< a b* c+>
CC ;
CC ;
• Inside angle brackets, you revert to all the assumptions
suitable for xfst regular expressions, including the
treatment of multicharacter symbols vs. concatenation of
symbols.
• This is fertile ground for confusion and errors.
Beesley 2001
Continuation Classes
• The Continuation Class is just the name of a defined
LEXICON or #, indicating end-of-word (a final state).
Multichar_Symbols +Noun +Sg +Pl
LEXICON Root
dog
N;
cat
N;
LEXICON N
+Noun+Sg:0
+Noun+Pl:s
Beesley 2001
#;
#;
Thinking About lexc LEXICONS
• A LEXICON should hold a coherent class of morphemes
• The entries in a lexc LEXICON are unioned together by
the compiler; the order of the entries in a LEXICON is
not significant.
• Think of LEXICONs as potential “targets”
• Entries “point at” a LEXICON via the ContinuationClass
• But each entry in a LEXICON could itself point to a different
ContinuationClass
• During development, you may have to subdivide lexicons
• Avoid having copies of the same material (if possible)
• You may change an entry in one place and forget to change the
copy
Beesley 2001
Formally Speaking
• Lexc syntax is a kind of right-recursive phrase-structure
grammar.
• Phrase-structure grammars can in general describe
languages beyond finite-state power, including languages
with balanced parentheses.
• But with the right-recursive limitation, a phrase-structure
grammar can define only finite-state languages.
• Lexc can describe only finite-state languages.
• Lexc descriptions compile into finite-state networks.
Beesley 2001
Lexc Idiom: Optional Morphemes via
By-Pass
LEXICON Vroot
kant
V;
dir
V;
don
V;
pens
V;
LEXICON V
AdLex ;
Vend ;
LEXICON AdLex
ad
Vend ;
Beesley 2001
LEXICON Vend
as
is
os
us
u
i
#;
#;
#;
#;
#;
#;
Lexc Idiom: Optional Morphemes via
“Escape” Entries
LEXICON Vroot
kant
AdLex ;
dir
AdLex ;
don
AdLex ;
pens
AdLex ;
LEXICON AdLex
ad
Vend ;
Vend ; ! escape
Beesley 2001
LEXICON Vend
as
is
os
us
u
i
#;
#;
#;
#;
#;
#;
Lexc Idiom: Loops
LEXICON Nroot
kat
N;
hund
N;
elefant
N;
LEXICON Plur
j
Case ; ! Opt. plural ending
Case ;
LEXICON N
eg
N ; ! loop
et
N ; ! loop
in
N ; ! loop
Nend ;
LEXICON Case
n
#;
! Opt. case ending
#;
LEXICON Nend
o
Plur ;
Beesley 2001
Stem compounding (loops) in lexc
LEXICON Nroot
kat
N;
hund
N;
elefant
N;
LEXICON Plur
j
Case ; ! Opt. plural ending
Case ;
LEXICON N
LEXICON Case
n
#;
#;
Nroot ;
Nend ;
LEXICON Nend
o
Plur ;
Beesley 2001
Special Characters in lexc
Overall, there are far fewer special characters in lexc than in regular
expressions. In lexc, the following are special:
Special
Literalized
:
used in upper:lower notation
%:
;
terminates an entry
%;
<
begins a regular expression
%<
>
ends a regular expression
%>
0
denotes the empty string (epsilon)
%0
!
introduces a comment line
%!
#
continuation-class for end-of-word
%#
%
literalizing prefix
%%
Beesley 2001
Lexc source files
• Lexc sources files are ascii, typically edited with xemacs
• Lexc programs for natural language can get very large
• Typically 8000 to 12000 entries for verbs
• Tens of thousands of entries for nouns and proper nouns
Beesley 2001
The lexc interface
• Invoke the lexc interface by simply entering ‘lexc’ at the
UNIX prompt.
unixprompt% lexc
• You communicate with the interface using lexc
commands. Type ‘?’ to see all the possible commands.
Invoke ‘help commandname’ to see some terse online
documentation.
• Enter ‘quit’ to leave lexc and return to the operating
system.
lexc: quit
Beesley 2001
The Three lexc Registers
To understand lexc commands, you must understand that
they refer to and operate on networks held in three
registers, visualized as
SOURCE
Typically used to store a lexicon FST.
RULES
Typically used to store a rule FST or FSTs.
RESULT
Typically used to store the result of composing
the rule FST(s) under the source FST.
Beesley 2001
Basic lexc commands
compile-source filename
SOURCE
compiles the lexc source code in filename and
stores the resulting network in SOURCE
read-rules filename
RULES
reads the binary file filename and stores the
network(s) in RULES. The binary file may be
from xfst or twolc.
compose-result
RESULT
Beesley 2001
composes the RULE FST(s) under the SOURCE
FST and stores the resulting FST in RESULT
Some Other lexc Commands
read-source
save-source filename
save-result filename
lookup word
lookdown word
result-to-source
load a pre-compiled binary network into
the SOURCE register
store network in SOURCE to binary file
store network in RESULT to binary file
(equivalent to the xfst ‘apply up’)
(equivalent to the xfst ‘apply down’)
move the network from the RESULT
register to SOURCE
Enter ‘?’ to see all the lexc interface commands.
Beesley 2001
Using lexc and xfst together
• Write a lexc source file (e.g. mysrc.lex) using xemacs or a
similar editor
• Write suitable alternation rules (in xfst or even twolc). Compile
them and save the network(s) to file, e.g. to myrul.fst
Then from the lexc interface:
lexc: compile-source mysrc.lex
lexc: read-rules
myrul.fst
lexc: compose-result
lexc: save-result
mylang.fst
Beesley 2001
Using lexc and xfst together
• Lexc lexicons build “words” (strings) using union and concatenation
• Entries within a LEXICON are unioned (the order of entries is not significant)
• The
LEXICON Root
• The special #
corresponds to the start state
continuation class corresponds to final states
• Other continuation classes translate into concatenation
• By Xerox convention, upper-side strings consist of a baseform and “tags”
• By convention, a surface (or more surfacy) form appears on the lower-side
• The surfacy forms generated by lexc may still be rather abstract, hyper-regular, or
“morphophonemic”. They may sometimes contain multicharacter symbols.
• Replace Rules (perhaps a whole cascade of them) map from the surfacy strings
produced by lexc to real surface strings; rules are applied using composition.
• Composition can also be used to “filter” out various kinds of overgeneration.
Beesley 2001
A Typical Finite-State System
Filters (xfst)
.o.
Core Lexicon
(lexc)
.o.
Orthographical or
Phonological
Alternation Rules
(xfst)
Beesley 2001
A System may be a Union of
Subsystems
Nouns (lexc)
Verbs (lexc)
.o.
.o.
Noun
Rules
Verb
Rules
Adjs (lexc)
Numbers (lexc)
Then, in xfst:
define final NounFST | VerbFST | AdjFST | NumberFST ;
Beesley 2001
Review: Outputs and Inputs
Unix Pipe:
cat wordlist.in | sort | uniq -c | sort -rnb > myfile.out
• The output of one routine is the input to the next
• NOT reversible
Cascade of Replace Rules:
read regex [ N -> m || _ p ] .o. [ p -> m || m _ ] ;
• Reversible/bidirectional relation
• apply down: the output of the first rule is the input to the second
• the lower side of the top rule is the upper side of the bottom rule
Beesley 2001
Review: Up and Down
In xfst (regular expressions)
a:b
%+Pl:s
[ a .x. b ]
a -> b
a <- b
Beesley 2001
In lexc
swim+Verb+Past:swam
upper:lower
Review: Xerox Conventions
The lexical side language, and all intermediate
languages, have to be defined by the linguist
writing the grammar.
Upper (lexical) language:
baseform+Tag+Tag+Tag
(mapping via rules)
Lower (surface) language:
orthographical-string
The surface language is usually determined
for you by the standard orthography.
Beesley 2001
Review: Up and Down with
Composition
A rule that refers to tags on
its lower side
.o.
baseform+Tag+Tag+Tag
surfacy-form
.o.
A rule that refers to a surfacy
form on its upper side
Beesley 2001
An FST defined
using lexc
Review: Think in terms of Languages
and Relations
Lexical Language
Core Lexicon FST
Surfacy Language
Rule1
Intermediate Language
Rule2
Intermediate Language
Rule n
Final Surface Language
Beesley 2001
Other Important Topics in The Book
• Composition is Our Friend
• Modify a common “core” network to handle
– Multiple orthographies
– Multiple dialects
– Multiple registers
• Testing with the Finite-State Calculus
•
•
•
•
Bulk testing against corpora
Regression testing/comparison
Testing against wordlists
Testing the well-formedness of the upper-side strings
Beesley 2001
Advanced Features
• “Flag Diacritic” features and feature unification
• Simplify lexc descriptions
• Help keep transducers small
• The compile-replace Algorithm
• Useful for non-concatenative morphology
– Reduplication
– Semitic Interdigitation
Beesley 2001