Guest lecture: Tracy King

Transcript Guest lecture: Tracy King

Integrating Finite-state Morphologies
with Deep LFG Grammars
Tracy Holloway King
FST and deep grammars


Finite state tokenizers and morphologies can
be integrated into deep processing systems
Integrated tokenizers
– eliminate the need for preprocessing
– allow the grammar writer more control over the
input

Morphologies
– eliminate the need to list (multiple) surface forms in
the lexicon
– eliminate the need for lexical entries for words with
predictable subcategorization frames
Talk outline



Basic integrated system
Integrating morphology FSTs
Interaction of tokenization and morphology
Basic Architecture
Input string
(Shallow markup)
Tokenizing FSTs
Morphology FSTs
LFG grammar and lexicons
Constituent-structure
(tree)
Functional-structure
(AVM)
Example steps through the system




Input string: Boys appeared.
Tokenizing: boys TB appeared TB . TB
Morphology:
boy + Noun +Pl
appear +Verb +PastBoth +123SP
. +Punct
C-structure/F-structure: next slides
C-structure tree
F-structure AVM
The wider system: XLE

Handwritten grammars for various languages
– Substantial for English, German, Japanese, Norwegian
– Also: Arabic, Chinese, Urdu, Korean, Welsh, Malagasy, Turkish

Robustness mechanisms
– Fragment grammar rules
– Morphological guessers
– Skimming when resource limits approached

Ambiguity management (packing)
– Compute all analyses (no “aggressive pruning”)
– Propagate packed ambiguities across processing modules

Stochastic disambiguation
– MaxEnt models to select from packed (f-)structures

Other processing available:
– generation, semantics, transfer/rewriting

Comparisons to other systems/tasks
– Parsing WSJ (Riezler et al, ACL 2002)
– Comparison to Collins model 3 (Riezler et al, NAACL 2004)
FST Morphologies

Associate surface form with
– a lemma (stem/canonical form)
– a set of tags

Process is non-deterministic
– can have many analyses for one surface form
– grammar has to be able to deal with multiple
analyses (morphological ambiguity)
– Issue: can the grammar control rampant
morphological ambiguity?
Arabic vowelless representations
Example Morphology Output





turnips <=> turnip +Noun +Pl
Mary <=> Mary +Prop +Giv +Fem +Sg
falls <=> fall +Noun +Pl
fall +Verb +Pres +3sg
broken <=> break +Verb +PastPerf +123SP
broken +Verb +PastPart } +Adj
New York <=>
New York +Prop +Place +USAState +Prefer
New York +Prop +Place +City +Prefer
[ plus analyses of New and York ]
Morphologies and lexicons

Without a morphology, need to list all surface
forms in the lexicon
– bad for English
– horrible for languages like Finnish and Arabic

With a morphology, one entry for the stem
form
go V XLE @(V-INTRANS go).
for: go, goes, going, gone, went

With additional integration, words with
predictable subcategorization frames need no
entry
Basic idea

Run surface forms of words through the
morphology to produce stems and tags
– MorphConfig file specifies which morphologies the
grammar uses



Look up stems and tags in the lexicon
Sublexical phrase structure rules build
syntactic nodes covering the stems and tags
Standard grammar rules build larger phrases
Lexical entries for tags
boys ==> boy +Noun +Pl
boy
N XLE @(NOUN boy).
+Noun N_SFX XLE @(PERS 3)
@(EXISTS NTYPE).
+Pl
NNUM_SFX XLE @(NUM pl).
Sublexical rules for tags



Build up lexical nodes from stem plus tags
Rules are identical to standard phrase structure
rules
– Except display can hide the sublexical information
N --> N_BASE
N_SFX_BASE
NNUM_SFX_BASE.
N
N_BASE
boy
N_SFX_BASE
+Noun
NNUM_SFX_BASE
+Pl
Resulting structures
N
N_BASE
boy
N_SFX_BASE
+Noun
PRED 'boy'
PERS
3
NUM
pl
NTYPE common
NNUM_SFX_BASE
+Pl
Lexical entries

Stems with unpredictable subcategorization
frames need entries
– verbs
– adjectives with obliques (proud of her)
– nouns with that complements (the idea that he
laughed)

Most lexical items have predictable frames
determined by part of speech
–
–
–
–
common and proper nouns
adjectives
adverbs
numbers
-unknown lexical entry


Match any stem to the entry
Provide desired functional information
– %stem will pass in the appropriate surface form
(i.e., the lemma/stem)


Constrain application via morphological tag
possibilities
-unknown N XLE @(NOUN %stem);
A XLE @(ADJ %stem);
ADV XLE @(ADVERB %stem).
-unknown example


The box boxes.
Lexicon entries:
box V XLE @(V-INTRANS %stem).
-unknown N XLE @(NOUN %stem); ADV…; A...

Morphology output:
box ==> box +Noun +Sg | +Verb +Non3Sg
boxes ==> box +Noun +Pl | +Verb +3Sg

Build up four effective lexical entries
– 1 noun, 1 verb, 1 adverb, 1 adjective
– adverb and adjective fail sublexically
– noun and verb relevant for the sentence
Inflectional morphology summary



Integrating FST morphologies significantly
decreases lexicon development
Verbs and other unpredictable items are
listed only under their stem form
Predictable items such as nouns are
processed via –unknown and never listed in
the lexicon
Guessers



Even large industrial FST morphologies are not
complete
Novel words usually have regular morphology
Build and FST guesser based on this
– Words with capital letters are proper nouns
(Saakashvili)
– Words ending in –ed are past tense verbs or
deverbal adjectives

Guessed words will go through –unknown
– no difference from standard morphological output
– can add +Guessed tag for further control
Guessers: controlling application

Apply guesser in the grammar only if there is no
form in the regular morphology
– don't guess unless you have to

Control this with the MorphConfig
– use multiple fst morphologies
– stop looking once analysis if found
Sample MorphConfig
STANDARD ENGLISH MORPHOLOGY (1.0)
TOKENIZE:
english.tok.parse.fst
ANALYZE USEFIRST:
english.infl.fst
english.guesser.fst
MULTIWORD:
english.standard.mwe.fst
try regular morphology first
if fail, guess
Multiple morphology FSTs

In addition to the regular morphology and
guesser, can have other morphologies
– morphology for technical terms, part numbers, etc.

These can be applied in sequence or in
parallel (cascaded or unioned)
ANALYZE USEALL:
english.infl.fst
english.eureka.parts.fst
try regular morphology
and also part names
Morphology vs. surface form


System always allows surface form through
Lexicon can match this form for
– multiword expressions
– override/supplement morphological analysis

Example: or as adverb (Or you could leave now.)
or ADV * @(ADVERB or);
CONJ XLE @(CONJ or).
Tokenizers


Tokenizers break strings (sentences) into
tokens (words)
Need to (for English):
– break off punctuation
Mary laughs. ==> Mary TB laughs TB . TB
– lower case certain letters
The dog ==> the TB dog
Tokenization and morphology


Linguistic analysis may govern tokenization
Are English contracted auxiliaries:
– affixes: John'll ==> no tokenization
John +Noun +Proper +Fut
– clitics: John'll ==> John TB 'll TB
John +Noun +Proper will +Fut

Arabic determiners and conjunctions
– both written with adjacent words
determiner as an affix giving +Def (Albint the-girl)
conjunction tokenized separately (wakutub and-books)
Non-deterministic tokenizers:
Punctuation


Cannot just break off punctuation and insert a TB
Comma haplology
Find the dog, a poodle. ==>
find TB the TB dog TB , TB a TB poodle TB , TB . TB

Period haplology
Go to Palm Dr. ==>
go TB to TB Palm TB Dr. TB . TB


Resulting tokenizer is non-deterministic
System must be able to handle multiple inputs
Capitalization

Intial capitals are optionally lower cased
The boy left. ==> the boy left.
Mary left. ==> Mary left.

Example for both types of non-determinism
Bush saw them. ==>
{ Bush | bush } TB saw TB them TB [, TB]* . TB

Tokenization rules vary from language to
language and by choice of linguistic analysis
Conclusions

System architecture integrates FST techniques
with deep LFG parsing
– tokenizers
– morphologies and guessers

Allows generalizations to be factored out
– properties of words
– properties of strings

Allows use of existing large-scale lexical
resources
– avoids redundant speficication

System is actively in use in ParGram grammars
Shallow Markup



Preprocessing with shallow markup can
reduce ambiguity and speed processing
Tokenizer must be able to process the
markup
Part of speech tagging:
– I/PRP_ saw/VBD_ her/PRP_ duck/VB_.

Named entities
– <person>General Mills</person> bought it.
POS tagging

POS tags are not relevant for tokenizing, but
the tokenizer must skip them
– She walks/VBZ_. should be treated like She walks.

The morphology must only insert compatible
tags
– A mapping table states allowable combinations
/VBZ_ +Verb +3sg
/NN_
+Noun +Sg
– These are encoded into a filtering FST
– Only compatible tags are passed to the grammar
POS tagging example

I saw her duck
duck +Noun +Sg
duck +Verb +Pres +Non3sg
– both possibilities passed to the grammar

I saw her duck/VB_.
– only +Verb +Pres +Non3sg possibility is
compatible with /VB_ POS tag
– only this possibility is passed to the grammar
Named Entities

Named entities appear in text as XML markup
<person>General Mills</person> bought it.

Tokenizer
– creates special tag for these
– puts literal spaces instead of TBs
– allows version without markup for fallback
General Mills TB +NamedEntity TB
General TB +Title TB Mills +Proper TB


Lexical entry added for +NamedEntity
Sublexical N and NAME rules allows the tag
Sample Named Entity output

Guest lecture: Tracy King

Transcript Guest lecture: Tracy King

Directory