Unlocking and Sharing LTCL Linguistic Knowledge
Download
Report
Transcript Unlocking and Sharing LTCL Linguistic Knowledge
With 6,500 languages in the world,
we must explore
new ways to learn, document, and share
our linguistic knowledge.
John J. Kovarik
NSA/CSS Senior Language Technology Authority
Unlocking and Sharing LTCL
Linguistic Knowledge
Keywords: CFG parsing, language generation,
computational linguistics
CALICO ’05
University of Michigan
Ann Arbor, MI May 17-20, 2005
The Challenges of Learning
and Sharing Knowledge of an
LCTL in the 21st Century
John J. Kovarik
National Security Agency
Presentation Overview
General LCTL Challenges
Challenges of Learning Mongolian
Recipe for New Approach
Khalka Mongolian Parts of Speech
Mongolian Morphological Affixes
Method of Lexical Knowledge Representation
Analyze, Parse, Build Grammar Model, Test
Iterate Repeatedly
LCTL Learning Challenges
Fewer Learned Resources to Learn from
Less Recognition Nationally
Less Opportunities to Document What’s Learned
Very Few Students to Learn from You
Almost All Learning Done Manually
Few Reliable 21st Century Applications
– Microsoft IME
– Font
Mongolian Learning Challenges
Input Method Emulator (IME)
– MicroSoft IME
• Keyboard arranged for native Mongols
• American Mongolists prefer phonetic keyboard
– “a” key on Mongolian keyboard mapped to ASCII “a” etc.
Fonts commonly used on Internet
– Russian Cyrillic fonts are commonly used
• “|” and “0” commonly substituted for “ү” and “ө”
• “у” and “о” often freely extended to “ү” and “ө”
Recipe for a New Approach
Take a student with a computational linguistics
background
Infuse with curiosity and energy
Stir in access to the Internet
Add Mongolian syntax and morphology
Create morphological analyzer, context free
parser, and grammatical generator for Mongolian
Resulting lexicons, software, and grammar models
can be used by other linguistically adept students
Khalkha Mongolian
Parts of Speech
Declinable Nouns
Declinable Adjectives
Inflected Verbs
Unchanging Adverbs
Declinable Converbs
Unchanging Postpositions
Unchanging Conjunctions
Unchanging Particles
Mongol Morphological Affixes
27 verbal suffixes denoting tense and mood
2 verb infixes denoting verb manner
– Consultative
– Passive
6 verb paradigms or verb types
3 irregular common verbs
6 cases in singular and plural number
Both nouns and adjectives are declined
Lexical Knowledge Representations
Unchanging adverbs, conjunctions,
particles, etc. and irregular verb forms
(unchanging.txt file)
Lemmas of declinable nouns and adjectives
(declinables.txt file)
Inflected verbs and nominalized verbs
(regvb.txt file)
Affix files (casendings.txt, reflex.txt,
infixes.txt, vbforms.txt)
Some Examples
declinables.txt file
– N нэр Q хэн
regverb.txt file
– V ир
V өс
Affix files
– casendings.txt g ний d д
a ыг b оос
– reflex.txt
аа
ээ
оо
– infixes.txt
C лц R лд P гд
– vbforms.txt)
ipf нө i1p в i3p чээ Ypf охгүй
unchanging.txt file
– Pg->талаар
Pc->холбогдуулан
Merge Morphology Knowledge
with the Power of the Computer
Wrote yalgah.pl to become tireless lexical pedagogue
Searches for identifiable affixes by comparison
with lexical knowledge affix files
Matches resulting lemma against lexical
knowledge declinables, verbs, and unchanging
words, then outputs word/part of speech tag to
standard output file plus expository lexicon
Depending whether lemma can or cannot be
matched, outputs:
• Lemma to Out Of Vocabulary (oov) file noting affixes found
• Word/part of speech tag to standard output file
Additional Outputs
Expository Morphology File (named morphlex.txt)
IR->verb command imperative 2nd person singular
IREEREY->converb future perfect continuative
IREG-> verb command concessive 3rd person singular/plural
BAGA->adjective
HURAL->noun nominative
IH->adjective
AJILDAA->reflexive noun dative-locative
ORLOO->verb indicative second past
Out Of Vocabulary File (named oov)
[C = : = > 5 = 0 E 0 0 A 0 0 ] (UNKNOWNAHAASAA) WORD 0 LINE 2
FALLS OUTSIDE OF VOCABULARY
possible reflexive ending <0 0 >-<AA>
possible declinable case ending<b>-<0 0 A >-<AAS>
possible verbal part of speech <Ypf >-<0 E >-<AH>
possible participial/converbal stem <C = : = > 5 = >--<UNKNOWN>
Feed Analytic Output to Parser
Developed context-free grammar (CFG) rules for both discourse and newspaper texts
S->Sbj Prd
S->Prd
Sbj->Nn Sbj->NP
NP->Tg Nn
NP->Tg Ng Nn
Prd->J
Wrote parse.pl to validate CFG rules against input text tagged as to part of speech
When each sentence can be fully parsed, outputs a parse tree and an English gloss.
Working on "BAGA HURAL IH AJILDAA ORLOO ."
ENGLISH GLOSS: large hural great work began .
The sentence does parse.
Branch nodes on tree:
S -> (Sbj Prd)
Sbj -> (NP)
NP -> (J Nn)
Prd -> (NPd Vi2p)
NPd -> (J Nd)
POS: J Nn J Nd Vi2p
Feed Output to Generator
Wrote gramgen.pl to generate sentences
based on lexical knowledge, morphological
knowledge, and syntactic knowledge gained
Output routinely reviewed for accuracy and
Chomskian explanatory adequacy of the
grammar models created for the parser and
generator engines
Iterative Process
First take new newspaper article or dialogue and
run morphological analyzer on it until all words
are listed within vocabulary (no output in the oov
[Out Of Vocabulary] file
Run output through parser, creating new CFG
rules until new text parses
Run generator for a hundred or more examples to
ensure adequacy of new rules
Morpho-analyzer, Parser, Generator
Software Led This Student to Deeper
Understanding of Mongolian
A linguistically adept learner can thus write
software to help one learn deeper & faster
Language tool development is thus
grounded in gaining and applying language
knowledge in a systematic and linguistically
principled manner for oneself and others
Contact Information
John Kovarik
Email: [email protected]
Home Page:
http://www.worldnet.att/~kovariks
Phone: 443-479-7188