Kirrkirr: a Bilingual Warlpiri-English Dictionary

Download Report

Transcript Kirrkirr: a Bilingual Warlpiri-English Dictionary

Kirrkirr: a Bidirectional WarlpiriEnglish Dictionary
Kristen Parton
Kirrkirr: Objectives
 Kirrkirr aims to present the contents of a dictionary in a way which is
flexible, interactive, customizable, and (especially) fun
 Kirrkirr has diverse target users, with varying levels of literacy, for
example professional linguists, elementary school children,
teachers, and native speakers
 Currently, Kirrkirr is used with the Australian Aboriginal language
Warlpiri, spoken by about 3,000 people in northern Australia
 Kirrkirr uses a Warlpiri-English dictionary developed by linguists in
Australia, with detailed information about each word, including
glosses, definitions, dialects, grammatical comments and crossreferences between words for synonyms, antonyms, “see also” and
other relationships
 Unlike paper dictionaries, electronic dictionaries can provide an
interactive educational tool customizable to various audiences
Dictionary Usability
 The interface has a colorful, clickable panel which links words
related in different ways, rather than just relying on the alphabetical
list of words; this also makes the dictionary more interactive
 Many words are linked to pictures and sounds, which reinforce the
meaning of the words through non-textual means
 The dictionary uses “fuzzy spelling” to catch spelling errors made
by the user when searching for a word
 User modes tailor the appearance of the formatted entries to each
target audience:
 English meaning only,for novice users with English
backgrounds
 In Warlpiri, for native speakers of Warlpiri
 Basic details, for intermediate users such as students
 Full details, for advanced users such as teachers or linguists
Lexicon Structure
 The dictionary is maintained by linguists in Australia in an adhoc text format, which is converted to a structured XML
dictionary by a Perl script
 Rather than load the large (10Mb) XML file in memory, each
headword’s XML entry is loaded individually as needed
 The rich structure of the XML allows XSLT stylesheet
manipulation of the dictionary entries to produce output
formatted differently for different users
 The XSLT stylesheet outputs HTML pages, which make use of
the cross-references in the dictionary by creating hyperlinks
between different words
Customizing Format with XSLT
 At run-time, the XML entries are processed by an XSLT stylesheet,
which selects which elements of the entry to show, determines the
order to show them in, and formats each field differently depending on
the user mode
 For example, “Meaning only” outputs the english glosses of a word
in large font, whereas “Full details” outputs all of the information in
the dictionary in a normal sized font in a specific order.
 Since the XML is parsed at run-time, more information can be added to
the XML to allow “parameter passing” from the program to the XSLT
 For example, the location of the images folder can only be
determined at run-time, but by adding an <IMAGE-DIR> field to the
XML at run-time, the XSLT can create an <IMG SRC> tag to display
an image in the HTML output
English-Warlpiri Dictionary
 The original dictionary is one-way Warlpiri to English, but a
bidirectional bilingual dictionary is more useful for most users
 An English index was built from glosses in the dictionary such that
each gloss links to the equivalent Warlpiri entries.
 Rather than being two separate monolingual dictionaries, these
dictionaries share the same data, thus eliminating conflicting entries
and maintaining consistency
 The XML entries of all the Warlpiri equivalents to an English word are
merged, and passed to an XSL T spreadsheet, which creates an
HTML page for the English word
English-Warlpiri Dictionary
 To make the English dictionary symmetric to the Warlpiri, Kirrkirr
now has an English word list, English formatted entries, a much
faster English search, and the capability to do “fuzzy spelling” in
English
 Problems arise because most Warlpiri words have several English
equivalents, and also because phrases in English might be indexed
under several different terms
 For example, “yawarrangi” meaning “large male kangaroo”
should be indexed under “kangaroo” rather than “large” or “male”
 However, the “jawirdiki” and other words that mean “stay put”
should be indexed under “stay” and not “put”
 Words like “kirany-kiranypa” meaning “spinifex lizard” should be
indexed under “spinifex” (the type) and “lizard”
Warlpiri Morphology
 Warlpiri is an agglutinating language, meaning that grammatical
suffixes get added on to words:
 nyangulparnangku
 nya- ngu- lpa- rnangku
 See- PAST- IPFV- 1SG.SUBj- 2SG.OBJ
 “I was looking at you.”
 Root word: “nya-nyi” meaning “to see”
 For lookup in the dictionary, users have to know the root word
 This is difficult for learners of Warlpiri, given that morphemes are not
always separated by hyphens and verbs are indexed with non-past
tense inflections
 To make Kirrkirr more usable, a morphological analyzer was
implemented to accept well-formed Warlpiri words and find the
possible root words to look up
Morphological Analysis
 Suffixes from the dictionary are stored in a trie for quick lookup
 Each time an affix is stripped, the remaining string is checked to
see whether it is in the dictionary
 Each possible morpheme is added to a lattice structure which
holds all possible morphological decompositions of the word
 Grammar rules are applied to eliminate many impossible parses
 Some properties of Warlpiri make parsing more difficult, and
show the need for a different indexing system:
 Verbs are stored with non-past inflections but are seen with
different inflections. For example, “nya-nyi” may show up as
“nya-ngu.” But indexing “nya-nyi” under “nya” creates more
abiguity, since “nya” is another word.
 Some words have optional suffixes, such as “l(pa)” which
may be seen as “l” or “lpa.” These words must be indexed
under both entries.
Conclusions
 Making Kirrkirr a bidirectional English-Warlpiri and Warlpiri-English
dictionary increases its usability and practicality, by making it easier
for users who are more comfortable in English to browse and search
in English.
 Allowing lookup of Warlpiri words from actual speech using the
morphological analysis also increases usability, especially for users
who are learning Warlpiri, since they do not have to figure out the
root word.
 Future work:
 Improving the morphological analysis to provide roughly ranked
possible parses of all morphemes of an entire word, using more
grammatical information and frequency information
 Extending Kirrkirr to other languages