Progress on Dictionary - Carnegie Mellon School of Computer Science

Download Report

Transcript Progress on Dictionary - Carnegie Mellon School of Computer Science

Data Collection and Analysis of
Mapudungun Morphology for Spelling
Correction
Christian Monson, Lori Levin, Rodolfo
Vega, Ralf Brown, Ariadna Font Llitjos,
Alon Lavie, Jaime Carbonell, Eliseo
Cañulef, Rosendo Huisca
AVENUE Mapudungun
• Instituto de Estudios Indígenas
– Universidad de La Frontera, Temuco, Chile
• Programa de Educación Intercultural
Bilingüe
– Ministry of Education (Mineduc), Chile
• Language Technologies Institute
– Carnegie Mellon University, USA
Goals of AVENUE Mapudungun
Multicultural and Bilingual
Education, Mineduc, Chile
Basic skills taught in Spanish
and mother tongue
Use of
NLP tools for
technology and
bilingual education: AVENUE Project,
networking even
CMU, USA
On-line dictionary
in rural areas
Bilingual Corpus
NLP tools for
Spelling checker
languages with
low resources
Machine learning
of morphology and
translation rules
Outline
•
•
•
•
•
Overview of Mapudungun language
Plan for on-line dictionary
Progress on dictionary
Plan for spelling checker
Progress on spelling checker
Mapudungun
• Mapuche people
– Around 900,000
– Chile and Argentina
• Agglutinative/Polysynthetic
– Up to 36 suffix slots (Smeets, 1989)
• Typical verb has five or six suffixes
– Noun incorporation
• Noun goes immediately after the verb stem
– Vstem+(noun)+(suffixes)+last-suffix
• Last suffix for finite verb is mood and person/number of agent
or patient,
• Last suffix for non-finite verb is nominalization or
adverbialization
• Other suffixes include aspect, negation, inversive, etc.
Examples of Mapudungun verbs
Amu -ke
-yngün
go -habitual -3plIndic
They (usually) go
Ngütrümtu -a -lu
call
-fut -adverb
While calling (tomorrow), …
nentu -ñma -nge -ymi
extract -mal
-pass -2sgIndic
You were extracted (on me)
ngütramka
-me
tell
-loc
I will tell her (away)
-a
-fut
-fi
-3obj
-ñ
-1sgIndic
Plans for Dictionary (Mineduc)
•
•
•
•
Tri-lingual (Spanish-Mapudungun-English);
Pronunciation for each word for each language
Example of use for each Mapudungun word
Specific users can exchange suggestions and alternate
pronunciations
– Teachers and students of schools in the
PEIB/Orígenes program www.origenes.cl
– Web-based, using Flash
– based on shared lessons plans and network
communications
• Vocabulary
– From the come from the corpus of spoken Mapudungun
– From the Chilean curriculum for first four years of school
– From the informatics domain
• User interface will be designed by Mineduc
Corpus of spoken Mapudungun
• 170 hours of speech
–
–
–
–
120 hours: Nguluche dialect
30 hours: Lafkenche dialect
20 hours: Pewenche dialect
0 hours: Williche dialect
• Different and more endangered
• Mapuche interviewer and interviewees
• Dialogues about health problems treated by doctor or
traditional healer.
• Recorded with DAT recorder
– Some recordings are poor quality
– Some high enough quality for training a speech recognizer
• Transcribed using TransEdit
• Translated into Spanish by native speaker of
Mapudungun
Examples from MapudungunSpanish corpus
nmlch-nmjm1_x_0405_nmjm_00:
M: <SPA>no pütokovilu kay ko
C: no, si me lo tomaba con agua
M: chumgechi pütokoki femuechi pütokon pu <Noise>
C: como se debe tomar, me lo tomé pués
nmlch-nmjm1_x_0406_nmlch_00:
M: Chengewerkelafuymiürke
C: Ya no estabas como gente entonces!
Progress on Dictionary
• Around 3000 Mapudungun words (stems and
fully inflected forms)
– Spanish translation of the word
– Sentence from the corpus of spoken Mapudungun
containing the word form
– Spanish translation of the sentence, and
– Reference into the corpus of spoken Mapudungun
identifying the specific cited sentence
– For 1600 words
• segmentation of the word into morphemes
• gloss for each morpheme
• Stored as a Word file with delimiters between
fields.
– Can be easily converted to other formats
Examples from Dictionary
• Lichi: .? . / /.
– leche.
translation
– Feychi lichi, ¿chem lichingey?
example
– (Esta leche ¿qué leche es?)
translation
– nmlch-nmfhp1_x_0051_nmlch_00. Ec/Rh/Fc.
Ec/ Rh02-01-03.
index
Examples from Dictionary
• Kümekünueymu:
– küme-künu-eymu.
segmentation
– bien-quedar-él(ella).a.ti .? . / /.
gloss
– te ha dejado muy bien.
translation
– Ka kümekünueymu tati.
example
– (Y te ha dejado muy bien).
translation
– nmlch-nmpll1_x_0070_nmlch_00. EC/RH0302-03.
index
Examples from Dictionary
• Mongepeürkelayan:
–
–
–
–
monge-pe-ürke-la-y-a-n.
segmentation
sanar-tal.vez-acaso-no-0-futuro-yo .? . / /.
gloss
no mejoraré tal vez.
translation
Feytüfachi operalayaymi, operaeliyu l'ayaymi" pieneu.
"Mongepeürkelayan may" pin. Fey l'awen'tueneu,
l'awen'tueneu; fey ka tripantun.
example
– ("Esta vez no te vas a operar, si te opero te vas a
morir" me dijo. "No mejoraré tal vez, entonces", dije.
Entonces me medicinó, me medicinó; entonces
también estuve un año).
translation
– nmlch-nmpll1_x_0042_nmpll_00. Ec/Rh/Fc. Ec/
Rh23-12-02
index
Plans for spelling checker
• Goal: identify misspellings even for
morphologically complex words.
• We don’t have a morphological analyzer 
– Mapugungun speakers don’t know computational
linguistics
– We don’t know Mapudungun
– Currently training a field linguist from Argentina
(Roberto Aranovich) in computational linguistics
– Research on automated morphology learning (Christian
Monson)
• We want the spelling checker to be compatible
with a major word processor.
• Using MySpell and OpenOffice
MySpell
• Open-source, standalone version of
OpenOffice.org spell-checker
• Functional equivalent of Unix 'ispell'
• Data files specify stems and classes of affixes
each base-form word specifies valid affix classes
can condition applicability based on characters in
base-form word
➔e.g. English plurals formed with -es if word ends in -ch
can modify base form prior to adding affix
➔e.g. change -y to -ie before adding -s
• Limitation: at most one prefix and one suffix can
be applied to each base form
Plans for Spelling Checker
• MySpell for Mapudungun
– Example of full segmentation
• Mongepeürkelayan
• monge-pe-ürke-la-y-a-n.
• no mejoraré tal vez.
– Example of segmentation for MySpell
• monge
stem
• peürkelayan suffix string
Progress on Spelling Checker
• Step 1: Devise spelling conventions
– There are competing standards for
Mapudungun spelling
– First version of spelling checker:
• AVENUE Mapudungun spelling standards by
Cañulef, Huisca, Painequeo, and Carrasco
• Step 2: Get a list of “correctly” spelled
words, according to the conventions.
– Currently have “correct” spelling for the
70,000 most frequent words from the corpus
Progress on Spelling Checker
most frequent 70,000 words corrected by hand
Frequency Rank
Transcribed Word Form
Spelling Corrected Word Form
………..……
103
104
105
feli
pichikeche
kümey
feley
pichikeche
kümey
chumkunual
puedelafuy
tulayin
kimngepelay
chumkünuael
puedelafuy
tulayiñ
kimngepelay
…………
10,001
10,002
10,003
10,004
…………
Can we use this list instead of stemming?
Mapudungun
Spanish
Types, in Thousands
140
120
100
80
60
40
20
0
0
The bad news
500
1,000
Tokens, in Thousands
1,500
Progress on Spelling Checker
• Step 3: Iteration of stem/suffix boundaries
– Start with 1600 segmented words from the dictionary
– Identify the suffix strings
– For the next most frequent 1000 words
• If the word ends in a known suffix string, insert a stem/suffix
boundary
• Oversegments because we don’t check that the remaining stem is
known after the suffix string is removed
– Native speakers correct the boundaries
• 333 had to be corrected
– Two more iterations
• Next most frequent 3000 words (579 were wrong)
• Next most frequent 5000 words (1175 were wrong)
– Results in 9000 words with correct stem/suffix boundaries
Effect of stemming on number of types
Mapudungun T-T Curve
7000
6000
Type
5000
4000
With
3000
Without
Stemming
2000
1000
0
0
10000
20000
30000
Tokens
If the suffix string and the stem are in the list of 9000 correctly
segmented words, treat it as an instance of the stem.
Otherwise, treat it as a new type.
Stemming
Conclusion
• Building tools that can be used for
bilingual education in Chilean schools
• Large corpus of parallel corpus spoken
Mapudungun translated into Spanish
• Small dictionary with examples from the
corpus
• Can we build a spelling checker with
MySpell?
– We will let you know at a future conference.