Automatic Feedback for CALL Using NLP

Download Report

Transcript Automatic Feedback for CALL Using NLP

Building and Using an InuktitutEnglish Parallel Corpus
Joel Martin,
Howard Johnson,
Benoit Farley &
Anna Maclachlan
<firstname>.<lastname>@nrc.gc.ca
Agglutinative written form
qaisaaliniaqquunngikkaluaqpuq
Root-
suffixes
-grammatical suffix
qai-, -saali-, -niaq-, -qquu-, -nngit-, -galuaq, -puq
“Actually, he probably won’t come early today.”
Nunavut Hansards
• 155 days of Nunavut Legislative Assembly
• April 1, 1999 to November 1, 2002
Characters
Words
Sentences
Paragraphs
English
20,124,587
3,432,212
348,619
112,346
Inuktitut
13,457,581/
21,305,295
1,586,423
352,486
118,733
These symbols, like the Qamutik that
rests on the floor, will find a home in
our new Assembly building.
I would finally like to recognize the
artists who created the mace.
taakkua qamutiik natirmiittuuk
iniqarumaanniaqtuuk nutaamik
maligaliurvingmi.
kingulliqpaami ilitarijumavakka
sananngualauqtuminiujuit anautarmik.
Difficulties Aligning Inuktitut
Hansards
No spelling checkers
Many dialects (translators)
“School”: ilinniarvik, ilisavik, ilinniaqvik, ilitarvik, ilinniavik
Words
1:1 Word alignment is not usually possible
No root dictionary for Eastern Canada
Lengths
Aligning by length in Words not a good idea
Aligning by length in Chars: average =1.05
Alignment Techniques
• Length Alignment: (Gale and Church, 1993)
• Gaussian to estimate matching probability
• Dynamic programming to optimize the match
• Lexical Alignment:
• non-alphabetic sequences (9:00, 42-1(1) and 1999)
• 8 reliable word correspondences
• speaker/uqaqti
• motion/pigiqati
Initial Alignment Results
Precision
Recall
Gale & Church
2448/3670 = 66.7%
2448/3424 = 71.5%
G&C paragraphs
2978/3479 = 85.6%
2978/3424 = 87%
Lexical & Length
3161/3459 = 91.4%
3161/3424 = 92.3%
Is the alignment useful?
• Term Dictionary
• Few contemporary dictionaries
• Few with roots and suffixes (Eastern Arctic)
• Spelling differences, Dialectical differences
• Examples:
• -kiaq
• tukisi• -juma• maligaliur(vi)• piita
• kanata• makalain
“don’t know”
“understand”
“want”
“assembly”
“Peter”
“Canada”
“McLean”
What is a term?
• Inuktitut Terms
• Words, phrases of 2 to 4 words
• Prefixes, internal substrings, final substrings < 10 ch.
• English Terms
• Words, phrases of 2 to 4 words
• Prefixes
All against all
• Consider every Inuktitut term to every English term
• Slow with big files of partial results
Consistent Translations
Bead contains
English Term
Bead contains
Inuktitut Term
Inuktitut term
is missing
I&E
~I & E
English term is I & ~E
missing
Pr(I&E)
PMI = log
Pr(I)*Pr(E)
~I & ~E
Confidence Interval around Ratios (95%)
Frequency
Total
Lower
Upper
2
2
0.3424
1.0000
2
10
0.0567
0.5098
167
1000
0.1452
0.1914
Glossary Results
4362 term pairs
72.3% of English word occurrences (but…)
Exact Matches (43%):
a) half were uninflected proper nouns.
b) inuup and person’s.
Good (more in the Inuktitut) Matches (44%):
pigiaqtitara and deal. “I deal with him”.
Summary
http://www.InuktitutComputing.ca/NunavutHansard/en/
1) Sentence alignment of an agglutinative language.
2) Use of the sentence alignment to build a glossary.
-lauqsimanngit“have never”
inuliriji“social worker”
-kiaq
“don’t know”
nuu juak
“New York”
tusaumajjutilirinirmut kanngunaqtulirinirmullu (kamis-)
“Information and Privacy Commissioner”