Eliciting a corpus of word-aligned phrases for MT

Download Report

Transcript Eliciting a corpus of word-aligned phrases for MT

Eliciting a corpus of wordaligned phrases for MT
Lori Levin, Alon Lavie, Jaime Carbonell
Erik Peterson, Alison Alvarez
Language Technologies Institute
Carnegie Mellon University
Introduction
• Problem: Building Machine Translation systems
for languages with scarce resources:
– Not enough data for Statistical MT and ExampleBased MT
– Not enough human linguistic expertise for writing
rules
• Approach:
– Elicit high quality, word-aligned data from bilingual
speakers
– Learn transfer rules from the elicited data
Modules of the AVENUE/MilliRADD
rule learning system and MT system
Word-aligned
elicited data
English
Language
Model
Learning Module
Transfer Rules
{PP,4894}
;;Score:0.0470
PP::PP [NP POSTP] -> [PREP NP]
((X2::Y1)
(X1::Y2))
Translation Lexicon
Run Time
Transfer
System
Lattice
Word-to-Word
Translation
Probabilities
Decoder
Outline
• Demo of elicitation interface
• Description of elicitation corpus
Demo of Elicitation Tool
• Speaker needs to be bilingual and literate: no
other knowledge necessary
• Mappings between words and phrases: Manyto-many, one-to-none, many-to-none, etc.
• Create phrasal mappings
• Fonts and character sets:
– Including Hindi, Chinese, and Arabic
• Add morpheme boundaries to target language
• Add alternate translations
• Notes and context
English-Chinese Example
English-Hindi Example
Spanish-Mapudungun Example
English-Arabic Example
Testing of Elicitation Tool
• DARPA Hindi Surprise Language Exercise
• Around 10 Hindi speakers
• Around 17,000 phrases translated and
aligned
– Elicitation corpus
– NPs and PPs from Treebanked Brown Corpus
Elicitation Corpus: Basic
Principles
•
•
•
•
Minimal pairs
Syntactic compositionality
Special semantic/pragmatic constructions
Navigation based on language typology
and universals
• Challenges
Elicitation Corpus: Minimal Pairs
• Eng: I fell.
• Eng: I am falling.
Sp: Caí
Sp: Estoy cayendo
M: Tranün
M: Tranmeken
• Eng: You (John) fell.
• Eng: You (John) are falling.
Sp: Tu (Juan) caiste
Sp: Tu (Juan) estás cayendo
M: Eymi tranimi (Kuan)
M: Eimi(Kuan) tranmekeymi
• Eng: You (Mary) fell. ;;
Sp: Tu (María) caiste
M:
Eymi tranimi (Maria)
Mapudungun: Spoken by around one million
people in Chile and Argentina.
Using feature vectors to detect
minimal pairs
• np1:(subj-of cl1).pro-pers.hum.2.sg. masc.noclusn.no-def.no-alien
• cl1:(subj np1).intr-ag.past.complete
– Eng: You (John) fell.
Sp: Tu (Juan) caiste
M: Eymi tranimi (Kuan)
• np1:(subj-of cl1).pro-pers.hum.2.sg. fem.noclusn.no-def.no-alien
• cl1:(subj np1).intr-ag.past.complete
– Eng: You (Mary) fell. ;;
Sp: Tu (María) caiste
M:
Eymi tranimi (Maria)
Inventory of features is based on fieldwork checklists: Comrie
and Smith; Boqiaux and Thomas.
Feature vectors can be extracted from the output of a parser for
English or Spanish. (Except for features that English and Spanish
do not have…)
Syntactic Compositionality
– The tree
– The tree fell.
– I think that the tree fell.
• We learn rules for smaller phrases
– E.g., NP
• Their root nodes become non-terminals in the
rules for larger phrases.
– E.g., S containing an NP
• Meaning of a phrase is predictable from the
meanings of the parts.
Special Semantic and Pragmatic
Constructions
• Meaning may not be compositional
– Not predictable from the meanings of the parts
• May not follow normal rules of grammar.
– Suggestion: Why not go?
• Word-for-word translation may not work.
• Tend to be sources of MT mismatches
– Comparative:
• English: Hotel A is [closer than Hotel B]
• Japanese: Hoteru A wa [Hoteru B yori] [tikai desu]
Hotel A TOP Hotel B than close is
• “Closer than Hotel B” is a constituent in English, but “Hoteru B
yori tikai” is not a constituent in Japanese.
Examples of Semantic/Pragmatic
Categories
• Speech Acts: requests, suggestions, etc.
• Comparatives and Equatives
• Modality: possibility, probability, ability,
obligation, uncertainty, evidentiality
• Correllatives: (the more the merrier)
• Causatives
• Etc.
A Challenge: Combinatorics
–
–
–
–
–
–
–
Person (1, 2, 3, 4)
Number (sg, pl, du, paucal)
Gender/Noun Class (?)
Animacy (animate/inanimate)
Definiteness (definite/indefinite)
Proximity (near, far, very far, etc.)
Inclusion/exclusion
• Multiply with: tenses and aspects (complete, incomplete,
real, unreal, iterative, habitual, present, past, recent past,
future, recent future, non-past, non-future, etc.)
• Multiply with verb class: agentive intransitive, nonagentive intransitive, transitive, ditransitive, etc.
• (Case marking and agreement may vary with verb tense,
verb class, animacy, definiteness, and whether or not
object outranks subject in person or animacy.)
Solutions to Combinatorics
• Generate paradigms of feature vectors,
and then automatically generate
sentences to match each feature vector.
• Use known universals to eliminate
features: e.g., Languages without plurals
don’t have duals.
Navigation through the corpus
• Initial diagnostics:
– Does the language mark number on nouns or
in agreement with verbs?
• Sentence selection:
– Based on initial diagnostics
– Based on principles and universals
• E.g., languages that don’t have plurals don’t have
duals
– So that the informant sees very few sentences
that are not relevant for his/her language
Other Challenges of Computer
Based Elicitation
• Inconsistency of human translation and
alignment
• Bias toward word order of the elicitation
language
– Need to provide discourse context for given and new
information
• How to elicit things that aren’t grammaticalized
in the elicitation language:
– Evidential: I see that it is raining/Apparently it is
raining/It must be raining.
• Context: You are inside the house. Your friend comes in wet.