MT-Overview-IC-Aug07.. - Carnegie Mellon School of Computer

Download Report

Transcript MT-Overview-IC-Aug07.. - Carnegie Mellon School of Computer

Machine Translation Overview
Alon Lavie
Language Technologies Institute
Carnegie Mellon University
LTI Immigration Course
August 23, 2007
Machine Translation: History
• MT started in 1940’s, one of the first conceived
application of computers
• Promising “toy” demonstrations in the 1950’s, failed
miserably to scale up to “real” systems
• AIPAC Report: MT recognized as an extremely difficult,
“AI-complete” problem in the early 1960’s
• MT Revival started in earnest in 1980s (US, Japan)
• Field dominated by rule-based approaches, requiring
100s of K-years of manual development
• Economic incentive for developing MT systems for small
number of language pairs (mostly European languages)
August 23, 2007
LTI IC 2007
2
Machine Translation:
Where are we today?
• Age of Internet and Globalization – great demand for
MT:
– Multiple official languages of UN, EU, Canada, etc.
– Documentation dissemination for large manufacturers
(Microsoft, IBM, Caterpillar)
• Economic incentive is still primarily within a small
number of language pairs
• Some fairly good commercial products in the market for
these language pairs
– Primarily a product of rule-based systems after many
years of development
• Web-based (mostly free) MT services: Google,
Babelfish, others…
• Pervasive MT between most language pairs still nonexistent and not on the immediate horizon
August 23, 2007
LTI IC 2007
3
Example of High Quality MT
PAHO’s Spanam system:
•
Mediante petición recibida por la Comisión Interamericana de
Derechos Humanos (en adelante …) el 6 de octubre de 1997, el señor
Lino César Oviedo (en adelante …) denunció que la República del
Paraguay (en adelante …) violó en su perjuicio los derechos a las
garantías judiciales … en su contra.
•
Through petition received by the `Inter-American Commission on
Human Rights` (hereinafter …) on 6 October 1997, Mr. Linen César
Oviedo (hereinafter “the petitioner”) denounced that the Republic of
Paraguay (hereinafter …) violated to his detriment the rights to the
judicial guarantees, to the political participation, to // equal protection
and to the honor and dignity consecrated in articles 8, 23, 24 and 11,
respectively, of the `American Convention on Human Rights`
(hereinafter …”), as a consequence of judgments initiated against it.
August 23, 2007
LTI IC 2007
4
Core Challenges of MT
• Ambiguity and Language Divergences:
– Human languages are highly ambiguous, and
differently in different languages
– Ambiguity at all “levels”: lexical, syntactic, semantic,
language-specific constructions and idioms
• Amount of required knowledge:
– Translation equivalencies for vast vocabularies
(several 100k words and phrases)
– Syntactic knowledge (how to map syntax of one
language to another), plus more complex language
divergences (semantic differences, constructions and
idioms, etc.)
– How do you acquire and construct a knowledge base
that big that is (even mostly) correct and consistent?
August 23, 2007
LTI IC 2007
5
Major Sources of Translation
Problems
• Lexical Differences:
– Multiple possible translations for SL word, or
difficulties expressing SL word meaning in a single TL
word
• Structural Differences:
– Syntax of SL is different than syntax of the TL: word
order, sentence and constituent structure
• Differences in Mappings of Syntax to
Semantics:
– Meaning in TL is conveyed using a different syntactic
structure than in the SL
• Idioms and Constructions
August 23, 2007
LTI IC 2007
6
Lexical Differences
• SL word has several different meanings, that
translate differently into TL
– Ex: financial bank vs. river bank
• Lexical Gaps: SL word reflects a unique
meaning that cannot be expressed by a single
word in TL
– Ex: English snub doesn’t have a corresponding verb
in French or German
• TL has finer distinctions than SL  SL word
should be translated differently in different
contexts
– Ex: English wall can be German wand (internal),
mauer (external)
August 23, 2007
LTI IC 2007
7
Structural Differences
• Syntax of SL is different than syntax of the
TL:
– Word order within constituents:
• English NPs: art adj n
• Hebrew NPs: art n art adj
– Constituent structure:
the big boy
ha yeled ha gadol
• English is SVO: Subj Verb Obj I saw the man
• Modern Arabic is VSO: Verb Subj Obj
– Different verb syntax:
• Verb complexes in English vs. in German
I can eat the apple Ich kann den apfel essen
– Case marking and free constituent order
• German and other languages that mark case:
den apfel esse Ich the(acc) apple eat I(nom)
August 23, 2007
LTI IC 2007
8
Syntax-to-Semantics Differences
• Meaning in TL is conveyed using a different
syntactic structure than in the SL
–
–
–
–
Changes in verb and its arguments
Passive constructions
Motion verbs and state verbs
Case creation and case absorption
• Main Distinction from Structural Differences:
– Structural differences are mostly independent of
lexical choices and their semantic meaning  can be
addressed by transfer rules that are syntactic in
nature
– Syntax-to-semantic mapping differences are
meaning-specific: require the presence of specific
words (and meanings) in the SL
August 23, 2007
LTI IC 2007
9
Syntax-to-Semantics Differences
• Structure-change example:
I like swimming
“Ich scwhimme gern”
I swim gladly
• Verb-argument example:
Jones likes the film.
“Le film plait à Jones.”
(lit: “the film pleases to Jones”)
• Passive Constructions
– Example: French reflexive passives:
Ces livres se lisent facilement
*”These books read themselves easily”
These books are easily read
August 23, 2007
LTI IC 2007
10
Idioms and Constructions
• Main Distinction: meaning of whole is
not directly compositional from meaning
of its sub-parts  no compositional
translation
• Examples:
– George is a bull in a china shop
– He kicked the bucket
– Can you please open the window?
August 23, 2007
LTI IC 2007
11
Formulaic Utterances
• Good night.
• tisbaH
cala xEr
waking up on good
• Romanization of Arabic from CallHome Egypt
August 23, 2007
LTI IC 2007
12
How to Tackle the Core Challenges
• Manual Labor: 1000s of person-years of human
experts developing large word and phrase translation
lexicons and translation rules.
Example: Systran’s RBMT systems.
• Lots of Parallel Data: data-driven approaches for
finding word and phrase correspondences automatically
from large amounts of sentence-aligned parallel texts.
Example: Statistical MT systems.
• Learning Approaches: learn translation rules
automatically from small amounts of human translated
and word-aligned data. Example: AVENUE’s Statistical
XFER approach.
• Simplify the Problem: build systems that are limiteddomain or constrained in other ways. Examples:
CATALYST, NESPOLE!.
August 23, 2007
LTI IC 2007
13
State-of-the-Art in MT
• What users want:
– General purpose (any text)
– High quality (human level)
– Fully automatic (no user intervention)
• We can meet any 2 of these 3 goals
today, but not all three at once:
– FA HQ: Knowledge-Based MT (KBMT)
– FA GP: Corpus-Based (Example-Based) MT
– GP HQ: Human-in-the-loop (efficiency tool)
August 23, 2007
LTI IC 2007
14
Types of MT Applications:
• Assimilation: multiple source languages,
uncontrolled style/topic. General purpose MT,
no semantic analysis. (GP FA or GP HQ)
• Dissemination: one source language,
controlled style, single topic/domain. Special
purpose MT, full semantic analysis. (FA HQ)
• Communication: Lower quality may be okay,
but system robustness, real-time required.
August 23, 2007
LTI IC 2007
15
Approaches to MT: Vaquois MT Triangle
Interlingua
Give-information+personal-data (name=alon_lavie)
Generation
Analysis
Transfer
[s [np [possessive_pronoun “name”]]
[s [vp accusative_pronoun
“chiamare” proper_name]]
[vp “be” proper_name]]
Direct
Mi chiamo Alon Lavie
August 23, 2007
My name is Alon Lavie
LTI IC 2007
16
Analysis and Generation
Main Steps
• Analysis:
– Morphological analysis (word-level) and POS tagging
– Syntactic analysis and disambiguation (produce
syntactic parse-tree)
– Semantic analysis and disambiguation (produce
symbolic frames or logical form representation)
– Map to language-independent Interlingua
• Generation:
– Generate semantic representation in TL
– Sentence Planning: generate syntactic structure and
lexical selections for concepts
– Surface-form realization: generate correct forms of
words
August 23, 2007
LTI IC 2007
17
Direct Approaches
• No intermediate stage in the translation
• First MT systems developed in the
1950’s-60’s (assembly code programs)
– Morphology, bi-lingual dictionary lookup,
local reordering rules
– “Word-for-word, with some local word-order
adjustments”
• Modern Approaches: EBMT and SMT
August 23, 2007
LTI IC 2007
18
Statistical MT (SMT)
• Proposed by IBM in early 1990s: a direct, purely statistical,
model for MT
• Statistical translation models are trained on a sentencealigned parallel bilingual corpus
– Train word-level alignment models
– Extract phrase-to-phrase correspondences
– Apply them at runtime on source input and “decode”
• Attractive: completely automatic, no manual rules, much
reduced manual labor
• Main drawbacks:
– Effective only with large volumes (several mega-words) of parallel
text
– Broad domain, but domain-sensitive
– Still viable only for small number of language pairs!
• Impressive progress in last 5 years
– Large DARPA funding programs (TIDES, GALE)
– Lots of research in this direction
– GIZA++, Pharoah, CAIRO
August 23, 2007
LTI IC 2007
19
EBMT Paradigm
New Sentence (Source)
Yesterday, 200 delegates met with President Clinton.
Matches to Source Found
Yesterday, 200
delegates met behind
closed doors…
Gestern trafen sich 200
Abgeordnete hinter
verschlossenen…
Difficulties with
President Clinton…
Schwierigkeiten mit
Praesident Clinton…
Alignment (Sub-sentential)
Yesterday, 200 delegates
met behind closed
doors…
Gestern trafen sich 200
Abgeordnete hinter
verschlossenen…
Difficulties with
President Clinton over…
Schwierigkeiten mit
Praesident Clinton…
Translated Sentence (Target)
August 23, 2007
LTI
IC 2007
20Clinton.
Gestern
trafen sich 200 Abgeordnete mit Praesident
Transfer Approaches
• Syntactic Transfer:
– Analyze SL input sentence to its syntactic structure
(parse tree)
– Transfer SL parse-tree to TL parse-tree (various
formalisms for specifying mappings)
– Generate TL sentence from the TL parse-tree
• Semantic Transfer:
– Analyze SL input to a language-specific semantic
representation (i.e., Case Frames, Logical Form)
– Transfer SL semantic representation to TL semantic
representation
– Generate syntactic structure and then surface
sentence in the TL
August 23, 2007
LTI IC 2007
21
Transfer Approaches
Main Advantages and Disadvantages:
• Syntactic Transfer:
– No need for semantic analysis and generation
– Syntactic structures are general, not domain specific
 Less domain dependent, can handle open domains
– Requires word translation lexicon
• Semantic Transfer:
– Requires deeper analysis and generation, symbolic
representation of concepts and predicates  difficult to
construct for open or unlimited domains
– Can better handle non-compositional meaning structures
 can be more accurate
– No word translation lexicon – generate in TL from symbolic
concepts
August 23, 2007
LTI IC 2007
22
Knowledge-based
Interlingual MT
• The classic “deep” Artificial Intelligence
approach:
– Analyze the source language into a detailed symbolic
representation of its meaning
– Generate this meaning in the target language
• “Interlingua”: one single meaning
representation for all languages
– Nice in theory, but extremely difficult in practice:
• What kind of representation?
• What is the appropriate level of detail to represent?
• How to ensure that the interlingua is in fact universal?
August 23, 2007
LTI IC 2007
23
Interlingua versus Transfer
• With interlingua, need only N parsers/
generators instead of N2 transfer systems:
L2
L2
L1
L3
L1
L3
L4
L6
L4
interlingua
L6
L5
August 23, 2007
L5
LTI IC 2007
24
Multi-Engine MT
• Apply several MT engines to
each input in parallel
• Create a combined
translation from the
individual translations
• Goal is to combine
strengths, and avoid
weaknesses.
• Along all dimensions:
domain limits, quality,
development time/cost,
run-time speed, etc.
• Various approaches to the
problem
August 23, 2007
LTI IC 2007
25
Speech-to-Speech MT
• Speech just makes MT (much) more difficult:
– Spoken language is messier
• False starts, filled pauses, repetitions, out-ofvocabulary words
• Lack of punctuation and explicit sentence boundaries
– Current Speech technology is far from perfect
• Need for speech recognition and synthesis in
foreign languages
• Robustness: MT quality degradation should be
proportional to SR quality
• Tight Integration: rather than separate
sequential tasks, can SR + MT be integrated in
ways that improves end-to-end performance?
August 23, 2007
LTI IC 2007
26
MT at the LTI
• LTI originated as the Center for Machine
Translation (CMT) in 1985
• MT continues to be a prominent sub-discipline
of research with the LTI
– More MT faculty than any of the other areas
– More MT faculty than anywhere else
• Active research on all main approaches to MT:
Interlingua, Transfer, EBMT, SMT
• Leader in the area of speech-to-speech MT
• Multi-Engine MT (MEMT)
• MT Evaluation (METEOR, BLANC)
August 23, 2007
LTI IC 2007
27
KBMT: KANT, KANTOO, CATALYST
• Deep knowledge-based framework, with symbolic
interlingua as intermediate representation
– Syntactic and semantic analysis into a unambiguous
detailed symbolic representation of meaning using
unification grammars and transformation mappers
– Generation into the target language using unification
grammars and transformation mappers
• First large-scale multi-lingual interlingua-based MT
system deployed commercially:
– CATALYST at Caterpillar: high quality translation of
documentation manuals for heavy equipment
•
•
•
•
Limited domains and controlled English input
Minor amounts of post-editing
Active follow-on projects
Contact Faculty: Eric Nyberg and Teruko Mitamura
August 23, 2007
LTI IC 2007
28
EBMT
• Developed originally for the PANGLOSS system
in the early 1990s
– Translation between English and Spanish
• Generalized EBMT under development for the
past several years
• Used in a variety of projects in recent years
– DARPA TIDES and GALE programs
– DIPLOMAT and TONGUES
• Active research work on improving alignment
and indexing, decoding from a lattice
• Contact Faculty: Ralf Brown and Jaime
Carbonell
August 23, 2007
LTI IC 2007
29
Statistical MT
• Word-to-word and phrase-to-phrase translation pairs
are acquired automatically from data and assigned
probabilities based on a statistical model
• Extracted and trained from very large amounts of
sentence-aligned parallel text
– Word alignment algorithms
– Phrase detection algorithms
– Translation model probability estimation
• Main approach pursued in CMU systems in the
DARPA/TIDES program and now in GALE
– Chinese-to-English and Arabic-to-English
• Most active work is on phrase detection and on
advanced decoding techniques
• Contact Faculty: Stephan Vogel and Alex Waibel
August 23, 2007
LTI IC 2007
30
Speech-to-Speech MT
• Evolution from JANUS/C-STAR systems to
NESPOLE!, LingWear, BABYLON, TC-STAR
– Early 1990s: first prototype system that fully
performed sp-to-sp (very limited domains)
– Interlingua-based, but with shallow task-oriented
representations:
“we have single and double rooms available”
[give-information+availability]
(room-type={single, double})
– Semantic Grammars for analysis and generation
– Multiple languages: English, German, French, Italian,
Japanese, Korean, and others
– Stat-MT applied in Speech-to-Speech scenarios
– Most active work on portable speech translation on
small devices: Arabic/English and Thai/English
– Contact Faculty: Alan Black, Stephan Vogel, Tanja
Schultz and Alex Waibel
August 23, 2007
LTI IC 2007
31
AVENUE/LETRAS:
Learning-based Transfer MT
• Develop new approaches for automatically acquiring
syntactic MT transfer rules from small amounts of
elicited translated and word-aligned data
– Specifically designed to bootstrap MT for languages for
which only limited amounts of electronic resources are
available (particularly indigenous minority languages)
– Use machine learning techniques to generalize transfer
rules from specific translated examples
– Combine with SMT-inspired decoding techniques for
producing the best translation of new input from a lattice
of translation segments
• Languages: Hebrew, Hindi, Mapudungun, Quechua
• Most active work on designing a typologically
comprehensive elicitation corpus, advanced techniques
for automatic rule learning, improved decoding, and rule
refinement via user interaction
• Contact Faculty: Alon Lavie, Lori Levin, Jaime Carbonell
and Bob Frederking
August 23, 2007
LTI IC 2007
32
Multi-Engine MT
• New approach developed over past two years under
DoD and DARPA funding (used in GALE)
• Main ideas:
– Treat original engines as “black boxes”
– Align the word and phrase correspondences between the
translations
– Build a collection of synthetic combinations based on the
aligned words and phrases
– Score the synthetic combinations based on Language
Model and confidence measures
– Select the top-scoring synthetic combination
• Architecture Issues: integrating “workflows” that
produce multiple translations and then combine them
with MEMT
– IBM’s UIMA architecture
• Contact Faculty: Alon Lavie
August 23, 2007
LTI IC 2007
33
Synthetic Combination MEMT
Two Stage Approach:
1. Align: Identify common words and phrases across
the translations provided by the engines
2. Decode: search the space of synthetic combinations
of words/phrases and select the highest scoring
combined translation
Example:
1.
2.
announced afghan authorities on saturday reconstituted
four intergovernmental committees
The Afghan authorities on Saturday the formation of the
four committees of government
August 23, 2007
LTI IC 2007
34
Synthetic Combination MEMT
Two Stage Approach:
1. Align: Identify common words and phrases across
the translations provided by the engines
2. Decode: search the space of synthetic combinations
of words/phrases and select the highest scoring
combined translation
Example:
1.
2.
announced afghan authorities on saturday reconstituted
four intergovernmental committees
The Afghan authorities on Saturday the formation of the
four committees of government
MEMT: the afghan authorities announced on Saturday the
formation of four intergovernmental committees
August 23, 2007
LTI IC 2007
35
Automatic MT Evaluation
• METEOR: new metric developed at CMU
• Improves upon BLEU metric developed by IBM and used
extensively in recent years
• Main ideas:
– Assess the similarity between a machine-produced
translation and (several) human reference translations
– Similarity is based on word-to-word matching that
matches:
• Identical words
• Morphological variants of same word (stemming)
• synonyms
– Similarity is based on weighted combination of Precision
and Recall
– Address fluency/grammaticality via a direct penalty: how
well-ordered is the matching of the MT output with the
reference?
• Improved levels of correlation with human judgments of
MT Quality
• Contact Faculty: Alon Lavie
August 23, 2007
LTI IC 2007
36
The METEOR Metric
• Example:
– Reference: “the Iraqi weapons are to be handed over to the
army within two weeks”
– MT output: “in two weeks Iraq’s weapons will give army”
• Matching: Ref: Iraqi weapons army two weeks
MT: two weeks Iraq’s weapons army
•
•
•
•
•
P = 5/8 =0.625 R = 5/14 = 0.357
Fmean = 10*P*R/(9P+R) = 0.3731
Fragmentation: 3 frags of 5 words = (3-1)/(5-1) = 0.50
Discounting factor: DF = 0.5 * (frag**3) = 0.0625
Final score: Fmean * (1- DF) = 0.3731*0.9375 = 0.3498
August 23, 2007
LTI IC 2007
37
Summary
• Main challenges for current state-of-the-art MT
approaches - Coverage and Accuracy:
– Acquiring broad-coverage high-accuracy translation
lexicons (for words and phrases)
– learning syntactic mappings between languages from
parallel word-aligned data
– overcoming syntax-to-semantics differences and
dealing with constructions
– Stronger Target Language Modeling
August 23, 2007
LTI IC 2007
38
Questions…
August 23, 2007
LTI IC 2007
39