Transcript Translation

Corpora and Translation
Parallel corpora
Statistical MT
(not to mention: Corpus of translated text, for
translation studies)
Parallel corpora
• Corpora of texts and their translations
• Basic idea that such parallel corpora implicitly
contain lots of information about translation
equivalence
• Nowadays many such “bitexts” are available
– bilingual countries have laws, parliamentary
proceedings, and other documents
– large multinational organizations (UN, EU [Europarl
corpus], etc.)
– multinational commercial organizations produce
multilingual texts
2/38
Bilingual concordance
Source: TransSearch,
Laboratoire de Recherche Appliquée en
Linguistique Informatique,
Université de Montréal
http://www-rali.iro.umontreal.ca
3/38
Parallel corpora
• Usually not corpora in the strict sense
(planned, annotated, etc.)
• Usefulness may depend on
– the quality of translation
– the closeness of translation
– whether we have a text and its translation, or
a multilingually authored text
– the language pair
• Parallel corpus needs to be aligned
4/38
Alignment
• Means annotating the bilingual corpus to show
explicitly the correspondences
– at sentence level
– at word and phrase level
• Main difficulty for sentence alignment is that
translations do not always keep sentence
boundaries, or even sentence order
• In addition, translation may be “localized” and
therefore not especially faithful
5/38
Sentence-level alignment
• If parallel corpus is quite a literal
translation, this can be done using quite
low-level information
– sentence length
– looking for anchors
• proper names, dates, figures
• eg in a parliamentary debate, speakers’ names
6/38
Alignment tools
7/38
Corpus-based MT
• Translation memory (tool for translators)
– database of previous translations
– find close matching examples to current
translation unit
– translator decides what to do with it
8/38
Note that
translator has to
know/decide
what bits of the
target sentence
to change
9/38
Corpus-based MT
• Translation memory (tool for translators)
– database of previous translations
– find close matching examples to current
translation unit
– translator decides what to do with it
• Example-based translation
– similar idea, but computer program tries to
manipulate example(s)
– may involve “learning” general rules from
multiple examples
10/38
Statistical MT
• Pioneered by IBM in early 1990s
• Spurred on by better success in speech
recognition of statistical over linguistic rulebased approaches
• Idea that translation can be modelled as a
statistical process
• Seems to work best in limited domain where
given data is a good model of future translations
11/38
Translation as a probabilistic
problem
• For a given SL sentence Si, there are 
number of “translations” T of varying
probability
• Task is to find for Si the sentence Tj for
which the probability P(Tj | Si) is the
highest
12/38
Two models
• P(Tj | Si) is a function of two models:
– The probabilities of the individual words that
make up Tj given the individual words in Si the “translation model”
– The probability that the individual words that
make up Tj are in the appropriate order – the
“language model”
13/38
Expressed in mathematical
terms:
P(T )  P( S | T )
arg max P(T | S ) 
P( S )
Since S is a given, and constant, this can be
simplified as
arg max P(T | S )  P(T )  P( S | T )
Language model
Translation model
14/38
So how do we translate?
• For a given input sentence Si we have to have a
practical way to find the Tj that maximizes the
formula
• We have to start somewhere, so we start with
the translation model: which words look most
likely to help us?
• In a systematic way we can keep trying different
combinations together with the language model
until we stop getting improvements
15/38
Input sentence
Translation
model
Seek improvement by trying
other combinations
Bag of possible
words
Language
model
Most probable
translation
16/38
Where do the models come
from?
• All the statistical parameters are pre-computed
(“learned”), based on a parallel corpus
• Language model is probabilities of word
sequences (n-grams)
• Translation model is derived from aligned
parallel corpus
• This approach is attractive to some as an
example of “machine learning”
– The computer learns to translate (just) from seeing
previous examples of translation
17/38
The translation model
• Take sentence-aligned parallel corpus
• Extract entire vocabulary for both
languages
• For every word-pair, calculate probability
that they correspond – e.g. by comparing
distributions
18/38
Problem: fertility
• “fertility”: not all word correspondences are
1:1
– Some words have multiple possible
translations, e.g. the  {le, la, l’, les}
– Some words have no translation, e.g. in il se
rase ‘he shaves’, se 
– Some words are translated by several words,
e.g. cheap  peu cher
– Not always obvious how to align
19/38
Problem: distortion
• Notice that corresponding words do not
appear in the same order.
• The translation model includes
probabilities for “distortion”
– e.g. P(2|5): the P that ws in position 2 will
produce a wt in position 5
– can be more complex: P(5|2,4,6): the P
that ws in position 2 will produce a wt in
position 5 when S has 4 words and T has
6.
20/38
The language model
• Impractical to calculate probability of every
word sequence:
– Many will be very improbable …
– Because they are ungrammatical
– Or because they happen not to occur in the data
• Probabilities of sequences of n words (“ngrams”) more practical
– Bigram model: P(w1 , w2 ,..., wn )   P(wi | wi 1 )
where P(wi|wi–1) f(wi–1, wi)/f(wi)
21/38
Sparse data
• Relying on n-grams with a large n risks 0probabilities
• Bigrams are less risky but sometimes not
discriminatory enough
– e.g. I hire men who is good pilots
• 3- or 4-grams allow a nice compromise, and if
a 3-gram is previously unseen, we can give it
a score based on the component bigrams
(“smoothing”)
22/38
Put it all together and …?
• To build a statistical MT system we need:
– Aligned bilingual corpus
– “Training programs” which will extract from the
corpora all the statistical data for the models
– A “decoder” which takes a given input, and
seeks the output that evaluates the magic
argmax formula – based on a heuristic search
algorithm
• Software for this purpose is freely available
– http://www.statmt.org/moses/,
http://www.isi.edu/licensed-sw/pharaoh/
• Claim is that an MT system for a new
language pair can be built in a matter of
hours
23/38
SMT latest developments
• Nevertheless, quality is limited
• SMT researchers quickly learned that this crude
approach can get them so far (quite far actually),
but that to go the extra distance you need
linguistic knowledge (eg morphology, “phrases”,
consitutents)
• Latest developments aim to incorporate this
• Big difference is that it too can be LEARNED
(automatically) from corpora
• So SMT still contrasts with traditional RBMT
where rules are “hand coded” by linguists
24/38
Direct phrase alignment
(Wang & Waible 1998, Och et al., 1999, Marcu & Wong 2002)
• Enhance word translation model by adding
joint probabilities, i.e. probabilities for
phrases
• Phrase probabilities compensate for
missing lexical probabilities
• Easy to integrate probabilities from
different sources/methods, allows for
mutual compensation
25/38
Word alignment induced model
Koehn et al. 2003; example stolen from Knight & Koehn
http://www.iccs.inf.ed.ac.uk/~pkoehn/publications/tutorial2003.pdf
Maria did not slap the
green witch
Maria no daba una
botefada a la bruja
verda
Start with all phrase pairs
justified by the word
alignment
26/38
Word alignment induced model
Koehn et al. 2003; example stolen from Knight & Koehn
http://www.iccs.inf.ed.ac.uk/~pkoehn/publications/tutorial2003.pdf
(Maria, Maria), (no, did not)
(daba una botefada, slap),
(a la, the), (verde, green),
(bruja, witch)
27/38
Word alignment induced model
Koehn et al. 2003; example stolen from Knight & Koehn
http://www.iccs.inf.ed.ac.uk/~pkoehn/publications/tutorial2003.pdf
(Maria, Maria), (no, did not)
(daba una botefada, slap),
(a la, the), (verde, green)
(bruja, witch), (Maria no,
Maria did not), (no daba
una botefada, did not slap),
(daba una botefada a la,
slap the), (bruja verde,
green witch)
etc.
28/38
Word alignment induced model
Koehn et al. 2003; example stolen from Knight & Koehn
http://www.iccs.inf.ed.ac.uk/~pkoehn/publications/tutorial2003.pdf
(Maria, Maria), (no, did not), (slap, daba una bofetada), (a la, the),
(bruja, witch), (verde, green), (Maria no, Maria did not),
(no daba una bofetada, did not slap),
(daba una bofetada a la, slap the), (bruja verde, green witch),
(Maria no daba una bofetada, Maria did not slap),
(no daba una bofetada a la, did not slap the),
(a la bruja verde, the green witch),
(Maria no daba una bofetada a la, Maria did not slap the),
(daba una bofetada a la bruja verde, slap the green witch),
(no daba una bofetada a la bruja verde, did not slap the green witch),
(Maria no daba una bofetada a la bruja verde, Maria did not slap the
green witch)
29/38
Alignment templates
Och et al. 1999; further developed by Marcu and Wong 2002,
Koehn and Knight 2003, Koehn et al. 2003)
• Problem of sparse data worse for phrases
• So use word classes instead of words
–
–
–
–
alignment templates instead of phrases
more reliable statistics for translation table
smaller translation table
more complex decoding
• Word classes are induced (by distributional
statistics), so may not correspond to intuitive
(linguistic) classes
• Takes context into account
30/38
Problems with phrase-based
models
• Still do not handle very well ...
– dependencies (especially long-distance)
– distortion
– discontinuities (e.g. bought = habe ... gekauft)
• More promising seems to be ...
31/38
Syntax-based SMT
• Better able to handle
– Constituents
– Function words
– Grammatical context (e.g. case marking)
•
•
•
•
Inversion Transduction Grammars
Hierarchical transduction model
Tree-to-string translation
Tree-to-tree translation
32/38
Inversion transduction grammars
• Wu and colleagues (1997 onwards)
• Grammar generates two trees in parallel
and mappings between them
• Rules can specify order changes
• Restriction to binary rules limits complexity
33/38
Inversion transduction grammars
34/38
Inversion transduction grammars
• Grammar is trained on word-aligned bilingual
corpus: Note that all the rules are learned
automatically
• Translation uses a decoder which effectively
works like traditional RBMT:
– Parser uses source side of transduction rules to build
a parse tree
– Transduction rules are applied to transform the tree
– The target text is generated by linearizing the tree
35/38
36/38
37/38
Other approaches
• Other approaches use more and more
“linguistic” information
• In each case automatically learned,
especially from treebanks
• Traditional (“rule-based”) MT used (handwritten) grammars and lexicons
• State-of-the-art MT is moving back in this
direction, except that linguistic rules are
machine learned
38/38