Statistical Machine Translation
Download
Report
Transcript Statistical Machine Translation
Statistical Machine
Translation
Alona Fyshe
Based on slides from Colin Cherry and Dekang Lin
Basic statistics
• 0 <= P(x) <=1
• P(A)
Probability that A happens
• P(A,B)
Probabiliy that A and B happen
• P(A|B)
Probability that A happens given that we know
B happened
Basic statistics
• Conditional probability
P(A,B)
P(A | B)
P(B)
Basic Statistics
• Use definition of conditional probability to
derive the chain rule
P(A,B)
P(A | B)
P(B)
P(A,B) P(B)P(A | B) P(A)P(B | A)
P(A1, A2 , An )
P(An | An1, A1 )P(An1, A1 )
P(A1 )P(A2 | A1 )P(A3 | A1, A2 )
P(An | A1
, An1 )
Basic Statistics
• Bayes Rule
P(A,B) P(A | B)P(B)
P(A,B) P(B | A)P(A)
P(A | B)P(B) P(B | A)P(A)
P(B | A)P(A)
P(A | B)
P(B)
Basic Statistics
• Just remember
Definition of cond. prob.
P(A,B)
P(A | B)
P(B)
Bayes rule
P(B | A)P(A)
P(A | B)
P(B)
Chain rule
P(A1)P(A2 | A1)P(A3 | A1, A2 )
P(An | A1 , An1)
Goal
• Translate.
• I’ll use French (F) into English (E) as
the running example.
Oh, Canada
• I’m Canadian
Mandatory French class in school until grade 6
I speak “Cereal Box French”
Gratuit
Gagner
Chocolat
Glaçage
Sans gras
Sans cholestérol
Élevé dans la fibre
Oh, Canada
Machine Translation
• Translation is easy for (bilingual) people
• Process:
Read the text in French
Understand it
Write it down in English
Machine Translation
• Translation is easy for (bilingual) people
• Process:
Read the text in French
Understand it
Write it down in English
Machine Translation
Understanding language
Writing well formed text
• Hard tasks for computers
The human process is invisible, intangible
One approach: Babelfish
• A rule-based approach to
machine translation
• A 30-year-old feat in Software Eng.
• Programming knowledge in by hand
is difficult and expensive
Alternate Approach: Statistics
• We are trying to model P(E|F)
I give you a French sentence
You give me back English
• How are we going to model this?
We could use Bayes rule:
P(F | E)P(E)
P(E | F)
P(F | E)P(E)
P(F)
Alternate Approach: Statistics
P(F | E)P(E)
P(E | F)
P(F | E)P(E)
P(F)
Given a French sentence F, we could do a
search for an E that maximizes P(E | F)
Why Bayes rule at all?
• Why not model P(E|F) directly?
• P(F|E)P(E) decomposition allows us to be
sloppy
P(E) worries about good English
P(F|E) worries about French that matches English
The two can be trained independently
Crime Scene Analogy
• F is a crime scene. E is a person who may have
committed the crime
P(E|F) - look at the scene - who did it?
P(E) - who had a motive? (Profiler)
P(F|E) - could they have done it? (CSI transportation, access to weapons, alabi)
• Some people might have great motives, but no
means - you need both!
On voit Jon à la télévision
good English? P(E)
good match to
French? P(F|E)
Jon appeared in TV.
It back twelve saw.
In Jon appeared TV.
Jon is happy today.
Jon appeared on TV.
TV appeared on Jon.
Jon was not happy.
Table borrowed from Jason Eisner
On voit Jon à la télévision
good English? P(E)
good match to
French? P(F|E)
Jon appeared in TV.
It back twelve saw.
In Jon appeared TV.
Jon is happy today.
Jon appeared on TV.
TV appeared on Jon.
Jon was not happy.
Table borrowed from Jason Eisner
I speak English good.
• How are we going to model good English?
• How do we know these sentences are not good
English?
Jon appeared in TV.
It back twelve saw.
In Jon appeared TV.
TV appeared on Jon.
Je ne parle pas l'anglais.
I speak English good.
• Je ne parle pas l'anglais.
These aren’t English words.
• It back twelve saw.
These are English words, but it’s jibberish.
• Jon appeared in TV.
“appeared in TV” isn’t proper English
I speak English good.
• Let’s say we have a huge collection of
documents written in English
Like, say, the Internet.
• It would be a pretty comprehensive list of
English words
Save for “named entities” People, places, things
Might include some non-English words
Speling mitsakes! lol!
• Could also tell if a phrase is good English
Google, is this good English?
• Jon appeared in TV.
“Jon appeared” 1,800,000 Google results
“appeared in TV” 45,000 Google results
“appeared on TV” 210,000 Google results
• It back twelve saw.
“twelve saw” 1,100 Google results
“It back twelve” 586 Google results
“back twelve saw” 0 Google results
• Imperfect counting… why?
Google, is this good English?
• Language is often modeled this way
Collect statistics about the frequency of words
and phrases
N-gram statistics
1-gram = unigram
2-gram = bigram
3-gram = trigram
4-gram = four-gram
5-gram = five-gram
Google, is this good English?
• Seriously, you want to query google for
every phrase in the translation?
• Google created and released a 5-gram
data set.
Now you can query Google locally
(kind of)
Language Modeling
• What’s P(e)?
P(English sentence)
P(e1, e2, e3 … ei)
Using the chain rule
P(e1)P(e2 | e1)P(e3 | e1,e2 )P(e4 | e1,e2,e3 )
P(ei | e1,e2, ei1)
Language Modeling
P(e1)P(e2 | e1)P(e3 | e1,e2 )P(e4 | e1,e2,e3 )
P(ei | e1,e2, ei1)
• Markov assumption
The choice of word ei depends only on the n
words before ei
P(ei | e1,e2, ei4 ,ei3,ei2,ei1) P(ei | ei4 ,ei3,ei2,ei1)
• Definition of conditional probability
P(ei4 ,ei3 ,ei2,ei1,ei )
P(ei | ei4 ,ei3 ,ei2,ei1)
P(ei4 ,ei3,ei2 ,ei1)
Language Modeling
P(I,love,to,eat, pie)
P( pie | I,love,to,eat)
P(I,love,to,eat)
Language Modeling
P(ei4 ,ei3 ,ei2,ei1,ei )
P(ei4 ,ei3,ei2 ,ei1)
• Approximate probability using counts
P(ei4 ,ei3 ,ei2,ei1,ei ) C(ei4 ,ei3,ei2 ,ei1,ei )
P(ei4 ,ei3,ei2 ,ei1)
C(ei4 ,ei3 ,ei2,ei1 )
• Use the n-gram corpus!
Language Modeling
• Use the n-gram corpus!
P(I,love,to,eat, pie)
P( pie | I,love,to,eat)
P(I,love,to,eat)
C(I,love,to,eat, pie)
C(I,love,to,eat)
2,760
409,000
0.0067
Not surprisingly, given that you love to eat, loving to
eat chocolate is more probable (0.177)
Language Modeling
• But what if
C(ei4 ,ei3,ei2,ei1,ei ) 0
• Then P(e) = 0
• Happens even if the sentence is
grammatically correct
“Al Gore’s pink Hummer was stolen.”
Language Modeling
• Smoothing
Many techniques
• Add one smoothing
Add one to every count
No more zeros, no problems
• Backoff
If P(e1, e2, e3, e4, e5) = 0 use something
related to P(e1, e2, e3, e4)
Language Modeling
• Wait… Is this how people “generate”
English sentences?
Do you choose your fifth word based on B
• Admittedly, this is an approximation to
process which is both
intangible and
hard for humans themselves to explain
• If you disagree, and care to defend
yourself, consider a PhD in NLP
Back to Translation
• Anyway, where were we?
Oh right…
P(F | E)P(E)
P(E | F)
P(F | E)P(E)
P(F)
So, we’ve got P(e), let’s talk P(f|e)
Where will we get P(F|E)?
Machine
Learning
Magic
Cereal boxes
in English
Same cereal
Boxes,
in French
P(F|E) model
Where will we get P(F|E)?
Machine
Learning
Magic
Books in
English
Same books,
in French
P(F|E) model
We call collections stored in two languages parallel
corpora or parallel texts
Want to update your system? Just add more text!
Translated Corpora
• The Canadian Parliamentary Debates
Available in both French and English
• UN documents
Available in Arabic, Chinese, English, French,
Russian and Spanish
Problem:
• How are we going to generalize from examples
of translations?
• I’ll spend the rest of this lecture telling you:
What makes a useful P(F|E)
How to obtain the statistics needed for P(F|E) from
parallel texts
Strategy: Generative Story
• When modeling P(X|Y):
Assume you start with Y
Decompose the creation of X from Y into
some number of operations
Track statistics of individual operations
For a new example X,Y: P(X|Y) can be
calculated based on the probability of the
operations needed to get X from Y
What if…?
The quick fox jumps over the lazy dog
Le renard rapide saut par - dessus le chien parasseux
New Information
• Call this new info a word alignment (A)
• With A, we can make a good story
The quick fox jumps over the lazy dog
Le renard rapide saut par - dessus le chien parasseux
P(F,A|E) Story
null The quick fox jumps over the lazy dog
P(F, A | E) ?
P(F,A|E) Story
null The quick fox jumps over the lazy dog
f1
f2
f3
…
f10
P(F, A | E)
Simplifying assumption: Choose the length of the French
sentence f. All lengths have equal probability
P(F,A|E) Story
null The quick fox jumps over the lazy dog
f1
f2
f3
P(F, A | E)
…
f10
(8 1)10
There are (l+1)m = (8+1)10 possible alignments
P(F,A|E) Story
null The quick fox jumps over the lazy dog
Le renard rapide saut par - dessus le chien parasseux
pt (Le | The)
pt (renard | fox)
P(F, A | E) 10
9
pt ( parasseux | lazy)
P(F,A|E) Story
null The quick fox jumps over the lazy dog
Le renard rapide saut par - dessus le chien parasseux
P(F, A | E)
(l 1)
m
m
j1
pt ( f j | ea j )
Getting Pt(f|e)
• We need numbers for Pt(f|e)
• Example: Pt(le|the)
Count lines in a large collection of
aligned text
#(le,the)
Pt (le | the)
#(le,the)#(la,the)#(les,the)
null The quick fox jumps over the lazy dog
Le renard rapide saut par - dessus le chien parasseux
null The quick fox jumps over the lazy dog
Le renard rapide saut par - dessus le chien parasseux
null The quick fox jumps over the lazy dog
Le renard rapide saut par - dessus le chien parasseux
null The quick fox jumps over the lazy dog
Le renard rapide saut par - dessus le chien parasseux
null The quick fox jumps over the lazy dog
Le renard rapide saut par - dessus le chien parasseux
# e linked to f
Pt ( f | e)
# e linked to anything
null The quick fox jumps over the lazy dog
Le renard rapide saut par - dessus le chien parasseux
Where do we get the lines?
• That sure looked like a lot of monkeys…
• Remember: some times the information hidden
in the text just jumps out at you
We’ll get alignments out of unaligned text by treating
the alignment as a hidden variable
We infer an A that maxes the prob. of our corpus
Generalization of ideas in HMM training: called EM
Where’s “heaven” in Vietnamese?
Example borrowed from Jason Eisner
Where’s “heaven” in
Vietnamese?
English:
In the beginning God created the heavens and
the earth.
Vietnamese: Ban dâu Dúc Chúa Tròi dung nên tròi dât.
English:
God called the expanse heaven.
Vietnamese: Dúc Chúa Tròi dat tên khoang không la tròi.
… you are this day like the stars of heaven in
number.
Vietnamese: … các nguoi dông nhu sao trên tròi.
English:
Example borrowed from Jason Eisner
Where’s “heaven” in
Vietnamese?
English:
In the beginning God created the heavens and
the earth.
Vietnamese: Ban dâu Dúc Chúa Tròi dung nên tròi dât.
English:
God called the expanse heaven.
Vietnamese: Dúc Chúa Tròi dat tên khoang không la tròi.
… you are this day like the stars of heaven in
number.
Vietnamese: … các nguoi dông nhu sao trên tròi.
English:
Example borrowed from Jason Eisner
EM: Expectation Maximization
• Assume a probability distribution (weights)
over hidden events
Take counts of events based on this
distribution
Use counts to estimate new parameters
Use parameters to re-weight examples.
• Rinse and repeat
Alignment Hypotheses
0.65 null I like milk
0.25 null I like milk
0.05
null I like milk
Je aime le lait
Je aime le lait
Je aime le lait
0.01 null I like milk
0.01 null I like milk
0.01
null I like milk
Je aime le lait
Je aime le lait
null I like milk
0.01 null I like milk
0.001
Je aime le lait
Je aime le lait
Je aime le lait
Weighted Alignments
• What we’ll do is:
Consider every possible alignment
Give each alignment a weight - indicating how
good it is
P(F, A | E)
P(A | E,F)
P(F | E)
Count weighted alignments as normal
Good grief! We forgot about P(F|E)!
• No worries, a little more stats gets us what
we need:
P(F | E)
P(F, A | E)
A A
P(F, A | E)
P(A | E,F)
P(F, A | E)
A A
Big Example: Corpus
1
fast car
voiture rapide
2
fast
rapide
Possible Alignments
1a
fast car
voiture rapide
1b
fast car
voiture rapide
2
fast
rapide
Parameters
1a
1b
fast car
voiture rapide
P(voiture|fast)
1/2
fast car
voiture rapide
P(rapide|fast)
1/2
P(voiture|car)
1/2
2
fast
rapide
P(rapide|car)
1/2
Weight Calculations
1a
1b
fast car
voiture rapide
P(voiture|fast)
voiture rapide
P(rapide|fast)
1/2
fast car
1/2
P(voiture|car)
1/2
P(A,F|E)
P(A|F,E)
1a
1/2*1/2=1/4
1/4 / 2/4 = 1/2
1b
1/2*1/2=1/4
1/4 / 2/4 = 1/2
2
1/2
1/2 / 1/2 = 1
2
fast
rapide
P(rapide|car)
1/2
Count Lines
1a
fast car
1b
fast car
1/2 voiture rapide 1/2 voiture rapide
2
fast
1
rapide
Count Lines
1a
fast car
1b
fast car
1/2 voiture rapide 1/2 voiture rapide
#(voiture,fast)
1/2
#(rapide,fast)
1/2+1 = 3/2
#(voiture,car)
1/2
2
fast
1
rapide
#(rapide,car)
1/2
Count Lines
1a
1b
fast car
fast car
1/2 voiture rapide 1/2 voiture rapide
#(voiture,fast)
1/2
#(rapide,fast)
1/2+1 = 3/2
#(voiture,car)
1/2
2
fast
1
rapide
#(rapide,car)
1/2
Normalize
P(voiture|fast)
1/4
P(rapide|fast)
3/4
P(voiture|car)
1/2
P(rapide|car)
1/2
Parameters
1a
1b
fast car
voiture rapide
P(voiture|fast)
1/4
fast car
voiture rapide
P(rapide|fast)
3/4
P(voiture|car)
1/2
2
fast
rapide
P(rapide|car)
1/2
Weight Calculations
1a
1b
fast car
voiture rapide
P(voiture|fast)
voiture rapide
P(rapide|fast)
1/4
fast car
3/4
P(voiture|car)
1/2
P(A,F|E)
P(A|F,E)
1a
1/4*1/2=1/8
1/8 / 4/8 = 1/4
1b
1/2*3/4=3/8
3/8 / 4/8 = 3/4
2
3/4
3/4 / 3/4 = 1
2
fast
rapide
P(rapide|car)
1/2
Count Lines
1a
fast car
1b
fast car
1/4 voiture rapide 3/4 voiture rapide
2
fast
1
rapide
Count Lines
1a
fast car
1b
fast car
1/4 voiture rapide 3/4 voiture rapide
#(voiture,fast)
1/4
#(rapide,fast)
3/4+1 = 7/4
#(voiture,car)
3/4
2
fast
1
rapide
#(rapide,car)
1/4
Count Lines
1a
1b
fast car
fast car
1/4 voiture rapide 3/4 voiture rapide
#(voiture,fast)
1/4
#(rapide,fast)
3/4+1 = 7/4
#(voiture,car)
3/4
2
fast
1
rapide
#(rapide,car)
1/4
Normalize
P(voiture|fast)
1/8
P(rapide|fast)
7/8
P(voiture|car)
3/4
P(rapide|car)
1/4
After many iterations:
1a
1b
fast car
fast car
~0 voiture rapide ~1 voiture rapide
P(voiture|fast)
0.001
P(rapide|fast)
0.999
P(voiture|car)
0.999
2
fast
1
rapide
P(rapide|car)
0.001
Seems too easy?
• What if you have no 1-word sentence?
Words in shorter sentences will get more weight fewer possible alignments
Weight is additive throughout the corpus: if a word e
shows up frequently with some other word f, P(f|e) will
go up
Like our heaven example
The Final Product
• Now we have a model for P(F|E)
• Test it by aligning a corpus!
IE: Find argmaxAP(A|F,E)
• Use it for translation:
Combine with our n-gram model for P(E)
Search space of English sentences for one
that maximizes P(E)P(F|E) for a given F
Model could be a lot better:
• Word positions
• Multiple f’s generated by the same e
• Could take into account who generated
your neighbors
• Could use syntax, parsing
• Could align phrases
Sure, but is it any better?
• We’ve got some good ideas for improving
translation
• How can we quantify the change
translation quality?
Sure, but is it any better?
• How to (automatically) measure translation?
Original French
Dès qu'il fut dehors, Pierre se dirigea vers la rue de Paris, la principale rue du
Havre, éclairée, animée, bruyante.
Human translation to English
As soon as he got out, Pierre made his way to the Rue de Paris, the high-street
of Havre, brightly lighted up, lively and noisy.
Two machine tranlations back to French:
Dès qu'il est sorti, Pierre a fait sa manière à la rue De Paris, la haut-rue de
Le Havre, brillamment allumée, animée et bruyante.
Dès qu'il en est sorti, Pierre s'est rendu à la Rue de Paris, de la grande rue
du Havre, brillamment éclairés, animés et bruyants.
Example from http://www.readwriteweb.com/archives/google_translation_systran.php
Bleu Score
• Bleu
Bilingual Evaluation Understudy
A metric for comparing translations
• Considers
n-grams in common with the target translation
Length of target translation
• Score of 1 is identical, 0 shares no words in
common
• Even human translations don’t score 1
Google Translate
• http://translate.google.com/translate_t
25 language pairs
• In the news (digg.com)
http://www.readwriteweb.com/archives/google
_translation_systran.php
• In competition
http://www.nist.gov/speech/tests/mt/doc/mt06
eval_official_results.html
Questions?
References
(Inspiration, Sources of borrowed material)
• Colin Cherry, MT for NLP, 2005
http://www.cs.ualberta.ca/~colinc/ta/MT650.pdf
• Knight, K., Automating Knowledge Acquisition for Machine
Translation , AI Magazine 18(4), 1997.
• Knight, K., A Statistical Machine Translation Tutorial Workbook,
1999, http://www.clsp.jhu.edu/ws99/projects/mt/mt-workbook.htm
• Eisner, J., JHU NLP Course notes: Machine Translation, 2001,
http://www.cs.jhu.edu/~jason/465/PDFSlides/lect32-translation.pdf
• Olga Kubassova, Probability for NLP,
http://www.comp.leeds.ac.uk/olga/ProbabilityTutorial.ppt
Enumerating all alignments
P(F | E)
There are
(l 1)
l 1
m
l
m
l
a1 0
am 0
m
j1
possible alignments!
pt ( f j | ea j )
Gah!
Null (0) Fast (1) car (2)
Voiture (1) rapide (2)
pt ( f1 | e0 ) pt ( f 2 | e0 )
pt ( f1 | e0 ) pt ( f 2 | e1 )
pt ( f1 | e0 ) pt ( f 2 | e2 )
pt ( f1 | e1 ) pt ( f 2 | e0 )
pt ( f1 | e1 ) pt ( f 2 | e1 )
pt ( f1 | e1 ) pt ( f 2 | e2 )
pt ( f1 | e2 ) pt ( f 2 | e0 )
pt ( f1 | e2 ) pt ( f 2 | e1 )
pt ( f1 | e2 ) pt ( f 2 | e2 )
Let’s move these over here…
Null (0) Fast (1) car (2)
Voiture (1) rapide (2)
pt ( f1 | e0 )pt ( f 2 | e0 ) pt ( f 2 | e1) pt ( f 2 | e2 )
pt ( f1 | e1 )pt ( f 2 | e0 ) pt ( f 2 | e1) pt ( f 2 | e2 )
pt ( f1 | e2 )pt ( f 2 | e0 ) pt ( f 2 | e1) pt ( f 2 | e2 )
And now we can do this…
Null (0) Fast (1) car (2)
Voiture (1) rapide (2)
pt ( f1 | e0 ) pt ( f1 | e1) pt ( f1 | e2 )
pt ( f 2 | e0 ) pt ( f 2 | e1) pt ( f 2 | e2 )
So, it turns out:
l
l
m
p (f
t
a1 0
a m 0 j1
m
j
l
| ea j ) pt ( f j | ei )
j1 i 0
Requires only m(l 1) operations.
Can be used whenever your alignment choice
for one word does not affect the probability of
the rest of the alignment