eliotayer.com

Download Report

Transcript eliotayer.com

Nihonglish
Machine Translation System
Jed Cruz
Eliot Ayer
May 14th
Introduction
Project Goal:
Translate from Japanese to English
Using Statistical Machine Translation
N-Gram Model
•
Tanaka Corpus:
150,000 parallel Japanese-English
sentences
N-Grams
1. Probabilistic Model
2. Predict the next word from N-1 previous
words.
n-gram:
P(wn | wn-1, wn-2, ..., w1 )
bigram:
P(w2 | w1)
trigram:
P(w3 | w2, w1)
3. Based on statistics gathered from our
parallel corpus.
Bi-Gram vs Tri-Gram
bigram: p(w2 | w1 = the )
trigram: p(w3 | w1 = in, w1 = the)
Japanese Language
Kanji:
亜米利加
> 5000 (in use)
Hiragana: あ め り か
> 48 characters
Katakana: ア メ リ カ
> 48 characters
Romaji:
a me ri ka
Like speech recognizer
disambiguating "seas" or "seize"
or "sees"
N-Gram Usages
Disambiguation (Choosing the correct meaning):
"bridge"
"chopsticks"
"edge"
"end"
Predicting the next word:
"Tokyo"
"tomodachi" - friend
"to" - with
"to iu" - said
Interpolation
We are utilizing Uni, Bi, Tri, and Quad-Grams,
and using linear interpolation to unify each
gram. Each N-Gram probability is weighted
by, λi, such that
.
P(wn|wn-3,wn-2,wn-1) = λ1P(wn|wn-3,wn-2,wn-1)
+ λ2P(wn|wn-2,wn-1)
+ λ3P(wn|wn-1)
+ λ4P(wn)
Statistical Machine Translation
Given a Foreign sentence, F, find most probable
sentence English sentence, E.
By Bayes Rule, can be turned into maximizing
two problems - P( F | E ) P( E )
Training is divided into two models
The translation model.
Requires a parallel corpus.
Probabilities of a foreign word corresponding
to English words.
The language model.
Can be computed with just English.
Probability that this is a fluent English
sentence.
Creating a Bilingual
Dictionary
Estimate P(F|E)
using Baum-Welch
•
•
Start with
no parameters...
Ends with
convergence.
Generate a phrase-based
dictionary
Word-based dictionaries have some problems
Phrase as fundamental unit results in better,
faster translations.
- phrases that appear in both languages are
aligned, each phrase and its probability is
stored in a phrase translation table.
Finding the hidden
English
The output of the translation model is a "bag of
words". We need to rearrange these words.
[
will Tokyo tomorrow go I ]
[
I will go to Tokyo tomorrow ]
N-grams to model
Language
The language model, or what sounds
statistically like a good sentence can be done
with n-grams. ex: food, chinese, want, i
w-i
w
frequency in corpus
$
i
20267
i
want
651
want
chinese
0
chinese
food
12
word1
word2
word3
word4
joint probability
i
want
chinese
food
158,325,804
food
chinese
i
want
46,872
chinese
food
i
want
23,436
Examples
These are word-based translations
Limitations of Nihonglish
Only infinitive verbs.
No katakana.
Must include the subject.
Limited scalability.
No punctuation