Transcript Slide 1

1
English-Persian SMT
Reza Saeedi
[email protected]
WTLAB
Wednesday, May 25, 2011
Outline
2






MT Introduction
SMT Introduction
Requirements for SMT
Evaluation metrics
English-Persian MT challenges
English-Persian SMT
 System1
 System2

Problems in English-Persian SMT
MT Introduction
3

Automatic translation of text written in a natural
language into another one by the use of computers
is referred to as Machine Translation.

There are several way to do this work:
 Dictionary-based
 Rule-based
 Example-based
 Statistical
approach
SMT Introduction
4

First ideas of Statistical machine translation was
proposed by Warren Weaver in 1947.

Statistical machine translation tries to learn the
translation by examining the translations made by
humans.
SMT Introduction(Cont.)
5



Statistical MT models take the view that every
sentence in the target language is a translation of
the source language sentence with some
probability.
The best translation, of course, is the sentence that
has the highest probability.
The key problems in statistical MT are:
 estimating
the probability of a translation
 and efficiently finding the sentence with the highest
probability.
SMT Introduction(Cont.)
6

Given a Source sentence f, we seek the target
sentence e that maximizes P(e | f).
e‘ = argmaxe P(e | f)

Intuitively, P(e|f) should depend on two factors:
 P(e|f)
= P(e) * P(f | e) / P(f)
 argmaxe
P(e | f) = argmaxe P(e) * P(f | e)
fluency
faithfulness
SMT Introduction(Cont.)
7

Philipp koehn
 http://homepages.inf.ed.ac.uk/pkoehn
Why SMT?
8

Better use of resources
Not need linguistic knowledge
It can use for any pair of language

But


 We
need a big training corpus
Steps of SMT
9
Requirements for SMT
10

Bilingual and Monolingual Corpus:
 For
bilingual need tow file aligned sentence by
sentence (one file for source language and other for
target language)
 Microsoft

Bi-Lingual sentence Aligner
Language Model:
 We
need a tool to compute P(e)
 For this step we need to monolingual corpus
 SRILM: a tool for create N-grams
LM output
11
Requirements for SMT
12

Translation Model:
 We
need a tool for compute P(f|e)
 For this step we need to bilingual corpus
 GIZA++
 The output of this tool is a phrase table

Decode:
 For
search and find best translation
 Moses
Phrase table
13
Moses tool
14
The training steps
15









Prepare data
Run GIZA++
Align words
Get lexical translation table
Extract phrases
Score phrases
Build reordering model
Build generation models
Create configuration file
Evaluation metrics
16

BLEU(BiLingual Evaluation Understudy)
 Developed
 The
at IBM’s
closer a MT is to a professional human translation,
the better it is

NIST
English-Persian MT challenges
17




The Persian language structure is very different in
comparison to English
The structure of Persian language is very complex
There has been little previous work done for this
language pair
Effective SMT systems rely on very large bilingual
corpora but there are not readily available for the
English/Persian language pair
English-Persian SMT
18

There have been few English-Persian MT systems
developed

Most of them are purely rule-based

There are two work on English-Persian SMT
 Mohaghegh
 Pilevar
and Sarrafzadeh (Massey University)
and Faili (Tehran University)
System1
19

Corpus: BBC news
System1(Cont.)
20

Tools: SRILM, GIZA++, Moses
System1: Improved Language Modeling
21
System2
22

Corpus:
 Bidirectional(TEP):
Subtitle of films, 3 books, KDE4
System2(Cont.)
23

Corpus:
 Monolingual:
Hamshahri, subtitle of films
System2(Cont.)
24

Tools: SRILM, GIZA++, Moses
PersianSMT with 4-gram Sub-LM
Comparison PersianSMT with Google Translator
25
Problems in English-Persian SMT
26

compound verbs (aligning problem)
 Use
a phrase-based SMT system
 But problem is inflectional morphology
 Large number of inflected verb forms does not let the
system learn to translate all the individual forms of a
compound verb

Persian takes personal pronouns as an optional
element in the sentence (aligning problem)
Problems(Cont.)
27

failure of the system to place the elements of the
sentence in the right order

Use a phrase-based SMT system

Re-rank the n-best output list and/or reorder the output
sentences

Prior to translation, the input sentence is reordered using
morpho-syntactic information, so that the word order
resembles better that of the target language.
28
References
29

[1] A. Ramanathan, "Statistical Machine Translation", Ph.D. Seminar Report,
Department of Computer Science and Engineering Indian Institute of Technology,
2000.

[2] A. LOPEZ, "Statistical Machine Translation", ACM Computing Surveys, 2008.

[3] M. Mohaghegh, & A. Sarrafzadeh, “The first english-persian statistical machine
translation”, New Zealand Postgraduate Conference, 2009 .

[4] M. Mohaghegh, & A. Sarrafzadeh, " An analysis of the effect of training data
variation in English-Persian Statistical Machine Translation”, 2009 International
Conference on Innovations in Information Technology (IIT 2009)

[5] M. Mohaghegh, & A. Sarrafzadeh, " Performance evaluation of various training
data in English-Persian statistical machine translation “, Appear in Proceedings of
the 10th International Conference on the Statistical Analysis of Textual Data
(JADT 2010), Rome, Italy, June 9-11, 2010.

[6] M. Mohaghegh, & A. Sarrafzadeh, " Improved Language Modeling for
English-Persian Statistical Machine Translation”, COLING 2010 / SIGMT
Workshop 23rd International Conference on Computational Linguistics Beijing,
China 28 August 2010
References(Cont.)
30

[7] M.T. Pilevar and H. Faili, "PersianSMT: A First Attempt to English-Persian
Statistical Machine Translation", to appear in Proc. of 10th International
Conference on statistical analysis of textual data (JADT 2010)