Translating Subtitles using Machine Translation

Download Report

Transcript Translating Subtitles using Machine Translation

Translating Subtitles using Machine Translation
Practices, Problems, Methodology
Elsa Sklavounou, Ph. D.
Linguist, Co-funded Projects Technical Coordinator
SYSTRAN
www.systransoft.com
1
TM
SYSTRAN MT Customization Methodology
Overview
A customization project involves three different customization
levels that provide incremental higher translation quality:

Basic Terminology

Complex Terminology

Linguistic Rules
www.systransoft.com
2
TM
SYSTRAN MT Customization Methodology
Overview

Basic Terminology
The first step entails the creation of a User Dictionary that covers most
of the noun terminology in the corpus, and various simple adjective and
verb terms.

Complex Terminology
The second level concerns the coding of complex terminological
entries; such as the coding of complex verbs with their complements
(subject, object…) and their translations.

Linguistic Rules
The third level involves language-specific code modifications in the
SYSTRAN linguistic modules.
www.systransoft.com
3
TM
SYSTRAN MT Customization Methodology
Level 1 & Level 2
Customization level 1 and 2 focuses on the implementation in the
systems of specialized terminology from the corpus. Level 1 and 2
tasks include:
Simple and complex terms extraction ;
Simple and complex terms translations ;
Simple and complex terms coding ;
Simple and complex terms review ;
www.systransoft.com
4
TM
SYSTRAN MT Customization Methodology
Level 1 & Level 2
Step 1: Corpus installation and analysis
Prerequisite 1: a formatted corpus
Step 2: Term extraction
Simple terms (nouns and noun expressions)
Complex terms (verb patterns)
DNT (Do Not Translate) integration
www.systransoft.com
5
TM
SYSTRAN MT Customization Methodology
Level 3
Customization level 3 focuses on the implementation of linguistic
rules uniquely adapted to language-specific syntactic and
semantic issues found in translations taken from the corpus.
Level 3 tasks include:
Detailed linguistic evaluations and the development of a
comprehensive customization plan:




Implementation of customized rules
Regression tests
Correction of linguistic translation errors
Acceptance testing before release
www.systransoft.com
6
TM
SYSTRAN MT Customization Methodology
Quality Levels
Estimate of the quality levels that may be achieved for each
customization level.
www.systransoft.com
7
TM
SYSTRAN MT Customization Methodology
Software Tools
The process for coding simple and complex terms and related
dictionary maintenance is managed by the SYSTRAN Linguistics
Platform that integrates the following two tools, required to
complete customization levels 1 and 2.
www.systransoft.com
8
TM
SYSTRAN MT Customization Methodology
Software Tools
SYSTRAN Dictionary Manager
The SYSTRAN Dictionary Manager (SDM) enables translators to
build and manage multilingual dictionaries. SDM includes
preparation steps for dictionary coding tasks, an online dictionary
lookup (via an HTML interface), and a compiler for runtime machine
translation dictionaries. It is composed of three main components: a
database, HTML query form (dictionary lookup, reports, logs,
import and export) and a Windows client (interactive coding tool).
www.systransoft.com
9
TM
SYSTRAN Customization Methodology
Software Tools
The SYSTRAN Review Manager
(SRM) is a productivity tool used for
the review
quality assessment and
maintenance
of
linguistic
resources used combined with a
SYSTRAN system.
www.systransoft.com
10
TM
SYSTRAN Customization Methodology
Prerequisite 1:
a formatted grammatical corpus
Grammar Writing Rules
Using Articles
Avoiding Speech Ambiguity
Using Enumeration
Ensuring Subject-Verb Agreement
Using Prepositions
Using Infinitives at the Beginning
of Sentences
Using Imperatives
Observing Punctuation Rules
Using Main Clauses
Using Subordinate Clauses
Using Relative Clauses
www.systransoft.com
Avoiding Multiple Stacking
Using Compound Words
Using Capitalization
Using Spelling Variations
Lexical Ambiguities
Disambiguation of Product
Names and Menus
Avoiding Lexical Ambiguities
Using Compounds
Format and Typographical Issues
Segmentation
11
TM
SYSTRAN Customization Methodology
for MUSA
Two-process fully-automatically generated Corpus:
Speech Recognition (KU Leuven),
Automatic Sentence Compression (CNTS)
First priority
Subtitles Constraints
Second Priority
The least possible ambiguous content
Lesson learned : No prerequisite
www.systransoft.com
12
TM
SYSTRAN MT Customization Methodology
Upgraded Software Tools (Client Tools v5)
www.systransoft.com
13
TM
SYSTRAN Translation Project Manager
Terminology Review
Not Found Words Extraction
Reviewing Terminology and
Sentences
The Terminology Review tab in the
Review window lets you identify
expressions such as Not Found
Words or Terminology extracted by
the software.
www.systransoft.com
14
TM
SYSTRAN Translation Project Manager
Terminology Review
Not Found Words Extraction
Examples
SRC_Id
these parents know measles can be
dangerous, but they don't want their
child to have MMR, the triple vaccine
which protects them from measles,
mumps and rubella.
Raw MT
ces parents savent la rougeole peut
être dangereuse, mais ils ne veulent
pas que leur enfant a MMR, le vaccin
triple qui les protège contre la rougeole,
les oreillons et la rubéole.
www.systransoft.com
15
TM
SYSTRAN Translation Project Manager
Alternative Meanings
Alternative Meanings
shows alternative translations
based on different meanings of a
source word or expression.
The Alternative Meanings tab in the
Review window shows alternative
meanings for expressions in
SYSTRAN or User Dictionaries
www.systransoft.com
16
TM
SYSTRAN Translation Project Manager
Alternative Meanings
Examples
SRC_Id
they'd rather pay for single vaccines at
60 pounds a shot, even though the
government insists MMR is safe.
Raw MT
ils payeraient plutôt les vaccins uniques
à 60 livres un coup de feu, quoique le
gouvernement exige que MMR est sûr.
Customized MT
ils payeraient plutôt les vaccins uniques
à 60 livres une injection, quoique le
gouvernement exige que MMR est sûr.
www.systransoft.com
17
TM
SYSTRAN Dictionary Manager
User Dictionaries (UDs)
User Dictionaries (UDs) let you
increase the quality of source
language analyses, which also
increases the
translation output for all associated
target languages. UDs can be used
for a number of functions, including:
Automatically translating Not
Found Words in the SYSTRAN
dictionary.
Overriding the target-language
meaning of a word or expression in
the SYSTRAN dictionaries, a
capability that lets you customize
translation output to fit specific
needs.
Ensuring that an expression is
always treated as a unit by SYSTRAN
analysis
programs.
www.systransoft.com
18
TM
SYSTRAN Dictionary Manager
User Dictionaries (UDs)
Metrics
Type of Dictionary
ENFR
ENEL
Do Not Translate Words
3532 entries (enxx)
Proper Nouns
1495 entries (enfr)
1495 entries (enel)
MUSA Terminology
1443 entries (enfr)
5228 entries (enel)
www.systransoft.com
19
TM
SYSTRAN Dictionary Manager
User Dictionaries (UDs)
Examples
SRC_ID
Andrew Wakefield ignited the debate
over MMR by announcing the findings
of research into a group with autism
and bowel disease.
 Raw MT
Andrew Wakefield a enflammé la
discussion au-dessus de MMR en
annonçant les résultats de la recherche
dans un groupe avec la maladie
d'autism et d'entrailles.
 Customized MT
Andrew Wakefield a enflammé la
discussion au-dessus de MMR en
annonçant les résultats de la recherche
dans un groupe avec autisme et
maladie d'entrailles.
www.systransoft.com
20
TM
SYSTRAN Translation Project Manager
Source Analysis
Interactive Disambiguation
The Source Analysis tab in the
Review window shows how the
software handled source
ambiguities and allows you to
override the software selections.
www.systransoft.com
21
TM
SYSTRAN Translation Project Manager
Source Analysis
Interactive Disambiguation
Examples
ID 523
At first we thought it was parts of the
building but it was people, literally
people falling all around us.
Raw MT
D'abord nous avons pensé que ce
faisait partie du bâtiment mais c'était les
gens, peuplent littéralement la chute
tout autour de nous.
Customized MT
D’abord nous avons pensé que c’etait
des fragments du bâtiment, mais c’était
des gens, littéralement des gens qui
tombaient autour de nous.
www.systransoft.com
22
TM
SYSTRAN Dictionary Manager
Normalization Dictionaries (NDs)
Normalization Dictionaries
(NDs)
There are two types of
Normalization Dictionaries
(NDs): source normalization and
target normalization.
Source normalization
normalizes source document
before translation.
Target normalization adapts
translation output to user needs
in term of terminology
consistency.
It can also provide a way to
replace expressions chosen by
the software’s translation engine
withwww.systransoft.com
user-defined expressions. 23
TM
SYSTRAN Dictionary Manager
Normalization Dictionaries (NDs)
Examples
SRC_IDs
we did n't know she had measles but
we do.
I mean I ca n't help...
Raw MT
nous avons fait le n't savons qu'il a eu la
rougeole mais nous faisons.
Je veux dire l'aide de n't d'I ca…
Customized MT via SRC
Normalization
nous n'avons pas su qu'il a eu la
rougeole mais nous faisons.
Je veux dire que je ne peux pas aider
www.systransoft.com
24
TM
SYSTRAN Translation Project Manager
Sentence Review
for Translation Memory Construction
The Sentence Review tab in the
Review window compares
sentences in the source and
target.
You can then check the
sentences you want to send to
User Dictionaries, where you can
work with them further in order
to post-edit them and construct
Translation Memories.
www.systransoft.com
25
TM
SYSTRAN Dictionary Manager
Translation Memories (TMs)
Translation Memory (TM)
A set of translated and validated
sentences that can be integrated into
the translation process. Translation
Memories (TMs) are databases of
aligned pre-translated sentences.
Unlike Dictionaries, TM
entries can be formatted (for
example, italic or bold) and are used
by the translation engine to perform
matches on full sentences in the
source document. TMs are not
usually created manually, but are
built using
SYSTRAN’s Translation Project
Export or from TMX files.
www.systransoft.com
26
TM
SYSTRAN Dictionary Manager
Translation Memories (TMs)
Examples
ID 370
Now people kind of started
panicking and said we've got to
leave no matter what.
Raw MT
Maintenant sorte de personnes de
panique commencée et dite nous
avons pour laisser n'importe ce
que.
Customized MT
Les gens maintenant avaient l’air
de paniquer disant qu’ils devaient à
tout prix partir.
www.systransoft.com
27
TM
SYSTRAN Dictionary Manager
Translation Memories (TMs)
Translation Memory
Import/Export
Already existent Tmx standard
translation memory exchange
files can be imported/exported
via SYSTRAN Dictionary
Manager .
www.systransoft.com
28
TM