Creating a Term Base to Customize an MT System

Download Report

Transcript Creating a Term Base to Customize an MT System

Creating a Term Base to
Customize an MT System:
Reusability of Resources and Tools from
the Translator’s Point of View
Natalie Kübler
Intercultural Centre for Studies in Lexicology
Objectives
 Introducing
available resources, tools,
and MT in translation training
 Testing
customisable MT as a timesaving tool for « industrial » translation
 Using simple tools and immediately
available resources to improve MT
translation results
Translation training
Post-graduate students in language industry
(LI) and specialised translation (ST):
Translation, linguistics, localisation, technical
writing
Dreamweaver, Catalyst, HTML, XML, SQL, UNIX,
translation memory, etc.
Semi-professional: every other week with a
private company in translation or language
industry
Corpus linguistics and applications to
terminology and translation => project in ST
(HOWTO) + LI (analysis + feedback to
Systran)
Experiment
Translating some yet untranslated Linux
HOWTOs, using a MT system
 subdomain of computing
 Highly specialised texts
 written by computer experts – and not
technical writers – for computer experts
 Translated by French-speaking
computer experts
+ Translating computing dictionary
entries
Systranet
Systran’s on-line customisable service
 Domain-specific dictionaries
 User dictionaries:




Mono- or multitarget
« advanced » linguistic information
On-line source and target text alignement


Words not in any (Systran’s or user’s) dictionary
Words in the user’s dictionary
Resources + Tools

Headwords + equivalents + linguistic
information

On-line technical bilingual glossaries
 On-line term bases

Comparable and translation technical corpora
 The Web as a corpus

Term extraction (Terminology Extractor)
Methodology

Step one dictionary: extracting term
candidates from text
 Creating and coding step-one dictionary
 First translation using the dictionary
 Step two dictionary: changing and/or adding
linguistic information using Systranet’s
alignment and color features + linguistic
analysis (feedback)

Step two: until the dictionary is saturated
Web-based HOWTO glossary

Several French equivalents
 boot,root disk= disquettes (d')
amorce ou de démarrage, racine
 browser= butineur, navigateur,
arpenteur
 buffer=tampon
 to build= bâtir
 currently= actuellement
 feedback=comment contacter
l'auteur, retour d'information
 A.D.S.L. (noun)=raccordement
numérique asymétrique
Step 1:Terminology Extractor
 French
and English dictionaries
 Morphological analysis
 Stop words
 Collocations: sequence of 2 to 10 words
repeated at least once
 Non-words
 Concordances
TE non-words
Debian
Permedia
RedHat
RgbPath
ServerFlags
ServerLayour
XkbLayout
Solaris
UI
USB
WindowMaker
Netscape
Dennis
Dialogs
FAQs
Howto
README
XkbModel
ISA
KDE
LeftOf
ModulePath
accellerate
XFCE
Corel
anoying
Microdoft
Linux
RealAudio
degredation
GUI
IRQs
NFS
TE collocations
Internet Gateway 3 { Looking look } at the Network 3
IP aliasing 3
name server 4
ISA { card cards } 3
Network { Device devices } 4
latest version
3
Linux computer 3
DHCP Server 15
IP { addresses address } 16
Linux gateway 3
Linux box
16
modules file 3
card on the Linux box 4
scripts / ifcfg 3
DNS { Server servers } 17
server will start 3
interface configuration file 3
{ Network networking } { Card Cards } 12
« Le grand dictionnaire
terminologique »

Looking for French equivalents
ENGLISH
buffer
Syn.
buffer storage
buffer memory
FRENCH
mémoire tampon n. f.
Syn.
tampon n. m.
mémoire intermédiaire
n. f
intermediate memory zone tampon n. f.
HOWTO translation corpus
English source – French translation
 WALL: Web-based environment




Concordances with perl-like regexp
Paragraph alignment
French equivalents
 lexicogrammatical information
 semantic classes
 « statistical » information in the domain
HOWTOs: equivalents
The daemon […] listens to all messages on each
network device
Le démon […] écoute tous les messages sur chacun
des périphériques réseau
All the Digital cards will autoprobe for their media
Toutes les cartes Digital effectueront la détection
automatique du média
The latest source distribution can be FTPed from the
directory ftp…or Mosaiced from http…
On peut charger la dernière version sur ftp…et sous
Mosaic depuis http…
Called by the kernel when the card posts an interrupt.
Appelé par le noyau quand la carte déclenche une
interruption
HOWTOs « semantic
classes »
can I run 32-bit video games under
dosemu
used to run Linux on a 386/16 MHz (
unless you want your modem to answer
the phone
The static SLIP server will answer your
modem call
WebCorp
 The
web as a corpus
 Concordances : buffer, run* * * on
 Updated
information
 More elements
buffer

me des débordements de buffer (tampon
en français). Pour
 com/advisories/bufero.html . Writing buffer
overflow exploits – a tutorial for
 de NOP . débordement de buffer dans le
tas (heap buffer overflow)
 (buffer overflow) . débordement de buffer
sous windows (et oui ;-)) --[
Customized dictionary
« Advanced » linguistic information, such as:
Part-of-speech information
noun, proper noun (product name, country, etc.), verb,
adjective, sentence
Morphological information
URL (noun) (plural:URLs) / cache (noun)(masculine)
Lexicogrammatical information
access (verb)(noprep)=accéder (verb)(prep:à)
Basic semantic information
to run (verb)(context:OS)
Unix (noun) (SEMCAT:OS)
Idioms
Your mileage may vary (sentence)
Dictionary Sample
"AT&T" (company name)
auto-dial (noun)=numérotation automatique (noun)
automatic number identification (noun)=identification de
l'appelant (noun)
based (adjective)(noprep)=architecturé
(adjective)(prep:autour)
basic language constructs (noun) (plural)=base de
construction du langage (noun) (singular)
to log in (verb)=se loger (verb)
to introduce (verb) (context:extensions)=introduire
to carry (verb)(context:digital data)=transmettre (verb)
With Step-one dictionary
This page contains a simple cookbook for
setting up Red Hat 6.X as an internet
gateway for a home network or small office
network.
Cette page contient un cookbook simple pour le
chapeau rouge 6.X d'établissement en tant
que Gateway d'Internet pour un réseau à la
maison ou le petit réseau de bureau.
With Step-two dictionary
This page contains a simple cookbook for
setting up Red Hat 6.X as an internet
gateway for a home network or small
office network.
Cette page contient des recettes simples
pour installer Red Hat 6.X en tant que
passerelle Internet pour un réseau
domestique ou un petit réseau de
bureau.
Error typology

Morphosyntax: subject-verb or noun-adjective
agreement
 Syntax:




POS ambiguïty
NP: determiners, NP coordination
transformations/ellipsis/cleft sentences/PP
attachment
Metacharacters
 « Bugs »
Error examples (1)

I am not going *je n'vais pas => je ne vais pas
 the phase of the light through it
*la phase du dépassement léger par lui
=> la phase de la lumière qui les traverse.
 decoded by specific individuals.
*décodée par les individus spécifiques.
décodée par des individus spécifiques.
 A cable or ADSL connection
*un câble ou une connexion d’AADSL
Une connexion par câble ou ADSL
Error examples (2)
When a user picks or is assigned a
password, it is encoded with a
randomly generated value called the
salt.
=> *Quand un utilisateur sélectionne ou
est généré un mot de passe, il est
codé avec une valeur aléatoirement
produite appelée le sel.
Conclusion

Translation results can be significantly
improved by creating customised dictionaries
 The tools mentionned here are user-friendly
 But, it implies much work in the beginning +
translators must have a training in linguistics
and basic NLP.
 Change of attitude towards MT + various
tools, especially in the language industry
oriented option
More things to be done..
 Merging
all dictionaries together into a
« Systranet term base »
 Translating more HOWTOs
 Project with Systran: improve user
coding
…