Creating a Term Base to Customize an MT System
Download
Report
Transcript Creating a Term Base to Customize an MT System
Creating a Term Base to
Customize an MT System:
Reusability of Resources and Tools from
the Translator’s Point of View
Natalie Kübler
Intercultural Centre for Studies in Lexicology
Objectives
Introducing
available resources, tools,
and MT in translation training
Testing
customisable MT as a timesaving tool for « industrial » translation
Using simple tools and immediately
available resources to improve MT
translation results
Translation training
Post-graduate students in language industry
(LI) and specialised translation (ST):
Translation, linguistics, localisation, technical
writing
Dreamweaver, Catalyst, HTML, XML, SQL, UNIX,
translation memory, etc.
Semi-professional: every other week with a
private company in translation or language
industry
Corpus linguistics and applications to
terminology and translation => project in ST
(HOWTO) + LI (analysis + feedback to
Systran)
Experiment
Translating some yet untranslated Linux
HOWTOs, using a MT system
subdomain of computing
Highly specialised texts
written by computer experts – and not
technical writers – for computer experts
Translated by French-speaking
computer experts
+ Translating computing dictionary
entries
Systranet
Systran’s on-line customisable service
Domain-specific dictionaries
User dictionaries:
Mono- or multitarget
« advanced » linguistic information
On-line source and target text alignement
Words not in any (Systran’s or user’s) dictionary
Words in the user’s dictionary
Resources + Tools
Headwords + equivalents + linguistic
information
On-line technical bilingual glossaries
On-line term bases
Comparable and translation technical corpora
The Web as a corpus
Term extraction (Terminology Extractor)
Methodology
Step one dictionary: extracting term
candidates from text
Creating and coding step-one dictionary
First translation using the dictionary
Step two dictionary: changing and/or adding
linguistic information using Systranet’s
alignment and color features + linguistic
analysis (feedback)
Step two: until the dictionary is saturated
Web-based HOWTO glossary
Several French equivalents
boot,root disk= disquettes (d')
amorce ou de démarrage, racine
browser= butineur, navigateur,
arpenteur
buffer=tampon
to build= bâtir
currently= actuellement
feedback=comment contacter
l'auteur, retour d'information
A.D.S.L. (noun)=raccordement
numérique asymétrique
Step 1:Terminology Extractor
French
and English dictionaries
Morphological analysis
Stop words
Collocations: sequence of 2 to 10 words
repeated at least once
Non-words
Concordances
TE non-words
Debian
Permedia
RedHat
RgbPath
ServerFlags
ServerLayour
XkbLayout
Solaris
UI
USB
WindowMaker
Netscape
Dennis
Dialogs
FAQs
Howto
README
XkbModel
ISA
KDE
LeftOf
ModulePath
accellerate
XFCE
Corel
anoying
Microdoft
Linux
RealAudio
degredation
GUI
IRQs
NFS
TE collocations
Internet Gateway 3 { Looking look } at the Network 3
IP aliasing 3
name server 4
ISA { card cards } 3
Network { Device devices } 4
latest version
3
Linux computer 3
DHCP Server 15
IP { addresses address } 16
Linux gateway 3
Linux box
16
modules file 3
card on the Linux box 4
scripts / ifcfg 3
DNS { Server servers } 17
server will start 3
interface configuration file 3
{ Network networking } { Card Cards } 12
« Le grand dictionnaire
terminologique »
Looking for French equivalents
ENGLISH
buffer
Syn.
buffer storage
buffer memory
FRENCH
mémoire tampon n. f.
Syn.
tampon n. m.
mémoire intermédiaire
n. f
intermediate memory zone tampon n. f.
HOWTO translation corpus
English source – French translation
WALL: Web-based environment
Concordances with perl-like regexp
Paragraph alignment
French equivalents
lexicogrammatical information
semantic classes
« statistical » information in the domain
HOWTOs: equivalents
The daemon […] listens to all messages on each
network device
Le démon […] écoute tous les messages sur chacun
des périphériques réseau
All the Digital cards will autoprobe for their media
Toutes les cartes Digital effectueront la détection
automatique du média
The latest source distribution can be FTPed from the
directory ftp…or Mosaiced from http…
On peut charger la dernière version sur ftp…et sous
Mosaic depuis http…
Called by the kernel when the card posts an interrupt.
Appelé par le noyau quand la carte déclenche une
interruption
HOWTOs « semantic
classes »
can I run 32-bit video games under
dosemu
used to run Linux on a 386/16 MHz (
unless you want your modem to answer
the phone
The static SLIP server will answer your
modem call
WebCorp
The
web as a corpus
Concordances : buffer, run* * * on
Updated
information
More elements
buffer
me des débordements de buffer (tampon
en français). Pour
com/advisories/bufero.html . Writing buffer
overflow exploits – a tutorial for
de NOP . débordement de buffer dans le
tas (heap buffer overflow)
(buffer overflow) . débordement de buffer
sous windows (et oui ;-)) --[
Customized dictionary
« Advanced » linguistic information, such as:
Part-of-speech information
noun, proper noun (product name, country, etc.), verb,
adjective, sentence
Morphological information
URL (noun) (plural:URLs) / cache (noun)(masculine)
Lexicogrammatical information
access (verb)(noprep)=accéder (verb)(prep:à)
Basic semantic information
to run (verb)(context:OS)
Unix (noun) (SEMCAT:OS)
Idioms
Your mileage may vary (sentence)
Dictionary Sample
"AT&T" (company name)
auto-dial (noun)=numérotation automatique (noun)
automatic number identification (noun)=identification de
l'appelant (noun)
based (adjective)(noprep)=architecturé
(adjective)(prep:autour)
basic language constructs (noun) (plural)=base de
construction du langage (noun) (singular)
to log in (verb)=se loger (verb)
to introduce (verb) (context:extensions)=introduire
to carry (verb)(context:digital data)=transmettre (verb)
With Step-one dictionary
This page contains a simple cookbook for
setting up Red Hat 6.X as an internet
gateway for a home network or small office
network.
Cette page contient un cookbook simple pour le
chapeau rouge 6.X d'établissement en tant
que Gateway d'Internet pour un réseau à la
maison ou le petit réseau de bureau.
With Step-two dictionary
This page contains a simple cookbook for
setting up Red Hat 6.X as an internet
gateway for a home network or small
office network.
Cette page contient des recettes simples
pour installer Red Hat 6.X en tant que
passerelle Internet pour un réseau
domestique ou un petit réseau de
bureau.
Error typology
Morphosyntax: subject-verb or noun-adjective
agreement
Syntax:
POS ambiguïty
NP: determiners, NP coordination
transformations/ellipsis/cleft sentences/PP
attachment
Metacharacters
« Bugs »
Error examples (1)
I am not going *je n'vais pas => je ne vais pas
the phase of the light through it
*la phase du dépassement léger par lui
=> la phase de la lumière qui les traverse.
decoded by specific individuals.
*décodée par les individus spécifiques.
décodée par des individus spécifiques.
A cable or ADSL connection
*un câble ou une connexion d’AADSL
Une connexion par câble ou ADSL
Error examples (2)
When a user picks or is assigned a
password, it is encoded with a
randomly generated value called the
salt.
=> *Quand un utilisateur sélectionne ou
est généré un mot de passe, il est
codé avec une valeur aléatoirement
produite appelée le sel.
Conclusion
Translation results can be significantly
improved by creating customised dictionaries
The tools mentionned here are user-friendly
But, it implies much work in the beginning +
translators must have a training in linguistics
and basic NLP.
Change of attitude towards MT + various
tools, especially in the language industry
oriented option
More things to be done..
Merging
all dictionaries together into a
« Systranet term base »
Translating more HOWTOs
Project with Systran: improve user
coding
…