Checking Terminology Consistency with Statistical Methods

Download Report

Transcript Checking Terminology Consistency with Statistical Methods

Checking Terminology Consistency
with Statistical Methods
LRC XIII
Alfredo Maldonado Guerra
Microsoft European Development Centre
2nd October 2008
Masaki Itagaki
Microsoft Corporation
About this presentation
Introduction
Internal Consistency Check
Step 1: Mine Source Terms
Step 2: Identify translations of Source Terms (Alignment)
Step 3: Consistency Check
Current Challenges
Tips
Future Improvements
Introduction
Terminology Consistency: A key element of localised
language quality
Terminology Consistency: Difficult to maintain
Difficulty to keep source and target in synch during dev/loc process
Translation done by several people (often working remotely)
Terminology changes (e.g. between product versions)
Manual Language Quality Assurance (QA) can help,
however
QA costs time and money
QA usually concentrates on a sample of the text
Reviewer must be familiar with reference material
It’s hard for humans to keep track of terminology
Introduction
Can we use technology to control consistency?
Yes, but…
Existing tools require term lists or term bases
Not all software companies have term bases set up
Companies that do have term bases won’t have every
single term captured – building a term base is always a
work in progress
Introduction
Our Approach doesn’t require a term base
By using Term Mining technology we identify terms
on the source strings
We then check the translation consistency of the
terminology mined
Internal Consistency Check
1. Mine Source Terms
2. Align Translations
Source String
1
2
3
3. Consistency Check
Target String
Resets all the input fields in the
containing form to their original
values.
Réattribue leurs valeurs initiales
à tous les champs d'entrée du
formulaire conteneur.
A single line break
Saut de ligne unique
Groups a collection of input fields
in the containing form.
Regroupe une série
de champs d'entrée de la
forme conteneuse.
Web Part Page
Page de composants WebPart
Submits all input from the
containing form to the server for
processing.
Soumet toutes les entrées à
partir du formulaire conteneur
vers le serveur en vue du
traitement.
Inconsistency!
Step 1: Source Term Mining
Bigram and Trigram extraction
Noun phrases of the form
Noun + Noun
Noun + Noun + Noun
Verb Phrases discriminated: 5% of terms
Adjective Phrases discriminated: 2% of terms
Monogram nouns discriminated: most are common
words, and only 27% of terms are monograms
In the future we might cover Adj + Noun forms
Step 2: Translation Alignment
Problem statement:
Given a mined source term S, identify the corresponding
target term T in the translation column.
Example:
Mined term: “input field” (S)  “champ d’entrée” (T)
Source String
Target String
Resets all the input fields in the
containing form to their original
values.
Réattribue leurs valeurs initiales
à tous les champs d'entrée du
formulaire conteneur.
Groups a collection of input fields
in the containing form.
Regroupe une série
de champs d'entrée du
forme conteneur.
Step 2: Translation Alignment
We need to consider all possible term combinations
Réattribue leurs valeurs initiales à tous les champs d'entrée.
Réattribue leurs
leurs valeurs
valeurs initiales
Initiales à
We call each combination an NGram
à les
NGrams: where N = 2, 3, 4, maybe 5.
…
Réattribue leurs valeurs
For languages like German
Leurs valeurs initiales
…
we even consider N = 1
How do we decide which NGram is the correct
translation for the term?
Bayesian statistics can help!
Step 2: Translation Alignment
Problem statement:
Given a source term S, obtain the NGram T that maximises
the conditional probability function
[1]
But how do we calculate this?!
Step 2: Translation Alignment
[1]
Well, the multiplication rule of conditional probability tells us that
So [1] becomes:
[2]
And we also know that:
|NGrams| is the number of
NGrams of the same N as T.
For example, if T is a 2 word
term (a bigram),
|NGrams| will be the amount of
NGrams made up of 2 words.
|STSeg| is the number of
segments (strings) that contain
both S in the source column and
T in the target column.
Step 2: Translation Alignment
In our Best Target Term Selection Routine we will be comparing
probabilities of different target terms (Tk’s):
Since P(S) remains constant during these comparisons, we can
eliminate it.
We call the resulting equation I(Tk):
[3]
The candidate Tk with the highest I, is our Best Target Term Candidate
Step 2: Translation Alignment
Normalisation
Depending on context any particular term can be
translated in a slightly different way.
For example: “file name” could be translated in Spanish as:





nombre de archivo
nombre del archivo
nombres de archivo
nombres de archivos
nombres de los archivos
Our algorithm has to be clever enough to realise that
“nombres de archivo” is just a form of “nombre de archivo”.
Step 2: Translation Alignment
Normalisation
So, during NGram generation, we need to generate
regular expressions for our terms
Since Asian languages do not inflect, regular expressions
are simpler for these languages
Source Term
Target Term (Japanese)
Regular Expression
Matches (admitted translations)
Error code
エラー コード
\bエラー\s?コード\b
エラー コード
For European languages we use more complex regular
expressions
Source Term
Target Term (Italian)
Regular Expression
Matches (admitted translations)
Error code
codice errore
\bcod\w{0,3}(\s\w{1,4}'?){0,2}\s?
err\w{0,3}\b
codice d'errore
codice di errore
codice errore
codici di errore
Step 3: Consistency Check
Detect the strings that do not use any of our
admitted translations
Report these strings along with our findings to the
user
Source Term Admitted Translations
event viewer visualizzatore eventi
local computer computer locale
Source String
At least one service or driver
failed during system startup.
Use Event Viewer to examine
the event log for details.
The local computer has a
previous version
Target String
Impossibile avviare uno o più
servizi o driver. Controllare il
registro eventi per ulteriori
informazioni.
Sul computer remoto è presente
una versione precedente
Current Challenges
False Positives
Due to “heavy” rephrasing
Source Term Target Term
client
client
Source String
Because of a security
error, the client could not
connect to the remote
computer.
Inconsistent Target String
Impossibile connettersi al
computer remoto. Errore di
protezione.
Unreliable for short, generic monograms
Source Term
Admitted translations (Italian)
data
d, d3d, da, dac, dai, dal, dall, data,
dati, dato, dc, ddc, dei, del, dell,
deny, der, deve, dfs, dhcp, di, dir,
disk, dll, dma, dns, dopo, dos, dove,
dpc, dsis, dtr, due, dvd, dwm
Current Challenges
Verbs can potentially cause problems
Due to high inflection:
amar => amo, amas, ama, amamos, amáis/aman, aman
venir => vengo, vienes, viene, venimos, venís/vienen, vienen
Difficult to differentiate from other parts of speech
Source term
Admitted translations
Target Language
download
descarga, descargar, descargó, descargue
Spanish
install
install, installa, installare, installata, installati, installato, installer
Italian
Not all languages supported:
Arabic
Complex Script languages
Current Challenges
Best Candidate Selection logic is very good, but it’s
not perfect. About 70% of term selections are
correct.
Correct selections
Source Term
Incorrect selections
data context
Source Term Target Term Candidate
I
name key
nom fort
5.29E-06
de nom fort
1.08E-06
Source Term
de nom
1.15E-07
clé de nom
(tiny number) reference type
Source Term Target Term Candidate
type argument argument de
argument de type
de type
I
3.41E-05
Source Term
8.33E-06
function
1.00E-06
evaluation
Source Term Target Term Candidate
com reference la référence
référence com
la référence com
I
2.49E-05
7.44E-06
4.13E-06
Correct term highlighted
Target Term Candidate
contexte de données
contexte de
de données
Target Term Candidate
type référence
un type référence
un type
Target Term Candidate
évaluation de la
fonction
évaluation de la
évaluation de
de la fonction
la fonction
de la
I
8.50E-05
6.01E-06
1.68E-06
I
6.41E-05
5.87E-06
1.12E-06
I
7.74E-05
3.87E-05
3.74E-05
2.88E-05
9.46E-06
1.70E-07
Tips
Make sure your data is clean to a certain degree.
Source Term Target Term
font
tipo di carattere
Source String
<font size="3">Server for NFS
Overview</font>
Inconsistent Target String
<font size="3">Panoramica di
Server per NFS</font>
Remove any HTML/XML tags from your strings
Filter out any unlocalised strings and
non-localisable strings.
For Asian languages, run a word breaker tool on
your target strings (this is required for proper
NGram handling)
Tips
If you already have source term lists you’re
interested in, you can use them to bypass the term
mining process
If your source terms are well selected, you’ll
achieve very good results – A well selected source
term has a precise technical meaning.
Source term
Good/Bad
Reason
failure
bad
Too generic
data
bad
Too generic, forms part of many other terms: data type,
data structure, etc.
worker process
good
Has a precise meaning
user account control
good
Has a precise meaning
Tips
The more data you have, the more accurate your
results will be
Try combining software data with help / user
education data to increase term repetitions
Future Improvements
More work with Adj + Noun
Work with verbs
Add support for Complex Script languages and
languages that inflect on different parts of the word
Further refine Best Translation Candidate Selection
logic
Questions?
Thank You!