Czech-English Word Alignment - Institute of Formal and Applied

Download Report

Transcript Czech-English Word Alignment - Institute of Formal and Applied

Czech-English Word Alignment
Ondřej Bojar ([email protected]), Magdalena Prokopová ([email protected])
Institute of Formal and Applied Linguistics, ÚFAL MFF, Charles University in Prague
Automatic Word Alignment
Motivation
Steps in statistical machine translation:
Phrase table
~Translation dictionary of
multi-word expressions
Word-to-word
alignments
Sentence-parallel
corpus
 Used twice (Cs->En and En->Cs)
 The two guessed alignments can be merged using union, intersection or possibly other
techniques.
Phrase
extraction
Automatic
word alignment
GIZA++ (Och and Ney, 2003) automatically creates asymmetric alignments (1 source word
connected to n target words).
The test set for GIZA++ was created by merging the two human annotations:
• both annotators mark a sure connection  required connection
• one of the annotators chooses sure connection and the other any other connection type
 required connection
• at least one of the annotators chooses any connection type  allowed connection
• otherwise  connection not allowed
Motivation to manually annotate word alignment:
• to create evaluation data for automatic alignment methods
• to learn more about inter-annotator agreement and the limits of the task
Manual Annotation
Two annotators independently annotated 515 sentences using 3 main connection types:
• the word has no counterpart (null, )
• the words can be possibly linked (possible, )
• the words are translations of each other (sure, )
Additionally, some segments could have been marked as phrasal translations:
• whole phrases correspond, but not the individual words (phrasal, )
Types of connections used to compare
annotations:
Annotator
A1
A2
Mismatch
Possible, Sure,
Phrasal
15,476
16,631
Connection of Any
Type
15,399
16,246
2,343
3,498
18.2 %
1,146
1,714
9.0 %
A1 but not A2
A2 but not A1
Relative mismatch
 Mismatch rate relatively high, but it reduces to a half if the differences in connection
type are disregarded.
Most Frequent Problematic Cases
Evaluation metrics: Precision penalizes superfluous connections (connections generated
automatically but not even allowed), recall penalizes forgotten required connections.
Alignment-error rate (AER) is a combination of precision and recall.
Preprocessing of the input text such as lemmatization significantly reduces data sparseness
(see the table Details about the PCEDT below) and helps to achieve better alignments:
Baseline (raw input text)
Zisk
se
vyšvihl
na
117
milionů
dolarů
Lemmas
zisk
se-1
vyšvihnout
na-1
117
milion
dolar
Lemmas + Numbers
zisk
se-1
vyšvihnout
na-1
NUM
milion
dolar
Lemmas + Singletons Backed off with POS
zisk
se-1
VERB
na-1
117
milion
dolar
Gloss
Revenue
refl
soared
to
117
million
dollar
Results of automatic word alignment:
Baseline
Lemmas
Lemmas + Numbers
Lemmas + Singletons
backed off with POS
Intersection (1-1)
Prec
Rec
AER
97.4
57.6
27.4
97.9
75.0
15.0
97.9
75.2
14.8
97.4
75.8
14.6
Prec
65.9
77.1
77.5
77.8
Union (n-n)
Rec
86.7
89.8
89.9
88.5
AER
25.5
17.2
17.0
17.4
Where GIZA++ Fails, Humans Were Often in
Trouble, Too
The following table displays the percentage of tokens where there was a match (OK) or
mismatch (Problems) in the respective languages:
• Two human annotations compared against each other.
• GIZA++ compared against golden alignments (i.e. merged human annotations).
Humans
Verbs and their
belongings,
including the
negative particle.
Punctuation: commas are used more
frequently in Czech, the dollar symbol
($) is almost always translated and thus
rarely repeated in Czech.
Top Ten Problematic Words and POSes
English
Problematic Parts of Speech
Czech
English
to
319
,
679
IN
1348
N
259
the
271
se
519
DT
1283
V
159
of
146
v
510
NN
661
R
143
a
112
na
386
PRP
505
P
124
,
74
o
361
TO
448
Z
107
be
61
že
327
VB
398
A
99
it
55
.
310
JJ
280
D
95
that
47
a
245
RB
192
J
84
in
41
bude
216
NNP
59
C
80
by
37
k
199
VBN
22
T
…
…
en
14.3
0.1
38.6
46.9
Improved
cs
15.5
0.1
35.7
48.7
en
14.3
0.2
25.2
60.4
cs
15.5
0.1
25.0
59.4
Details about the Prague Czech-English
Dependency Treebank
• Source: Wall Street Journal section of the Penn Treebank
• Translated sentence-by-sentence to Czech.
Czech
361
…
Problems
OK
Problems
OK
Baseline
 Out of all the positions where GIZA++ failed, 38% were problematic for humans.
 The improvement thanks to lemmatization is not observed on words that are
difficult for humans anyway.
English articles in cases
where the rule “connect to
the Czech governing noun”
cannot be clearly applied.
Problematic Words
Problems
Problems
OK
OK
GIZA++
…
English Penn Treebank Tag-Set: IN - Preposition or subordinating conjunction, DT - Determiner, NN - Noun, common, singular
or mass, PRP - Pronoun, personal, TO - to, VB - Verb, base form, JJ - Adjective, NNP - Noun, proper, singular, VBN - Verb, past
participle.Czech Tag-Set: N - Noun, V - Verb, R - Preposition, P - Pronoun, Z - Punctuation, sentence border, A - Adjective, D Adverb, J - Conjunction, C - Number, T - Particle
Czech
Sentences
Running Words
Running Words without Punctuation
Baseline
Vocabulary
Singletons
Lemmas
Vocabulary
Singletons
Lemmas
Vocabulary
+ Singletons
Singletons
English
21,141
475,719
404,523
57,085
31,458
28,007
13,009
15,041
12
494,349
439,304
30,770
14,637
25,000
11,873
13,150
2
Full text, acknowledgement and the list of references in the proceedings of LREC 2006.