Transcript of terms

1
Building Bilingual Lexicons Using
Lexical Translation Probabilities
via Pivot Language
Takashi Tsunakawa1
Naoaki Okazaki1
Jun’ichi Tsujii1,2
LREC 2008
1Department
29 May, 2008
of Computer Science,
Graduate School of Information Science and Technology,
University of Tokyo
2School of Computer Science, University of Manchester /
National Centre for Text Mining
2
Introduction

Building bilingual lexicons via pivot languages
CHINESE 计步器(jìbùqì)
C-E lexicon
E-J lexicon
ENGLISH
odometer
pedometer
オドメーター
(odomētā)
ペドメータ
(pedomēta)
JAPANESE
ペドメーター
(pedomētā)
万歩計
(mampokei)
歩数計
(hosūkei)
3
Introduction

Building bilingual lexicons via pivot languages
计步器
(jìbùqì)
(1) オドメーター (odomētā)
(2) ペドメータ(pedomēta),ペドメーター
(pedomētā),歩数計(hosūkei),万歩計
(mampokei)
odometer
pedometer
Creative Commons Attribution ShareAlike 2.0 License
by skippy13
4
Advantages of the pivotal approach

Constructing Japanese-Chinese lexicon from
Japanese-English and English-Chinese lexicons
through English terms


J-E and E-C lexicons are well-supported for many
terms and domains, compared to J-C lexicons
Especially for technical terms, there are few J-C
lexicons because technical terms are first written by
English in most cases
The pivotal approach could help us to (semi-)
automatically find J-C translation term pairs
5
Mismatch problem

We cannot find a Chinese-Japanese term pair that does
not share the identical English translations.
Chinese terms
English terms
Japanese terms
全球变暖
(qúanqíu-bìannŭan)
global heating
(n/a)
(n/a)
global warming
地球温暖化
(chikyū-ondanka)
Is it possible to generate the
following lexical item?
Chinese terms
English terms
Japanese terms
全球变暖
global heating
global warming
地球温暖化
6
Merging Two Bilingual Lexicons

“Exact merging”


cannot merge pairs that do not share the identical
English translations  mismatch problem
Challenges to merge more terms


“Word-based merging”
“Alignment-based merging”
7
Word-based merging
Tokenize a term into word tokens, and
 Translate each word by the bilingual lexicon

Chinese terms
English terms
Japanese terms
全球变暖
global heating
(n/a)
(n/a)
global warming
地球温暖化
(n/a)
global
地球
(n/a)
heating
全球变暖
温暖化
global
heating
地球
温暖化
(qúanqíu-bìannŭan)
(chikyū - ondanka)
8
Alignment-based merging:
Overview



Align each word,
Calculate word translation probabilities, and
Translate each word by the probabilities
Chinese terms
English terms
Japanese terms
全球 变暖
global heating
(n/a)
(n/a)
global warming
地球 温暖化
(n/a)
heating
全球
变暖
global heating
温暖化
global warming
heating
地球 温暖化
温暖化
Alignment-based merging:
Overview
9
Word-byword
translation
Merging word pairs & recalculating probabilities
(Add term
frequencies on
Web)
10
Alignment-based merging


Apply word alignment
(GIZA++) (Och & Ney, 2003)
for all term pairs
Calculate word translation
probabilities from cooccurrence frequencies
p( w f | w p ; a p  f ) 
p( w p | we ; ae p ) 
C (wp , w f ; a p f )
C (wp )
C ( we , w p ; ae p )
C ( we )
For both of the
bilingual lexicons,
source(f)-pivot(p) and
pivot(p)-target(e)
,
11
Alignment-based merging

Calculate word translation probabilities from
a target-language word to a source-language
word (Utiyama & Isahara, 2007):
p( w f | we )  p( w f | we ; ae p , a p  f )
  p( w f | wp ; a p  f ) p( w p | we ; ae p )
wp
12
Alignment-based merging

Calculate the translation probabilities (scores) based
on the noisy channel model (Brown et al., 1990)
Pr( we | w f )  p( we ) p( w f | we )
 p( we ) p( w f ,i | we,i )
i
The language model p(we) is calculated by using the
number of Web searching results (Google) of the term
we
 p(we) ∝ (hit count of we)
 Generate the merged lexicon with translation
probabilities are greater than zero.
 New_Lexicon = {(wf,we)|Pr(we|wf)>0 and Pr(wf|we) > 0}

13
Experimental settings

Used lexicons: Bilingual lexicons that consist of
technical terms



C-E: Wanfang Data E-C & C-E Science and Technology
Dictionary
J-E: JST Machine Translation Dictionary
By “exact merging,” we can translate about 22% of
Japanese (or Chinese) terms
Lexicon
J-E
# of terms (J)
465,563
# of terms (E)
416,578
C-E
429,766
# of distinct E terms
777,344
C-J by “exact
merging”
103,437
(22.2%)
# of terms (C)
68,996
439,795
98,537
(22.4%)
14
Experimental
results
 Utilization ratio
Method
# of terms (J)
# of terms (C)
(Utilization ratio of J) (Utilization ratio of C)
Exact merging
103,437 (22.2%)
98,537 (22.4%)
Word-based merging
124,945 (26.8%)
167,929 (38.1%)
Alignment-based merging
438,976 (94.2%)
342,229 (77.8%)
 Alignment-based merging drastically improved the utilization ratio,
and the size of merged lexicon also increased

Accuracy (by manual evaluation)
Source-Target
MRR
Prec1
Prec10
Japanese-to-Chinese
0.242
0.14
0.46
Chinese-to-Japanese



0.258
0.20
0.40
MRR: Mean Reciprocal Rank (Voorhees, 1999) calculates the mean of
reciprocal ranks over all source terms
Prec1: Precision of the highest ranked terms
Prec10: Precision that the 10-best outputs include the correct one
15
Experimental results: Examples (1/2)

A Chinese-to-Japanese example of “角膜 实质 炎”
(jiăomó - shízhì - yán)
(keratitis parenchymatosa)
Japanese
translation
J-to-E literal translation
角膜 実質 炎
kerato- parenchymatitis
角膜 的 炎
kerato- inflammation
角膜 物質 炎
Score
Log10 Hit
prob. count
432 OK
0.057
-2.89
0.00457
-3.34
10
kerato- material inflammation
0
-2.24
0
角膜 物質 関節
kerato- material joint
0
-2.49
0
角膜 実 炎
kerato- real inflammation
0
-2.63
0
角膜 物質 性
kerato- materiality
0
-2.66
0
角膜 材料 炎
kerato- stuff inflammation
0
-2.66
0
角膜 物質 高安
kerato- material high-low
0
-2.83
0
角膜 物質 胃腸
kerato- material stomach
0
-2.87
0
16
Experimental results: Examples (2/2)

A J-to-C example of “発育 状態” (growth status)
(hatsuiku - jōtai)
Chinese
translation
C-to-E literal translation
Score Log10
prob.
Hit
count
的 状态
state of
7249
-2.43 1960000
发展 状态
development state
6593
-1.58
252000
发展 条件
development condition
6001
-2.05
674000
的 条件
condition of
3159
-2.90 2510000
发展 国家
development country
2715
-2.57
生长 状态
growing state
2688
-1.51
生长 条件
growing condition
2248
-1.98
增长 状态
rising state
1343
-1.72
开发 条件
development condition
1260
-2.78
998000
87900 OK
216000
69800 OK
192000
17
Conclusion
Alignment-based merging of two bilingual lexicons via
a pivot language is proposed
 The alignment-based merging could achieve at least
75% utilization ratio in our experiments
 The precision still remains 0.14 (Japanese-to-Chinese)
and 0.20 (Chinese-to-Japanese), which would be
improved by sophisticated scoring method


Future directions


To choose the correct translation with examining the
context or semantic classes of source and target terms
To evaluate a machine translation system with this lexicon
integrated
18
Thank you for your attention

Acknowledgments




MEXT, Japan
Japan Science and Technology Agency (JST), Japan
NICT, Japan
Wanfang Data, China
19
Experimental Results

Our system could generate at least one
Japanese translations into 73.4%
Japanese
(385509/525259)
of the
C-E lexicons
Chinese input
reference
translation
term
(infectious hepatitis virus,
感染性肝炎ウイルス)
(coliphage, 大腸菌ファージ)
传染性 肝炎 病毒
score
大肠 杆菌 噬菌体
score
感染
性 肝炎
-8.29
大腸 菌 ファージ
-17.68
感染
肝炎 ウィールス
-16.58
大腸 ファージ
-17.82
感染
肝炎 ウイルス
-16.60
大腸 菌 型 ファージ
-18.48
感染
性 肝炎
ウイルス
-17.24
大腸 菌 ファージ の
-18.88
感染
性 肝炎
ウイルス
-17.42
-18.88
伝染
性 肝炎
ウィールス
-17.63
大腸 菌 バクテリオファー
ジ
伝染
性 肝炎
ウイルス
-17.65
コリフォーム ファージ
-19.01
大腸 ファージ の
-19.02
ウイルス
20
Experimental Results
same character but the
meanings are not identical
(acoustic delay line storage,
音響遅延線記憶装置)
声 延迟 线存 储器
(complement form,
補数形式)
score
补码
-17.15
補
形式
-18.38
-17.51
補
体
-18.47
-17.80
補
形
-17.87
補完
音 遅延 記憶 装置
-18.16
補
音響
記憶
-18.17
音響
遅延
音声
音
遅延
遅延
線
線
記憶
記憶
音声
遅延
記憶
音響
遅延
線
記憶
貯蔵
音響
遅延
装置
記憶
遅延
音響
装置
装置
線
超 音波
装置
装置
線
装置
記憶
装置
形式
score
-18.63
形式
-18.72
追加
形式
-18.81
-18.36
補完
形
-18.93
-18,42
補助
形式
-18.95
保健
形式
-18.97
-18.52
体
-18.68
形
-18.50
記憶
形式
追加 形
-19.05
21
Manual evaluation

A human evaluator checked the translation results of 200
Chinese terms classified in the category of “Computer” by the CE lexicon




Terms that could be translated into Japanese: 181 (90.5%)
Terms that the top-10 translations included the correct one: 135
(67.5%)
Terms that the top translation was correct: 73 (36.5%)
MRR (mean reciprocal rank) = 0.466

The average of the inverses of the ranks that are the highest correct
translations
Terms that the top was correct
Terms that the top was incorrect /
Terms that could not be translated
激光 存储器 电路 – laser memory circuit
– レーザー メモリ 回路
虚拟 处理 – dummy treatment – 仮想 処
理
综合 数字网 – integrated digital
network – 総合 ディジタル 網
数 组 元素 – array element – 配列 元素
计算机 化 管理 学会 – ICM – 特 発 性 心筋
障害
信息量 – information content – 量
转镜 式激 光束 影像 记录 仪 – laser
beam rotating mirror image recorder – (NO)
22
Manual
evaluation
 A human evaluator checked the translation results of 200
Chinese terms classified in the category of “Computer” by the CE lexicon




Terms that could be translated into Japanese: 181 (90.5%)
Terms that the top-10 translations included the correct one: 135
(67.5%)
Terms that the top translation was correct: 73 (36.5%)
MRR (mean reciprocal rank) = 0.466

The average of the inverses of the ranks that are the highest correct
translations
Terms that the top was correct
Terms that the top was incorrect /
Terms that could not be translated
激光 存储器 电路 – laser memory circuit
– レーザー メモリ 回路
虚拟 处理 – dummy treatment – 仮想 処
理
综合 数字网 – integrated digital
network – 総合 ディジタル 網
数 组 元素 – array element – 配列 元素
计算机 化 管理 学会 – ICM – 特 発 性 心筋
障害
信息量 – information content – 量
转镜 式激 光束 影像 记录 仪 – laser
beam rotating mirror image recorder – (NO)
23
Manual
evaluation
 A human evaluator checked the translation results of 200
Chinese terms classified in the category of “Computer” by the CE lexicon




Terms that could be translated into Japanese: 181 (90.5%)
Terms that the top-10 translations included the correct one: 135
(67.5%)
Terms that the top translation was correct: 73 (36.5%)
MRR (mean reciprocal rank) = 0.466

The average of the inverses of the ranks that are the highest correct
translations
Terms that the top was correct
Terms that the top was incorrect /
Terms that could not be translated
激光 存储器 电路 – laser memory circuit
– レーザー メモリ 回路
虚拟 处理 – dummy treatment – 仮想 処
理
综合 数字网 – integrated digital
network – 総合 ディジタル 網
数 组 元素 – array element – 配列 元素
计算机 化 管理 学会 – ICM – 特 発 性 心筋
障害
信息量 – information content – 量
转镜 式激 光束 影像 记录 仪 – laser
beam rotating mirror image recorder – (NO)
24
Manual
evaluation
 A human evaluator checked the translation results of 200
Chinese terms classified in the category of “Computer” by the CE lexicon




Terms that could be translated into Japanese: 181 (90.5%)
Terms that the top-10 translations included the correct one: 135
(67.5%)
Terms that the top translation was correct: 73 (36.5%)
MRR (mean reciprocal rank) = 0.466

The average of the inverses of the ranks that are the highest correct
translations
Terms that the top was correct
Terms that the top was incorrect /
Terms that could not be translated
激光 存储器 电路 – laser memory circuit
– レーザー メモリ 回路
虚拟 处理 – dummy treatment – 仮想 処
理
综合 数字网 – integrated digital
network – 総合 ディジタル 網
数 组 元素 – array element – 配列 元素
计算机 化 管理 学会 – ICM – 特 発 性 心筋
障害
信息量 – information content – 量
转镜 式激 光束 影像 记录 仪 – laser
beam rotating mirror image recorder – (NO)
25
Manual
evaluation
 A human evaluator checked the translation results of 200
Chinese terms classified in the category of “Computer” by the CE lexicon




Terms that could be translated into Japanese: 181 (90.5%)
Terms that the top-10 translations included the correct one: 135
(67.5%)
Terms that the top translation was correct: 73 (36.5%)
MRR (mean reciprocal rank) = 0.466

The average of the inverses of the ranks that are the highest correct
translations
Terms that the top was correct
Terms that the top was incorrect /
Terms that could not be translated
激光 存储器 电路 – laser memory circuit
– レーザー メモリ 回路
虚拟 处理 – dummy treatment – 仮想 処
理
综合 数字网 – integrated digital
network – 総合 ディジタル 網
数 组 元素 – array element – 配列 元素
计算机 化 管理 学会 – ICM – 特 発 性 心筋
障害
信息量 – information content – 量
转镜 式激 光束 影像 记录 仪 – laser
beam rotating mirror image recorder – (NO)
26
Conclusion
We proposed the method using phrase-based SMT for
constructing J-C lexicon from J-E and C-E lexicons.
 We could obtain J translations for 73.4% of items in the C-E
lexicon, and it outperformed the “exact matching” (22.2%).
 36.5% of the top J translations were correct and that 67.5% of
the top-10 J translations included the correct one.



We could apply this method for support of manual construction of
bilingual dictionaries and use this lexicon for MT.
Future work




Parameter optimization of SMT by using existing J-C lexicons
Chinese character similarity considering each similarity between
individual characters
More sophisticated reordering model (considering parts-of-speech)
Other translation directions (EJ, JC, EC)
Acquisition of Translation Pairs of
Technical Terms


27
Large-scale translation dictionaries (lexicons) of
technical terms are required for translating
technical documents
For constructing such dictionaries, we must ask
the experts who can deal with both languages


It requires huge costs
We must support rapid increase of new terms
Automatic acquisition of translation candidates of technical
terms
• Support for constructing the dictionary
• Improvement of the performance of machine
translation systems
28
J-E bilingual lexicon


527,206 translation pairs
Numbers of distinct terms: 465,565 J terms, 509,259 E terms
Japanese terms
English terms
“外装・内装”派
"exterior・interior" fraction
(案)
(draft)
(案)
(plan)
(株)
Co.,Ltd.
(株)
Inc.
…
…
ころがり接触疲労
rolling contact fatigue
ころがり損失
rolling loss
ころがり対偶
rolling pair
ころがり疲れ寿命
rolling fatigue life
29

C-E bilingual lexicon
Wanfang Data E-C & C-E Science and Technology
Dictionary

525,259 pairs
id
Chinese terms
Japanese terms
Category
1
……的瞬时值
Instantaneous…
科技 (science and
technology)
2
Ⅰ-Ⅴ族化合物半导体
group Ⅰ-Ⅴ compound
semiconductor
电子 (electronic)
3
Ⅰ-Ⅵ族化合物半导体
group I-VI compound
semiconductor
电子
4
Ⅰ-Ⅶ族化合物半导体
group Ⅰ-Ⅶ compound
semiconductor
电子
5
ⅠA族化合物
ⅠA compound
无化 (inorganic
chemistry)
专利发明
patent
专利 (patent)
…
525259
30
Construction of the C-J bilingual
lexicon for each lexical
 Attach Japanese translations
item of C-E lexicon
Chinese terms
English terms
Japanese terms
……的瞬时值
Instantaneous…
瞬間…
Ⅰ-Ⅴ族化合物半导体
group Ⅰ-Ⅴ compound
semiconductor
Ⅰ-V族化合物半
導体
Ⅰ-Ⅵ族化合物半导体
group I-VI compound
semiconductor
Ⅰ-Ⅵ族化合物半
導体
Ⅰ-Ⅶ族化合物半导体
group Ⅰ-Ⅶ compound
semiconductor
Ⅰ-Ⅶ族化合物半
導体
ⅠA族化合物
ⅠA compound
ⅠA族化合物
patent
特許
…
专利发明
31
Overview of constructing J-C lexicon
We assume the C-E and J-E lexicons as parallel corpora,
and use them for training data for constructing a J-C
SMT system
 Word/phrase-level merging in English can be available
by applying an SMT approach for the C-E and J-E
lexicons
 We apply C-J phrase-based SMT for Chinese terms in
the C-E lexicon



Statistical approaches seem to be effective because of
similarities of semantics and word order between C and J
Easy to introduce other clues such as Chinese character
similarity
32


Collecting J-E & C-E translation phrase
Apply morphological analyzers,pairs
and obtain word alignments by GIZA++
(Och and Ney, 2003) for J-E and C-E lexicons
Collect phrase pairs by “Grow-diag-final” method (using Moses, Koehn et
al., 2007) and calculate the probabilities by the relative frequencies
ころがり
rolling
疲れ
寿命
fatigue life
Japanese phrases
English phrases
p( e | j )
p( j | e )
ころがり
rolling
0.733
0.083
疲れ
fatigue
0.973
0.503
寿命
life
0.565
0.210
rolling fatigue
1
1
fatigue life
1
0.545
1
1
ころがり
疲れ
疲れ
寿命
ころがり
疲れ
寿 rolling fatigue life
33
Merging phrase pairs (Utiyama &
Isahara, 2007) (J-E & E-C phrases to J-C
Japanese phrases
English
phrases
p( e | j )
p( j | e )
phrases)
ころがり
rolling
0.733
0.083
疲れ
fatigue
0.973
0.503
寿命
life
0.565
0.210
rolling fatigue
1
1
fatigue life
1
0.545
rolling fatigue life
1
1
English phrases
p( e | c )
p( c | e )
侧倾
rolling
0.182
0.029
横摇
rolling
0.5
0.014
…
…
…
…
疲乏
fatigue
1
0.011
…
…
…
…
疲劳 寿命
fatigue life
1
1
ころがり
疲れ
疲れ
寿命
ころがり 疲れ
命
Chinese phrases
寿
34
Merging phrase pairs (Utiyama &
Isahara, 2007) (J-E & E-C phrases to J-C
Japanese phrases
Chinese
phrases
p( c | j )
p( j | c )
phrases)
ころがり
侧倾
…
0.015
ころがり
横摇
…
0.042
…
…
…
…
疲れ
疲乏
…
0.297
…
…
…
…
疲劳 寿命
…
0.545
疲れ
寿命
1
p( w f | we ) 
Ze
 p( w
f
| w p ) p( w p | we )
wp
Z e   p( w f | w p ) p( w p | we )
w f
wp
(Ze is a normalized
factor)
35
Features for learning of the log-linear
model

We employ the following features
h
1-h4 for the
M
log-linear model: wˆ  arg max  h ( w , w )
e
we

m 1
m m
e
f
Phrase translation prob. h1 (we , w f )  i log p(we(i ) , w f(i ) )
1.

where we(i ) , w f(i ) are the i-th phrase pair for the translation
3-gram language model of the target language
2.
h2 ( we , w f )  log p( we )

3.
4.
where p(we) is a language model probability from other
monolingual corpora
Phrase reordering penalty (Koehn et al., 2003)
Chinese character similarity (Zhang et al., 2005)
36
Feature 3: Phrase reordering penalty
(Koehn
et sum
al., of
2003)
 The feature
value is the
penalties d
defined by the following formula for the phrase
pairs we, wf
d ( we(i ) , w f(i ) )   ai  bi 1  1
h3 ( we , w f )  i d ( w , w )
(i )
e

(i )
f
where ai is the position of the first word of wf and
bi-1 is the position of the last word of wf translated
in the previous step
f1 f2 f3
e1 e2
f4 f5
e3
e4
f6 f7
e5 e6
f8
d(e1 e2, f1 f2 f3) = 0
d(e3, f8) = – |8 – 3 – 1| = – 4
d(e4, f6 f7) = – |6 – 8 – 1| = – 3
d(e5 e6, f4 f5) = – |4 – 7 – 1| = – 4
h3(e1…e6, f1…f8) = – 11
37
Feature 4: Chinese character similarity
Chinese and Japanese writing systems both have
Chinese characters, and their similarity should be a
powerful clue to derive the translation phrase pairs
(Zhang et al., 2005)
 We define the feature value h4 between we and wf as
follows:
Edit distance of Chinese characters between we and wf
h4(we,wf) = 1 –

Max. of the number of characters in we and wf


Differences of Chinese and Japanese forms of characters
are ignored
Example:h4(万歩計,计步器) = h4(万歩計, 計歩器)
= h4(ABC,CBD) = 1 – 2 / 3 = 0.333