Transcript Experiment
國立雲林科技大學
N.Y.U.S.T.
I. M.
National Yunlin University of Science and Technology
Chinese Word Segmentation and
Statistical Machine Translation
Presenter : Wu, Jia-Hao
Authors : RUIQIANG ZHANG , KEIJI YASUDA ,
EIICHIRO SUMITA
TOSLP (2008)
Intelligent Database Systems Lab
Outline
Motivation
Objective
Methodology
Dictionary-based
CRF-based
Experiments
Conclusion
Personal Comments
N.Y.U.S.T.
I. M.
2
Intelligent Database Systems Lab
Motivation
N.Y.U.S.T.
I. M.
Chinese word segmentation is a necessary step in ChineseEnglish statistical machine translation.
However, there are many choices involved in creating a
CWS system such as various specifications and CWS
methods.
Ex 我們要發展中國家用電器
我們
要 發展
中國
家用電器
We Want to develop China’s Home electrical appliances.
我們
要 發展中國家
用
電器
We Want Developing country To use Electrical appliances.
Intelligent Database Systems Lab
Motivation
N.Y.U.S.T.
I. M.
Chinese word segmentation is a necessary step in ChineseEnglish statistical machine translation.
However, there are many choices involved in creating a
CWS system such as various specifications and CWS
methods.
Chinese word segmentation
Statistical machine translation
The ChineseName is called by Rome phonetic transcription
Intelligent Database Systems Lab
Objective
N.Y.U.S.T.
I. M.
They created 16 CWS schemes under different setting to
examine the relationship between CWS and SMT.
The authors also tested two CWS methods that dictionarybased and CRF-based approaches.
The authors propose two approaches for combining
advantages of different specifications .
A simple concatenation of training data.
Implementing linear interpolation of multiple translation models.
Intelligent Database Systems Lab
Methodology-Dictionary-based
N.Y.U.S.T.
I. M.
The pure dictionary-based CWS does not recognize OOV
words.
Out-of-vocabulary
The authors combined N-gram language model with
Dictionary-based word segmentation.
For a give Chinese character sequence , C=c0c1c2…cN
The word sequence , W=wt0wt1wt2…wtM
Which satisfies
wt0 c0 ...ct0 , wW
carg
t1
t 0 max
1 ...cP
t1 (W | C ) arg max P (W ) P (C | W )
W
W
wti cti1 1...cti , wt M arg
cmax
wt cwt Mt ...wt ) (c0 ...ct , wt )
t M 1 P
1(...
W
ti ti 1 ,0 ti N
iM
(c,0
t 1 ...ct , wt )... (ct 1 ...cM , wt )
0
0
1
1
1
M
M 1
0
0
M
δ(u,v) equal to 1 if both arguments are the same , and 0 otherwise.
Intelligent Database Systems Lab
Methodology-CRF-based IOB Tagging
N.Y.U.S.T.
I. M.
Each character of a word is labeled.
B if it is the first character of a multiple-character word.
O if the character functions as an independent word
I for other.
Ex:全北京市 is labeled 全/O 北/B 京/I 市/I
The probability of an IOB tag sequence, T=t0t1…tM , given
the word sequence W=w0w1…wM
bigram
features
: simply
used
counts
in 2the
Unigram
features
: w0,w
,w-1wfeature
w0 training data
-1,wabsolute
1,w-2,w2,w
0w-1,wfor
0w1each
1,w-2w-1,w
and define a cutoff value for each feature type.
Intelligent Database Systems Lab
Methodology-Achilles
N.Y.U.S.T.
I. M.
An In-House CWS including Both Dictionary-Based and
CRF-Based Approaches.
Dictionary-based
Zero OOV recognition rate.
In-vocabulary rate is higher.
CRF-based
OOV recognition rate higher than Dictionary-based.
Best F-scores.
Intelligent Database Systems Lab
Methodology-Phrase-Based SMT
The method use a framework of log-linear models to
integrate multiple features.
Where fi(F,E) is the logarithmic value of the i-th feature
,and λi is the weight of the i-th feature. The target sentence
candidate that maximizes P(E|F) is the solution.
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Experiments
N.Y.U.S.T.
I. M.
The data used in the experiments were provided by LDC ,
and use the English sentences of the data plus Xinhua news
of the LDC Gigaword English corpus.
Implementation of CWS Schemes
Tokens : the total number of words in the training data
Unique word : lexicon size of the segmented training data.
OOVs : the unknown words in the test data.
Intelligent Database Systems Lab
Experiment
N.Y.U.S.T.
I. M.
The effect of CWS specifications on SMT.
Intelligent Database Systems Lab
Experiment
N.Y.U.S.T.
I. M.
Intelligent Database Systems Lab
Experiment - Combining multiple CWS
schemes
Effect of Combining Training Data from Multiple CWS
Specifications.
Create a new CWS scheme called dict-hybrid by combining AS,
CITYU, MSR, PKU.
49,546,231 tokens , 112,072 unique words for the training data. 693
OOVs for the test data.
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Experiment
N.Y.U.S.T.
I. M.
Effect of Feature Interpolation of Translation Models.
The authors generated multiple translation models by using different
word segmenters.
The phrase translation model p(e|f) can be linearly interpolated as
Where pi(e|f) is the phrase translation model corresponding to the i-th
CWSs. αi is the weight and S is the total number of models.
Intelligent Database Systems Lab
Conclusion
N.Y.U.S.T.
I. M.
The authors analyzed multiple CWS specifications and
built a CWS for each one to examine how they affected
translations.
They proposed a new approach to linear interpolation of
translation features , and improvement in translation and
achieved the best BLEU score of all the CWS schemes.
Intelligent Database Systems Lab
Comments
Advantage
There are many experiments to evaluate their performance.
Drawback
N.Y.U.S.T.
I. M.
But some interpretation of experiments are complex.
Application
Chinese Word Segmentation.
Statistical Machine Translation.
Intelligent Database Systems Lab