Confidence Estimation for Machine Translation
Download
Report
Transcript Confidence Estimation for Machine Translation
Confidence Estimation for
Machine Translation
J. Blatz et.al,
Coling 04
SSLI MTRG 11/17/2004
Takahiro Shinozaki
Abstract
Detailed study of CE for machine
translation
Various machine learning methods
CE for sentences and for words
Different definitions of correctness
Experiments
NIST 2003 Chinese-to-English MT
evaluation
1 Introduction
CE can improve usability of NLP
based systems
CE techniques is not well studied in
Machine translation
Investigate sentence and word
level CE
2 Background
Strong vs. weak CE
Strong CE: require probability
CE Score
Correctness
probabilities
Threshold
Binary output
Weak CE: require only binary classification
CE Score
NOT
necessary
probability
Threshold
Binary output
2 Background
Has CE layer or not
No distinct CE layer
x
NLP system
y
ConfidenceScore
Has distinct CE Layer
x
NLP system
y
Require a training corpus
Powerful and modular
CE module
ConfidenceScore
Naïve Bayes, NN, SVM etc…
3 Experimental Setting
Input
sentences
N-best
Correct
or Not
Train
Src
Translation
system
Hyp
C
Validation
ISI Alignment
Template MT system
Test
Reference
sentences
3.1 Corpora
Chinese-to-English
Evaluation sets from NIST MT competitions
Multi reference corpus from LDC
3.2 CE Techniques
Data : A collection of pairs (x,c)
X: feature vector, c: correctness
Weak CE
X score
X MLP score
(Regressing MT evaluation score)
Strong CE
X naïve Bayes P(c=1|x)
X MLP P(c=1|x)
3.2 Naïve Bayes (NB)
Assume features are statistically
independent
C
D
Pc | x Pc Px | c Pc Pxd | c
d 1
x1 x2
Apply absolute discounting
xD
3.2 Multi Layer Perceptron
Non-linear mapping of input features
Linear transformation layers
Non-linear transfer functions
Parameter estimation
Weak CE (Regression)
• Target: MT evaluation score
• Minimizing a squared error loss
Strong CE (Classification)
• Target: Binary correct/incorrect class
• Minimizing negative log likelihood
3.3 Metrics for Evaluation
Strong CE metric:
Evaluates probability distribution
Normalized cross entropy (NCE)
Weak CE metrics:
Evaluates discriminability
Classification error rate (CER)
Receiver operating characteristic (ROC)
3.3 Normalized Cross Entropy
Cross Entropy (negative log-likelihood)
NLL log P c | x
i
i
i
Normalized Cross Entropy (NCE)
NLLb NLL
NCE
NLLb
Estimated
probability from
CE module
Empirical
probability
obtained from
test set
n0 n0 n1 n1
NLLb log log
n n n n
3.3 Classification Error Rate
CER: Ratio of samples with wrong
binary (Correct/Incorrect) prediction
Threshold optimization
Sentence-level experiments: test set
Word-level experiments: validation set
Baseline
min n0 , n1
CERb
n
3.3 Receiver operating characteristic
Correct
Incorrect
Correct
Incorrect
a
b
c
d
Correct-accept-ratio
Fact
Prediction
1
ROC
curve
IROC
0,0
d
cd
a
Correct accept ratio
ab
Correct reject ratio
random
Correct-reject-ratio
Cf.
Recall
1
a
ab
Precision
a
ac
4 Sentence Level Experiments
MT evaluation measures
WERg: normalized word error rate
NIST: sentence-level NIST score
“Correctness” definition
Thresholding WERg
Thresholding NIST
Threshold value
5% “correct” examples
30% “correct” examples
4.1 Features
Total of 91 sentence level features
Base-Model-Intrinsic
• Output from 12 functions for Maximum entropy based base-system
• Pruning statistics
N-best List
• Rank, score ratio to the best, etc…
Source Sentence
• Length, ngram frequency statistics, etc…
Target Sentence
• LM scores, parenthesis matching, etc…
Source/Target Correspondence
• IBM model1 probabilities, semantic similarity, etc…
4.2 MLP Experiments
MLPs are trained on all features for
the four problem settings
N:NIST
BASE CER
3.21
W:WERg
Strong CE
32.5
(Classification
)
5.65
32.5
Weak CE
N/A
(Regression)
Table 2
Classification models are better than regression
model
Performance is better than baseline
4.3 Feature Comparison
Compare contributions of features
Individual feature
Group of features
All: All features
Base: base model scores
BD: base-model dependent
BI: base model independent
S: apply to source sentence
T: apply to target sentence
ST: apply to source and target sentence
4.3 Feature Comparison
(results)
Exp. Condition: NIST 30%
ALL
Base
BD
BI
S
T
ST
Table 3
Base All
BD > BI
T>ST>S
CE Layer > No CE Layer
Figure 1
5 Word Level Experiments
Definition of word correctness
A word is correct if:
Pos: occurs exactly at the same
position as reference
WER: aligned to reference
PER: occurs in the reference
Select a “best” transcript from
multiple references
Ratio of “correct” words
Pos(15%) < WER(43%) < PER(64%)
5.1 Features
Total of 17 features
SMT model based features (2)
• Identity of alignment template, whether or not translated by a rule
IBM model 1 (1)
• Averaged word translation probability
Word posterior and Related
measures (3x3)
Relative freq.
Rank weighted freq.
Any
WPP-any
Source
WPP-source
Target
WPP-target
Word Posterior prob.
Target language based features (3+2)
• Semantic features by WordNet
• Syntax check, number of occurrences in the sentence
5.2 Performance of Single Features
Experimental setting
Naïve Bayes classifier
PER based correctness
WPP-any give the best results
WPP-any>model1>WPP-source
Top3>any of the single features
No gain for ALL
Table 4
5.3 Comparison of Different models
Naïve Bayes, MLPs with different number of hidden units
All features, PER based correctness
Naïve Bayes
MLP0
Naïve Bayes < MLP5
MLP5
Figure 2
NLP10 NLP20
5.4 Comparison of Word Error Measures
Experimental settings
MLP20
All features
Table 5
PER is the easiest to lean
6 Conclusion
Separate CE layer is useful
Features derived from base model
are better than external ones
N-best based features are valuable
Target based features are more
valuable than those not
MLPs with hidden units are better
than naïve Bayes