Confidence Estimation for Machine Translation

Transcript Confidence Estimation for Machine Translation

Confidence Estimation for
Machine Translation
J. Blatz et.al,
Coling 04
SSLI MTRG 11/17/2004
Takahiro Shinozaki
Abstract

Detailed study of CE for machine
translation
Various machine learning methods
 CE for sentences and for words
 Different definitions of correctness


Experiments

NIST 2003 Chinese-to-English MT
evaluation
1 Introduction
CE can improve usability of NLP
based systems
 CE techniques is not well studied in
Machine translation
 Investigate sentence and word
level CE

2 Background
Strong vs. weak CE
Strong CE: require probability
CE Score
Correctness
probabilities
Threshold
Binary output
Weak CE: require only binary classification
CE Score
NOT
necessary
probability
Threshold
Binary output
2 Background
Has CE layer or not
No distinct CE layer
x
NLP system
y
ConfidenceScore
Has distinct CE Layer
x
NLP system
y
 Require a training corpus
 Powerful and modular
CE module
ConfidenceScore
Naïve Bayes, NN, SVM etc…
3 Experimental Setting
Input
sentences
N-best
Correct
or Not
Train
Src
Translation
system
Hyp
C
Validation
ISI Alignment
Template MT system
Test
Reference
sentences
3.1 Corpora

Chinese-to-English


Evaluation sets from NIST MT competitions
Multi reference corpus from LDC
3.2 CE Techniques

Data : A collection of pairs (x,c)


X: feature vector, c: correctness
Weak CE
X  score
 X  MLP  score


(Regressing MT evaluation score)
Strong CE
X  naïve Bayes  P(c=1|x)
 X  MLP  P(c=1|x)

3.2 Naïve Bayes (NB)

Assume features are statistically
independent
C
D
Pc | x   Pc Px | c   Pc  Pxd | c 
d 1

x1 x2
Apply absolute discounting
xD
3.2 Multi Layer Perceptron

Non-linear mapping of input features
Linear transformation layers
 Non-linear transfer functions


Parameter estimation

Weak CE (Regression)
• Target: MT evaluation score
• Minimizing a squared error loss

Strong CE (Classification)
• Target: Binary correct/incorrect class
• Minimizing negative log likelihood
3.3 Metrics for Evaluation

Strong CE metric:
Evaluates probability distribution
 Normalized cross entropy (NCE)

Weak CE metrics:
Evaluates discriminability
 Classification error rate (CER)
 Receiver operating characteristic (ROC)
3.3 Normalized Cross Entropy
 Cross Entropy (negative log-likelihood)

NLL   log P c | x
i 
i 

i
 Normalized Cross Entropy (NCE)

NLLb  NLL 
NCE 
NLLb
Estimated
probability from
CE module
Empirical
probability
obtained from
test set
 n0   n0   n1   n1 
NLLb    log     log 
n n n n
3.3 Classification Error Rate
CER: Ratio of samples with wrong
binary (Correct/Incorrect) prediction
 Threshold optimization

Sentence-level experiments: test set
 Word-level experiments: validation set


Baseline
min n0 , n1 
CERb 
n
3.3 Receiver operating characteristic
Correct
Incorrect
Correct
Incorrect
a
b
c
d
Correct-accept-ratio
Fact
Prediction
1
ROC
curve
IROC
0,0
d
cd
a
Correct accept ratio 
ab
Correct reject ratio 
random
Correct-reject-ratio
Cf.
Recall 
1
a
ab
Precision 
a
ac
4 Sentence Level Experiments

MT evaluation measures
WERg: normalized word error rate
 NIST: sentence-level NIST score


“Correctness” definition
Thresholding WERg
 Thresholding NIST


Threshold value
5% “correct” examples
 30% “correct” examples

4.1 Features

Total of 91 sentence level features

Base-Model-Intrinsic
• Output from 12 functions for Maximum entropy based base-system
• Pruning statistics

N-best List
• Rank, score ratio to the best, etc…

Source Sentence
• Length, ngram frequency statistics, etc…

Target Sentence
• LM scores, parenthesis matching, etc…

Source/Target Correspondence
• IBM model1 probabilities, semantic similarity, etc…
4.2 MLP Experiments

MLPs are trained on all features for
the four problem settings
N:NIST
BASE CER
3.21
W:WERg
Strong CE
32.5
(Classification
)
5.65
32.5
Weak CE
N/A
(Regression)
Table 2


Classification models are better than regression
model
Performance is better than baseline
4.3 Feature Comparison
Compare contributions of features
 Individual feature
 Group of features








All: All features
Base: base model scores
BD: base-model dependent
BI: base model independent
S: apply to source sentence
T: apply to target sentence
ST: apply to source and target sentence
4.3 Feature Comparison
(results)
Exp. Condition: NIST 30%
ALL
Base
BD
BI
S
T
ST
Table 3




Base  All
BD > BI
T>ST>S
CE Layer > No CE Layer
Figure 1
5 Word Level Experiments

Definition of word correctness
A word is correct if:
Pos: occurs exactly at the same
position as reference
 WER: aligned to reference
 PER: occurs in the reference

Select a “best” transcript from
multiple references
 Ratio of “correct” words


Pos(15%) < WER(43%) < PER(64%)
5.1 Features

Total of 17 features

SMT model based features (2)
• Identity of alignment template, whether or not translated by a rule

IBM model 1 (1)
• Averaged word translation probability

Word posterior and Related
measures (3x3)
Relative freq.

Rank weighted freq.
Any
WPP-any
Source
WPP-source
Target
WPP-target
Word Posterior prob.
Target language based features (3+2)
• Semantic features by WordNet
• Syntax check, number of occurrences in the sentence
5.2 Performance of Single Features

Experimental setting
Naïve Bayes classifier
 PER based correctness

 WPP-any give the best results
 WPP-any>model1>WPP-source
 Top3>any of the single features
 No gain for ALL
Table 4
5.3 Comparison of Different models


Naïve Bayes, MLPs with different number of hidden units
All features, PER based correctness
 Naïve Bayes
 MLP0
 Naïve Bayes < MLP5
MLP5
Figure 2
 NLP10 NLP20
5.4 Comparison of Word Error Measures

Experimental settings


MLP20
All features
Table 5
 PER is the easiest to lean
6 Conclusion
Separate CE layer is useful
 Features derived from base model
are better than external ones
 N-best based features are valuable
 Target based features are more
valuable than those not
 MLPs with hidden units are better
than naïve Bayes


Confidence Estimation for Machine Translation

Transcript Confidence Estimation for Machine Translation

Directory