A DNN-Based Acoustic Modeling of Tonal Language And Its

Download Report

Transcript A DNN-Based Acoustic Modeling of Tonal Language And Its

Maximum F1-Score
Discriminative Training Criterion for
Automatic Mispronunciation Detection
Hao Huang, Haihua Xu, Xianhui Wang,
Wushour Silamu
Yaochi Hsu
2015/05/12
1
Outline
• Abstract
• Introduction
• The MFC Objective Function
• Model Space Discriminative Training
• Feature Space Discriminative Training
• Experiments And Results
• Conclusion And Future Work
2
Abstract
• We carry out an in-depth investigation on a newly proposed Maximum F1score Criterion (MFC) discriminative training objective function for
Goodness of Pronunciation (GOP) based automatic mispronunciation
detection that makes use of Gaussian Mixture Model-hidden Markov model
(GMM-HMM) as acoustic models.
• We present model-space training algorithm according to MFC using
extended Baum-Welch form like update equations based on the weak-sense
auxiliary function method.
• We then present MFC based feature-space discriminative training.
3
Abstract (cont.)
• Mispronunciation detection experiments show MFC based model-space
training and feature-space training are effective in improving F1-score and
other commonly used evaluation metrics.
• Further, we review and compare mispronunciation detection results with
the use of MFC and some traditional training criteria that minimize word
error rate in speech recognition.
4
Introduction
•
Automatic mispronunciation detection, which aims at helping the learner by
automatically pinpoint erroneous pronunciations, is one of the most popularly
deployed applications.
•
A major approach to mispronunciation detection is based on automatic speech
recognition (ASR) technologies.
•
There are two types of ASR based mispronunciation detection techniques :
1.
One uses confidence scores such as posterior probability to measure the correctness of
a pronunciation. (GOP)
2.
The other ASR based method uses a phone recognizer to decode the input waveforms
with
extended
pronunciation
networks
that
include
correct
and
incorrect
pronunciations to capture possible error types.
5
Introduction (cont.)
• An alternative to ASR based approach is to use acoustic phonetic features
as front-end and a classifier as a back-end.
• Mispronunciation detection can be formulated more suitably as a
classification problem and thus more discriminative features and classifiers
can be explored. (DBN、DNN、SVM)
• In this paper, we use GOP based mispronunciation detection method and
use GMM-HMM based acoustic models to compute GOP scores.
6
Introduction (cont.)
• In ASR, discriminative training (DT) of the acoustic models has been
widely used and has proved to give significant improvement over
traditional ML estimation method.
• Minimum Classification Error (MCE)、 Maximum Mutual Information
(MMI)、Minimum Phone Error (MPE)
• In ASR, the performance of a system is often evaluated in terms of Word
Error Rate (WER).
• In mispronunciation detection, the commonly used metrics include False
Rejection、False Acceptances、Precision and Recall.
7
Introduction (cont.)
• The F1-score is nowadays an important metric when evaluating the
performance of a natural language processing (NLP) system or an
information retrieval (IR) system.
• Recently, researchers began to refine system parameters by directly
maximizing the F1-score for logistic regression based classifiers in NLP.
• The training objective function is a smoothed form of F1-score function,
denoted as Maximum F1-score Criterion (MFC).
8
The MFC Objective Function
GOP 𝑂𝑟,𝑛 , 𝑞𝑟,𝑛 =
1
log
𝑇𝑟,𝑛
GOP 𝑂𝑟,𝑛 , 𝑞𝑟,𝑛 > 𝜏
𝑝 𝑂𝑟,𝑛 𝑞𝑟,𝑛 𝑃(𝑞𝑟,𝑛 )
𝑞∈𝑄(𝑟,𝑛) 𝑝
𝑂𝑟,𝑛 𝑞𝑟,𝑛 𝑃(𝑞)
𝑌𝑒𝑠 𝐶𝑜𝑟𝑟𝑒𝑐𝑡 𝑝𝑟𝑜𝑛𝑢𝑛𝑐𝑖𝑎𝑡𝑖𝑜𝑛
𝑁𝑜 𝑀𝑖𝑠𝑝𝑟𝑜𝑛𝑢𝑛𝑐𝑖𝑎𝑡𝑖𝑜𝑛
10
The MFC Objective Function (cont.)
0 < κ < 1 is a commonly applied exponential scaling factor in
discriminative training to reduce dynamic range of the probabilities.
G 𝑟, 𝑛 = log
𝑃 𝑂𝑟,𝑛 𝑞𝑟,𝑛
𝑞∈𝑄(𝑟,𝑛)
𝜅
𝑇𝑟,𝑛
𝑃 𝑂𝑟,𝑛 𝑞
𝜅
𝑇𝑟,𝑛
𝑑 𝑟, 𝑛 = −𝐺 𝑟, 𝑛 + 𝜏
𝕝 𝑑 𝑟, 𝑛
=
1 𝑖𝑓 𝑑 𝑟, 𝑛 > 0,
𝑀𝑖𝑠𝑝𝑟𝑜𝑛𝑢𝑛𝑐𝑖𝑎𝑡𝑖𝑜𝑛
0 𝑖𝑓 𝑑 𝑟, 𝑛 ≤ 0, 𝐶𝑜𝑟𝑟𝑒𝑐𝑡 𝑝𝑟𝑜𝑛𝑢𝑛𝑐𝑖𝑎𝑡𝑖𝑜𝑛
11
The MFC Objective Function (cont.)
2 ∗ 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 ∗ 𝑅𝑒𝑐𝑎𝑙𝑙
𝐹1 =
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑅𝑒𝑐𝑎𝑙𝑙
𝐹1 =
𝑅
𝑁𝑊𝑊
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =
∗ 100%
𝑁D
𝑁𝑊𝑊
𝑅𝑒𝑐𝑎𝑙𝑙 =
∗ 100%
𝑁𝑊
2𝑁𝑊𝑊
𝑁𝐷 + 𝑁𝑊
𝑁𝑟
𝑁𝑊𝑊 =
𝕝 𝑑 𝑟, 𝑛
∗ 𝐸(𝑟, 𝑛)
𝑟=1 𝑛=1
𝑅
𝑁𝑟
𝑁D =
𝕝 𝑑 𝑟, 𝑛
𝑟=1 𝑛=1
 𝑁𝑊𝑊 is the number of phones marked as mispronunciations by both the
computer and the human evaluator.
 𝑁D is the total number of mispronunciations detected by the machine.
 𝑁𝑊 is the number of mispronunciations judged by the human evaluator.
 𝐸(𝑟, 𝑛) is the human-annotated result of 𝑠𝑒𝑔𝑚𝑒𝑛𝑡 (𝑟, 𝑛). 𝐸(𝑟, 𝑛) = 1 is
marked as mispronunciation and 0 otherwise.
12
The MFC Objective Function (cont.)
1
𝑆(𝑢) =
1 + exp(−𝜉𝑢)
𝑅
𝑁𝑟
𝑆
𝑁𝑊𝑊
=
ℱ𝑀𝐹𝐶
𝑆 𝑑 𝑟, 𝑛
∗ 𝐸(𝑟, 𝑛)
𝑟=1 𝑛=1
𝑅
𝑁𝐷𝑆
ℱ𝑀𝐹𝐶 =
𝑁𝑟
=
𝑆 𝑑 𝑟, 𝑛
2
𝑆
2𝑁𝑊𝑊
= 𝑆
𝑁𝐷 + 𝑁𝑊
𝑁𝑟
𝑅
𝑟=1 𝑛=1 𝑆
𝑁𝑟
𝑅
𝑟=1 𝑛=1 𝑆
𝑑 𝑟, 𝑛
∗ 𝐸(𝑟, 𝑛)
𝑑 𝑟, 𝑛
+ 𝑁𝑊
𝑟=1 𝑛=1
 However, F1-score is not differentiable because of the step indicator
function, which make it difficult to optimize by using a gradient
based method.
13
Model Space Discriminative Training
• Model-space discriminative is to optimize the MFC objective function by
updating the GMM-HMM parameters.
• Various optimization methods for discriminative training of GMM-HMM
have been tried.
– MCE : Generalized Probabilistic Descent (GPD) algorithm
– MMI : Stochastic Gradient Ascent (SGA) and Resilient Propagation (RProp)
– MMI,MPE and Boosted MMI (BMMI) : Weak-sense auxiliary function
method. (there is no need to determine the appropriate learning rate, or use
second-order statistics.)
14
Model Space Discriminative Training (cont.)
𝜕𝑄(𝜃, 𝜃)
𝜕𝜃
𝜕𝑄(𝜃, 𝜃)
𝜕𝜃
=
𝜃
𝑑𝜃 =
𝜃
𝜕ℱ(𝜃)
𝜕𝜃
𝑄(𝜃, 𝜃) =
𝜃
𝜕ℱ(𝜃)
𝜕𝜃
𝑞
𝑑𝜃
𝜃
𝜕𝑙𝑜𝑔𝑝(𝑂𝑞 |𝑞)
𝜕ℱ𝑀𝐹𝐶
𝜕𝑙𝑜𝑔𝑝(𝑂𝑞 |𝑞)
𝜕𝜃
𝜕ℱ𝑀𝐹𝐶
𝑄(𝜃, 𝜃) =
𝑞
𝜕𝑙𝑜𝑔𝑝 𝑂𝑞 𝑞
𝑑𝜃
𝜃
𝑙𝑜𝑔𝑝 𝑂𝑞 𝑞 + 𝑄 𝑆 (𝜃, 𝜃)
𝜃
 Where 𝜃 represents the current model parameters and the variable 𝜃 the
parameters to be estimated.
 where 𝑞 is a phone arc within the training confusion networks, 𝑂𝑞 represents the
observation sequence in arc 𝑞.
 A smoothing term 𝑄(𝜃, 𝜃) is necessarily added on right-hand side of 𝑄 𝑆 (𝜃, 𝜃) to
ensure concavity of the auxiliary function, and consequently improve stability in
optimization.
 It must satisfy the following constraint to ensure the resulting auxiliary function
is still a valid weak-sense auxiliary function.
15
Model Space Discriminative Training (cont.)
1
𝑆
𝑄 𝜃, 𝜃 = −
2
𝑆
𝑀𝑠
𝐷𝑠𝑚 {log 2𝜋𝜎𝑠𝑚
𝑠=1 𝑚=1
2
2
2
𝜇𝑠𝑚
+ 𝜎𝑠𝑚
− 2𝜇𝑠𝑚 𝜇𝑠𝑚 + 𝜇𝑠𝑚
+
}
2
𝜎𝑠𝑚
 where 𝑆 is the number of states in the HMM set. and 𝑀𝑠 is number of
Gaussians in state 𝑠.
 𝜇𝑠𝑚 and 𝜎𝑠𝑚 are respectively the current mean and variance for Gaussian 𝑚
in state 𝑠.
 𝜇𝑠𝑚 and 𝜎𝑠𝑚 are the mean and variance to be updated.
 𝐷𝑠𝑚 is a Gaussian-dependent factor.
16
Model Space Discriminative Training (cont.)
𝑙
𝑈𝑠𝑚
𝑇
𝑙 =
𝛽𝑠𝑚
𝑙 (𝑡))
max(0, 𝑈𝑠𝑚
𝑀𝐹𝐶 𝑡
𝜓𝑠𝑚
𝑡 =
𝑀𝐹𝐶 𝑡
−𝜓𝑠𝑚
𝑖𝑓 𝑙 = 𝑛𝑢𝑚𝑒𝑟𝑎𝑡𝑜𝑟
𝑖𝑓 𝑙 = 𝑑𝑒𝑛𝑜𝑚𝑖𝑛𝑎𝑡𝑜𝑟
𝑡=1
𝜓𝑞𝑀𝐹𝐶 =
𝑇
𝑙
𝑋𝑠𝑚
=
𝑙
max(0, 𝑈𝑠𝑚
(𝑡)) ∗ 𝑜(𝑡)
𝑡=1
𝑇
𝑙 =
𝑌𝑠𝑚
𝑙 (𝑡)) ∗ 𝑜 2 (𝑡)
max(0, 𝑈𝑠𝑚
𝑡=1
1 𝜕ℱ𝑀𝐹𝐶
𝜅 𝜕𝑙𝑜𝑔𝑝(𝑂𝑞 |𝑞)
𝑀𝐹𝐶 𝑡 = 𝜓 𝑀𝐹𝐶 𝜓
 where 𝜓𝑠𝑚
𝑞
𝑞𝑠𝑚 (𝑡) and
𝜓𝑞𝑠𝑚 𝑡 is the posterior probability of
being in state 𝑠 and Gaussian mixture 𝑚
of arc 𝑞 at time 𝑡 and can be obtained
via a forward-backward pass within arc
𝑞.
17
Model Space Discriminative Training (cont.)
𝜇𝑠𝑚
𝑛 − 𝑋𝑑 + 𝐷 𝜇
𝑋𝑠𝑚
𝑠𝑚
𝑠𝑚 𝑠𝑚
=
𝑛
𝑑
𝛽𝑠𝑚
− 𝛽𝑠𝑚
+ 𝐷𝑠𝑚
2
𝜎𝑠𝑚
𝑛
𝑑
2
2
𝑌𝑠𝑚
− 𝑌𝑠𝑚
+ 𝐷𝑠𝑚 (𝜎𝑠𝑚
+ 𝜇𝑠𝑚
)
2
=
−
𝜇
𝑠𝑚
𝑛
𝑑
𝛽𝑠𝑚
− 𝛽𝑠𝑚
+ 𝐷𝑠𝑚
𝑑
𝐷𝑠𝑚 = 𝐸𝛽𝑠𝑚
 The smoothing factor 𝐷𝑠𝑚 is empirically determined for each Gaussian
component.
 where 𝐸 is a global constant controlling the update speed.
 A large value of 𝐸 will slow down the convergence speed.
 Typically a constant E = 3.0 was chosen in the experiments.
18
Model Space Discriminative Training (cont.)
ℱ𝑀𝐹𝐶
𝑅
𝑁𝑟
𝑆
𝑁𝑊𝑊
=
𝑆 𝑑 𝑟, 𝑛
𝑟=1 𝑛=1
𝑆
2𝑁𝑊𝑊
= 𝑆
𝑁𝐷 + 𝑁𝑊
∗ 𝐸(𝑟, 𝑛)
𝑅
𝑁𝑟
𝑁𝐷𝑆 =
𝑆 𝑑 𝑟, 𝑛
𝑟=1 𝑛=1
S
𝜕ℱ𝑀𝐹𝐶
𝜕ℱ𝑀𝐹𝐶
𝜕𝑁WW
𝜕ℱ𝑀𝐹𝐶
𝜕𝑁𝐷S
=
+
S
𝜕𝑙𝑜𝑔𝑝(𝑂𝑟,𝑛 |𝑞)
𝜕𝑁WW 𝜕𝑙𝑜𝑔𝑝(𝑂𝑟,𝑛 |𝑞)
𝜕𝑁𝐷S 𝜕𝑙𝑜𝑔𝑝(𝑂𝑟,𝑛 |𝑞)
S
S
𝜕𝑁WW
2𝑁WW
𝜕𝑁𝐷S
= S
+
S
S
𝜕𝑙𝑜𝑔𝑝(𝑂𝑟,𝑛 |𝑞) (𝑁𝐷S + 𝑁WW
𝑁𝐷 + 𝑁WW
)2 𝜕𝑙𝑜𝑔𝑝(𝑂𝑟,𝑛 |𝑞)
2
19
Model Space Discriminative Training (cont.)
𝜕𝑁𝐷S
𝜕𝑁𝐷S
𝜕𝑑(𝑟, 𝑛) 𝜕𝑒 𝑙𝑜𝑔𝑝(𝑂𝑟,𝑛 |𝑞)
=
𝜕𝑙𝑜𝑔𝑝(𝑂𝑟,𝑛 |𝑞)
𝜕𝑑(𝑟, 𝑛) 𝜕𝑒 𝑙𝑜𝑔𝑝(𝑂𝑟,𝑛 |𝑞) 𝜕𝑙𝑜𝑔𝑝(𝑂𝑟,𝑛 |𝑞)
𝜕𝑁𝐷S
= 𝜉 1 − 𝑆 𝑑 𝑟, 𝑛
𝜕𝑑(𝑟, 𝑛)
=𝜉 1−
=𝜉
𝑆 𝑑 𝑟, 𝑛
1
1
1 + 𝑒 −𝜉𝑑(𝑟,𝑛) 1 + 𝑒 −𝜉𝑑(𝑟,𝑛)
1 + 𝑒 −𝜉𝑑(𝑟,𝑛)
1
−
1 + 𝑒 −𝜉𝑑(𝑟,𝑛) 2
1 + 𝑒 −𝜉𝑑(𝑟,𝑛)
𝜉𝑒 −𝜉𝑑(𝑟,𝑛)
=
1 + 𝑒 −𝜉𝑑(𝑟,𝑛)
2
2
20
Model Space Discriminative Training (cont.)
𝜕𝑑(𝑟, 𝑛)
𝜕
=
𝜕𝑒 𝑙𝑜𝑔𝑝(𝑂𝑟,𝑛 |𝑞) 𝜕𝑝 𝑂𝑟,𝑛 𝑞
𝑃 𝑂𝑟,𝑛 𝑞𝑟,𝑛
− log
𝑞∈𝑄(𝑟,𝑛)
𝜅
𝑇𝑟,𝑛
𝑃 𝑂𝑟,𝑛 𝑞
𝜅
𝑇𝑟,𝑛
+𝜏
𝑊ℎ𝑒𝑛 𝑞 ≠ 𝑞𝑟,𝑛 ,
=−
𝑞∈𝑄 𝑟,𝑛
𝑃 𝑂𝑟,𝑛 𝑞
𝑃 𝑂𝑟,𝑛 𝑞𝑟,𝑛
𝜅
𝑇𝑟,𝑛
−𝑃 𝑂𝑟,𝑛 𝑞𝑟,𝑛
𝜅
𝑇𝑟,𝑛
𝜅
=
𝑇𝑟,𝑛
𝑞∈𝑄 𝑟,𝑛
𝑃 𝑂𝑟,𝑛 𝑞𝑟,𝑛
𝑞∈𝑄 𝑟,𝑛
𝜅
𝑇𝑟,𝑛
𝑃 𝑂𝑟,𝑛 𝑞
𝜅
𝑇𝑟,𝑛
𝑃 𝑂𝑟,𝑛 𝑞
𝜅
𝑇𝑟,𝑛
𝜅
𝑃 𝑂𝑟,𝑛 𝑞
2𝑇
𝑟,𝑛
𝜅
𝑇𝑟,𝑛 −1
1
𝜅
𝑇𝑟,𝑛
𝑃 𝑂𝑟,𝑛 𝑞
21
Model Space Discriminative Training (cont.)
𝑊ℎ𝑒𝑛 𝑞 = 𝑞𝑟,𝑛 ,
=−
𝑞∈𝑄 𝑟,𝑛
𝜅
𝑇𝑟,𝑛
𝑃 𝑂𝑟,𝑛 𝑞
𝑃 𝑂𝑟,𝑛 𝑞𝑟,𝑛
𝜅
𝑃 𝑂𝑟,𝑛 𝑞
𝑇𝑟,𝑛
𝜅
−1
𝑇𝑟,𝑛
𝑞∈𝑄 𝑟,𝑛
𝜅
𝑇𝑟,𝑛
𝑞∈𝑄 𝑟,𝑛
𝜅
=−
𝑇𝑟,𝑛
𝑞∈𝑄 𝑟,𝑛
𝑃 𝑂𝑟,𝑛 𝑞
𝑞∈𝑄 𝑟,𝑛
=
𝜅
𝑇𝑟,𝑛
𝑃 𝑂𝑟,𝑛 𝑞
− 𝑃 𝑂𝑟,𝑛 𝑞
𝑃 𝑂𝑟,𝑛 𝑞
𝑃 𝑂𝑟,𝑛 𝑞
𝑞∈𝑄 𝑟,𝑛
𝜅
𝑇𝑟,𝑛
𝜅
𝑇𝑟,𝑛
𝜅
𝑇𝑟,𝑛
𝑃 𝑂𝑟,𝑛 𝑞
𝑃 𝑂𝑟,𝑛 𝑞
𝜅
𝑇𝑟,𝑛
−1
𝜅
𝑇𝑟,𝑛
𝜅
𝑇𝑟,𝑛
𝜅
𝑇𝑟,𝑛
− 𝑃 𝑂𝑟,𝑛 𝑞
𝜅
𝑇𝑟,𝑛
2
1
𝑃 𝑂𝑟,𝑛 𝑞
1
𝑃 𝑂𝑟,𝑛 𝑞
22
Model Space Discriminative Training (cont.)
𝜕𝑒 𝑙𝑜𝑔𝑝(𝑂𝑟,𝑛 |𝑞)
= 𝑝(𝑂𝑟,𝑛 |𝑞)
𝜕𝑙𝑜𝑔𝑝(𝑂𝑟,𝑛 |𝑞)
𝜕𝑁𝐷S
𝜕𝑁𝐷S
𝜕𝑑(𝑟, 𝑛) 𝜕𝑒 𝑙𝑜𝑔𝑝(𝑂𝑟,𝑛 |𝑞)
=
𝜕𝑙𝑜𝑔𝑝(𝑂𝑟,𝑛 |𝑞)
𝜕𝑑(𝑟, 𝑛) 𝜕𝑒 𝑙𝑜𝑔𝑝(𝑂𝑟,𝑛 |𝑞) 𝜕𝑙𝑜𝑔𝑝(𝑂𝑟,𝑛 |𝑞)
𝑊ℎ𝑒𝑛 𝑞 ≠ 𝑞𝑟,𝑛
𝜅
𝜉𝑒 −𝜉𝑑(𝑟,𝑛)
→
𝑇𝑟,𝑛 1 + 𝑒 −𝜉𝑑(𝑟,𝑛)
𝑃 𝑂𝑟,𝑛 𝑞𝑟,𝑛
2
𝑞∈𝑄 𝑟,𝑛
𝑊ℎ𝑒𝑛 𝑞 = 𝑞𝑟,𝑛
𝜅
𝜉𝑒 −𝜉𝑑(𝑟,𝑛)
→
𝑇𝑟,𝑛 1 + 𝑒 −𝜉𝑑(𝑟,𝑛)
𝜅
𝑇𝑟,𝑛
𝑃 𝑂𝑟,𝑛 𝑞
𝑃 𝑂𝑟,𝑛 𝑞𝑟,𝑛
𝜅
𝑇𝑟,𝑛
2
𝑞∈𝑄 𝑟,𝑛
𝜅
𝑇𝑟,𝑛
𝑃 𝑂𝑟,𝑛 𝑞
𝜅
𝑇𝑟,𝑛
−1
23
Model Space Discriminative Training (cont.)
S
𝜕𝑁WW
1
=
𝐸 𝑟, 𝑛 ∗ 𝑃
𝜕𝑙𝑜𝑔𝑝(𝑂𝑟,𝑛 |𝑞) 𝑇𝑟,𝑛
𝜕𝑁𝐷S
1
=
∗ 𝑃
𝜕𝑙𝑜𝑔𝑝(𝑂𝑟,𝑛 |𝑞) 𝑇𝑟,𝑛
𝜅𝜉𝑒 −𝜉𝑑(𝑟,𝑛)
𝑃=
1 + 𝑒 −𝜉𝑑(𝑟,𝑛)
𝜓𝑞 𝑟, 𝑛 =
2
𝜓𝑞 𝑟, 𝑛 − 𝐴 𝑞, 𝑞𝑟,𝑛
𝑃 𝑂𝑟,𝑛 𝑞
𝑞′∈𝑄 𝑟,𝑛
 𝐴 𝑞, 𝑞𝑟,𝑛 is ’phone accuracy’
of phone 𝑞 and is equal to 1
when 𝑞 and 𝑞𝑟,𝑛 are the same,
and 0 otherwise.
𝜅
𝑇𝑟,𝑛
𝑃 𝑂𝑟,𝑛 𝑞′
𝜅
𝑇𝑟,𝑛
24
Model Space Discriminative Training (cont.)
1.
Initialize the acoustic models using ML estimation;
2.
Iterative MFC model training:
a.
Compute GOP scores of all the phone 𝑠𝑒𝑔𝑚𝑒𝑛𝑡𝑠(𝑟, 𝑛) in the training utterances;
b.
Search for the best phone-dependent thresholds 𝜏 that maximize ℱ𝑀𝐹𝐶 ;
c.
S
Compute 𝑁WW
and 𝑁𝐷S using the GOP scores and thresholds obtained in 2.a and 2.b;
d.
Do forward-backward computations to accumulate sufficient statistics.
e.
2
Update the means and variances of Gaussians {𝜇𝑠𝑚 , 𝜎𝑠𝑚
}.
f.
Goto step 2.a unless convergence or maximum number of iterations is reached.
25
Feature Space Discriminative Training
• Region-dependent feature compensation uses a global Gaussian mixture
model to divide the acoustic space into multiple regions, each having a
different compensation offset.
 𝑜𝑡 the observation vector of input features at time 𝑡.
 ℳ being the total number of the Gaussians in the
HMM set.
(𝑚)
 𝜙𝑡
being the posterior probability of Gaussian
given 𝑜𝑡 .
 𝑏𝑚 being a region-dependent offset vector of Gaussian
𝑚.
 Defining an offset matrix 𝐵 = 𝑏1 𝑏1 … 𝑏𝑚 … 𝑏ℳ 𝑇
 Φ𝑡 be an ℳ -dimensional vector whose elements
consist of posterior probabilities 𝜙𝑡
1, … , ℳ.
(𝑚)
ℳ
𝑦𝑡 = 𝑜𝑡 +
𝜙𝑡
(𝑚)
𝑏𝑚
𝑚=1
𝑦𝑡 = 𝑜𝑡 + 𝐵Φ𝑡
where 𝑚 =
26
Feature Space Discriminative Training (cont.)
• We use a gradient ascent algorithm to update the transform matrix B.
𝑦𝑡 = 𝑜𝑡 + 𝐵Φ𝑡
𝜕ℱ𝑀𝐹𝐶
=
𝜕𝐵
𝑇
𝑡=1
𝑇
=
𝑡=1
𝜕ℱ𝑀𝐹𝐶 𝜕𝑦𝑡
𝜕𝑦𝑡 𝜕𝐵
𝜕ℱ𝑀𝐹𝐶 𝑇
Φ𝑡
𝜕𝑦𝑡
27
Experiments And Results
• The proposed method is evaluated on a Mandarin mispronunciation
detection task for Uighur college students who have been learning
Putonghua (Mandarin Chinese) in Xinjiang University.
• we focus only on the phone-level mispronunciations without consideration
of tonal error detection, which normally belongs to another separate
research topic.
• Therefore, the methods proposed in this paper can also be applied to other
non-tonal languages.
28
Experiments And Results (cont.)
32:1
39:1
 The spectral front-end uses 39
dimensional vector.
 Due to the limited non-native training
data, especially the limited amount of
mispronounced training data, only
monophone HMMs are used.
 Another reason for using contextindependent models is that context
information is not easy to implement
considering possible mispronunciations
around the target phone.
29
Experiments And Results (cont.)
• Here we simply add all the initials into the segment when the canonical
phone 𝑞𝑟,𝑛 is an initial, and add all the finals when 𝑞𝑟,𝑛 is a final.
G 𝑟, 𝑛 = log
𝑃 𝑂𝑟,𝑛 𝑞𝑟,𝑛
𝑞∈𝑄(𝑟,𝑛)
𝜅
𝑇𝑟,𝑛
𝑃 𝑂𝑟,𝑛 𝑞
𝜅
𝑇𝑟,𝑛
• We tuned the threshold of each phone 𝑞 using grid search to find a best
ℱ𝑀𝐹𝐶 while maintaining other phone thresholds fixed.
• The procedure was repeated until the objective function ℱ𝑀𝐹𝐶 converged to
an optimum.
𝜏 = 𝑎𝑟𝑔 max ℱ𝑀𝐹𝐶
𝜏
30
Experiments And Results (cont.)
 The ML estimated baseline is trained on the native speech database and the
MFC training is conducted on the non-native speech database.
 It is seen F1-scores on the training and test sets are respectively 0.388 and
0.381, that is, no clear F1-score improvement is obtained.
 This indicates in our experiments the adaption of the baseline acoustic
models to L2 acoustic conditions is not helpful in improving
mispronunciation detection performance.
31
Experiments And Results (cont.)
 The HMM set contains 1592 Gaussians, thus the dimension of the posterior
probability vector 𝛷𝑡 in 𝑦𝑡 = 𝑜𝑡 + 𝐵𝛷𝑡 is 1592.
 When only the posterior probability vector of current frame 𝛷𝑡 is used, i.e.,
the context window size is 1 (CXT=1).
 Here we expand the posterior probability vector 𝛷𝑡 in 𝑦𝑡 = 𝑜𝑡 + 𝐵𝛷𝑡 by
splicing the successive 3 posterior probability vectors 𝛷𝑡 , 𝛷𝑡−1 and 𝛷𝑡+1
together (denoted as RDLC CXT=3).
32
Experiments And Results (cont.)
33
Experiments And Results (cont.)
 True Acceptance : 發音正確且辨識為正確
 False Acceptance : 發音錯誤但辨識成正確
 True Rejection : 發音錯誤且辨識為錯誤
 False Rejection : 發音正確但辨識成錯誤
34
Experiments And Results (cont.)
𝐷𝐸𝑅 =
1
𝑁 + 𝑁𝐹𝑅 ∗ 100%
𝑁 𝐹𝐴
𝑆𝐴 = 1 − 𝐷𝐸𝑅 =
1
𝑁 + 𝑁𝑇𝑅 ∗ 100%
𝑁 𝑇𝐴
※ Detection Error Rate
※ Scoring Accuracy
35
Experiments And Results (cont.)
※ Syllable Error Rate
※ Phone Error Rate
𝑅
ℱ𝑀𝑀𝐼 =
𝑟=1
𝑅
ℱ𝑀𝑃𝐸 =
𝑟=1
𝑝 𝑂𝑟 |𝑠𝑟 𝜅 𝑝 𝑠𝑟 𝜅
𝑙𝑜𝑔
𝜅
𝜅
𝑠 𝑝 𝑂𝑟 |𝑠 𝑝 𝑠
𝑂𝑟 |𝑠 𝜅 𝑝 𝑠 𝜅 𝐴 𝑠, 𝑠𝑟
𝜅
𝜅
𝑠′∈𝑆 𝑝 𝑂𝑟 |𝑠′ 𝑝 𝑠′
𝑠∈𝑆 𝑝
36
Conclusion And Future Work
•
Mispronunciation detection experiments have shown the methods are effective
in increasing F1-scores on both the training set and the test set.
•
The GOP based mispronunciation detection method can be also viewed as a
two-class classification method.
•
The method uses GOP score as feature input and use a pre-set threshold as the
back-end classifier.
•
Mispronunciation detection using GOP score and related features as input and
DNN or SVM based classifier has been investigated.
•
Evaluation of the uses of MFC optimized GOP with better back-end classifiers
could be remained for the further work.
37
Conclusion And Future Work
•
In mispronunciation detection, using DNN based acoustic models to compute
GOP scores has been proposed and has shown better mispronunciation
detection results.
•
In ASR, applying sequence discriminative criteria such as MMI, MPE to DNN
training has shown lower WER.
•
This suggests that using a task-oriented objective function could be helpful in
obtaining better performance.
•
We think the MFC objective function might be a better fine-tuning target for
DNN based acoustic models in mispronunciation detection and the validation
should be remained for the future work.
38