A DNN-Based Acoustic Modeling of Tonal Language And Its
Download
Report
Transcript A DNN-Based Acoustic Modeling of Tonal Language And Its
Maximum F1-Score
Discriminative Training Criterion for
Automatic Mispronunciation Detection
Hao Huang, Haihua Xu, Xianhui Wang,
Wushour Silamu
Yaochi Hsu
2015/05/12
1
Outline
• Abstract
• Introduction
• The MFC Objective Function
• Model Space Discriminative Training
• Feature Space Discriminative Training
• Experiments And Results
• Conclusion And Future Work
2
Abstract
• We carry out an in-depth investigation on a newly proposed Maximum F1score Criterion (MFC) discriminative training objective function for
Goodness of Pronunciation (GOP) based automatic mispronunciation
detection that makes use of Gaussian Mixture Model-hidden Markov model
(GMM-HMM) as acoustic models.
• We present model-space training algorithm according to MFC using
extended Baum-Welch form like update equations based on the weak-sense
auxiliary function method.
• We then present MFC based feature-space discriminative training.
3
Abstract (cont.)
• Mispronunciation detection experiments show MFC based model-space
training and feature-space training are effective in improving F1-score and
other commonly used evaluation metrics.
• Further, we review and compare mispronunciation detection results with
the use of MFC and some traditional training criteria that minimize word
error rate in speech recognition.
4
Introduction
•
Automatic mispronunciation detection, which aims at helping the learner by
automatically pinpoint erroneous pronunciations, is one of the most popularly
deployed applications.
•
A major approach to mispronunciation detection is based on automatic speech
recognition (ASR) technologies.
•
There are two types of ASR based mispronunciation detection techniques :
1.
One uses confidence scores such as posterior probability to measure the correctness of
a pronunciation. (GOP)
2.
The other ASR based method uses a phone recognizer to decode the input waveforms
with
extended
pronunciation
networks
that
include
correct
and
incorrect
pronunciations to capture possible error types.
5
Introduction (cont.)
• An alternative to ASR based approach is to use acoustic phonetic features
as front-end and a classifier as a back-end.
• Mispronunciation detection can be formulated more suitably as a
classification problem and thus more discriminative features and classifiers
can be explored. (DBN、DNN、SVM)
• In this paper, we use GOP based mispronunciation detection method and
use GMM-HMM based acoustic models to compute GOP scores.
6
Introduction (cont.)
• In ASR, discriminative training (DT) of the acoustic models has been
widely used and has proved to give significant improvement over
traditional ML estimation method.
• Minimum Classification Error (MCE)、 Maximum Mutual Information
(MMI)、Minimum Phone Error (MPE)
• In ASR, the performance of a system is often evaluated in terms of Word
Error Rate (WER).
• In mispronunciation detection, the commonly used metrics include False
Rejection、False Acceptances、Precision and Recall.
7
Introduction (cont.)
• The F1-score is nowadays an important metric when evaluating the
performance of a natural language processing (NLP) system or an
information retrieval (IR) system.
• Recently, researchers began to refine system parameters by directly
maximizing the F1-score for logistic regression based classifiers in NLP.
• The training objective function is a smoothed form of F1-score function,
denoted as Maximum F1-score Criterion (MFC).
8
The MFC Objective Function
GOP 𝑂𝑟,𝑛 , 𝑞𝑟,𝑛 =
1
log
𝑇𝑟,𝑛
GOP 𝑂𝑟,𝑛 , 𝑞𝑟,𝑛 > 𝜏
𝑝 𝑂𝑟,𝑛 𝑞𝑟,𝑛 𝑃(𝑞𝑟,𝑛 )
𝑞∈𝑄(𝑟,𝑛) 𝑝
𝑂𝑟,𝑛 𝑞𝑟,𝑛 𝑃(𝑞)
𝑌𝑒𝑠 𝐶𝑜𝑟𝑟𝑒𝑐𝑡 𝑝𝑟𝑜𝑛𝑢𝑛𝑐𝑖𝑎𝑡𝑖𝑜𝑛
𝑁𝑜 𝑀𝑖𝑠𝑝𝑟𝑜𝑛𝑢𝑛𝑐𝑖𝑎𝑡𝑖𝑜𝑛
10
The MFC Objective Function (cont.)
0 < κ < 1 is a commonly applied exponential scaling factor in
discriminative training to reduce dynamic range of the probabilities.
G 𝑟, 𝑛 = log
𝑃 𝑂𝑟,𝑛 𝑞𝑟,𝑛
𝑞∈𝑄(𝑟,𝑛)
𝜅
𝑇𝑟,𝑛
𝑃 𝑂𝑟,𝑛 𝑞
𝜅
𝑇𝑟,𝑛
𝑑 𝑟, 𝑛 = −𝐺 𝑟, 𝑛 + 𝜏
𝕝 𝑑 𝑟, 𝑛
=
1 𝑖𝑓 𝑑 𝑟, 𝑛 > 0,
𝑀𝑖𝑠𝑝𝑟𝑜𝑛𝑢𝑛𝑐𝑖𝑎𝑡𝑖𝑜𝑛
0 𝑖𝑓 𝑑 𝑟, 𝑛 ≤ 0, 𝐶𝑜𝑟𝑟𝑒𝑐𝑡 𝑝𝑟𝑜𝑛𝑢𝑛𝑐𝑖𝑎𝑡𝑖𝑜𝑛
11
The MFC Objective Function (cont.)
2 ∗ 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 ∗ 𝑅𝑒𝑐𝑎𝑙𝑙
𝐹1 =
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑅𝑒𝑐𝑎𝑙𝑙
𝐹1 =
𝑅
𝑁𝑊𝑊
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =
∗ 100%
𝑁D
𝑁𝑊𝑊
𝑅𝑒𝑐𝑎𝑙𝑙 =
∗ 100%
𝑁𝑊
2𝑁𝑊𝑊
𝑁𝐷 + 𝑁𝑊
𝑁𝑟
𝑁𝑊𝑊 =
𝕝 𝑑 𝑟, 𝑛
∗ 𝐸(𝑟, 𝑛)
𝑟=1 𝑛=1
𝑅
𝑁𝑟
𝑁D =
𝕝 𝑑 𝑟, 𝑛
𝑟=1 𝑛=1
𝑁𝑊𝑊 is the number of phones marked as mispronunciations by both the
computer and the human evaluator.
𝑁D is the total number of mispronunciations detected by the machine.
𝑁𝑊 is the number of mispronunciations judged by the human evaluator.
𝐸(𝑟, 𝑛) is the human-annotated result of 𝑠𝑒𝑔𝑚𝑒𝑛𝑡 (𝑟, 𝑛). 𝐸(𝑟, 𝑛) = 1 is
marked as mispronunciation and 0 otherwise.
12
The MFC Objective Function (cont.)
1
𝑆(𝑢) =
1 + exp(−𝜉𝑢)
𝑅
𝑁𝑟
𝑆
𝑁𝑊𝑊
=
ℱ𝑀𝐹𝐶
𝑆 𝑑 𝑟, 𝑛
∗ 𝐸(𝑟, 𝑛)
𝑟=1 𝑛=1
𝑅
𝑁𝐷𝑆
ℱ𝑀𝐹𝐶 =
𝑁𝑟
=
𝑆 𝑑 𝑟, 𝑛
2
𝑆
2𝑁𝑊𝑊
= 𝑆
𝑁𝐷 + 𝑁𝑊
𝑁𝑟
𝑅
𝑟=1 𝑛=1 𝑆
𝑁𝑟
𝑅
𝑟=1 𝑛=1 𝑆
𝑑 𝑟, 𝑛
∗ 𝐸(𝑟, 𝑛)
𝑑 𝑟, 𝑛
+ 𝑁𝑊
𝑟=1 𝑛=1
However, F1-score is not differentiable because of the step indicator
function, which make it difficult to optimize by using a gradient
based method.
13
Model Space Discriminative Training
• Model-space discriminative is to optimize the MFC objective function by
updating the GMM-HMM parameters.
• Various optimization methods for discriminative training of GMM-HMM
have been tried.
– MCE : Generalized Probabilistic Descent (GPD) algorithm
– MMI : Stochastic Gradient Ascent (SGA) and Resilient Propagation (RProp)
– MMI,MPE and Boosted MMI (BMMI) : Weak-sense auxiliary function
method. (there is no need to determine the appropriate learning rate, or use
second-order statistics.)
14
Model Space Discriminative Training (cont.)
𝜕𝑄(𝜃, 𝜃)
𝜕𝜃
𝜕𝑄(𝜃, 𝜃)
𝜕𝜃
=
𝜃
𝑑𝜃 =
𝜃
𝜕ℱ(𝜃)
𝜕𝜃
𝑄(𝜃, 𝜃) =
𝜃
𝜕ℱ(𝜃)
𝜕𝜃
𝑞
𝑑𝜃
𝜃
𝜕𝑙𝑜𝑔𝑝(𝑂𝑞 |𝑞)
𝜕ℱ𝑀𝐹𝐶
𝜕𝑙𝑜𝑔𝑝(𝑂𝑞 |𝑞)
𝜕𝜃
𝜕ℱ𝑀𝐹𝐶
𝑄(𝜃, 𝜃) =
𝑞
𝜕𝑙𝑜𝑔𝑝 𝑂𝑞 𝑞
𝑑𝜃
𝜃
𝑙𝑜𝑔𝑝 𝑂𝑞 𝑞 + 𝑄 𝑆 (𝜃, 𝜃)
𝜃
Where 𝜃 represents the current model parameters and the variable 𝜃 the
parameters to be estimated.
where 𝑞 is a phone arc within the training confusion networks, 𝑂𝑞 represents the
observation sequence in arc 𝑞.
A smoothing term 𝑄(𝜃, 𝜃) is necessarily added on right-hand side of 𝑄 𝑆 (𝜃, 𝜃) to
ensure concavity of the auxiliary function, and consequently improve stability in
optimization.
It must satisfy the following constraint to ensure the resulting auxiliary function
is still a valid weak-sense auxiliary function.
15
Model Space Discriminative Training (cont.)
1
𝑆
𝑄 𝜃, 𝜃 = −
2
𝑆
𝑀𝑠
𝐷𝑠𝑚 {log 2𝜋𝜎𝑠𝑚
𝑠=1 𝑚=1
2
2
2
𝜇𝑠𝑚
+ 𝜎𝑠𝑚
− 2𝜇𝑠𝑚 𝜇𝑠𝑚 + 𝜇𝑠𝑚
+
}
2
𝜎𝑠𝑚
where 𝑆 is the number of states in the HMM set. and 𝑀𝑠 is number of
Gaussians in state 𝑠.
𝜇𝑠𝑚 and 𝜎𝑠𝑚 are respectively the current mean and variance for Gaussian 𝑚
in state 𝑠.
𝜇𝑠𝑚 and 𝜎𝑠𝑚 are the mean and variance to be updated.
𝐷𝑠𝑚 is a Gaussian-dependent factor.
16
Model Space Discriminative Training (cont.)
𝑙
𝑈𝑠𝑚
𝑇
𝑙 =
𝛽𝑠𝑚
𝑙 (𝑡))
max(0, 𝑈𝑠𝑚
𝑀𝐹𝐶 𝑡
𝜓𝑠𝑚
𝑡 =
𝑀𝐹𝐶 𝑡
−𝜓𝑠𝑚
𝑖𝑓 𝑙 = 𝑛𝑢𝑚𝑒𝑟𝑎𝑡𝑜𝑟
𝑖𝑓 𝑙 = 𝑑𝑒𝑛𝑜𝑚𝑖𝑛𝑎𝑡𝑜𝑟
𝑡=1
𝜓𝑞𝑀𝐹𝐶 =
𝑇
𝑙
𝑋𝑠𝑚
=
𝑙
max(0, 𝑈𝑠𝑚
(𝑡)) ∗ 𝑜(𝑡)
𝑡=1
𝑇
𝑙 =
𝑌𝑠𝑚
𝑙 (𝑡)) ∗ 𝑜 2 (𝑡)
max(0, 𝑈𝑠𝑚
𝑡=1
1 𝜕ℱ𝑀𝐹𝐶
𝜅 𝜕𝑙𝑜𝑔𝑝(𝑂𝑞 |𝑞)
𝑀𝐹𝐶 𝑡 = 𝜓 𝑀𝐹𝐶 𝜓
where 𝜓𝑠𝑚
𝑞
𝑞𝑠𝑚 (𝑡) and
𝜓𝑞𝑠𝑚 𝑡 is the posterior probability of
being in state 𝑠 and Gaussian mixture 𝑚
of arc 𝑞 at time 𝑡 and can be obtained
via a forward-backward pass within arc
𝑞.
17
Model Space Discriminative Training (cont.)
𝜇𝑠𝑚
𝑛 − 𝑋𝑑 + 𝐷 𝜇
𝑋𝑠𝑚
𝑠𝑚
𝑠𝑚 𝑠𝑚
=
𝑛
𝑑
𝛽𝑠𝑚
− 𝛽𝑠𝑚
+ 𝐷𝑠𝑚
2
𝜎𝑠𝑚
𝑛
𝑑
2
2
𝑌𝑠𝑚
− 𝑌𝑠𝑚
+ 𝐷𝑠𝑚 (𝜎𝑠𝑚
+ 𝜇𝑠𝑚
)
2
=
−
𝜇
𝑠𝑚
𝑛
𝑑
𝛽𝑠𝑚
− 𝛽𝑠𝑚
+ 𝐷𝑠𝑚
𝑑
𝐷𝑠𝑚 = 𝐸𝛽𝑠𝑚
The smoothing factor 𝐷𝑠𝑚 is empirically determined for each Gaussian
component.
where 𝐸 is a global constant controlling the update speed.
A large value of 𝐸 will slow down the convergence speed.
Typically a constant E = 3.0 was chosen in the experiments.
18
Model Space Discriminative Training (cont.)
ℱ𝑀𝐹𝐶
𝑅
𝑁𝑟
𝑆
𝑁𝑊𝑊
=
𝑆 𝑑 𝑟, 𝑛
𝑟=1 𝑛=1
𝑆
2𝑁𝑊𝑊
= 𝑆
𝑁𝐷 + 𝑁𝑊
∗ 𝐸(𝑟, 𝑛)
𝑅
𝑁𝑟
𝑁𝐷𝑆 =
𝑆 𝑑 𝑟, 𝑛
𝑟=1 𝑛=1
S
𝜕ℱ𝑀𝐹𝐶
𝜕ℱ𝑀𝐹𝐶
𝜕𝑁WW
𝜕ℱ𝑀𝐹𝐶
𝜕𝑁𝐷S
=
+
S
𝜕𝑙𝑜𝑔𝑝(𝑂𝑟,𝑛 |𝑞)
𝜕𝑁WW 𝜕𝑙𝑜𝑔𝑝(𝑂𝑟,𝑛 |𝑞)
𝜕𝑁𝐷S 𝜕𝑙𝑜𝑔𝑝(𝑂𝑟,𝑛 |𝑞)
S
S
𝜕𝑁WW
2𝑁WW
𝜕𝑁𝐷S
= S
+
S
S
𝜕𝑙𝑜𝑔𝑝(𝑂𝑟,𝑛 |𝑞) (𝑁𝐷S + 𝑁WW
𝑁𝐷 + 𝑁WW
)2 𝜕𝑙𝑜𝑔𝑝(𝑂𝑟,𝑛 |𝑞)
2
19
Model Space Discriminative Training (cont.)
𝜕𝑁𝐷S
𝜕𝑁𝐷S
𝜕𝑑(𝑟, 𝑛) 𝜕𝑒 𝑙𝑜𝑔𝑝(𝑂𝑟,𝑛 |𝑞)
=
𝜕𝑙𝑜𝑔𝑝(𝑂𝑟,𝑛 |𝑞)
𝜕𝑑(𝑟, 𝑛) 𝜕𝑒 𝑙𝑜𝑔𝑝(𝑂𝑟,𝑛 |𝑞) 𝜕𝑙𝑜𝑔𝑝(𝑂𝑟,𝑛 |𝑞)
𝜕𝑁𝐷S
= 𝜉 1 − 𝑆 𝑑 𝑟, 𝑛
𝜕𝑑(𝑟, 𝑛)
=𝜉 1−
=𝜉
𝑆 𝑑 𝑟, 𝑛
1
1
1 + 𝑒 −𝜉𝑑(𝑟,𝑛) 1 + 𝑒 −𝜉𝑑(𝑟,𝑛)
1 + 𝑒 −𝜉𝑑(𝑟,𝑛)
1
−
1 + 𝑒 −𝜉𝑑(𝑟,𝑛) 2
1 + 𝑒 −𝜉𝑑(𝑟,𝑛)
𝜉𝑒 −𝜉𝑑(𝑟,𝑛)
=
1 + 𝑒 −𝜉𝑑(𝑟,𝑛)
2
2
20
Model Space Discriminative Training (cont.)
𝜕𝑑(𝑟, 𝑛)
𝜕
=
𝜕𝑒 𝑙𝑜𝑔𝑝(𝑂𝑟,𝑛 |𝑞) 𝜕𝑝 𝑂𝑟,𝑛 𝑞
𝑃 𝑂𝑟,𝑛 𝑞𝑟,𝑛
− log
𝑞∈𝑄(𝑟,𝑛)
𝜅
𝑇𝑟,𝑛
𝑃 𝑂𝑟,𝑛 𝑞
𝜅
𝑇𝑟,𝑛
+𝜏
𝑊ℎ𝑒𝑛 𝑞 ≠ 𝑞𝑟,𝑛 ,
=−
𝑞∈𝑄 𝑟,𝑛
𝑃 𝑂𝑟,𝑛 𝑞
𝑃 𝑂𝑟,𝑛 𝑞𝑟,𝑛
𝜅
𝑇𝑟,𝑛
−𝑃 𝑂𝑟,𝑛 𝑞𝑟,𝑛
𝜅
𝑇𝑟,𝑛
𝜅
=
𝑇𝑟,𝑛
𝑞∈𝑄 𝑟,𝑛
𝑃 𝑂𝑟,𝑛 𝑞𝑟,𝑛
𝑞∈𝑄 𝑟,𝑛
𝜅
𝑇𝑟,𝑛
𝑃 𝑂𝑟,𝑛 𝑞
𝜅
𝑇𝑟,𝑛
𝑃 𝑂𝑟,𝑛 𝑞
𝜅
𝑇𝑟,𝑛
𝜅
𝑃 𝑂𝑟,𝑛 𝑞
2𝑇
𝑟,𝑛
𝜅
𝑇𝑟,𝑛 −1
1
𝜅
𝑇𝑟,𝑛
𝑃 𝑂𝑟,𝑛 𝑞
21
Model Space Discriminative Training (cont.)
𝑊ℎ𝑒𝑛 𝑞 = 𝑞𝑟,𝑛 ,
=−
𝑞∈𝑄 𝑟,𝑛
𝜅
𝑇𝑟,𝑛
𝑃 𝑂𝑟,𝑛 𝑞
𝑃 𝑂𝑟,𝑛 𝑞𝑟,𝑛
𝜅
𝑃 𝑂𝑟,𝑛 𝑞
𝑇𝑟,𝑛
𝜅
−1
𝑇𝑟,𝑛
𝑞∈𝑄 𝑟,𝑛
𝜅
𝑇𝑟,𝑛
𝑞∈𝑄 𝑟,𝑛
𝜅
=−
𝑇𝑟,𝑛
𝑞∈𝑄 𝑟,𝑛
𝑃 𝑂𝑟,𝑛 𝑞
𝑞∈𝑄 𝑟,𝑛
=
𝜅
𝑇𝑟,𝑛
𝑃 𝑂𝑟,𝑛 𝑞
− 𝑃 𝑂𝑟,𝑛 𝑞
𝑃 𝑂𝑟,𝑛 𝑞
𝑃 𝑂𝑟,𝑛 𝑞
𝑞∈𝑄 𝑟,𝑛
𝜅
𝑇𝑟,𝑛
𝜅
𝑇𝑟,𝑛
𝜅
𝑇𝑟,𝑛
𝑃 𝑂𝑟,𝑛 𝑞
𝑃 𝑂𝑟,𝑛 𝑞
𝜅
𝑇𝑟,𝑛
−1
𝜅
𝑇𝑟,𝑛
𝜅
𝑇𝑟,𝑛
𝜅
𝑇𝑟,𝑛
− 𝑃 𝑂𝑟,𝑛 𝑞
𝜅
𝑇𝑟,𝑛
2
1
𝑃 𝑂𝑟,𝑛 𝑞
1
𝑃 𝑂𝑟,𝑛 𝑞
22
Model Space Discriminative Training (cont.)
𝜕𝑒 𝑙𝑜𝑔𝑝(𝑂𝑟,𝑛 |𝑞)
= 𝑝(𝑂𝑟,𝑛 |𝑞)
𝜕𝑙𝑜𝑔𝑝(𝑂𝑟,𝑛 |𝑞)
𝜕𝑁𝐷S
𝜕𝑁𝐷S
𝜕𝑑(𝑟, 𝑛) 𝜕𝑒 𝑙𝑜𝑔𝑝(𝑂𝑟,𝑛 |𝑞)
=
𝜕𝑙𝑜𝑔𝑝(𝑂𝑟,𝑛 |𝑞)
𝜕𝑑(𝑟, 𝑛) 𝜕𝑒 𝑙𝑜𝑔𝑝(𝑂𝑟,𝑛 |𝑞) 𝜕𝑙𝑜𝑔𝑝(𝑂𝑟,𝑛 |𝑞)
𝑊ℎ𝑒𝑛 𝑞 ≠ 𝑞𝑟,𝑛
𝜅
𝜉𝑒 −𝜉𝑑(𝑟,𝑛)
→
𝑇𝑟,𝑛 1 + 𝑒 −𝜉𝑑(𝑟,𝑛)
𝑃 𝑂𝑟,𝑛 𝑞𝑟,𝑛
2
𝑞∈𝑄 𝑟,𝑛
𝑊ℎ𝑒𝑛 𝑞 = 𝑞𝑟,𝑛
𝜅
𝜉𝑒 −𝜉𝑑(𝑟,𝑛)
→
𝑇𝑟,𝑛 1 + 𝑒 −𝜉𝑑(𝑟,𝑛)
𝜅
𝑇𝑟,𝑛
𝑃 𝑂𝑟,𝑛 𝑞
𝑃 𝑂𝑟,𝑛 𝑞𝑟,𝑛
𝜅
𝑇𝑟,𝑛
2
𝑞∈𝑄 𝑟,𝑛
𝜅
𝑇𝑟,𝑛
𝑃 𝑂𝑟,𝑛 𝑞
𝜅
𝑇𝑟,𝑛
−1
23
Model Space Discriminative Training (cont.)
S
𝜕𝑁WW
1
=
𝐸 𝑟, 𝑛 ∗ 𝑃
𝜕𝑙𝑜𝑔𝑝(𝑂𝑟,𝑛 |𝑞) 𝑇𝑟,𝑛
𝜕𝑁𝐷S
1
=
∗ 𝑃
𝜕𝑙𝑜𝑔𝑝(𝑂𝑟,𝑛 |𝑞) 𝑇𝑟,𝑛
𝜅𝜉𝑒 −𝜉𝑑(𝑟,𝑛)
𝑃=
1 + 𝑒 −𝜉𝑑(𝑟,𝑛)
𝜓𝑞 𝑟, 𝑛 =
2
𝜓𝑞 𝑟, 𝑛 − 𝐴 𝑞, 𝑞𝑟,𝑛
𝑃 𝑂𝑟,𝑛 𝑞
𝑞′∈𝑄 𝑟,𝑛
𝐴 𝑞, 𝑞𝑟,𝑛 is ’phone accuracy’
of phone 𝑞 and is equal to 1
when 𝑞 and 𝑞𝑟,𝑛 are the same,
and 0 otherwise.
𝜅
𝑇𝑟,𝑛
𝑃 𝑂𝑟,𝑛 𝑞′
𝜅
𝑇𝑟,𝑛
24
Model Space Discriminative Training (cont.)
1.
Initialize the acoustic models using ML estimation;
2.
Iterative MFC model training:
a.
Compute GOP scores of all the phone 𝑠𝑒𝑔𝑚𝑒𝑛𝑡𝑠(𝑟, 𝑛) in the training utterances;
b.
Search for the best phone-dependent thresholds 𝜏 that maximize ℱ𝑀𝐹𝐶 ;
c.
S
Compute 𝑁WW
and 𝑁𝐷S using the GOP scores and thresholds obtained in 2.a and 2.b;
d.
Do forward-backward computations to accumulate sufficient statistics.
e.
2
Update the means and variances of Gaussians {𝜇𝑠𝑚 , 𝜎𝑠𝑚
}.
f.
Goto step 2.a unless convergence or maximum number of iterations is reached.
25
Feature Space Discriminative Training
• Region-dependent feature compensation uses a global Gaussian mixture
model to divide the acoustic space into multiple regions, each having a
different compensation offset.
𝑜𝑡 the observation vector of input features at time 𝑡.
ℳ being the total number of the Gaussians in the
HMM set.
(𝑚)
𝜙𝑡
being the posterior probability of Gaussian
given 𝑜𝑡 .
𝑏𝑚 being a region-dependent offset vector of Gaussian
𝑚.
Defining an offset matrix 𝐵 = 𝑏1 𝑏1 … 𝑏𝑚 … 𝑏ℳ 𝑇
Φ𝑡 be an ℳ -dimensional vector whose elements
consist of posterior probabilities 𝜙𝑡
1, … , ℳ.
(𝑚)
ℳ
𝑦𝑡 = 𝑜𝑡 +
𝜙𝑡
(𝑚)
𝑏𝑚
𝑚=1
𝑦𝑡 = 𝑜𝑡 + 𝐵Φ𝑡
where 𝑚 =
26
Feature Space Discriminative Training (cont.)
• We use a gradient ascent algorithm to update the transform matrix B.
𝑦𝑡 = 𝑜𝑡 + 𝐵Φ𝑡
𝜕ℱ𝑀𝐹𝐶
=
𝜕𝐵
𝑇
𝑡=1
𝑇
=
𝑡=1
𝜕ℱ𝑀𝐹𝐶 𝜕𝑦𝑡
𝜕𝑦𝑡 𝜕𝐵
𝜕ℱ𝑀𝐹𝐶 𝑇
Φ𝑡
𝜕𝑦𝑡
27
Experiments And Results
• The proposed method is evaluated on a Mandarin mispronunciation
detection task for Uighur college students who have been learning
Putonghua (Mandarin Chinese) in Xinjiang University.
• we focus only on the phone-level mispronunciations without consideration
of tonal error detection, which normally belongs to another separate
research topic.
• Therefore, the methods proposed in this paper can also be applied to other
non-tonal languages.
28
Experiments And Results (cont.)
32:1
39:1
The spectral front-end uses 39
dimensional vector.
Due to the limited non-native training
data, especially the limited amount of
mispronounced training data, only
monophone HMMs are used.
Another reason for using contextindependent models is that context
information is not easy to implement
considering possible mispronunciations
around the target phone.
29
Experiments And Results (cont.)
• Here we simply add all the initials into the segment when the canonical
phone 𝑞𝑟,𝑛 is an initial, and add all the finals when 𝑞𝑟,𝑛 is a final.
G 𝑟, 𝑛 = log
𝑃 𝑂𝑟,𝑛 𝑞𝑟,𝑛
𝑞∈𝑄(𝑟,𝑛)
𝜅
𝑇𝑟,𝑛
𝑃 𝑂𝑟,𝑛 𝑞
𝜅
𝑇𝑟,𝑛
• We tuned the threshold of each phone 𝑞 using grid search to find a best
ℱ𝑀𝐹𝐶 while maintaining other phone thresholds fixed.
• The procedure was repeated until the objective function ℱ𝑀𝐹𝐶 converged to
an optimum.
𝜏 = 𝑎𝑟𝑔 max ℱ𝑀𝐹𝐶
𝜏
30
Experiments And Results (cont.)
The ML estimated baseline is trained on the native speech database and the
MFC training is conducted on the non-native speech database.
It is seen F1-scores on the training and test sets are respectively 0.388 and
0.381, that is, no clear F1-score improvement is obtained.
This indicates in our experiments the adaption of the baseline acoustic
models to L2 acoustic conditions is not helpful in improving
mispronunciation detection performance.
31
Experiments And Results (cont.)
The HMM set contains 1592 Gaussians, thus the dimension of the posterior
probability vector 𝛷𝑡 in 𝑦𝑡 = 𝑜𝑡 + 𝐵𝛷𝑡 is 1592.
When only the posterior probability vector of current frame 𝛷𝑡 is used, i.e.,
the context window size is 1 (CXT=1).
Here we expand the posterior probability vector 𝛷𝑡 in 𝑦𝑡 = 𝑜𝑡 + 𝐵𝛷𝑡 by
splicing the successive 3 posterior probability vectors 𝛷𝑡 , 𝛷𝑡−1 and 𝛷𝑡+1
together (denoted as RDLC CXT=3).
32
Experiments And Results (cont.)
33
Experiments And Results (cont.)
True Acceptance : 發音正確且辨識為正確
False Acceptance : 發音錯誤但辨識成正確
True Rejection : 發音錯誤且辨識為錯誤
False Rejection : 發音正確但辨識成錯誤
34
Experiments And Results (cont.)
𝐷𝐸𝑅 =
1
𝑁 + 𝑁𝐹𝑅 ∗ 100%
𝑁 𝐹𝐴
𝑆𝐴 = 1 − 𝐷𝐸𝑅 =
1
𝑁 + 𝑁𝑇𝑅 ∗ 100%
𝑁 𝑇𝐴
※ Detection Error Rate
※ Scoring Accuracy
35
Experiments And Results (cont.)
※ Syllable Error Rate
※ Phone Error Rate
𝑅
ℱ𝑀𝑀𝐼 =
𝑟=1
𝑅
ℱ𝑀𝑃𝐸 =
𝑟=1
𝑝 𝑂𝑟 |𝑠𝑟 𝜅 𝑝 𝑠𝑟 𝜅
𝑙𝑜𝑔
𝜅
𝜅
𝑠 𝑝 𝑂𝑟 |𝑠 𝑝 𝑠
𝑂𝑟 |𝑠 𝜅 𝑝 𝑠 𝜅 𝐴 𝑠, 𝑠𝑟
𝜅
𝜅
𝑠′∈𝑆 𝑝 𝑂𝑟 |𝑠′ 𝑝 𝑠′
𝑠∈𝑆 𝑝
36
Conclusion And Future Work
•
Mispronunciation detection experiments have shown the methods are effective
in increasing F1-scores on both the training set and the test set.
•
The GOP based mispronunciation detection method can be also viewed as a
two-class classification method.
•
The method uses GOP score as feature input and use a pre-set threshold as the
back-end classifier.
•
Mispronunciation detection using GOP score and related features as input and
DNN or SVM based classifier has been investigated.
•
Evaluation of the uses of MFC optimized GOP with better back-end classifiers
could be remained for the further work.
37
Conclusion And Future Work
•
In mispronunciation detection, using DNN based acoustic models to compute
GOP scores has been proposed and has shown better mispronunciation
detection results.
•
In ASR, applying sequence discriminative criteria such as MMI, MPE to DNN
training has shown lower WER.
•
This suggests that using a task-oriented objective function could be helpful in
obtaining better performance.
•
We think the MFC objective function might be a better fine-tuning target for
DNN based acoustic models in mispronunciation detection and the validation
should be remained for the future work.
38