different formant structures
Download
Report
Transcript different formant structures
Phone-Level Pronunciation
Scoring and Assessment for
Interactive Language Learning
Speech Communication, 2000
Authors: S. M. Witt, S. J. Young
Presenter: Davidson
Date: 2009/07/08, 2009/07/15
Contents
Introduction
Goodness of Pronunciation (GoP) algorithm
Basic GoP algorithm
Phone dependent thresholds
Explicit error modeling
Collection of a non-native database
Performance measures
The labeling consistency of the human
judges
Experimental results
Conclusions and future work
Introduction (1/3)
CAPT systems (Computer-Assisted
Pronunciation Training)
Word and phrase level scoring (’93, ’94, ’97)
Intonation, stress, and rhythm
Requires several recordings of native utterances
for each word
Difficult to add new teaching material
Selected phonemic error teaching (1997)
Uses duration information or models trained on
non-native speech
Introduction (2/3)
HMM has been used to produce
sentence-level scores (1990, 1996)
Eskenazi’s system (1996) produces
phone-level scores but no attempt to
relate this to human judgement
Author’s proposed system:
Measures pronunciation quality for nonnative speech at the phone level
Introduction (3/3)
Other issues
GoP algorithms with refinements
Performance measures for both GoP
scores and scores by human judges
Non-native database
Experiments on these performance
measures
Goodness of Pronunciation (GoP)
algorithm: Basic GoP algorithm
A score for each phone
= likelihood of the acoustic
segment
corresponding to each
phone
GoP = duration normalized log of
the posterior probability
for a
phone given the corresponding
acoustic segment
Basic GoP algorithm (2/5)
= the set of all phone models
= number of frames in
By assuming equal phone priors
and approximating by its maximum:
Basic GoP algorithm (3/5)
Numerator term
is computed using
forced alignment with known transcription
Denominator term
is determined
using an unconstrained phone loop
Basic GoP algorithm (4/5)
If a mispronunciation has occurred, it is not
reasonable to constrain the acoustic
segment used to compute the maximum
likelihood phone to be identical to the
assumed phone
Hence, the denominator score is computed
by summing the log likelihood per frame
over the duration of
In practice, this will often mean that more
than one phone in the unconstrained phone
sequence has contributed to the
computation of
Basic GoP algorithm (5/5)
Intuitive to use speech data from native
speakers to train the acoustic models
However, non-native speech is
characterized by different formant
structures compared to those from a
native speaker for the same phone
Adapt Gaussian means by MLLR
Use only one single global transform of the
HMM Gaussian component mean to avoid
adapting to specific phone error patterns
Phone dependent thresholds
The acoustic fit of phone-based HMMs
differs from phone to phone
E.g. fricatives tend to have lower log
likelihood than vowels
2 ways to determine phone-specific
thresholds
By using mean
and variance
for phone
By approximating human labeling behavior
Explicit error modeling (1/3)
2 types of pronunciation errors
Individual mispronunciations
Systematic mispronunciations
Consists of substitutions of native sounds
for sounds of the target language, which do
not exist in the native language
Knowledge of the learner’s native
language is included in order to
detect systematic mispronunciation
Explicit error modeling (2/3)
Solution: a recognition network
incorporating both correct
pronunciation and common
pronunciation errors in the form of
error sublattices for each phone.
E.g. “but”
Explicit error modeling (3/3)
Target phone posterior probability
Scores for systematic mispronunciations
GoP that includes additional penalty for
systematic mispronunciation
Collection of a non-native database
(1/2)
Based on the procedures used for the
WSJCAM0 corpus
Texts are composed of a limited vocabulary
of 1500 words
6 females and 4 males whose mothertongues are Korean (3), Japanese (3),
Latin-American Spanish (3), and Italian (1).
Each speaker reads 120 sentences
40 common set of phonetically-balanced
sentences
80 sentences varied from session to session
Collection of a non-native database
(2/2)
6 human judges who speaks native British
English
Each speaker was labeled by 1 judge
20 sentences from a female Spanish
speakers are used as calibration sentences
Annotated by all 6 judges
Transcriptions reflect the actual sound
uttered by the speakers
Including phonemes from other languages
Performance measures (1/3)
Compares 2 transcriptions of the same
sentence
Transcriptions are either transcribed by human
judges or generated automatically
4 types of performance measures
Strictness
Agreement
Cross-correlation
Overall phone correlation
Performance measures (2/3)
Compared on a frame by frame basis
Each error is marked as 1 or 0 otherwise.
Yields a vector
of length
with
Apply a Hamming window
Transition between 0 and 1 is too abrupt where
as in practice the boundary is often uncertain
Forced alignment might be erroneous due to
poor acoustic modeling of non-native speech
Window length
Performance measures (3/3)
Strictness (S)
Measures how
strict the judge was
in marking
pronunciation
errors
Relative strictness
Overall Agreement (A)
Measures the agreement of all frames
between 2 transcriptions
Defined in terms of cityblock distance
between 2 transcription vectors
Cross-correlation (CC)
Measures the agreement between the
error frames in either or both
transcriptions
is the Euclidean distance
Phoneme Correlation (PC)
Measures the overall agreement of overall
rejection statistics for each phone between
2 judges/systems
PC is defined as
is a vector of rejection count for each phone
denotes the mean rejection counts
Labeling consistency of the human
judges (1/4)
Labeling consistency of the human
judges (2/4)
All results are within an acceptable
range
0.85<A<0.95, mean = 0.91
0.40<CC<0.65, mean = 0.47
0.70<PC<0.85, mean = 0.78
0.03< <0.14, mean = 0.06
These mean values can be used as a
benchmark values
Labeling consistency of the human
judges (3/4)
Labeling consistency of the human
judges (4/4)
Experimental results (1/7)
Multiple mixture monophone models
Corpus: WSJCAM0
Range of rejection threshold was
restricted to lie within one standard
deviation of the judges strictness
where
Experimental results (2/7)
Experimental results (3/7)
Experimental results (4/7)
Experimental results (5/7)
Experimental results (6/7)
Add error handling with Latin-American
Spanish models to detect systematic
mispronunciations
Experimental results (7/7)
Transcriptions comparison between human
judges and the system with error network
Conclusions and future work
2 GoP scoring mechanism
Basic GoP
GoP with systematic mispronunciation penalty
Refinement methods
MLLR adaptation
Independent thresholds trained from human
judgement
Error network
Future work
Information about the type of mistake