different formant structures

Transcript different formant structures

Phone-Level Pronunciation
Scoring and Assessment for
Interactive Language Learning
Speech Communication, 2000
Authors: S. M. Witt, S. J. Young
Presenter: Davidson
Date: 2009/07/08, 2009/07/15
Contents
 Introduction
 Goodness of Pronunciation (GoP) algorithm
 Basic GoP algorithm
 Phone dependent thresholds
 Explicit error modeling
 Collection of a non-native database
 Performance measures
 The labeling consistency of the human
judges
 Experimental results
 Conclusions and future work
Introduction (1/3)
 CAPT systems (Computer-Assisted
Pronunciation Training)
 Word and phrase level scoring (’93, ’94, ’97)
 Intonation, stress, and rhythm
 Requires several recordings of native utterances
for each word
 Difficult to add new teaching material
 Selected phonemic error teaching (1997)
 Uses duration information or models trained on
non-native speech
Introduction (2/3)
 HMM has been used to produce
sentence-level scores (1990, 1996)
 Eskenazi’s system (1996) produces
phone-level scores but no attempt to
relate this to human judgement
 Author’s proposed system:
 Measures pronunciation quality for nonnative speech at the phone level
Introduction (3/3)
 Other issues
 GoP algorithms with refinements
 Performance measures for both GoP
scores and scores by human judges
 Non-native database
 Experiments on these performance
measures
Goodness of Pronunciation (GoP)
algorithm: Basic GoP algorithm
 A score for each phone

= likelihood of the acoustic
segment
corresponding to each
phone
 GoP = duration normalized log of
the posterior probability
for a
phone given the corresponding
acoustic segment
Basic GoP algorithm (2/5)
 = the set of all phone models

= number of frames in
 By assuming equal phone priors
and approximating by its maximum:
Basic GoP algorithm (3/5)
 Numerator term
is computed using
forced alignment with known transcription
 Denominator term
is determined
using an unconstrained phone loop
Basic GoP algorithm (4/5)
 If a mispronunciation has occurred, it is not
reasonable to constrain the acoustic
segment used to compute the maximum
likelihood phone to be identical to the
assumed phone
 Hence, the denominator score is computed
by summing the log likelihood per frame
over the duration of
 In practice, this will often mean that more
than one phone in the unconstrained phone
sequence has contributed to the
computation of
Basic GoP algorithm (5/5)
 Intuitive to use speech data from native
speakers to train the acoustic models
 However, non-native speech is
characterized by different formant
structures compared to those from a
native speaker for the same phone
 Adapt Gaussian means by MLLR
 Use only one single global transform of the
HMM Gaussian component mean to avoid
adapting to specific phone error patterns
Phone dependent thresholds
 The acoustic fit of phone-based HMMs
differs from phone to phone
 E.g. fricatives tend to have lower log
likelihood than vowels
 2 ways to determine phone-specific
thresholds
 By using mean
and variance
for phone
 By approximating human labeling behavior
Explicit error modeling (1/3)
 2 types of pronunciation errors
 Individual mispronunciations
 Systematic mispronunciations
 Consists of substitutions of native sounds
for sounds of the target language, which do
not exist in the native language
 Knowledge of the learner’s native
language is included in order to
detect systematic mispronunciation
Explicit error modeling (2/3)
 Solution: a recognition network
incorporating both correct
pronunciation and common
pronunciation errors in the form of
error sublattices for each phone.
 E.g. “but”
Explicit error modeling (3/3)
 Target phone posterior probability
 Scores for systematic mispronunciations
 GoP that includes additional penalty for
systematic mispronunciation
Collection of a non-native database
(1/2)
 Based on the procedures used for the
WSJCAM0 corpus
 Texts are composed of a limited vocabulary
of 1500 words
 6 females and 4 males whose mothertongues are Korean (3), Japanese (3),
Latin-American Spanish (3), and Italian (1).
 Each speaker reads 120 sentences
 40 common set of phonetically-balanced
sentences
 80 sentences varied from session to session
Collection of a non-native database
(2/2)
 6 human judges who speaks native British
English
 Each speaker was labeled by 1 judge
 20 sentences from a female Spanish
speakers are used as calibration sentences
 Annotated by all 6 judges
 Transcriptions reflect the actual sound
uttered by the speakers
 Including phonemes from other languages
Performance measures (1/3)
 Compares 2 transcriptions of the same
sentence
 Transcriptions are either transcribed by human
judges or generated automatically
 4 types of performance measures




Strictness
Agreement
Cross-correlation
Overall phone correlation
Performance measures (2/3)
 Compared on a frame by frame basis
 Each error is marked as 1 or 0 otherwise.
 Yields a vector
of length
with
 Apply a Hamming window
 Transition between 0 and 1 is too abrupt where
as in practice the boundary is often uncertain
 Forced alignment might be erroneous due to
poor acoustic modeling of non-native speech
 Window length
Performance measures (3/3)
Strictness (S)
 Measures how
strict the judge was
in marking
pronunciation
errors
 Relative strictness
Overall Agreement (A)
 Measures the agreement of all frames
between 2 transcriptions
 Defined in terms of cityblock distance
between 2 transcription vectors
Cross-correlation (CC)
 Measures the agreement between the
error frames in either or both
transcriptions

is the Euclidean distance
Phoneme Correlation (PC)
 Measures the overall agreement of overall
rejection statistics for each phone between
2 judges/systems
 PC is defined as


is a vector of rejection count for each phone
denotes the mean rejection counts
Labeling consistency of the human
judges (1/4)
Labeling consistency of the human
judges (2/4)
 All results are within an acceptable
range




0.85<A<0.95, mean = 0.91
0.40<CC<0.65, mean = 0.47
0.70<PC<0.85, mean = 0.78
0.03< <0.14, mean = 0.06
 These mean values can be used as a
benchmark values
Labeling consistency of the human
judges (3/4)
Labeling consistency of the human
judges (4/4)
Experimental results (1/7)
 Multiple mixture monophone models
 Corpus: WSJCAM0
 Range of rejection threshold was
restricted to lie within one standard
deviation of the judges strictness

where
Experimental results (2/7)
Experimental results (3/7)
Experimental results (4/7)
Experimental results (5/7)
Experimental results (6/7)
 Add error handling with Latin-American
Spanish models to detect systematic
mispronunciations
Experimental results (7/7)
 Transcriptions comparison between human
judges and the system with error network
Conclusions and future work
 2 GoP scoring mechanism
 Basic GoP
 GoP with systematic mispronunciation penalty
 Refinement methods
 MLLR adaptation
 Independent thresholds trained from human
judgement
 Error network
 Future work
 Information about the type of mistake

different formant structures

Transcript different formant structures

Directory