speaker adaptation
Download
Report
Transcript speaker adaptation
Adaptation Techniques in
Automatic Speech Recognition
Tor André Myrvoll
Telektronikk 99(2), Issue on Spoken Language
Technology in Telecommunications, 2003.
Goal and Objective
Make ASR robust to speaker and
environmental variability.
Model adaptation: Automatically adapt a
HMM using limited but representative
new data to improve performance.
Train ASRs for applications w/ insufficient
data.
What Do We Have/Adapt?
A HMM based ASR trained in the usual manner.
The output probability is parameterized by
GMMs.
No improvement when adapting state transition
probabilities and mixture weights.
Difficult to estimate robustly.
Mixture means can be adapted “optimally” and
proven useful.
Adaptation Principles
Main Assumption: Original model is
“good enough”, model adaptation can’t
be re-training!
Offline Vs. Online
If possible offline (performance
uncompromised by computational reasons).
Decode the
adaptation speech data
based on current
model.
Use this to estimate
the “speakerdependent” model’s
statistics.
Online Adaptation Using Prior
Evolution.
Present posterior
is the next prior.
i 1 ˆ i
i 1 ˆ i
p
O
|
O
,
W
,
p
O
, W1 ,
i
1
1
1
p | O , Wˆ
p Oi , O1i 1 , Wˆ1i
p Oi | O1i 1 , Wˆ1i , p | O1i 1 , Wˆ1i p O1i 1 , Wˆ1i
p O | O i 1 ,Wˆ i p O i 1 , Wˆ i
i
1
i
1
i
1
1
1
1
p Oi | Wˆi , p | Wˆ1i 1 , O1i 1
p Oi | Wˆi
p Oi , Q, K | Wˆi , p | O1i 1 , Wˆ1i 1
QQ K K
p Oi | Wˆi
MAP Adaptation
MAP arg max pO | W , g |
HMMs have no sufficient statistics =>
can’t use conjugate prior-posterior pairs.
Find posterior via EM.
Find prior empirically (multi-modal, first
model estimated using ML training).
EMAP
All phonemes in every context don’t occur in
adaptation data; Need to store correlations
between variables.
EMAP only considers correlation between mean
vectors under jointly Gaussian assumption.
T
S0 E ~ ~0 ~ ~0
For large model sizes, share means across
models.
Transformation Based Model
Adaptation
Estimate a transform T parameterized by .
ML
ML arg max pO | T SI ,W
MAP
MAP arg max pO | T SI ,W g |
Bias, Affine and Nonlinear
Transformations
ML estimation of
bias.
Affine
transformation.
Nonlinear
transformation
( may be a
neural network).
̂ m br m
ˆ m Ar m m br m
ˆ m Ar m m Ar m
ˆ m g m
MLLR
f x Ax b; x ~ Ν m , m
W A b
ˆ m W m
Wˆ arg max pO | W , , W
W
ˆ m BmT Hˆ m Bm
Apply separate
transformations to
different parts of the
model (HEAdapt in
HTK).
SMAP
Model the mismatch between the SI model (x) and
the test environment.
ymt 1/ 2 xt m
ymt ~ N 0, I
No mismatch
ymt ~ N ( , )
Mismatch
ˆ m m 1/ 2
and estimated by usual
ML methods on adaptation
data.
ˆ m 1m/ 2 (1m/ 2 ) t
Adaptive Training
Gender dependent model selection
VTLN (in HTK using WARPFREQ)
Speaker Adaptive Training
Assumption: There exists a compact
model (c), which relates to all speakerdependent model via an affine
transformation T (~MLLR). The model
and the transformation are found using
EM.
C , T arg max
pO r | Tr , C ,W r
,T
R
r 1
Cluster Adaptive Training
Group speakers in training set into
clusters. Now find the cluster closest to
the test speaker.
Use Canonical Models
M m m1 ..... mC
m M m r b
Eigenvoices
Similar to Cluster Adaptive Training.
Concatenate means from ‘R’ speaker dependent
model. Perform PCA on the resulting vector.
Store K << R eigenvoice vectors.
Form a vector of means from the SI model too.
Given a new speaker, the mean is a linear
combination of SI vector and eigenvoice vector.
Summary
2 major approaches: MAP (&EMAP) and
MLLR.
MAP needs more data (use of a simple
prior) than MLLR. MAP --> SD model.
Adaptive training is gaining popularity.
For mobile applications, complexity and
memory are major concerns.