speaker adaptation

Download Report

Transcript speaker adaptation

Adaptation Techniques in
Automatic Speech Recognition
Tor André Myrvoll
Telektronikk 99(2), Issue on Spoken Language
Technology in Telecommunications, 2003.
Goal and Objective



Make ASR robust to speaker and
environmental variability.
Model adaptation: Automatically adapt a
HMM using limited but representative
new data to improve performance.
Train ASRs for applications w/ insufficient
data.
What Do We Have/Adapt?





A HMM based ASR trained in the usual manner.
The output probability is parameterized by
GMMs.
No improvement when adapting state transition
probabilities and mixture weights.
Difficult to estimate  robustly.
Mixture means can be adapted “optimally” and
proven useful.
Adaptation Principles

Main Assumption: Original model is
“good enough”, model adaptation can’t
be re-training!
Offline Vs. Online

If possible offline (performance
uncompromised by computational reasons).
 Decode the
adaptation speech data
based on current
model.
 Use this to estimate
the “speakerdependent” model’s
statistics.
Online Adaptation Using Prior
Evolution.

Present posterior
is the next prior.
 



i 1 ˆ i
i 1 ˆ i
p
O
|
O
,
W
,

p
O
, W1 , 
i
1
1
1
p  | O , Wˆ 
p Oi , O1i 1 , Wˆ1i
p Oi | O1i 1 , Wˆ1i ,  p  | O1i 1 , Wˆ1i p O1i 1 , Wˆ1i

p O | O i 1 ,Wˆ i p O i 1 , Wˆ i

i
1
i
1





i
1

 

1

1




1
p Oi | Wˆi ,  p  | Wˆ1i 1 , O1i 1

p Oi | Wˆi
p Oi , Q, K | Wˆi ,  p  | O1i 1 , Wˆ1i 1

QQ  K K

p Oi | Wˆi




MAP Adaptation
 MAP  arg max pO | W ,  g  |  



HMMs have no sufficient statistics =>
can’t use conjugate prior-posterior pairs.
Find posterior via EM.
Find prior empirically (multi-modal, first
model estimated using ML training).
EMAP


All phonemes in every context don’t occur in
adaptation data; Need to store correlations
between variables.
EMAP only considers correlation between mean
vectors under jointly Gaussian assumption.

T
S0  E ~  ~0 ~  ~0 


For large model sizes, share means across
models.
Transformation Based Model
Adaptation
 Estimate a transform T parameterized by .

ML
ML  arg max pO | T  SI ,W 

MAP
MAP  arg max pO | T  SI ,W g  |  


Bias, Affine and Nonlinear
Transformations



ML estimation of
bias.
Affine
transformation.
Nonlinear
transformation
( may be a
neural network).
̂ m    br m 
ˆ m  Ar m  m  br m 
ˆ m  Ar m  m Ar m 
ˆ m  g  m 
MLLR
f x   Ax  b; x ~ Ν  m ,  m 
W  A b
ˆ m  W m
Wˆ  arg max pO | W , , W 
W
ˆ m  BmT Hˆ m Bm
 Apply separate
transformations to
different parts of the
model (HEAdapt in
HTK).
SMAP
 Model the mismatch between the SI model (x) and
the test environment.
ymt   1/ 2  xt   m 
ymt ~ N 0, I 
 No mismatch
ymt ~ N ( , )
 Mismatch
ˆ m   m  1/ 2
 and  estimated by usual
ML methods on adaptation
data.
ˆ m  1m/ 2 (1m/ 2 ) t
Adaptive Training


Gender dependent model selection
VTLN (in HTK using WARPFREQ)
Speaker Adaptive Training

Assumption: There exists a compact
model (c), which relates to all speakerdependent model via an affine
transformation T (~MLLR). The model
and the transformation are found using
EM.
 C , T   arg max
 pO r  | Tr ,  C ,W r  
 ,T
R
r 1
Cluster Adaptive Training

Group speakers in training set into
clusters. Now find the cluster closest to
the test speaker.

Use Canonical Models

M m   m1 ..... mC
 m  M m r  b

Eigenvoices




Similar to Cluster Adaptive Training.
Concatenate means from ‘R’ speaker dependent
model. Perform PCA on the resulting vector.
Store K << R eigenvoice vectors.
Form a vector of means from the SI model too.
Given a new speaker, the mean is a linear
combination of SI vector and eigenvoice vector.
Summary




2 major approaches: MAP (&EMAP) and
MLLR.
MAP needs more data (use of a simple
prior) than MLLR. MAP --> SD model.
Adaptive training is gaining popularity.
For mobile applications, complexity and
memory are major concerns.