robust_sid - Faculty of Information Technology

Download Report

Transcript robust_sid - Faculty of Information Technology

Robust Speaker Recognition
JHU Summer School 2008
Lukas Burget
Brno University of Technology
Intersession variability
The largest challenge to practical use of speaker detection
systems is channel/session variability
•
•
Variability refers to changes in channel
effects between training and
successive detection attempts
Channel/session variability
encompasses several factors
– The microphones
• Carbon-button, electret, hands-free,
array, etc
– The acoustic environment
• Office, car, airport, etc.
– The transmission channel
• Landline, cellular, VoIP, etc.
– The differences in speaker voice
• Aging, mood, spoken language, etc.
•
Anything which affects the spectrum
can cause problems
– Speaker and channel effects are bound
together in spectrum and hence
features used in speaker verifiers
NIST SRE2008 - Interview speech
Different microphone
in training and test
about 3% EER
The same
microphone
in training
and test
< 1% EER
Channel/Session Compensation
Channel/session compensation occurs at several levels in a speaker detection system
Signal domain
Feature domain
Model domain
Score domain
Target model
Front-end
processing
Adapt
S
LR score
normalization
Background
model
• Noise
removal
• Tone
removal
• Cepstral mean
subtraction
• RASTA filtering
•Feature
Mapping
• Mean & variance •Eigenchannel
normalization
adaptation in
• Feature warping feature domain
• Speaker Model
Synthesis
• Eigenchannel
compensation
•Joint Factor
Analysis
• Nuisance Attribute
Projection
• Z-norm
• T-norm
• ZT-norm
L
Signal domain
Feature domain
Model domain
Score domain
Target model
Front-end
processing
Adapt
S
LR score
normalization
Background
model
• Noise
removal
• Tone
removal
• Cepstral mean
subtraction
• RASTA filtering
•Feature
Mapping
• Mean & variance •Eigenchannel
normalization
adaptation in
• Feature warping feature domain
• Speaker Model
Synthesis
• Eigenchannel
compensation
•Joint Factor
Analysis
• Nuisance Attribute
Projection
• Z-norm
• T-norm
• ZT-norm
L
Adaptive Noise Suppression
Basic idea of spectral subtraction (or Wiener filter):
Y(n) = X(n) - N(n)
•Y(n) – enhanced speech
•X(n) – spectrum of nth frame of noisy speech
•N(n) – estimate of stationary additive noise spectrum
Reformulate as filtration: Y(n) = H(n)X(n) where H(n) = (X(n) – N(n)) / X(n)
It is necessary to
•
to smooth H(n) in time
•
make sure magnitude spectrum is not negative
•
…
Adaptive Noise Suppression
•
•
•
Goal: Suppress wideband noise and preserve the speech
Approach: Maintain transient and dynamic speech components, such as energy bursts
in consonants, that are important “information-carriers”
Suppression algorithm has two primary components
–
–
Detection of speech or background in each frame
Suppression component uses an adaptive Wiener filter requiring:
•
Underlying speech signal spectrum, obtained by smoothing the enhanced output
•
Background spectrum
•
Signal change measure, given by a spectral derivative, for controlling smoothing constants
Degraded
Speech
Suppression
Enhanced
Speech
Suppression
Filter
Spectral
Derivative
Background
|Spectrum|
Speech
|Spectrum|
Short-Time
Spectral
Magnitude
Detection
Time
Constant
Smooth
Delay
Adaptive Noise Suppression
• C3 example from ICSI
• Processed with LLEnhance toolkit for wideband noise reduction
SNR = 15 dB
SNR = 25 dB
Signal domain
Feature domain
Model domain
Score domain
Target model
Front-end
processing
Adapt
S
LR score
normalization
Background
model
• Noise
removal
• Tone
removal
• Cepstral mean
subtraction
• RASTA filtering
•Feature
Mapping
• Mean & variance •Eigenchannel
normalization
adaptation in
• Feature warping feature domain
• Speaker Model
Synthesis
• Eigenchannel
compensation
•Joint Factor
Analysis
• Nuisance Attribute
Projection
• Z-norm
• T-norm
• ZT-norm
L
Cepstral Mean Subtraction
•MFCC feature extraction scheme
•Consider the same speech signal recorded over
different microphone attenuating
certain frequencies twice
•Scaling in magnitude spectrum
domain corresponds to constant
shift of the log filter bank outputs
Fourier
Transform
Magnitude
x 0.5
Log()
- 0.3
frames
Cosine
transform
Cepstral Mean Subtraction
•Assuming the frequency
characteristics of the two
microphones do not change over
time, the whole temporal
trajectories of the affected log
filter bank outputs differs by the
constant.
•The shift disappears after
subtracting mean computed over
the segment.
0.0
•Usually only speech frames are
considered for the mean
estimation
•Since Cosine transform is linear
operation the same trick can be
applied directly in cepstral
domain
NIST SRE 2005 all trials
Miss probability [%]
2048 Gauss., 13 MFCC + delatas, CMS
False alarm probability [%]
RASTA filtering
•Filtering log filter bank output (or equivalently
cepstral) temporal trajectories by band pass filter
•Remove fast changes (> 25Hz) likely not caused
by speaker with limited ability to quickly change
vocal tract configuration
0
Magnitude [dB]
•Remove slow changes to compensate for the
channel effect (≈CMS over 0.5 sec. sliding window)
Frequency characteristic
10
-10
-20
-30
-40
0.01
original
0.1
1
Frequency [Hz]
10
100
Impulse response
0.0
 frames
RASTA filtered
0.0
-100
0
100
200
Time [s]
300
400
NIST SRE 2005 all trials
2048 Gauss., 13 MFCC + delatas, CMS
Miss probability [%]
with RASTA
False alarm probability [%]
Mean and Variance Normalization
•While convolutive noise causes the constant shift of cepstral coeff. temporal
trajectories, noise additive in spectral domain fills valleys in the trajectories
•In addition to subtracting mean, trajectory can be normalized to unity variance
(i.e. dividing by standard deviation) to compensate for his effect
original
Speech with
additive noise
 frames
after CMN/CVN
Clean speech
Feature Warping
•Warping each cepstral coefficients in 3 second sliding window into
Gaussian distribution
•Combines advantages of the previous techniques (CMN/CVN, RASTA)
•Resulting coefficients are (locally) Gaussianized  more suitable for GMM
models
0.0
0.5
1.0
0.0
Inverse Gaussian
cumulative density
function
NIST SRE 2005 all trials
2048 Gauss., 13 MFCC + delatas, CMS
with RASTA
Miss probability [%]
with Feature Warping
False alarm probability [%]
NIST SRE 2005 all trials
2048 Gauss., 13 MFCC + delatas, CMS
with RASTA
Miss probability [%]
with Feature Warping
+ triple deltas
+ HLDA
False alarm probability [%]
Example of 2D GMM
HLDA
Heteroscedastic Linear Discriminant Analysis provides a
linear transformation that de-correlates classes.
HLDA
HLDA allows for dimensionality reduction while preserving
the discriminability between classes (HLDA without dim.
Reduction is also called MLLT)
Useful dimension
Nuisance dimension
Signal domain
Feature domain
Model domain
Score domain
Target model
Front-end
processing
Adapt
S
LR score
normalization
Background
model
• Noise
removal
• Tone
removal
• Cepstral mean
subtraction
• RASTA filtering
•Feature
Mapping
• Mean & variance •Eigenchannel
normalization
adaptation in
• Feature warping feature domain
• Speaker Model
Synthesis
• Eigenchannel
compensation
•Joint Factor
Analysis
• Nuisance Attribute
Projection
• Z-norm
• T-norm
• ZT-norm
L
Speaker Model Synthesis
• It is generally difficult to get enrollment speech from all
microphone types to be used
• The SMS approach addresses this by synthetically generating
speaker models as if they came from different microphones
(Teunen, ICSLP 2000)
– A mapping of model parameters between different microphone types
is applied
synthesis
cellular
synthesis
electret
carbon button
Speaker Model Synthesis
Learning mapping of model
parameters between different
microphone types:
•Start with channel-independent
root model
•Create channel models by
adapting root with channel
specific data
•Learn mean shift between
channel models
Speaker Model Synthesis
Training speaker model:
Training data
Test data
•Adapt channel model which
scores highest on training data
to get target model
•Synthesize new target channel
model by applying the shift
Ti CD1CD 2 ( i )  i  ( iCD 2  iCD1 )
•GMM weights and variances
can be also adapted and
used to improve the mapping
of model parameters between
different microphone types
Ti CD1CD 2 (i )  i (iCD 2 / iCD1 )
Ti CD1CD 2 ( i )   i ( iCD 2 /  iCD1 )
Signal domain
Feature domain
Model domain
Score domain
Target model
Front-end
processing
Adapt
S
LR score
normalization
Background
model
• Noise
removal
• Tone
removal
• Cepstral mean
subtraction
• RASTA filtering
•Feature
Mapping
• Mean & variance •Eigenchannel
normalization
adaptation in
• Feature warping feature domain
• Speaker Model
Synthesis
• Eigenchannel
compensation
•Joint Factor
Analysis
• Nuisance Attribute
Projection
• Z-norm
• T-norm
• ZT-norm
L
Feature mapping
• Aim: Apply transform to map channel-dependent feature space
into a channel-independent feature space
• Approach:
– Train a channel-independent model using pooling of data from all
types of channels
– Train channel-dependent models using MAP adaptation
– For utterance, find top scoring CD model (channel detection)
– Map each feature vector in utterance into CI space
CD 1
…
CD 2
CI
CD N
yt  M iCDCI ( xt )
D.A. Reynolds, “Channel Robust Speaker Verification via Feature Mapping,” ICASSP 2003
Feature mapping
•As for SMS, sreate channel
models by adapting root with
channel specific data
•Learn mean shifts between
each channel models and
channel-independent root model
Feature mapping
•
For each (training or test) speech
segment, determine maximum
likelihood channel model
•
For each frame of the segment,
record top-1 Gaussian per frame

CD
i  arg max  CD
p
j
j ( xt )
1 j  M
•

For each frame apply mapping to
map x with CD pdf to y with CI pdf
CI

i
yt  ( xt  iCD ) CD
 iCI
i
•
Target model is adapted from CI
model using mapped features
•
Mapped features and CI models
are used in test
NIST SRE 2005 all trials
2048 Gauss., 13 MFCC + delatas, CMS
with RASTA
Miss probability [%]
with Feature Warping
+ triple deltas
+ HLDA
+ Feature mapping (14 classes)
False alarm probability [%]
Session variability in mean supervector space
•GMM mean supervector – column vector created by concatenating mean
vectors of all GMM components.
•For the case of variances shared by all speaker models, supervector M fully
defines speaker model
•Speaker Model Synthesis can be rewritten as:
MCD2 = MCD1 + kCD1CD2, where kCD1CD2 is the cross-channel shift
•Drawbacks of SMS (and Feature Mapping)
•Channel dependent models must be created for each channel
•Different factors causing intersession variability may combine (e.g. channel
and language)  compensation must be trained for each such combination
•The factors are not discrete (i.e. effects on the intersession variability may
be more or less strong)
•There is evidence that there is limited number of directions in the supervector
space strongly affected by intersession variability. Different directions possibly
corresponds to different factors.
Session variability in mean supervector space
Example: single Gaussian model with 2D features
Target speaker model
UBM
32
Session compensation in supervector space
Target speaker model
Test data
UBM
For recognition, move both models along the high inter-session variability
direction(s) to fit well the test data (e.g. in ML sense)
33
6D example of supervector space
Identifying high intersession variability directions
• Take multiple speech segments from
many training speakers recorded
under different channel conditions.
For each segment derive supervector
by MAP adapting UBM.
• From each supervector, subtract
mean computed over supervectors of
corresponding speaker.
• Find direction's with largest
intersession variability using PCA
(eigen vectors of the average with-in
speaker covariance matrix).
supervectors
of speaker 1
speaker 2
speaker 3
Eigenchannel adaptation
• Speaker model obtained in usual way
by MAP adapting UBM
• For test, adapt speaker model and
UBM by moving supervectors in the
direction(s) of eigenchannel(s) to well
fit the test data  find factors x
maximizing likelihood of test data for
 log p( x | M  Ux)
t
Target speaker
model M
UBM
t
• The score is LLR computed using the
adapted speaker model and UBM
N. Brummer, SDV NIST SRE’04 System description, 2004.
Test data
NIST SRE 2005 all trials
2048 Gauss., 13 MFCC + delatas, CMS
with RASTA
Miss probability [%]
with Feature Warping
+ triple deltas
+ HLDA
+ Eigenchannels adaptation
+ Feature mapping (14 classes)
Nuisance Attribute Projection
U
• NAP is an intersession
compenzation technique
proposed for SVMs
• Project out the eigenchannel
directions from supervectors
before using the supervectors
for training SVMs or test
38
Constructing models in supervector space
•
Speaker Model Synthesis: MCD2 = MCD1 + kCD1CD2
– constant supervector shift for recognized training and
test channel
•
Eigenchannel adaptation: Mtest = Mtrain + Ux
– the shift is given by linear combination of
eigenchannel basis U with factors x tuned for test data
•
Eigenvoice adaptation
– Consider also supervector subspace V with high
speaker variability and use it to obtain speaker model
– M = MUBM + Vy – speaker model given by linear
combination of UBM supervec. and eigenvoice bases
– speaker factors y tuned to match enrollment data
– Can be combined with channel subspace:
M = MUBM + Vy + Ux
• both x and y estimated on enrollment data
• only x updated for test data to adapt speaker model to
test channel condition
Joint Factor analysis
•
•
M = MUBM + Vy + Dz + Ux
Probabilistic model
– Gaussian priors assumed for factors y, z, x
– Hyperparameters MUBM, V, D, U can be trained using EM algorithm
– D - diagonal matrix describing remaining speaker variability not
covered by eigenvoices
u2
u1
d11
v2
d22
d33
v1
NIST SRE 2005 all trials
2048 Gauss., 13 MFCC + delatas, CMS
with RASTA
with Feature Warping
+ triple deltas
+ HLDA
+ Eigenchannels adaptation
Joint Factor Analysis (extrapolated result)
False alarm probability [%]
+ Feature mapping (14 classes)
Signal domain
Feature domain
Model domain
Score domain
Target model
Front-end
processing
Adapt
S
LR score
normalization
Background
model
• Noise
removal
• Tone
removal
• Cepstral mean
subtraction
• RASTA filtering
•Feature
Mapping
• Mean & variance •Eigenchannel
normalization
adaptation in
• Feature warping feature domain
• Speaker Model
Synthesis
• Eigenchannel
compensation
•Joint Factor
Analysis
• Nuisance Attribute
Projection
• Z-norm
• T-norm
• ZT-norm
L
Z-norm
•
Target model LR scores have different biases and scales for test data
– Unusual channel or poor quality speech in training segments  lower
scores from target model
– Little training data  target model close to UBM  all LLR scores close to 0
•
Znorm attempts to remove these bias and scale differences from the
LR scores
– Estimate mean and standard
deviation of non-target, same-sex
utterances from data similar to
test data
Tgt1 scores
– During testing normalize LR
score
– Align each model’s non-target Tgt2 scores
scores to N(0,1)
ZTgt ( x) 
LTgt ( x)  Tgt
 Tgt
pooled
LR scores
znorm scores
T-norm
• Similar idea to Z-norm , but compensating for differences in test data
• Estimates bias and scale parameters for score normalization using
fixed “cohort” set of speaker models
– Normalizes target score relative to a non-target model ensemble
– Similar to standard cohort normalization except for standard deviation
scaling
Target
model
Cohort
Cohort
model
Cohort
model
model
Tnorm
score
Ttgt (u ) 
  coh ,  coh )
• Used cohorts of same gender as target
• Can be used in conjunction with Znorm
– ZTnorm or TZnorm depending on order
Introduced in 1999 by Ensigma (DSP Journal January 2000)
L tgt (u )   coh
 coh
Effect of ZT-norm
NIST SRE2006
Miss probability [%]
telephone trials
Eigenchannel adaptation
Joint Factor Analysis
no normalization
ZT-norm
False alarm probability [%]
Score fusion
NISR SRE 2006 all trials
Linear logistic regression fusion of
scores from:
•GMM with eigenchannel adaptation
•SVM based on GMM supervectors
•SVM based on MLLR transformation
(transformation adapting speaker
indipendent LVCSR system to speaker)
LLR trained using many target and
non-target trials from development set
Conclusions