Speech and Music Discrimination using GMM

Download Report

Transcript Speech and Music Discrimination using GMM

Speech and Music Discrimination using
Gaussian Mixture Model
_________________________
Seminar Program
Project Team
Dr. Deep Sen
CHOI Arthur, Tsz Kin
Derek, Ka Chun
(Supervisor)
(3015809) CHENG
(3015631)
_________________________
Speech and Music Discrimination using GMM
_________________________
Speech and Music Discrimination using GMM
Motivations
• Many researches on HMM, not too many using GMM
• GMM reduce complexity compared to HMM
• Our feature extraction methods will reduce complexity
• Multimedia files search/storage still under develop
• Fit University requirement
_________________________
Speech and Music Discrimination using GMM
_________________________
Speech and Music Discrimination using GMM
Applications
• Audio Database Indexing
• Automatic Bandwidth Allocation
• Broadcast Browsing
• Intelligent Signal Processing
• Intelligent Audio Coding
• Audio file Compression
• Audio Clip Editing
_________________________
Speech and Music Discrimination using GMM
Approaches
Deterministic Signals
can be analysis as completely specified functions of time
Un-deterministic Signals
must analysis probilistically
[Tele3013 notes]
_________________________
Speech and Music Discrimination using GMM
Procedures
1. Read a signal
2. Segmented it into small
frames
3. Extract features of each
frames
4. Classify each frames
_________________________
Speech and Music Discrimination using GMM
Feature Extractions
_________________________
Speech and Music Discrimination using GMM
Classification
_________________________
Speech and Music Discrimination using GMM
music
speech
silence
speech
_________________________
Speech and Music Discrimination by using GMM
Segmentation
Reasons
• Get a better estimation result
• Achieve a Real-Time behavior
Music Signal
Problems and solutions
• Frames too big
-- Classification accuracy decrease
• Frames too small -- Feature extraction accuracy decrease
• Chose frame size ~20ms
_________________________
Speech and Music Discrimination using GMM
_________________________
Speech and Music Discrimination using GMM
4 Hz modulation energy
Speech energy has a characteristic energy modulation peak
around the 4Hz syllabic rate. [Houtgast & Steeneken 1985]
Reasons
• Accurately separate speech signals and music signals (~94%)
• Easy to implement in Matlab
• Novel and Robust
_________________________
Speech and Music Discrimination using GMM
Music Signal
Speech Signal
_________________________
Speech and Music Discrimination using GMM
_________________________
Speech and Music Discrimination using GMM
Music Signal
Speech Signal
Energy vs. Time
_________________________
Speech and Music Discrimination using GMM
Zero-Crossing Count (ZCC)
The zero-crossing count is the total number of times that a
signal goes through the x-axis over a certain time.
Speech signals
High ZCC
Music signals
Low ZCC
Reasons
• ZCC of a speech signal is significantly high
• Very easy to implement in Matlab
• Mature and Robust
_________________________
Speech and Music Discrimination using GMM
_________________________
Speech and Music Discrimination using GMM
_________________________
Speech and Music Discrimination using GMM
Spectral Roll-off Point
The spectral roll-off point measures the “skewness” of the
spectrum.
Reasons
• Music usually has more energy in the high frequency range
• Useful for separate different kind of speech later
_________________________
Speech and Music Discrimination using GMM
Spectral Roll-off Point
Spectral Roll-off Point = SR
where,
_________________________
Speech and Music Discrimination using GMM
power
Music Signal
frequency
power
Speech Signal
frequency
_________________________
Speech and Music Discrimination using GMM
Entropy Modulation
Music appears to be “ordered” compared with a speech signal
[J.Pinquier, J.L. Rouas, R. Andre-Obercht 2002]
Higher Entropy means higher “ordered”
Higher Dynamism means higher rate of changes
Reasons
• Accurately separate speech signals and music signals(~90%)
• Novel and Robust
_________________________
Speech and Music Discrimination using GMM
Music Signal
Speech Signal
_________________________
Speech and Music Discrimination using GMM
[J. Ajmera, I.A. McCowan, H.Bourlard 2002]
_________________________
Speech and Music Discrimination using GMM
Instantaneous entropy
Average entropy
Average Instantaneous entropy
_________________________
Speech and Music Discrimination using GMM
Pulse Metric
The beat of a piece of music is one of the clearest features of
the music. [K.D. Martin, E.D.Scheirer, B.L. Vercoe 1988]
_________________________
Speech and Music Discrimination using GMM
Other Features
• Spectral Centroid
• Spectral Flux
• Silence Ratio
• Short-Time Energy Ratio
• Volume Dynamic Change
• Number of Segments
• Segment Duration
• …etc
_________________________
Introduction to Gaussian Mixture Model (GMM)
• Differentiation of speech and music
from a sound source
• Use for speech processing, mostly for
speech recognition, speaker identification
and voice conversion
• Model densities and to represent
general spectral features
Why we choose GMM?
Low complexity
Rate independence
Bit scalability
Short computation time
What is Gaussian Mixture Model?
Gaussian Mixture Model consist of a set of
local Gaussian modes, and an integrating
network. Different Gaussian distributions
represent different domain of feature
space, and have different output
characteristics
GMM try to describe a complex system
using combination of all the Gaussian
clusters, instead of using a single model
Gaussian mixtures or clusters
Use to describe a complex system instead
of using a single model
Represents a dataset by a set of mean
and covariance
Gaussian Mixture Model
A Gaussian Mixture Model is represented by:
M
f (x,  )   iN (x, i,  i )
i 1
is the P-dimensional input vector
i is the mixture weights
N (x, i,  i ) is the component densities

Clustering
‘clustering’ is a technique from pattern
classification
A technique to group samples
P-dimensional feature vector is considered
as a point in space and all points ‘near’ if
are clustered together
clustering
Grey circle represents the variance of distribution
Gaussian component density
P-variate Gaussian function of the form:
N (x, i  i ) 
i
i
1
1
 12
T
(

i
)
exp(

(
x


)
i (x  i ))
p
i
2
(2 ) 2
is the mean vector
is the covariance matrix
Covariance matrix
Indicates the dispersion of distribution
In mathematics, it is defined as the matrix
whose ij th element ij is the covariance of
and i
x
ij  ji  xi  i xj  j 
i,j=1…d
xj
Covariance matrix
The diagonal components of the covariance
matrix are the variances of individual random
variables
Off-diagonal components are the covariance of
two random variables, j and i
Symmetric matrix
x
x
Full covariance matrix
The most powerful Gaussian model as it
fits the data best
drawback!
Needs a lot of data to estimate parameters
Costly in high-dimensional feature spaces
Diagonal covariance matrix
Good compromise between quality and
model size
Gaussian components can act together to
model the overall probability density
function
Capable of modelling the correlations
between the feature vector
Review the Gaussian mixture
density
The matrix weight i must satisfy the condition
   1 and i  0
M
i
i
Three components compose the Gaussian
mixture density: mean vectors, covariance
matrices and mixture weights
Expectation-maximization (EM)
Estimate the mean vector, covariance
matrix and mixture weight
Recursively updates distribution of each
Gaussian model and conditional
probability
Idea of Expectation-maximization
Instead of starting with a random
configuration of all components and
improve upon this configuration with
expectation-maximization. We start with
the optimal one-component mixture.
Then start repeating two steps until
convergence
i) Inset a new components and
ii) Apply EM until convergence
Convergence Theorem
The sequence of likelihood is
monotonically-increasing and bounded,
the likelihood will converge to a local
maximum
EM algorithm
n
 log f x 
Assume
denote the loglikelihood of the dataset under k-component
matrix fk
1. Compute the optimal one-component mixture
f 1. Set k=1
2. Find the optimal new component  x; * and
corresponding matrix weight  *
( Xn , f k ) 
k
i
i 1


*
,

  arg max
log 1    f  x     x ; 
  
n
*
while keeping
 ,
k
i 1
fk fixed
i
i
EM algorithm
3. Set



fk  1x   1   fk x     x;
*
and k=k+1
4. Update
fk
until convergence
*
*

Speech/music discrimination by
using GMM
An interesting feature of GMM, component
densities of mixture may represent…
Different phonetic events for modelling
speech
Different portion of the sound when used
to model spectra of sound from musical
instrument
Achievement
Identified optimized frame size
Obtained robust features
Performed a few tests
Implemented some Matlab codes
Studied the Gaussian Mixture Models
(GMMs) and some of their mathematical
expressions
Next year planning
Comprehensive and more in-depth
research on GMMs
Model the sound source base on GMMs
Evaluate noise effect
Matlab implementation for speech/music
separation
Next year planning
Investigate a novel classification method –
Support Vector Machine (SVM)
Differentiate Male and female speech
Differentiate Classical and Non-Classical
Music
Generate a final thesis report
_________________________
Speech and Music Discrimination using GMM
_________________________
Speech and Music Discrimination using GMM
Resources
• Internet, Microsoft Sound Recorder, Matlab
• Neural Networks for Pattern Recognition (Bishop 1996)
• Processing and Perception of Speech and Music
(Morgan 2000)
• Research Papers
_________________________
Speech and Music Discrimination using GMM
Management Plan
• Dec – Feb 04
Matlab Implementations
Investigate noise effect
Research on Support Vector Machine
Experiments
• Jan 05
Separating class., non-class. music
• Feb 05
Separating male, female speech
• Mar – Jun 05
Separate Chamber music and Orchestra
Music. Separate Baby speech. (if have time)
Perception of Speech and Music (2000), John Wiley & Sons, Inc., USA.
Thank you
Joseph F. Hair, JR., Rolph E. Anderson, Ronald L. Tatham, William C. Black,
Multivariate Data Analysis 4th Edition (1995), Prentice-Hall International, Inc. USA.
Keinosuke Fukunaga, Computer Science and Scientific Computing: Introduction to
Statistical Pattern Recognition 2nd Edition (1990), Academic Press, Inc., California,
USA., ISBN 0-12-269851-7
Marty J.Schmidts, Understanding and Using Statistic (1975), D.C Health and
Company, Canada. ISBN 0-669-94490-4
Norman L.Johnson, Samuel Kotz, Distributions in statistics: Continuous univariate
distributions vol.1 (1970), Houghton Mifflin Company, Boston, USA
Richard A. Johnson, Dean W. Wichern, Applied Multivariate Statistical Analysis
(1992), Prentice-Hall, Inc., New Jersey, USA. ISBN 0-13-041400-X
Richard J.Harris, A Primer of Multivariate Statistics (1975), Academic Press Inc.,
New York, USA. ISBN 0-12-327250-5
Thomas D. Rossing, The Science of Sound (1982), Addison-Wesley Publishing
Company Inc., USA., ISBN 0-201-06505-3
Thomas D. Rossing, Neville H. Fletcher, Principles of Vibration and Sound (1995),