The Chinese University of Hong Kong Department of Computer
Download
Report
Transcript The Chinese University of Hong Kong Department of Computer
The Chinese University of Hong Kong
Department of Computer Science and
Engineering
Lyu0202
Advanced Audio Information
Retrieval System
Network-based AdvAIR System
Consists of client side and server side
Client Side
Consists of 2 parts
Advanced Part
• Audio Data Mining
• Audio Data Retrieval and Indexing
Basic Part
• Audio Streaming from Server side
Server Side
For Audio Streaming, Searching on Server
Advanced Part of AdvAIR system
Audio Data Mining
Segmentation
Recognition Engine
Segmentation with Speaker Recognition
Audio Retrieval and Indexing
Query by Humming
Pattern Matching
Search on Server
Audio Data Mining – Recognition Engine
Consists of Three functions:
Speaker Recognition
Language Recognition
Gender Recognition
Speaker Recognition engine
Open-set system with 10 models and 1 general model
Language Recognition engine
Close-set system with 3 models (Cantonese, English,
Mandarin)
Gender Recognition engine
Close-set system with 2 models (Male and Female)
Audio Data Mining - Segmentation
Group 1
Group 2
Group 3
Audio Data Mining - Segmentation
Bayesian Information Criterion is used for
determining the acoustic change point of
the input Mpeg file
First, input an Mpeg file
Next, extract the features
Use BIC criterion to calculate the change
point
Finally, have a list of segments which is
cut according to acoustic change point
Audio Data Mining – Recognition Engine
Trained Model
Input Mpeg
Extract feature
Calculate a score
For each model
Select the
most suitable
model
Audio Data Mining – Recognition Engine
Use Gaussian Mixture Model
text independent, robust, computationally
efficient
256 mixture for each models
Need pre-processing (Training)
First, input Mpeg file
Next, extract the features
Calculate a score for each models and
select the model with the best score
Audio Data Mining – Segmentation with
Speaker Recognition
Automatic speaker recognition engine
First, do segmentation
Next, each segmentation is sent to the
speaker recognition engine
Finally, we get list of segments in which
the speakers of each segment will be
known
Group 1
Group 2
Group 3
Speaker identification Process
Speaker1
Speaker 2
Speaker 3
Speaker 2 Speaker 1 Speaker 2
Audio Retrieval and Indexing - Query by
Humming
First Step:
Do Pitch Tracking using time domain autocorrelation
function, ACF for the input audio clips
Track the trend of input audio clips, in the manner of
“UP”, “Down” or “Same”
Intermediate output: a file consists of a list of “Up”,
“Down”, “Same”
Second Step:
Do largest substring matching for each of the
intermediate output of audio clips in the database and
the intermediate output of the input audio clip and
calculate a score
Last Step:
List the audio clips in database according to the score
Intermediate
representation
Hummed
Song
Pitch
tracker
Intermediate
Database
Largest
Substring
matching
Pitch tracker
Tack the pitch of hummed voice, convert
into representation of relative change of
voice
E.g. Do Me Fa So Fa Re Me
•U U U D D U
Audio Retrieval and Indexing – Direct
Audio Search
First Step:
covariance matrix is calculated from the feature
vectors of the cue-audio and a clip in database
Second Step:
AHS (arithmetic harmonic sphericity) distance
measurement to calculate a score
Last Step:
List the audio clips in database according to the
score
Target Clips with
Same size
AHU Comparison
Source Clip
Audio Retrieval and Indexing – Search on
Server
Direct Audio Search on Server
Server Side has a database
Client connect to server
Client select a cue-audio and upload to the
server
Server will do the direct audio search and
send back the result
Client can use the audio streaming to get
the result file
Basic Part - Audio Streaming
AdvAIR is N-to-N system, allow N server and N
client
Client and Server can be added at any time
It’s Fault Tolerant
Basic Part – Server Side
Have two parts:
For Audio Streaming
For Searching on Server (Direct Search on server)
Separate it because Searching on Server use a
lot of resource
A server can’t process for too many users at the
same time
Only privileged users allow to use the searching
on server function
Basic Part – Client Side
Client request for download, an audio clips
is divided into many small parts
Each server send a small parts to client
simultaneously to speed up the download
speed
Client combine all the small parts to form
the whole file
The End