The Chinese University of Hong Kong Department of Computer

Download Report

Transcript The Chinese University of Hong Kong Department of Computer

The Chinese University of Hong Kong
Department of Computer Science and
Engineering
Lyu0202
Advanced Audio Information
Retrieval System
Network-based AdvAIR System
Consists of client side and server side
Client Side
Consists of 2 parts
Advanced Part
• Audio Data Mining
• Audio Data Retrieval and Indexing
Basic Part
• Audio Streaming from Server side
Server Side
For Audio Streaming, Searching on Server
Advanced Part of AdvAIR system
Audio Data Mining
Segmentation
Recognition Engine
Segmentation with Speaker Recognition
Audio Retrieval and Indexing
Query by Humming
Pattern Matching
Search on Server
Audio Data Mining – Recognition Engine
 Consists of Three functions:
Speaker Recognition
Language Recognition
Gender Recognition
 Speaker Recognition engine
Open-set system with 10 models and 1 general model
 Language Recognition engine
Close-set system with 3 models (Cantonese, English,
Mandarin)
 Gender Recognition engine
Close-set system with 2 models (Male and Female)
Audio Data Mining - Segmentation
Group 1
Group 2
Group 3
Audio Data Mining - Segmentation
Bayesian Information Criterion is used for
determining the acoustic change point of
the input Mpeg file
First, input an Mpeg file
Next, extract the features
Use BIC criterion to calculate the change
point
Finally, have a list of segments which is
cut according to acoustic change point
Audio Data Mining – Recognition Engine
Trained Model
Input Mpeg
Extract feature
Calculate a score
For each model
Select the
most suitable
model
Audio Data Mining – Recognition Engine
Use Gaussian Mixture Model
text independent, robust, computationally
efficient
256 mixture for each models
Need pre-processing (Training)
First, input Mpeg file
Next, extract the features
Calculate a score for each models and
select the model with the best score
Audio Data Mining – Segmentation with
Speaker Recognition
Automatic speaker recognition engine
First, do segmentation
Next, each segmentation is sent to the
speaker recognition engine
Finally, we get list of segments in which
the speakers of each segment will be
known
Group 1
Group 2
Group 3
Speaker identification Process
Speaker1
Speaker 2
Speaker 3
Speaker 2 Speaker 1 Speaker 2
Audio Retrieval and Indexing - Query by
Humming
 First Step:
Do Pitch Tracking using time domain autocorrelation
function, ACF for the input audio clips
Track the trend of input audio clips, in the manner of
“UP”, “Down” or “Same”
Intermediate output: a file consists of a list of “Up”,
“Down”, “Same”
 Second Step:
Do largest substring matching for each of the
intermediate output of audio clips in the database and
the intermediate output of the input audio clip and
calculate a score
 Last Step:
List the audio clips in database according to the score
Intermediate
representation
Hummed
Song
Pitch
tracker
Intermediate
Database
Largest
Substring
matching
Pitch tracker
Tack the pitch of hummed voice, convert
into representation of relative change of
voice
E.g. Do Me Fa So Fa Re Me
•U U U D D U
Audio Retrieval and Indexing – Direct
Audio Search
First Step:
covariance matrix is calculated from the feature
vectors of the cue-audio and a clip in database
Second Step:
AHS (arithmetic harmonic sphericity) distance
measurement to calculate a score
Last Step:
List the audio clips in database according to the
score
Target Clips with
Same size
AHU Comparison
Source Clip
Audio Retrieval and Indexing – Search on
Server
Direct Audio Search on Server
Server Side has a database
Client connect to server
Client select a cue-audio and upload to the
server
Server will do the direct audio search and
send back the result
Client can use the audio streaming to get
the result file
Basic Part - Audio Streaming
 AdvAIR is N-to-N system, allow N server and N
client
 Client and Server can be added at any time
 It’s Fault Tolerant
Basic Part – Server Side
 Have two parts:
For Audio Streaming
For Searching on Server (Direct Search on server)
 Separate it because Searching on Server use a
lot of resource
 A server can’t process for too many users at the
same time
 Only privileged users allow to use the searching
on server function
Basic Part – Client Side
Client request for download, an audio clips
is divided into many small parts
Each server send a small parts to client
simultaneously to speed up the download
speed
Client combine all the small parts to form
the whole file
The End