Transcript qbt
Query by Tapping
敲擊選歌
J.-S. Roger Jang (張智星)
Multimedia Information Retrieval Lab
CS Dept., Tsing Hua Univ., Taiwan
http://mirlab.org/jang
2016/3/27
1
Query by Tapping
Goal:
Music search based on uses’ tapping (at notes’
onsets) over the microphone/keyboard
Characteristics
Only note duration is used for comparison, note
pitch is discarded.
A hard task for human to recognize (which is
different from query by singing/humming)
Try this…
-2-
Query by Tapping
Goal:
Music search based on uses’ tapping (at notes’
onsets) over the microphone/keyboard
Characteristics
Only note duration is used for comparison, note
pitch is discarded.
A hard task for human to recognize (which is
different from query by singing/humming)
Try this…
-3-
Query by Tapping
Challenges:
Users is unlikely to use the same tempo as the
intended song
Users tend to lose notes instead of gaining ones
We have about 13,000 songs in the database
Major approach:
A distance measure based on dynamic
programming
-4-
Feature Extraction via Microphone
Microphone input:
After frame blocking, energy computation,
and thresholding:
-6-
Performance Evaluation of Onset
Detection
simSequence.m
1
Computed
GT
0.8
precision=3/6=0.5
recall=3/5=0.6
f-measure=2pr/(p+r)=0.5455
0.6
0.4
0.2
0
-0.2
-0.4
-0.6
-0.8
-1
0
1
2
3
4
5
6
-7-
Similarity Comparison with Songs in
Database
A fast method based on IOI ratios
Compute the IOI ratios for both query and db IOI
vectors
Compute the Euclidean distance these two ratio
vectors
-8-
Music Note Alignment
t: test (input) IOI vector
r: reference IOI vector
Alignment
by DP
Normalization
r(3)
t(3)
r(2)
t(2)
t(1)
r(1)
t
r
t
r
t
r
-9-
Normalization
Normalization to have
p
q
i 1
j 1
~
~
t
(
i
)
rq (i) 1000
~
t round (1000 * t / sum(t ))
~
r round (1000 * r (1 : q) / sum(r (1 : q)))
q
(Multiplication of 1000 to guarantee high resolution
in fixed-point computation.)
dist (t , r )
min
q p 2~ p 2
~
D( t , ~
rq )
-10-
Dynamic-programming-based Distance
j
t: test IOI vector of length m
r: reference IOI vector of length n
D (i , j )
Recurrent relation:
r(j-1)
D(i, j )
r(j-2)
D(i 1, j 2) r ( j 2) r ( j 1) t (i 1) 1
min D(i 1, j 1) t (i 1) r ( j 1)
D(i 2, j 1) t (i 2) t (i 1) r ( j 1)
2
r(2)
D(i,1) 0, i 1 ~ m 1
r(1)
t(1) t(2)
t(i-2) t(i-1)
D(1, j ) 0, j 1 ~ n 1
i
-11-
Experimental Environment
269 test wave files of tapping clips
9 contributors (7 males, 2 females)
Wave length: 15 seconds
Wave format: PCM, 11025Hz, 8bits, Mono
Start position: Beginning of a song
Environment
Pentium III 800, 256MB RAM
Database
11,744 MIDI files
-12-
Test Results Using Clips of 15 Seconds
Average response time: 3.42
seconds (29.98 notes)
Recognition rates:
Top-1 (top 0.0085%): 15%
Top-10 (top 0.085%): 51%
Top-100 (top 0.85%): 80%
-13-
Error Analysis
Errors analysis of low-ranked clips
Some users cannot tap consistently through 15
seconds
Feature extraction is not robust enough to handle
noisy input.
Some MIDI files are not faithful rendition of the
original tunes.
Users cannot keep up with short consecutive
notes.
-14-
Recog. Rates w.r.t. Tapping Duration
Top-100 and 1000
curves level off after 10
seconds.
Top-100 curve does not
go up monotonically.
Top-1000
Top-100
Top-10
-15-
Demo
No. of MIDI files: 12982
-16-
Partial List of Songs
All I have to do is dream
You are my sunshine
Beautiful Sunday
Do Re Mi
Feelings
A time for us
Love is blue
Let it be me
My way
Love story
More than I can say
Only you
Rain and tears
Rhythm of the rain
Rose Rose I love you
The sound of silence
Unchained melody
We are the world
Yesterday
I just call to say I love you
Close to you
Mr. Lonely
Ben
Hey Jude
Donna Donna
Sealed with a kiss
-17-
Potential Applications
Interactive toys
Beat-tracking training and games
Song retrieval in noisy karaoke bars
-18-
Conclusions
Our MIR system is the first one with queryby-tapping capability.
Rhythm-based search can be used in
conjunction with pitch-contour-based search
to achieve a better recognition rate.
-19-
Future Work
Search scope expansion
How to retrieve MP3 or CD music directly?
Scale-up by hierarchical filtering method
How to deal with database with 100,000 songs?
What if the user tap from anywhere in the middle
of a song?
-20-