Query by Singing (CBMR)

Transcript Query by Singing (CBMR)

Two Paradigms for Music IR:
Query by Singing/Humming and
Audio Fingerprinting
J.-S. Roger Jang (張智星)
Multimedia Information Retrieval Lab
CS Dept., Tsing Hua Univ., Taiwan
http://mirlab.org/jang
2016/4/6
1
Outline
Introduction to MIR
QBSH (query by singing/humming)
Intro, demos, conclusions
AFP (audio fingerprinting)
Intro, demos, conclusions
-2-
Content-based Music Information
Retrieval (MIR) via Acoustic Inputs
Melody
Query by humming
(usually “ta” or “da”)
Query by singing
Query by whistling
Note onsets
Query by tapping (at the
onsets of notes)
Metadata
Query by speech (for
meta-data, such as title,
artist, lyrics)
Audio contents
Query by examples
(noisy versions of
original clips)
Drums
Query by beatboxing
-3-
Introduction to QBSH
QBSH: Query by Singing/Humming
Input: Singing or humming from microphone
Output: A ranking list retrieved from the song
database
Progression
First paper: Around 1994
Extensive studies since 2001
State of the art: QBSH tasks at ISMIR/MIREX
-4-
Two Stages in QBSH
Offline stage
Database preparation
From MIDI files
From audio music (eg.,
MP3)
From human vocals
Indexing (if necessary)
Online stage
Perform pitch tracking
on the user’s query
Compare the query pitch
with songs in database
Return the ranked list
according to similarity
-5-
Frame Blocking for Pitch Tracking
0.3
0.2
0.1
0
-0.1
-0.2
-0.3
-0.4
0
500
1000
1500
Overlap
2000
2500
0.3
0.2
0.1
0
Frame
Frame size=256 points
Overlap=84 points
Frame rate=11025/(256-84)=64 pitch/sec
Zoom in
-0.1
-0.2
-0.3
-0.4
0
50
100
150
200
250
300
-6-
ACF: Auto-correlation Function
1
128
Frame s(i):
Shifted frame s(i+t):
t=30 acf(30) = inner product of overlap part
Pitch period
acf t  
n 1t
 s i  s i  t 
i 0
30
-7-
Frequency to Semitone Conversion
Semitone : A music scale based on A440
 freq 
semitone  12  log 2 
  69
 440 
Reasonable pitch range:
E2 - C6
82 Hz - 1047 Hz (
-
)
-8-
Typical Result of Pitch Tracking
Pitch tracking via autocorrelation for茉莉花 (jasmine)
-9-
Comparison of Pitch Vectors
Yellow line : Target pitch vector
-10-
Comparison Methods of QBSH
Categories of approaches to QBSH
Histogram/statistics-based
Note vs. note
Edit distance
Frame vs. note
HMM
Frame vs. frame
Linear scaling, DTW, recursive alignment
-11-
Linear Scaling
Scale the query pitch linearly to match the
candidates
Target pitch in database
Compressed by 0.5
Compressed by 0.75
Original input pitch
Stretched by 1.25
Original pitch
Best match
Stretched by 1.5
-12-
Challenges in QBSH Systems
Song database preparation
MIDIs, singing clips, or audio music
Reliable pitch tracking for acoustic input
Input from mobile devices or noisy karaoke bar
Efficient/effective retrieval
Karaoke machine: ~10,000 songs
Internet music search engine: ~500,000,000 songs
-13-
Goal and Approach
Goal: To retrieve songs effectively within a
given response time, say 5 seconds or so
Our strategy
Multi-stage progressive filtering
Indexing for different comparison methods
Repeating pattern identification
-14-
MIRACLE
MIRACLE
Music Information
Retrieval Acoustically
via Clustered and
paralleL Engines
Database (~13000)
MIDI files
Solo vocals (<100)
Melody extracted from
polyphonic music (<100)
Comparison methods
Linear scaling
Dynamic time warping
Top-10 Accuracy
70~75%
Platform
Single CPU+GPU
-15-
Current MIRACLE
Single server with GPU
NVIDIA 560 Ti, 384 cores (speedup factor = 10)
Clients
Single server
PC
Request:
pitch vector
Master server
Master
server
Response:
search result
PDA/Smartphone
Database size: ~13,000
Cellular
-16-
QBSH for Various Platforms
PC
Web version
Embedded systems
Karaoke machines
Smartphones
iPhone/iPad
Android phone
Toys
-17-
QBSH Demo
Demo page of MIR Lab:
http://mirlab.org/mir_products.asp
MIRACLE demo:
http://mirlab.org/demo/miracle
Existing commercial QBSH systems
www.midomi.com
www.soundhound.com
-18-
Conclusions for QBSH
QBSH
Fun and interesting way to retrieve music
Can be extend to singing scoring
Commercial applications getting mature
Challenges
How to deal with massive music databases?
How to extract melody from audio music?
-19-
Audio Fingerprinting (AFP)
Goal
Identify a noisy version of
a given audio clips (query
by example, not by “cover
versions”)
Technical barrier
Robustness
Efficiency (6M tags/day
for Shazam)
Effectiveness (15M tracks
for Shazam)
Applications
Song purchase
Royalty assignment
(over radio)
Confirmation of
commercials (over TV)
Copyright violation
(over web)
TV program ID
-20-
Two Stages in AFP
Offline
Robust feature
extraction (audio
fingerprinting)
Hash table construction
Inverted indexing
Online
Robust feature
extraction
Hash table search
Ranked list of the
retrieved songs/music
-21-
Representative Approaches to AFP
Philips
J. Haitsma and T.
Kalker, “A highly robust
audio fingerprinting
system”, ISMIR 2002.
Shazam
A.Wang, “An industrialstrength audio search
algorithm”, ISMIR 2003
Google
S. Baluja and M. Covell,
“Content fingerprinting using
wavelets”, Euro. Conf. on
Visual Media Production,
2006.
V. Chandrasekhar, M. Sharifi,
and D. A. Ross, “Survey and
evaluation of audio
fingerprinting schemes for
mobile query-by-example
applications”, ISMIR 2011
-22-
Shazam: Landmarks as Features
Spectrogram
Local peaks of
spectrogram
Pair peaks to
form landmarks
Landmark: [t1, f1, t2, f2]
20-bit hash key:
f1: 8 bits
Δf = f2-f1: 6 bits
Δt = t2-t1: 6 bits
Hash value:
Song ID & offset time
(Source: Dan Ellis)
-23-
Shazam: Landmark as Features (II)
Peak picking after
smoothing
Matched landmarks (green)
(Source: Dan Ellis)
-24-
Shazam: Time-justified Landmarks
Valid landmarks based on offset time (which
maintains robustness even with hash collision)
-25-
Our AFP Engine
Database (~2500)
2500 tracks currently
50k tracks soon
1M tracks in the future
Driving forces
Fundamental issues in
computer science
(hashing, indexing…)
Requests from local
companies
Methods
Landmarks as feature
(Shazam)
Speedup by hash tables
and inverted files
Platform
Currently: Single CPU
In the future: Multiple
CPU & GPU
-26-
Experiments
Accuracy test
Corpora
Database: 2550 tracks
Test files: 5 mobilerecorded songs chopped
into segments of 5, 10,
15, and 20 seconds
Accuracy vs. duration
5-sec clips: 161/275=58.6%
10-sec clips: 121/136=89.0%
15-sec clips: 88/90=97.8%
20-sec clips: 65/66=98.5%
Computing time. vs. duration
Accuracy vs. computing time
-27-
Demos of Audio Fingerprinting
Commercial apps
Shazam
Soundhound
Our demo
http://mirlab.org/demo/afpFarmer2550
-28-
Conclusions For AFP
Conclusions
Landmark-based methods are effective
Machine learning is indispensable for further
improvement.
Future work: Scale up
Shazam: 15M tracks in database, 6M tags/day
Our goal:
50K tracks with a single PC and GPU
1M tracks with cloud computing of 10 PC
-29-

Query by Singing (CBMR)

Transcript Query by Singing (CBMR)

Directory