Query by Singing (CBMR)
Download
Report
Transcript Query by Singing (CBMR)
Two Paradigms for Music IR:
Query by Singing/Humming and
Audio Fingerprinting
J.-S. Roger Jang (張智星)
Multimedia Information Retrieval Lab
CS Dept., Tsing Hua Univ., Taiwan
http://mirlab.org/jang
2016/4/6
1
Outline
Introduction to MIR
QBSH (query by singing/humming)
Intro, demos, conclusions
AFP (audio fingerprinting)
Intro, demos, conclusions
-2-
Content-based Music Information
Retrieval (MIR) via Acoustic Inputs
Melody
Query by humming
(usually “ta” or “da”)
Query by singing
Query by whistling
Note onsets
Query by tapping (at the
onsets of notes)
Metadata
Query by speech (for
meta-data, such as title,
artist, lyrics)
Audio contents
Query by examples
(noisy versions of
original clips)
Drums
Query by beatboxing
-3-
Introduction to QBSH
QBSH: Query by Singing/Humming
Input: Singing or humming from microphone
Output: A ranking list retrieved from the song
database
Progression
First paper: Around 1994
Extensive studies since 2001
State of the art: QBSH tasks at ISMIR/MIREX
-4-
Two Stages in QBSH
Offline stage
Database preparation
From MIDI files
From audio music (eg.,
MP3)
From human vocals
Indexing (if necessary)
Online stage
Perform pitch tracking
on the user’s query
Compare the query pitch
with songs in database
Return the ranked list
according to similarity
-5-
Frame Blocking for Pitch Tracking
0.3
0.2
0.1
0
-0.1
-0.2
-0.3
-0.4
0
500
1000
1500
Overlap
2000
2500
0.3
0.2
0.1
0
Frame
Frame size=256 points
Overlap=84 points
Frame rate=11025/(256-84)=64 pitch/sec
Zoom in
-0.1
-0.2
-0.3
-0.4
0
50
100
150
200
250
300
-6-
ACF: Auto-correlation Function
1
128
Frame s(i):
Shifted frame s(i+t):
t=30 acf(30) = inner product of overlap part
Pitch period
acf t
n 1t
s i s i t
i 0
30
-7-
Frequency to Semitone Conversion
Semitone : A music scale based on A440
freq
semitone 12 log 2
69
440
Reasonable pitch range:
E2 - C6
82 Hz - 1047 Hz (
-
)
-8-
Typical Result of Pitch Tracking
Pitch tracking via autocorrelation for茉莉花 (jasmine)
-9-
Comparison of Pitch Vectors
Yellow line : Target pitch vector
-10-
Comparison Methods of QBSH
Categories of approaches to QBSH
Histogram/statistics-based
Note vs. note
Edit distance
Frame vs. note
HMM
Frame vs. frame
Linear scaling, DTW, recursive alignment
-11-
Linear Scaling
Scale the query pitch linearly to match the
candidates
Target pitch in database
Compressed by 0.5
Compressed by 0.75
Original input pitch
Stretched by 1.25
Original pitch
Best match
Stretched by 1.5
-12-
Challenges in QBSH Systems
Song database preparation
MIDIs, singing clips, or audio music
Reliable pitch tracking for acoustic input
Input from mobile devices or noisy karaoke bar
Efficient/effective retrieval
Karaoke machine: ~10,000 songs
Internet music search engine: ~500,000,000 songs
-13-
Goal and Approach
Goal: To retrieve songs effectively within a
given response time, say 5 seconds or so
Our strategy
Multi-stage progressive filtering
Indexing for different comparison methods
Repeating pattern identification
-14-
MIRACLE
MIRACLE
Music Information
Retrieval Acoustically
via Clustered and
paralleL Engines
Database (~13000)
MIDI files
Solo vocals (<100)
Melody extracted from
polyphonic music (<100)
Comparison methods
Linear scaling
Dynamic time warping
Top-10 Accuracy
70~75%
Platform
Single CPU+GPU
-15-
Current MIRACLE
Single server with GPU
NVIDIA 560 Ti, 384 cores (speedup factor = 10)
Clients
Single server
PC
Request:
pitch vector
Master server
Master
server
Response:
search result
PDA/Smartphone
Database size: ~13,000
Cellular
-16-
QBSH for Various Platforms
PC
Web version
Embedded systems
Karaoke machines
Smartphones
iPhone/iPad
Android phone
Toys
-17-
QBSH Demo
Demo page of MIR Lab:
http://mirlab.org/mir_products.asp
MIRACLE demo:
http://mirlab.org/demo/miracle
Existing commercial QBSH systems
www.midomi.com
www.soundhound.com
-18-
Conclusions for QBSH
QBSH
Fun and interesting way to retrieve music
Can be extend to singing scoring
Commercial applications getting mature
Challenges
How to deal with massive music databases?
How to extract melody from audio music?
-19-
Audio Fingerprinting (AFP)
Goal
Identify a noisy version of
a given audio clips (query
by example, not by “cover
versions”)
Technical barrier
Robustness
Efficiency (6M tags/day
for Shazam)
Effectiveness (15M tracks
for Shazam)
Applications
Song purchase
Royalty assignment
(over radio)
Confirmation of
commercials (over TV)
Copyright violation
(over web)
TV program ID
-20-
Two Stages in AFP
Offline
Robust feature
extraction (audio
fingerprinting)
Hash table construction
Inverted indexing
Online
Robust feature
extraction
Hash table search
Ranked list of the
retrieved songs/music
-21-
Representative Approaches to AFP
Philips
J. Haitsma and T.
Kalker, “A highly robust
audio fingerprinting
system”, ISMIR 2002.
Shazam
A.Wang, “An industrialstrength audio search
algorithm”, ISMIR 2003
Google
S. Baluja and M. Covell,
“Content fingerprinting using
wavelets”, Euro. Conf. on
Visual Media Production,
2006.
V. Chandrasekhar, M. Sharifi,
and D. A. Ross, “Survey and
evaluation of audio
fingerprinting schemes for
mobile query-by-example
applications”, ISMIR 2011
-22-
Shazam: Landmarks as Features
Spectrogram
Local peaks of
spectrogram
Pair peaks to
form landmarks
Landmark: [t1, f1, t2, f2]
20-bit hash key:
f1: 8 bits
Δf = f2-f1: 6 bits
Δt = t2-t1: 6 bits
Hash value:
Song ID & offset time
(Source: Dan Ellis)
-23-
Shazam: Landmark as Features (II)
Peak picking after
smoothing
Matched landmarks (green)
(Source: Dan Ellis)
-24-
Shazam: Time-justified Landmarks
Valid landmarks based on offset time (which
maintains robustness even with hash collision)
-25-
Our AFP Engine
Database (~2500)
2500 tracks currently
50k tracks soon
1M tracks in the future
Driving forces
Fundamental issues in
computer science
(hashing, indexing…)
Requests from local
companies
Methods
Landmarks as feature
(Shazam)
Speedup by hash tables
and inverted files
Platform
Currently: Single CPU
In the future: Multiple
CPU & GPU
-26-
Experiments
Accuracy test
Corpora
Database: 2550 tracks
Test files: 5 mobilerecorded songs chopped
into segments of 5, 10,
15, and 20 seconds
Accuracy vs. duration
5-sec clips: 161/275=58.6%
10-sec clips: 121/136=89.0%
15-sec clips: 88/90=97.8%
20-sec clips: 65/66=98.5%
Computing time. vs. duration
Accuracy vs. computing time
-27-
Demos of Audio Fingerprinting
Commercial apps
Shazam
Soundhound
Our demo
http://mirlab.org/demo/afpFarmer2550
-28-
Conclusions For AFP
Conclusions
Landmark-based methods are effective
Machine learning is indispensable for further
improvement.
Future work: Scale up
Shazam: 15M tracks in database, 6M tags/day
Our goal:
50K tracks with a single PC and GPU
1M tracks with cloud computing of 10 PC
-29-