Transcript Slides
Audio Fingerprinting
J.-S. Roger Jang (張智星)
[email protected]
http://mirlab.org/jang
MIR Lab, CSIE Dept.
National Taiwan University
1
Intro to Audio Fingerprinting (AFP)
Goal
Identify a noisy version of a given audio clips
Also known as…
“Query by exact example” no “cover versions”
Can also be used to…
Align two different-speed audio clips of the same
source
Dan Ellis used AFP for aligned annotation on Beatles
dataset
AFP Challenges
Music variations
Encoding/compression (MP3 encoding, etc)
Channel variations
Speakers & microphones, room acoustics
Environmental noise
Efficiency (6M tags/day for Shazam)
Database collection (15M tracks for Shazam)
AFP Applications
Commercial applications of AFP
Music identification & purchase
Royalty assignment (over radio)
TV shows or commercials ID (over TV)
Copyright violation: Youtube
Major commercial players
Shazam, Soundhound, Intonow, Viggle…
Company: Shazam
Facts
First commercial product of AFP
Since 2002, UK
Technology
Audio fingerprinting
Founder
Avery Wang (PhD at Standard, 1994)
Company: Soundhound
Facts
First product with multi-modal music search
AKA: midomi
Technologies
Audio fingerprinting
Query by singing/humming
Speech recognition
Multi-modal retrieval
Founder
Keyvan Mohajer (PhD at Stanford, 2007)
Two Stages in AFP
Offline
Feature extraction
Hash table construction
for songs in database
Inverted indexing
Online
Feature extraction
Hash table search
Ranked list of the
retrieved songs/music
Robust Feature Extraction
Various kinds of features for AFP
Invariance along time and frequency
Landmark of a pair of local maxima
Wavelets
…
Extensive test required for choosing the best
features
Representative Approaches to AFP
Philips
J. Haitsma and T.
Kalker, “A highly robust
audio fingerprinting
system”, ISMIR 2002.
Shazam
A.Wang, “An industrialstrength audio search
algorithm”, ISMIR 2003
Google
S. Baluja and M. Covell,
“Content fingerprinting using
wavelets”, Euro. Conf. on
Visual Media Production,
2006.
V. Chandrasekhar, M. Sharifi,
and D. A. Ross, “Survey and
evaluation of audio
fingerprinting schemes for
mobile query-by-example
applications”, ISMIR 2011
Philips: Thresholding as Features
Observation
Magnitude spectrum S(t, f)
The sign of energy
differences is robust to
various operations
Lossy encoding
Range compression
Added noise
Thresholding as Features
F (t , f ) 1, if
S (t , f ) S (t 1, f 1)
S (t , f 1) S (t 1, f )
Fingerprint F(t, f)
(Source: Dan Ellis)
Philips: Thresholding as Features (II)
Robust to low-bitrate
MP3 encoding (see the
right)
Sensitive to “frame time
difference” Hop size
is kept small!
Original
fingerprinting
Fingerprinting after
MP3 encoding
BER=0.078
Philips: Robustness of Features
BER of the features
after various operations
General low
High for speed and timescale changes (which is
not likely to occur under
query by example)
Philips: Search Strategies
Via hashing
Inverted indexing
Shazam’s Method
Ideas
Take advantage of music local structures
Find salient peaks on spectrogram
Pair peaks to form landmarks for comparison
Efficient search by hash tables
Use positions of landmarks (excluding t1) as hash keys
Use song ID and start time as hash values
Use time constraints to find matched landmarks
Database Preparation
Compute spectrogram
Perform mean subtraction & high-pass filtering
Detect salient peaks
Find initial threshold
Update the threshold along time
Pair salient peaks to form landmarks
Define target zone
Form landmarks and save them to a hash table
Query Match
Identify landmarks
Find matched landmarks
Retrieve landmarks from the hash table
Keep only time-consistent landmarks
Rank the database items
Via matched landmark counts
Via other confidence measures
Shazam: Landmarks as Features
Pair peaks in
target zone to
form landmarks
Spectrogram
• Landmark: [t1, f1, t2, f2]
• 24-bit hash key:
• f1: 9 bits
• Δf = f2-f1: 8 bits
• Δt = t2-t1: 7 bits
• Hash value:
• Song ID
• Landmark’s
start time t1
Salient peaks of
spectrogram
(Avery Wang, 2003)
How to Find Salient Peaks
We need to find peaks that are salient along
both frequency and time axes
Frequency axis: Gaussian local smoothing
Time axis: Decaying threshold over time
How to Find Initial Threshold?
Goal
Example
To suppress neighboring
peaks
Ideas
Find the local max. of
mag. spectra of initial 10
frames
Superimpose a Gaussian
on each local max.
Find the max. of all
Gaussians
Based on Bad Romance
envelopeGen.m
4
Original signal
Positive local maxima
Final output
3.5
3
2.5
2
1.5
1
0.5
0
50
100
150
200
250
How to Update the Threshold along Time?
Decay the threshold
Find local maxima larger
than the threshold
salient peaks
Define the new threshold
as the max of the old
threshold and the
Gaussians passing
through the active local
maxima
How to Control the No. of Salient peaks?
To decrease the no. of salient peaks
Perform forward and backward sweep to find
salient peaks along both directions
Use Gaussians with larger standard deviation
…
Time-decaying Thresholds
landmarkFind01.m
Forward pass
250
5
200
Forward:
Freq index
4
150
3
100
2
50
1
200
400
600
Frame index
800
1000
1200
Backward pass
250
5
Backward:
Freq index
200
4
150
3
100
2
50
1
200
400
600
Frame index
800
1000
1200
How to Pair Salient Peaks?
Target zone
A target zone is created
right following each
salient peaks
The leading peak are
paired with each peak in
the target zone to form
landmarks.
Each landmark is
denoted by [t1, f1, t2, f2].
Peak 1
Peak 2
Salient Peaks and Landmarks
Peak picking after
forward smoothing
Matched landmarks (green)
(Source: Dan Ellis)
Time Skew
Out of sync of frame
boundary
time
Reference
frames
Query
frames
1
Increase frame size
Repeated LM extraction
1
2
1
Solution
3
2
4
2
1
3
2
4
3
4
1
time skew!
3
5
2
5
4
3
4
1
2
3
4
1
2
3
4
1
2
3
4
To Avoid Time Skew
To avoid time skew, query landmarks are
extracted at various time shifts
Example of 4 shifts of step = hop/4
LM set 1
LM set 2
LM set 3
LM set 4
Union &
unique
Query
landmark set
Landmarks for Hash Table Access
Convert each landmark to hash key/value
Landmark from the database hash table creation
songId, [t1, f1, t2, f2]
24-bit hash key = f1 (9 bits) + ∆f (8 bits) + ∆t (7 bits)
32-bit hash value = songId (18 bits) + t1 (14 bits)
Landmark from the query hash table lookup
Use f1, t2, f2 to generate hash key for hash table lookup
Use t1 (start time) to find matched (time-consistent)
landmarks
Parameters in Our Implementation
Landmarks
Sample rate = 8000 Hz
Frame size = 1024
Overlap = 512
Frame rate = 15.625
Landmark rate = ~400
LM/sec
Hash table
Hash key size = 2^24 =
16.78M
Max song ID = 2^18 =
262 K
Max start time =
2^14/frameRate = 17.48
minutes
Our implementation is based on Dan Ellis’ work:
Robust Landmark-Based Audio Fingerprinting, http://labrosa.ee.columbia.edu/matlab/fingerprint
Structure of Hash Table
Collision happens when LMs have the same [f1, t2, f2]
Hash Table Lookup
Query (hash keys from landmarks)
Hash table
…
…
Generated from f1, t2, f2
…
8002
15007
0
1
9753
1432
1232
41
10002
19662
653
677
1461
438
142
486
997
73
1977
…
8002
65
… 15007 … 224-2
224-1
Hash
keys
436
Hash
values
Retrieved
landmarks
How to Find Query Offset Time?
Offset time of query can be derived by…
A landmark is Represented
by its first peak
Database LM start time
Retrieved and
matched LM
Retrieved
Database
landmarks
Query
landmarks
Query offset time
Query LM start time
Find Matched Landmarks
Start time plot for landmarks
A given LM starting at
9.5 sec retrieves 3 LMs
X axis: start time of database LMs
in the hash table
Y axis: start time of query LMs
Query offset time = x-y
t=9.5 sec
Query offset time
But only this
one is matched!
Find Matched Landmarks
We can determine the offset time by plotting
histogram of start time difference (x-y):
Start time plot
Histogram of
start time
difference
(x-y)
(Avery Wang, 2003)
Matched Landmark Count
To find matched (time-consistent) landmark count of a
song:
All retrieved landmarks
LM from
the same
song 2286
Song ID
Offset time
Hash value
2046
6925
485890
2286
555
485890
2286
795
485890
2286
1035
485890
2286
2715
384751
2286
555
384751
2286
556
963157
…
…
…
Histogram of
Offset time of a song
Offset
time
Count
18
1
1
1
…
…
Matched
landmark
count of
song 2286
Final Ranking
A common way to have the final ranking
Based on each song’s matched landmark count
Can also be converted into scores between 0~100
Song ID
Matched
landmark
count
2286
18
2746
13
2255
9
2033
5
2019
4
…
…
Offset time
…
Freq index
Matched Landmarks vs. Noise
250
200
150
100
50
0
-5
-10
Freq index
200
Freq index
600
Frame index
800
1000
1200
250
200
150
100
50
0
-5
200
400
600
Frame index
800
1000
1200
250
200
150
100
50
400
600
Frame index
800
1000
400
600
Frame index
800
1000
Noisy02
1200
250
200
150
100
50
200
Noisy01
-10
4
2
0
-2
-4
200
Freq index
400
Original
1200
4
2
0
-2
-4
-6
Noisy03
Run goLmVsNoise.m in AFP toolbox to create this example.
Optimization Strategies for AFP
Several ways to optimize AFP
Strategy for query landmark extraction
Confidence measure
Incremental retrieval
Better use of the hash table
Re-ranking for better performance
Strategy for LM Extraction (1)
10-sec
query
Goal
To trade computation for
accuracy
Steps:
1. Use a classifier to predict if
a query is likely to be a
“hit” or a “miss”.
2. Increase the landmark
counts of “miss” queries
for better accuracy
“hit”
Classifier
Regular LM
extraction
“miss”
Dense LM
extraction
AFP engine
Retrieved
songs
Strategy for LM Extraction (2)
Classifier construction
Training data: “hit” and
“miss” queries
Classifier: SVM
Features
mean volume
standard deviation of
volume
standard deviation of
absolute sum of highorder difference
…
Requirement
Fast in evaluation
Simple or readily
available features
Efficient classifier
Adaptive
Effective threshold for
detecting miss queries
Strategy for LM Extraction (3)
To increase landmarks for “miss” queries
Use more time-shifted query for LM extraction
Our test takes 4 shifts vs. 8 shifts
Decay the thresholds more rapidly to reveal more
salient peaks
…
Strategy for LM Extraction (4)
Song database
44.1kHz, 16-bits
1500 songs
1000 songs (30 seconds)
from GTZAN dataset
500 songs (3~5 minutes)
from our own collection
of English/Chinese songs
Datasets
10-sec clips recorded by
mobile phones
Training data
1412 clips (1223:189)
Test data
1062 clips
Strategy for LM Extraction (5)
AFP accuracy vs. computing time
Confidence Measure (1)
Confusion matrix
Performance indices
False acceptance rate
Predicted
No
Groundtruth
No
Yes
FAR
Yes
c01
c00 c01
False rejection rate
C00
C01
C10
C11
FRR
c10
c10 c11
Confidence Measure (2)
Factors for confidence
measure
Matched landmark count
Landmark count
Salient peak count
…
How to use these
factors
Take a value of the
factor and used it as a
threshold
Normalize the threshold
by dividing it by query
duration
Vary the threshold to
identify FAR & FRR
Dataset for Confidence Measure
Song database
44.1kHz, 16-bits
1500 songs
1000 songs (30 seconds)
from GTZAN dataset
16284 songs (3~5
minutes) from our own
collection of English
songs
Datasets
10-sec clips recorded by
mobile phones
In the database
1062 clips
Not in the database
1412 clips
Confidence Measure (3)
DET (Detection Error
Tradeoff) Curve
Accuracy vs. tolerance
No OOV queries
Toleranace of
matched
landmarks
Accuracy
79.19%
79.66%
79.57%
Incremental Retrieval
Goal
Take additional query input if the confidence
measure is not high enough
Implementation issues
Use only forward mode for landmark extraction
no. of landmarks ↗ computation time ↗
Use statistics of matched landmarks to restricted
the number of extracted landmarks for comparison
Hash Table Optimization
Possible directions for hash table optimization
To increase song capacity 20 bits for songId
Song capacity = 2^20 = 1 M
Max start time = 2^12/frameRate = 4.37 minutes
Longer songs are split into shorter segments
To increase efficiency 80-20 rule
Put 20% of the most likely songs to fast memory
Put 80% of the lese likely songs to slow memory
To avoid collision better hashing strategies
Re-ranking for Better Performance
Features that can be used to rank the matched
songs
Matched landmark count
Matched frequency count 1
Matched frequency count 2
…
Our AFP Engine
Music database
260k tracks currently
1M tracks in the future
Driving forces
Fundamental issues in
computer science
(hashing, indexing…)
Requests from local
companies
Methods
Landmarks as feature
(Shazam’s method)
Speedup by GPU
Platform
Single CPU + 3 GPUs
Specs of Our AFP Engine
Platform
OS: CentOS 6
CPU: Intel Xeon x5670 six cores 2.93GHz
Memory: 96GB
Database
Please refer to this page.
Experiments
Accuracy test
Corpora
Database: 2550 tracks
Test files: 5 mobilerecorded songs chopped
into segments of 5, 10,
15, and 20 seconds
Accuracy vs. duration
5-sec clips: 161/275=58.6%
10-sec clips: 121/136=89.0%
15-sec clips: 88/90=97.8%
20-sec clips: 65/66=98.5%
Computing time. vs. duration
Accuracy vs. computing time
MATLAB Prototype for AFP
Toolboxes
Audio fingerprinting
SAP
Utility
Dataset
Russian songs
Instruction
Download the toolboxes
Modify afpOptSet.m (in
the audio fingerprinting
toolbox) to add toolbox
paths
Run goDemo.m.
Demos of Audio Fingerprinting
Commercial apps
Shazam
Soundhound
Our demo
http://mirlab.org/demo/audioFingerprinting
QBSH vs. AFP
QBSH
Goal: MIR
Feature: Pitch
Perceptible
Small data size
Method: LS
Database
Harder to collect
Small storage
Bottleneck
CPU/GPU-bound
AFP
Goal: MIR
Features: Landmarks
Not perceptible
Big data size
Method: Matched LM
Database
Easier to collect
Large storage
Bottleneck
I/O-bound
Conclusions For AFP
Conclusions
Landmark-based methods are effective
Machine learning is indispensable for further
improvement.
Future work: Scale up
Shazam: 15M tracks in database, 6M tags/day
Our goal:
1M tracks with a single PC and GPU
10M tracks with cloud computing of 10 PC
References (I)
Robust Landmark-Based Audio Fingerprinting, Dan Ellis,
http://labrosa.ee.columbia.edu/matlab/fingerprint/
Avery Wang (Shazam)
“An Industrial-Strength Audio Search Algorithm”, ISMIR, 2003
“The Shazam music recognition service”, Comm. ACM 49(8), 44-48,,
2006..
J. Haitsma and T. Kalker (Phlillips)
“A highly robust audio fingerprinting system”, ISMIR, 2002
“A highly robust audio fingerprinting system with an efficient search
strategy,” J. New Music Research 32(2), 211-221, 2003.
References (II)
Google:
“Content Fingerprinting Using Wavelets”, Baluja, Covell., Proc.
CVMP , 2006
“Survey and Evaluation of Audio Fingerprinting Schemes for Mobile
Query-by-Example Applications”, Vijay Chandrasekhar, Matt Sharifi,
David A. Ross, ISMIR, 2011
“Computer Vision for Music Identification”, Y. Ke, D.
Hoiem, and R. Sukthankar, CVPR, 2005