Redundant Bit Vectors for the Audio Fingerprinting Server

Download Report

Transcript Redundant Bit Vectors for the Audio Fingerprinting Server

Redundant Bit Vectors for the
Audio Fingerprinting Server
John Platt
Jonathan Goldstein
Chris Burges
Structure of Talk
1.
2.
3.
4.
5.
6.
Problem Statement
Problems with Existing Techniques
Bit Vectors
Partitioning the Query Space
Results
Future Extensions
Problem Statement

Find all hyperspheres that overlap a query point
S1
S2
S3
Query is song 2
Query
S4
S5
Data sphere



Centers of spheres = undistorted fingerprint of song
Radius of sphere = acceptable distortion of fingerprint
Sphere overlaps query = query song is same as dB song
Existing Techniques

Linear scan




For each sphere, check if query is inside
Linear effort in size of dB
Previous best known technique
Data partitioning (R-trees, SS-trees, etc)




Store data in a tree of shapes
Shapes chop up space
Descend and backtrack in tree, finding all
leaf nodes that could match query
Performs worse than linear scan!!!
Key New Ideas
New Algorithm: Redundant Bit Vectors (RBV)
Partition the queries, not the data

1.

Avoids hopeless task
Store each data point redundantly
2.

Combats high dimensionality
Use bit vectors to index database
3.

Small & fast representation
Data Structure: Bit Vectors

Bit vectors represent every data point as one bit
1
1
1
1
1
0
1
1
0
0
1
0
1
0
0
1
1
1
1
1
1 & 1 & 0 & 1 → 0
1
1
0
0
1
0
1
1
1
0
1
0
1
1
1
1
Point N 1
Point 1
Most examples
are excluded
from a final
linear scan
0 means exclude this example from linear search
1 means linear search must still look at example
Why Bit Vectors are Good

Small memory footprint


1 bit per example per bit vector
Fast on modern CPUs


1 CPU cycle operates on 32 examples per clock cycle
Compare to Euclidean distance


3 operations/example/dimension
Potential speed up: 96x !!!

we use 1 bit vector per dimension in lookup
Partition the Queries, not the Data
Query
1
0
1
1
1
1
1
0
1
1
0
1
1
1
1
1
bin
1
1
0
1
1
1
1
1
1
1
0
1
0
1
1
0
For each query dimension,
• dimension indexed into bins
• bit vector associated with each bin
• when query falls into bin …
use the corresponding bit vector
AND together bit vectors for some
or all dimensions
 Each dimension trims examples
 Perform linear scan on survivors
How We Compute the RBVs
At index building time, construct the bit vectors:
S1
S2
S3
1
1
0
0
0
0
1
0
1
0
0
0
S4
S5
S6
0
0
1
1
0
1
0
0
0
1
1
1
Project spheres into
each dimension
i th vector, j th bit
= does sphere j overlap
bin i ?
How We Decide on the Bin Edges

Two equivalent heuristics



each bin should have ~ same number of spheres
adjacent bit vectors should be ~ constant Hamming
distance apart
Place bin edges equal num. of sphere edges apart
6
3
S2
S1
S3
9
S4
12
S5
S6
Improve Selectivity by Shrinking Boxes

Fingerprinting with hyperspheres (L2 norm)


Low, but non-zero false negative rate
Bit vectors implement hyperrectangles (L∞ norm)
bit vectors guaranteed
to never introduce false
positives
bit vectors empirically
found to introduce no
extra false positives
We shrink the hyperrectangles to speed up
final linear search
Speed Comparisons




Ran 1000 queries against fingerprint database
Database size = 240K
64 dimensional points
14 bit vector dimensions used



32 bins per dimension


chosen to optimize bit vector + linear scan speed
more dim: bit vectors slow down, linear scan speeds up
chosen to optimize memory/speed tradeoff
Pentium 4, 2.2 GHz
Results
Method
Queries/sec
14-D RBVs + 64-D
limited L2 scan
1490
Factor Slower
Than RBVs
1
64-D full L2 scan
31
48
14-D Hilbert R-trees
10
149
• all linear scans used early bailing
• measured code by itself, not in context of SQL or IIS
Code Details



~600 lines of C++ (pretty simple)
Not integrated with SQL Server or IIS, etc.
Running as part of audio fingerprinting demo
How Fast Is It Really?

You tell us: it depends on several factors



Linear time in size of database
Linear time in amount of resistance to cropping
Sorting by popularity may help substantially
Resistance to Cropping

How much “crop slop” do you need?
location of fingerprint
true
start
of song
largest acceptable
crop to recognize

current system = 5.4 traces / second of slop


need to search
this portion of song
possible to reduce by 2x
server load linear in number of traces
Popularity Sorting May Help
Order database by approximate popularity of music
Split search into different sections
5000 most
popular
50000 next
most popular
first search here
if not found,
search here
if not found,
search here
the rest
May yield
substantial
speed gain
Memory Performance

Bit Vectors can stay resident in memory





For 240K songs
All fingerprints live in 128M
Bit vector indices only require 13M
We can store fingerprints as 2-byte short, save 2x mem.
Bit vector search blows out of cache

speed depends on memory bandwidth of server
Summary


Bit Vectors are Simple, Small, and Fast
Must be used to get good server-side performance