Database Operation On GPU - University of North Carolina

Download Report

Transcript Database Operation On GPU - University of North Carolina

Database Operations on GPU
Changchang Wu
4/18/2007
Outline
• Database Operations on GPU
• Point List Generation on GPU
• Nearest Neighbor Searching on GPU
Database Operations on GPU
Design Issues
• Low bandwidth between GPU and CPU
• Avoid frame buffer readbacks
• No arbitrary writes
• Avoid data rearrangements
• Programmable pipeline has poor branching
• Evaluate branches using fixed function tests
Design Overview
• Use depth test functionality of GPUs for
performing comparisons
• Implements all possible comparisons <, <=, >=, >,
==, !=, ALWAYS, NEVER
• Use stencil test for data validation and storing
results of comparison operations
• Use occlusion query to count number of
elements that satisfy some condition
Basic Operations
Basic SQL query
Select A
From T
Where C
A= attributes or aggregations (SUM, COUNT, MAX etc)
T=relational table
C= Boolean Combination of Predicates (using operators
AND, OR, NOT)
Basic Operations
• Predicates – ai op constant or
ai op aj
• Op is one of <,>,<=,>=,!=, =, TRUE, FALSE
• Boolean combinations – Conjunctive Normal
Form (CNF) expression evaluation
• Aggregations – COUNT, SUM, MAX, MEDIAN,
AVG
Predicate Evaluation
• ai op constant (d)
• Copy the attribute values ai into depth buffer
• Define the comparison operation using depth test
• Draw a screen filling quad at depth d
glDepthFunc(…)
glStencilOp(fail, zfail, zpass );
Predicate Evaluation
• Comparing two attributes:
• ai op aj is treated as (ai – aj) op 0
• Semi-linear queries
• Easy to compute with fragment shader
Boolean Combinations
• Expression provided as a CNF
• CNF is of form (A1 AND A2 AND … AND Ak)
where Ai = (Bi1 OR Bi2 OR … OR Bimi )
• CNF does not have NOT operator
• If CNF has a NOT operator, invert comparison operation to
eliminate NOT
Eg. NOT (ai < d) => (ai >= d)
• For example, compute ai within [low, high]
• Evaluated as ( ai >= low ) AND ( ai <= high )
Algorithm
Range Query
• Compute ai within [low, high]
• Evaluated as ( ai >= low ) AND ( ai <= high )
Aggregations
• COUNT, MAX, MIN, SUM, AVG
• No data rearrangements
COUNT
• Use occlusion queries to get pixel pass
count
• Syntax:
• Begin occlusion query
• Perform database operation
• End occlusion query
• Get count of number of attributes that
passed database operation
• Involves no additional overhead!
MAX, MIN, MEDIAN
• We compute Kth-largest number
• Traditional algorithms require data
rearrangements
• We perform no data rearrangements, no
frame buffer readbacks
K-th Largest Number
• By comparing and counting, determinate
every bit in order of MSB to LSB
Example: Parallel Max
• S={10,24,37,99,192,200,200,232}
• Step 1: Draw Quad at 128(10000000)
• S = {10,24,37,99,192,200,200,232}
• Step 2: Draw Quad at 192(11000000)
• S = {10,24,37,192,200,200,232}
• Step 3: Draw Quad at 224(11100000)
• S = {10,24,37,192,200,200,232}
• Step 4: Draw Quad at 240(11110000)
•
– No values pass
• Step 5: Draw Quad at 232(11101000)
• S = {10,24,37,192,200,200,232}
• Step 6,7,8: Draw Quads at 236,234,233 – No
values pass, Max is 232
Accumulator, Mean
• Accumulator - Use sorting algorithm and add
all the values
• Mean – Use accumulator and divide by n
• Interval range arithmetic
• Alternative algorithm
• Use fragment programs – requires very few
renderings
• Use mipmaps [Harris et al. 02], fragment programs
[Coombe et al. 03]
Accumulator
• Data representation is of form
ak 2k + ak-1 2k-1 + … + a0
Sum = sum(ak) 2k+ sum(ak-1) 2k-1+…+sum(a0)
Current GPUs support no bit-masking
operations
The Algorithm
>=0.5 means i-th bit is 1
Implementation
• Algorithm
• CPU – Intel compiler 7.1 with hyper-threading,
multi-threading, SIMD optimizations
• GPU – NVIDIA Cg Compiler
• Hardware
• Dell Precision Workstation with Dual 2.8GHz Xeon
Processor
• NVIDIA GeForce FX 5900 Ultra GPU
• 2GB RAM
Benchmarks
• TCP/IP database with 1 million records
and four attributes
• Census database with 360K records
Copy Time
Predicate Evaluation
Range Query
Multi-Attribute Query
Semi-linear Query
Kth-Largest
Kth-Largest
Kth-Largest conditional
Accumulator
Analysis: Issues
• Precision
• Copy time
• Integer arithmetic
• Depth compare masking
• Memory management
• No Branching
• No random writes
Analysis: Performance
• Relative Performance Gain
• High Performance – Predicate evaluation,
multi-attribute queries, semi-linear queries,
count
• Medium Performance – Kth-largest number
• Low Performance - Accumulator
High Performance
• Parallel pixel processing engines
• Pipelining
• Early Z-cull
• Eliminate branch mispredictions
Medium Performance
• Parallelism
• FX 5900 has clock speed 450MHz, 8 pixel
processing engines
• Rendering single 1000x1000 quad takes
0.278ms
• Rendering 19 such quads take 5.28ms.
Observed time is 6.6ms
• 80% efficiency in parallelism!!
Low Performance
• No gain over SIMD based CPU
implementation
• Two main reasons:
• Lack of integer-arithmetic
• Clock rate
Advantages
• Algorithms progress at GPU growth rate
• Offload CPU work
• Fast due to massive parallelism on
GPUs
• Algorithms could be generalized to any
geometric shape
• Eg. Max value within a triangular region
• Commodity hardware!
GPU Point List Generation
• Data compaction
Overall task
3D to 2D mapping
Current Problem
The solution
Overview, Data Compaction
Algorithm: Discriminator
Algorithm: Histogram Builder
Histogram Output
Algorithm: PointList Builder
PointList Output
Timing
Reduces a highly sparse matrix with N
elements to a list of its M active entries
in O(N) + M (log N) steps,
Applications
• Image Analysis
• Feature Detection
• Volume Analysis
• Sparse Matrix Generation
Searching
• 1D Binary Search
• Nearest Neighbor Search for High
dimension space
• K-NN Search
Binary Search
• Find a specific element in an ordered list
• Implement just like CPU algorithm
• Assuming hardware supports long enough shaders
• Finds the first element of a given value v
• If v does not exist, find next smallest element > v
• Search algorithm is sequential, but many
searches can be executed in parallel
• Number of pixels drawn determines number of
searches executed in parallel
• 1 pixel == 1 search
Binary Search
• Search for v0
Initialize
Search starts at center of
sorted array
4
v2 >= v0 so search left half
of sub-array
Sorted List
v0
0
v0
1
v0
2
v2
3
v2
4
v2
5
v5
6
v5
7
Binary Search
• Search for v0
Initialize
4
Step 1
2
Sorted List
v0
0
v0 >= v0 so search left half
of sub-array
v0
1
v0
2
v2
3
v2
4
v2
5
v5
6
v5
7
Binary Search
• Search for v0
Initialize
4
Step 1
2
Step 2
1
Sorted List
v0
0
v0 >= v0 so search left half
of sub-array
v0
1
v0
2
v2
3
v2
4
v2
5
v5
6
v5
7
Binary Search
• Search for v0
Initialize
4
Step 1
2
Step 2
1
Step 3
0
Sorted List
v0
0
At this point, we either
have found v0 or are 1
element too far left
One last step to resolve
v0
1
v0
2
v2
3
v2
4
v2
5
v5
6
v5
7
Binary Search
• Search for v0
Initialize
4
Step 1
2
Step 2
1
Step 3
0
Step 4
0
Sorted List
v0
0
Done!
v0
1
v0
2
v2
3
v2
4
v2
5
v5
6
v5
7
Binary Search
• Search for v0 and v2
Initialize
4
Search starts at center of
sorted array
4
Both searches proceed to
the left half of the array
Sorted List
v0
0
v0
1
v0
2
v2
3
v2
4
v2
5
v5
6
v5
7
Binary Search
• Search for v0 and v2
Initialize
4
4
Step 1
2
2
Sorted List
v0
0
The search for v0
continues as before
The search for v2
overshot, so go back to
the right
v0
1
v0
2
v2
3
v2
4
v2
5
v5
6
v5
7
Binary Search
• Search for v0 and v2
Initialize
4
4
Step 1
2
2
Step 2
1
3
Sorted List
v0
0
v0
1
v0
2
We’ve found the proper
v2, but are still looking for
v0
Both searches continue
v2
3
v2
4
v2
5
v5
6
v5
7
Binary Search
• Search for v0 and v2
Initialize
4
4
Step 1
2
2
Step 2
1
3
Step 3
0
2
Sorted List
v0
0
v0
1
v0
2
Now, we’ve found the
proper v0, but overshot v2
The cleanup step takes
care of this
v2
3
v2
4
v2
5
v5
6
v5
7
Binary Search
• Search for v0 and v2
Initialize
4
4
Step 1
2
2
Step 2
1
3
Step 3
0
2
Step 4
0
3
Sorted List
v0
0
v0
1
v0
2
Done! Both v0 and v2 are
located properly
v2
3
v2
4
v2
5
v5
6
v5
7
Binary Search Summary
• Single rendering pass
• Each pixel drawn performs independent
search
• O(log n) steps
Nearest Neighbor Search
• Very fundamental step in similarity
search of data mining, retrieval…
• Curse of dimensionality,
• When dimensionality is very high, structures
like k-d tree does not help
• Use GPU to improve linear scan
Distances
• N-norm distance
• Cosine distance acos(dot(x,y))
Data Representation
• Use separate textures to store different
dimensions.
Distance Computation
• Accumulating distance component of
different dimensions
Reduction in RGBA
Reduction to find NN
Results
Results
K-Nearest Neighbor Search
• Given a sample point p, find the k points
nearest p within a data set
• On the CPU, this is easily done with a heap or
priority queue
• Can add or reject neighbors as search progresses
• Don’t know how to build one efficiently on GPU
• kNN-grid
• Can only add neighbors…
kNN-grid Algorithm
sample point
candidate neighbor
neighbors found
Want 4 neighbors
kNN-grid Algorithm
• Candidate neighbors
must be within max
search radius
• Visit voxels in order
of distance to sample
point
sample point
candidate neighbor
neighbors found
Want 4 neighbors
kNN-grid Algorithm
1
sample point
candidate neighbor
neighbors found
Want 4 neighbors
• If current number of
neighbors found is
less than the number
requested, grow
search radius
kNN-grid Algorithm
2
sample point
candidate neighbor
neighbors found
Want 4 neighbors
• If current number of
neighbors found is
less than the number
requested, grow
search radius
kNN-grid Algorithm
2
sample point
candidate neighbor
neighbors found
Want 4 neighbors
• Don’t add neighbors
outside maximum
search radius
• Don’t grow search
radius when neighbor
is outside maximum
radius
kNN-grid Algorithm
• Add neighbors within
search radius
3
sample point
candidate neighbor
neighbors found
Want 4 neighbors
kNN-grid Algorithm
• Add neighbors within
search radius
4
sample point
candidate neighbor
neighbors found
Want 4 neighbors
kNN-grid Algorithm
• Don’t expand search
radius if enough
neighbors already
found
4
sample point
candidate neighbor
neighbors found
Want 4 neighbors
kNN-grid Algorithm
• Add neighbors within
search radius
5
sample point
candidate neighbor
neighbors found
Want 4 neighbors
kNN-grid Algorithm
6
sample point
candidate neighbor
neighbors found
Want 4 neighbors
• Visit all other voxels
accessible within
determined search
radius
• Add neighbors within
search radius
kNN-grid Summary
6
sample point
candidate neighbor
neighbors found
Want 4 neighbors
• Finds all neighbors
within a sphere
centered about
sample point
• May locate more
than requested knearest neighbors
References
• Naga Govindaraju, Brandon Lloyd, Wei Wang, Ming Lin and
Dinesh Manocha, Fast Computation of Database Operations
using Graphics Processors
http://www.gpgpu.org/s2004/slides/govindaraju.DatabaseOperations.ppt
• Benjamin Bustos, Oliver Deussen, Stefan Hiller, and Daniel
Keim, A Graphic Hardware Accelerated Algorithm for Nearest
Neighbor Search
• Gernot Ziegler, Art Tevs, Christian Theobalt, Hans-Peter Seidel,
GPU Point List Generation through Histogram Pyramids
http://www.mpi-inf.mpg.de/~gziegler/gpu_pointlist/
• Tim Purcell, Sorting and Searching
http://www.gpgpu.org/s2005/slides/purcell.SortingAndSearching.
ppt