Fast Numerical Computations on Large Integer Databases

Download Report

Transcript Fast Numerical Computations on Large Integer Databases

Database and Stream Mining
using GPUs
Naga K. Govindaraju
UNC Chapel Hill
Goal
• Utilize graphics processors for fast
computation of common database operations
• Conjunctive selections
• Aggregations
• Semi-linear queries
• Essential components
2
Motivation: Fast operations
• Increasing database sizes
• Faster processor speeds but low
improvement in query execution time
• Memory stalls
• Branch mispredictions
• Resource stalls
• Ref: [Ailamaki99,01] [Boncz99] [Manegold00,02]
[Meki00] [Shatdal94] [Rao99] [Ross02] [Zhou02]……
3
Fast Database Operations
CPU
(3 GHz)
GPU
(500 MHz)
System Memory
(2 GB)
Video Memory
(256 MB)
AGP Memory
(512 MB)
Others
PCI-e Bus
(4 GB/s)
Ours
4
Memory
Bandwidth
Peak SIMD
Instructions
Vector Ops
per Clock
NVIDIA GeForceFX NVIDIA GeForceFX Intel Pentium 4
6800 Ultra
5900 Ultra
35.2 GBps
27.2 GBps
6.4 GBps
DDR2 400 RDRAM
6 Vertex Ops
16 Pixel Ops
Float
16 vector4 (float)
Peak
64
Comparison
Ops per
Clock
Clock
400 MHz
4 Vertex Ops
4 Pixel Ops
Float
4 vector4 (float)
4 Float Ops (SSE)
2 Double Ops
(SSE2)
16
4
450 MHz
3.4 GHz
1 vector4 (float)
5
Graphics Processors: Design Issues
• Relatively low bandwidth to CPU
• Design database operations avoiding frame buffer
readbacks
• No arbitrary writes
• Design algorithms avoiding data rearrangements
• Programmable pipeline has poor branching
• Design algorithms without branching in programmable
pipeline - evaluate branches using fixed function tests
6
Basic DB Operations
Basic SQL query
Select A
From T
Where C
A= attributes or aggregations (SUM, COUNT, MAX
etc)
T=relational table
C= Boolean Combination of Predicates (using
operators AND, OR, NOT)
7
Database Operations
• Predicates
• ai op constant or ai op aj
• op: <,>,<=,>=,!=, =, TRUE, FALSE
• Boolean combinations
• Conjunctive Normal Form (CNF)
• Aggregations
• COUNT, SUM, MAX, MEDIAN, AVG
8
Data Representation
• Attribute values ai are stored in 2D textures
on the GPU
• A fragment program is used to copy
attributes to the depth buffer
9
Copy Time to the Depth Buffer
10
Data Representation: Issues
• Floating point and fixed point
representations are different
• Need to define scaling operations
11
Predicate Evaluation
• ai op constant (d)
• Copy the attribute values ai into depth buffer
• Specify the comparison operation used in the depth
test
• Draw a screen filling quad at depth d and perform
the depth test
12
ai op d
If ( ai op d )pass fragment
P
Else
reject fragment
Screen
d
13
Predicate Evaluation
CPU implementation — Intel compiler 7.1 with SIMD optimizations
14
Predicate Evaluation
• ai op aj
• Equivalent to (ai – aj) op 0
• Semi-linear queries
• Defined as linear combination of attribute values compared
against a constant
• Linear combination is computed as a dot product of two
vectors
s  a
i
i
• Utilize the vector processing capabilities of GPUs
15
Semi-linear Query
16
Boolean Combination
• CNF:
• (A1 AND A2 AND … AND Ak) where
Ai = (Bi1 OR Bi2 OR … OR Bimi )
• Performed using stencil test recursively
• C1 = (TRUE AND A1) = A1
• Ci = (A1 AND A2 AND … AND Ai) = (Ci-1 AND Ai)
• Different stencil values are used to code the
outcome of Ci
• Positive stencil values — pass predicate evaluation
• Zero — fail predicate evaluation
17
A1 AND A2
A2 = (B21 OR B22 OR B23 )
B23
A1
B22
B21
18
A1 AND A2
Stencil value = 1
A1
19
A1 AND A2
TRUE AND A1
Stencil value = 0
Stencil value = 1
A1
20
A1 AND A2
Stencil = 0
Stencil=2
B22
Stencil = 1
A1
Stencil=2
B23
Stencil=2
B2 1
21
A1 AND A2
Stencil = 0
Stencil=2
B22
Stencil = 1
A1
Stencil=2
B23
Stencil=2
B2 1
22
A1 AND A2
Stencil = 2 Stencil = 0
A1 AND B22
Stencil=2
A1 AND B23
Stencil=2
A1 AND B21
23
Multi-Attribute Query
24
Range Query
• Compute ai within [low, high]
• Evaluated as ( ai >= low ) AND ( ai <= high )
• Use NVIDIA depth bounds test to evaluate
both conditionals in a single clock cycle
25
Range Query
26
Aggregations
• COUNT, MAX, MIN, SUM, AVG
27
COUNT
• Use occlusion queries to get the number of pixels
•
passing the tests
Syntax:
• Begin occlusion query
• Perform database operation
• End occlusion query
• Get count of number of attributes that passed database
operation
• Involves no additional overhead!
• Efficient selectivity computation
28
MAX, MIN, MEDIAN
• Kth-largest number
• Traditional algorithms require data
rearrangements
• We perform
• no data rearrangements
• no frame buffer readbacks
29
K-th Largest Number
• Let vk denote the k-th largest number
• How do we generate a number m equal to
vk?
• Without knowing vk’s value
• Using occlusion queries to obtain the number of
values  some given value
• Starting from the most significant bit, determine the
value of each bit at a time
30
K-th Largest Number
• Given a set S of values
• c(m) —number of values  m
• vk — the k-th largest number
• We have
• If c(m) > k-1, then m ≤ vk
• If c(m) ≤ k-1, then m > vk
31
2nd Largest in 9 Values
0011
1011
1101
0111
0101
0001
0111
1010
0010
m = 0000
v2 = 1011
32
Draw a Quad at Depth 8
Compute c(1000)
0011
1011
1101
0111
0101
0001
0111
1010
0010
m = 1000
v2 = 1011
33
1st bit = 1
0011
1011
1101
0111
0101
0001
0111
1010
0010
m = 1000
v2 = 1011
c(m) = 3
34
Draw a Quad at Depth 12
Compute c(1100)
0011
1011
1101
0111
0101
0001
0111
1010
0010
m = 1100
v2 = 1011
35
2nd bit = 0
0011
1011
1101
0111
0101
0001
0111
1010
0010
m = 1100
v2 = 1011
c(m) = 1
36
Draw a Quad at Depth 10
Compute c(1010)
0011
1011
1101
0111
0101
0001
0111
1010
0010
m = 1010
v2 = 1011
37
3rd bit = 1
0011
1011
1101
0111
0101
0001
0111
1010
0010
m = 1010
v2 = 1011
c(m) = 3
38
Draw a Quad at Depth 11
Compute c(1011)
0011
1011
1101
0111
0101
0001
0111
1010
0010
m = 1011
v2 = 1011
39
4th bit = 1
0011
1011
1101
0111
0101
0001
0111
1010
0010
m = 1011
v2 = 1011
c(m) = 2
40
Our algorithm
• Initialize m to 0
• Start with the MSB and scan all bits till LSB
• At each bit, put 1 in the corresponding bitposition of m
• If c(m) ≤ k-1, make that bit 0
• Proceed to the next bit
41
Kth-Largest
42
Median
43
Top K Frequencies
• Given n values in frame buffer, compute the
top k frequencies without performing data
rearrangements and using comparisons
44
Accumulator, Mean
• Possible algorithms
• Use fragment programs – requires very few
renderings
• Use mipmaps [Harris et al. 02], fragment programs
[Coombe et al. 03]
• Issue: overflow in floating point values
• Our approach: bit-based algorithm
• Mean computed using accumulator and
divide by n
45
Accumulator
• Data representation is of form
2k ak + 2k-1 ak-1 + … + a0
Sum = 2k Σ ak + 2k-1 Σ ak-1 +…+ Σ a0
Σ ai = number of values with i-th bit as 1
Current GPUs support no bit-masking operations
46
TestBit
• Read the data value from texture, say ai
• F= frac(ai/2k)
• If F>=0.5, then k-th bit of ai is 1
• Set F to alpha value. Alpha test passes a
fragment if alpha value>=0.5
47
Accumulator
48
Stream Mining
• Streams are continuous sequence of data
values arriving at a port
• A few common examples include networking data,
stock marketing and financial data, and data
collected from sensors
• Goal: Efficiently approximate order statistics
such as frequencies, and quantiles on data
streams
• Exact computations require infinite memory
49
Issues
• Data streaming applications require realtime processing requirements
• Applications also require small or limited
memory footprint
50
Issues
• Efficient CPU-algorithms perform histogram
computations and are either
• Compute-limited and therefore, cannot process data
faster than its arrival rate
• Memory-limited, and therefore, use memory
hierarchies on disks and are slow. Alternately, load
shedding algorithms which drop excess items are
also used
51
Histogram Computation
• Efficient sorting is fundamental for
histogram computations
• Our new sorting network algorithm uses
texture mapping and blending functionality
of GPUs to perform fast sorting on GPUs.
• The comparator mapping is performed using texture
mapping
• The conditional assignments (MIN and MAX) are
implemented using blending algorithm
• Maps efficiently to rasterization and is fast!
52
Further details
Fast and Approximate Stream Mining of
Quantiles and Frequencies Using Graphics
Processors
Naga K. Govindaraju, Nikunj Raghuvanshi,
Dinesh Manocha
Proc. of ACM SIGMOD 2005
53
Advantages
• Algorithms progress at GPU growth rate
• Offload CPU work
• Streaming processor parallel to CPU
• Fast
• Massive parallelism on GPUs
• High memory bandwidth
• No branch mispredictions
• Commodity hardware!
54
Conclusions
• Novel algorithms to perform
database operations on GPUs
• Evaluation of predicates, boolean
combinations of predicates, aggregations
• Algorithms take into account GPU
limitations
• No data rearrangements
• No frame buffer readbacks
55
Conclusions
• Algorithms map well to rasterization and
GPUs
• Preliminary comparisons with optimized CPU
implementations is promising
• GPU as a useful co-processor
56
Future Work
• Improve performance of many of our
algorithms
• More database operations such as join,
sorting, classification and clustering.
• Queries on spatial and temporal databases
57