Transcript Talk
Fast Computation of Database
Operations using Graphics
Processors
Naga K. Govindaraju
Univ. of North Carolina
Modified By,
Mahendra Chavan for CS632
Goal
• Utilize graphics processors for fast computation of
common database operations
Motivation: Fast operations
• Increasing database sizes
• Faster processor speeds but low improvement in
query execution time
– Memory stalls
– Branch mispredictions
– Resource stalls Eg. Instruction dependency
• Utilize the available architectural features and
exploit parallel execution possibilities
Graphics Processors
• Present in almost every PC
• Have multiple vertex and pixel processing engines
running parallel
• Can process tens of millions of geometric primitives
per second
• Peak Perf. Of GPU is increasing at the rate of 2.5-3
times a year!
• Programmable- fragment programs – executed on
pixel processing engines
Main Contributions
• Algorithms for predicates, boolean combinations and
aggregations
• Utilize SIMD capabilities of pixel processing engines
• They have used these algorithms for selection
queries on one or more attributes and aggregate
queries
Related Work
• Hardware Acceleration for DB operations
– Vector processors for relational DB operations
[Meki and Kambayashi 2000]
– SIMD instructions for relational DB operations
[ Zhou and Ross 2002]
– GPUs for spatial selections and joins [Sun et al. 2003]
Graphics Processors:
Design Issues
• Programming model is limited due to lack of
random access writes
– Design algorithms avoiding data rearrangements
• Programmable pipeline has poor branching
– Design algorithms without branching in programmable
pipeline - evaluate branches using fixed function tests
Frame Buffer
• Pixels stored on graphics card in a frame buffer.
• Frame buffer conceptually divided into:
• Color Buffer
– Stores color component of each pixel in the frame buffer
• Depth Buffer
– Stores depth value associated with each pixel. The depth is
used to determine surface visibility
• Stencil Buffer
– Stores stencil value for each pixel . Called Stencil because, it
is typically used for enabling/disabling writes to frame buffer
Graphics Pipeline
Vertices
Vertex
Processing
Engine
Setup
Engine
Pixel
processing
Engine
Alpha Test
Stencil Test
Depth Test
Graphics Pipeline
• Vertex Processing Engine
– Transforms vertices to points on screen
• Setup Engine
– Generates Info. For color, depth etc. associated with primitive
vertices
• Pixel processing Engines
– Fragment processors, performs a series of tests before
writing the fragments to frame buffer
Pixel processing Engines
• Alpha Test
– Compares fragments alpha value to user-specified reference
value
• Stencil Test
– Compares fragments’ pixel’s stencil value to user-specified
reference value
• Depth Test
– Compares depth value of the fragment to the reference depth
value.
Operators
• =
• <
• >
• <=
• >=
• Never
• Always
Fragment Programs
• Users can supply custom fragment programs on
each fragment
Occlusion Query
•Gives no. of fragments that pass different no. of tests
Radeon R770 GPU by AMD Graphics Product Group
Data Representation on GPUs
• Textures – 2 D arrays- may have multiple channels
• We store data in textures in floating point formats
• To perform computations on the values, render the
quadrilateral, generate fragments, run fragment
programs and perform tests!
Stencil Tests
• Fragments failing Stencil test are rejected from the
rasterization pipeline
• Stencil Operations
–
–
–
–
–
–
KEEP: keep the stencil value in the stencil buffer
INCR: stencil value ++
DECR: stencil value –
ZERO: stencil value = 0
REPLACE: stencil value = reference value
INVERT: bitwise invert (stencil value)
Stencil and Depth Tests
• We can setup the stencilOP routine as below
• For each fragment , three possible outcomes, based
on the outcome, corresponding stencil op. is
executed
• Op1: when a fragment fails stencil test
• Op2: when a fragment passes stencil test but fails
depth test
• Op3: when a fragment passes stencil and depth test
Outline
• Database Operations on GPUs
• Implementation & Results
• Analysis
• Conclusions
Outline
• Database Operations on GPUs
• Implementation & Results
• Analysis
• Conclusions
Overview
• Database operations require comparisons
• Utilize depth test functionality of GPUs for
performing comparisons
– Implements all possible comparisons <, <=, >=, >, ==, !=,
ALWAYS, NEVER
• Utilize stencil test for data validation and storing
results of comparison operations
Basic Operations
Basic SQL query
Select A
From T
Where C
A= attributes or aggregations (SUM, COUNT, MAX etc)
T=relational table
C= Boolean Combination of Predicates (using operators AND,
OR, NOT)
Outline: Database Operations
• Predicate Evaluation
– (a op constant) – depth test and stencil test
– (a op b) = (a-b op 0 ) – can be executed on GPUs
• Boolean Combinations of Predicates
– Express as CNF and repetitively use stencil tests
• Aggregations
– Occlusion queries
Outline: Database Operations
• Predicate Evaluation
• Boolean Combinations of Predicates
• Aggregations
Basic Operations
• Predicates – ai op constant or
ai op aj
– Op is one of <,>,<=,>=,!=, =, TRUE, FALSE
• Boolean combinations – Conjunctive Normal Form
(CNF) expression evaluation
• Aggregations – COUNT, SUM, MAX, MEDIAN, AVG
Predicate Evaluation
• ai op constant (d)
– Copy the attribute values ai into depth buffer
– Define the comparison operation using depth test
– Draw a screen filling quad at depth d
ai op d
If ( ai op d )
pass fragment
P
Else
reject fragment
Screen
d
Predicate Evaluation
• ai op aj
– Treat as (ai – aj) op 0
• Semi-linear queries
– Defined as linear combination of attribute values compared
against a constant
– Linear combination is computed as a dot product of two
vectors
– Utilize the vector processing capabilities of GPUs
Data Validation
• Performed using stencil test
• Valid stencil values are set to a given value “s”
• Data values that fail predicate evaluation are set to
“zero”
Outline: Database Operations
• Predicate Evaluation
• Boolean Combinations of Predicates
• Aggregations
Boolean Combinations
• Expression provided as a CNF
• CNF is of form
(A1 AND A2 AND … AND Ak)
where Ai = (Bi1 OR Bi2 OR … OR Bimi )
• CNF does not have NOT operator
– If CNF has a NOT operator, invert comparison operation to
eliminate NOT
Eg. NOT (ai < d) => (ai >= d)
Boolean Combination
• We will focus on (A1 AND A2)
• All cases are considered
– A1 = (TRUE AND A1)
– If Ei = (A1 AND A2 AND … AND Ai-1 AND Ai),
Ei = (Ei-1 AND Ai)
•
Clear stencil value to 1
•
For each Ai , i=1,….,k
•
do
– if (mod(I,2)) /* Valid stencil value is 1 */
• Stencil test to pass if stencil value is equal to 1
• StencilOp (KEEP,KEPP, INCR)
– Else
• Stencil test to pass if stencil value is equal to 2
• StencilOp (KEEP,KEPP, DECR)
– Endif
– For each Bij, j=1,…..,mi
– Do
• Perform Bij using COMPARE /* depth test */
– End for
– If (mod(I,2)) /* valid stencil value is 2 */
• If stencil value on screen is 1 , REPLACE with 0
– Else /* valid stencil value is 1 */
• If stencil value on screen is 2, REPLACE with 0
– Endif
•
End For
A1 AND A2
B23
A1
B22
B21
A1 AND A2
Stencil value = 1
A1 AND A2
Stencil value = 1
A1
A1 AND A2
Stencil value = 0
Stencil value = 2
A1
A1 AND A2
St = 0
St=2
St=0
St=1
B22
A1
St=1
B23
St=1
B21
A1 AND A2
Stencil = 0
St = 0
St=1
B22
A1
St=1
B23
St=1
B21
A1 AND A2
St = 0
St = 1
A1 AND B22
St=1
A1 AND B23
St=1
A1 AND B21
Range Query
• Compute ai within [low, high]
– Evaluated as ( ai >= low ) AND ( ai <= high )
Outline: Database Operations
• Predicate Evaluation
• Boolean Combinations of Predicates
• Aggregations
Aggregations
• COUNT, MAX, MIN, SUM, AVG
• No data rearrangements
COUNT
• Use occlusion queries to get pixel pass count
• Syntax:
–
–
–
–
Begin occlusion query
Perform database operation
End occlusion query
Get count of number of attributes that passed database
operation
• Involves no additional overhead!
MAX, MIN, MEDIAN
• We compute Kth-largest number
• Traditional algorithms require data rearrangements
• We perform no data rearrangements, no frame
buffer readbacks
K-th Largest Number
• Say vk is the k-th largest number
• How do we generate a number m equal to vk?
– Without knowing vk’s bit-representation and using
comparisons
Our algorithm
• b_max = max. no. of bits in the values in tex
• x=0
• For i= b_max-1 down to 0
– Count = Compare (text >= x + 2^i)
– If Count > k-1
• x=x+2^i
• Return x
K-th Largest Number
• Lemma: Let vk be the k-th largest number. Let count
be the number of values >= m
– If count > (k-1): m<= vk
– If count <= (k-1): m>vk
• Apply the earlier algorithm ensuring that count >(k1)
Example
• Vk = 11101001
• M = 00000000
Example
• Vk = 11101001
• M = 10000000
• M <= Vk
Example
• Vk = 11101001
• M = 11000000
• M <= Vk
Example
• Vk = 11101001
• M = 11100000
• M <= Vk
Example
• Vk = 11101001
• M = 11110000
• M > Vk
Make the bit 0
M = 11100000
Example
• Vk = 11101001
• M = 11101000
• M <= Vk
Example
• Vk = 11101001
• M = 11101100
• M > Vk
• Make this bit 0
• M = 11101000
Example
• Vk = 11101001
• M = 11101010
• M > Vk
• M = 11101000
Example
• Vk = 11101001
• M = 11101001
• M <= Vk
Example
• Integers ranging from 0 to 255
• Represent them in depth buffer
– Idea – Use depth functions to perform comparisons
– Use NV_occlusion_query to determine maximum
Example: Parallel Max
• S={10,24,37,99,192,200,200,232}
• Step 1: Draw Quad at 128
– S = {10,24,37,99,192,200,200,232}
• Step 2: Draw Quad at 192
– S = {10,24,37,192,200,200,232}
• Step 3: Draw Quad at 224
– S = {10,24,37,192,200,200,232}
• Step 4: Draw Quad at 240 – No values pass
• Step 5: Draw Quad at 232
– S = {10,24,37,192,200,200,232}
• Step 6,7,8: Draw Quads at 236,234,233 – No values
pass
• Max is 232
SUM and AVG
• Mipmaps – multi resolution textures consisting of multiple
levels
• Highest level contains average of all values at lowest level
• SUM = AVG * COUNT
• Problems with mipmaps
– If we want sum of a subset of values then we have to introduce
conditions in the fragment programs
– Floating point representations may have problems
Accumulator
• Data representation is of form
• ak 2k + ak-1 2k-1 + … + a0
Sum = sum(ak) 2k+ sum(ak-1) 2k-1+…+sum(a0)
Current GPUs support no bit-masking operations
AVG = SUM/COUNT
TestBit
• Read the data value from texture, say ai
• F= frac(ai/2k+1)
• If F>=0.5, then k-th bit of ai is 1
• Set F to alpha value. Alpha test passes a fragment if
alpha value>=0.5
Outline
• Database Operations on GPUs
• Implementation & Results
• Analysis
• Conclusions
Implementation
• Dell Precision Workstation with Dual 2.8GHz Xeon
Processor
• NVIDIA GeForce FX 5900 Ultra GPU
• 2GB RAM
Implementation
• CPU – Intel compiler 7.1 with hyperthreading, multithreading, SIMD optimizations
• GPU – NVIDIA Cg Compiler
Benchmarks
• TCP/IP database with 1 million records and four
attributes
• Census database with 360K records
Copy Time
Predicate Evaluation (3 times
faster)
Range Query(5.5 times faster)
Multi-Attribute Query (2 times)
Semi-linear Query (9 times
faster)
COUNT
• Same timings for GPU implementation
Kth-Largest for median(2.5
times)
Kth-Largest
Kth-Largest conditional
Accumulator(20 times slower!)
Outline
• Database Operations on GPUs
• Implementation & Results
• Analysis
• Conclusions
Analysis: Issues
• Precision
– Currently depth buffer has only 24 bit precision , inadequate
• Copy time
– Copy from texture to depth buffer – no mechanism in GPU
• Integer arithmetic
– Not enough arithmetic inst. In pixel processing engines
• Depth compare masking
– Useful to have comparison mask for depth function
Analysis: Issues
• Memory management
– Current GPUS have 512 MB video memory, we may use the
out-of–core techniques and swap
• No random writes
– No data re-arrangements possible
Analysis: Performance
• Relative Performance Gain
– High Performance – Predicate evaluation, multi-attribute
queries, semi-linear queries, count
– Medium Performance – Kth-largest number
– Low Performance - Accumulator
High Performance
• Parallel pixel processing engines
• Pipelining
– Multi-attribute queries get advantage
• Early Depth culling
– Before passing through the pixel processing engine
• Eliminate branch mispredictions
Medium Performance
• Parallelism
• FX 5900 has clock speed 450MHz, 8 pixel processing
engines
• Rendering single 1000x1000 quad takes 0.278ms
• Rendering 19 such quads take 5.28ms. Observed time is
6.6ms
•
80% efficiency in parallelism!!
Low Performance
• No gain over SIMD based CPU implementation
• Two main reasons:
– Lack of integer-arithmetic
– Clock rate
Outline
• Database Operations on GPUs
• Implementation & Results
• Analysis
• Conclusions
Conclusions
•
Novel algorithms to perform database operations
on GPUs
–
•
Evaluation of predicates, boolean combinations of
predicates, aggregations
Algorithms take into account GPU limitations
–
–
No data rearrangements
No frame buffer readbacks
Conclusions
• Preliminary comparisons with optimized CPU
implementations is promising
• Discussed possible improvements on GPUs
• GPU as a useful co-processor
Relational Joins
• Modern GPUs have thread groups
• Each thread group have several threads
• Data Parallel primitives
– Map
– Scatter – scatters the Data of a relation with respect to an
array L
– Gather – reverse of scatter
– Split – Divides the relation into a number of disjoint partitions
with a given partitioning function
NINLJ
R
Thread
Group 1
S
Thread
Group j
Thread
Group i
Thread
Group Bp
INLJ
• Used Cache Optimized Search Trees (CSS trees)
for index structure
• Inner relation as the CSS tree
• Multiple keys are searched in parallel on the tree
Sort Merge join
• Merge step is done in parallel
• 3 steps
– Divide relation S into Q chunks Q= ||S|| / M
– Find the corresponding matching chunks from R by using the
start and end of each chunk of S
– Merge each pair of S and R chunk in parallel. 1 thread group
per pair.
Hash join
• Partitioning
– Use the Split primitive to partition both the relations
• Matching
– Read the inner relation in memory relation
– Each tuple from the outer relation uses sequential/binary
search on the inner relation
– For binary search, the inner relation will be sorted using
bitonic sort.