Transcript ppt
L15: Tree-Structured Algorithms on GPUs
CS6963
L15: Tree Algorithms
Administrative
STRSM due March 17 (EXTENDED)
Midterm coming
In class April 4, open notes
Review notes, readings and review lecture (before break)
Will post prior exams
Design Review
Intermediate assessment of progress on project, oral and short
Tentatively April 11 and 13
Final projects
Poster session, April 27 (dry run April 25)
Final report, May 4
CS6963
L15: Tree Algorithms
Outline
• Mapping trees to data-parallel architectures
• Sources:
• Parallel scan from Lin and Snyder, _Principles of Parallel
Programming_
• “An Effective GPU Implementation of Breadth-First Search,” Lijuan
Luo, Martin Wong and Wen-mei Hwu, DAC ‘10, June 2010.
• “Inter-block GPU communication via fast barrier synchronization,” S.
Xiao and W. Feng, ?2009 Va. Tech TR?.
• “Stackless KD-Tree Traversal for High Performance GPU Ray
Tracing,” S. Popov, J. Gunther, H-P Seidel, P. Slusallek,
Eurographics 2007, 26(3), 2007.
CS6963
L15: Tree Algorithms
Mapping Challenge
From this:
CS6963
To this:
L15: Tree Algorithms
Simple Example
Parallel Prefix Sum:
Compute a partial sum from A[0],…,A[n-1]
Standard way to express it
for (i=0; i<n; i++) {
sum += A[i];
y[i] = sum;
}
Semantics require:
(…((sum+A[0])+A[1])+…)+A[n-1]
That is, sequential
Can it be executed in parallel?
CS6963
L15: Tree Algorithms
Graphical Depiction of Sum Code
Original Order
Pairwise Order
Which decomposition is better suited for parallel execution.
CS6963
L15: Tree Algorithms
Parallelization Strategy
1. Map tree to flat array
2. Two passes through tree:
a. Bottom up: Compute sum for each subtree and
propagate all the way up to root node
b. Top down:
•
•
Each non-leaf node receives a value from its parent for
sum up to current element. It sends the right child
the sum of the parent plus left child value computed on
top-down pass.
Leaves add the prefix value from parent and saved
value to compute final result.
3. Solution can be found on my website from
CS4961 last fall
CS6963
L15: Tree Algorithms
Solution (Figure 1.4 from Lin and Snyder)
Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley
Breadth First Search
Definition: Beginning at a node s in a connected
component, explore all the neighboring nodes. Then for
each of the neighbors, explore their unexplored
neighbors until all nodes have been visited. The result
of this search is the set of nodes reachable from s.
Input: G=(V,E) and distinguished vertex s
Output: a breadth first spanning tree with root s that
contains all reachable vertices
Key data structure: a frontier queue of nodes that
have been visited at the current level of the tree
CS6963
L15: Tree Algorithms
9
GPU Challenges
Very little work at the root node. Much more work as
the algorithm progresses to leaves.
Managing global frontier queue can lead to high
overhead.
Sparse matrix implementation (L multiplications
corresponding to L levels) can be slower than sequential
CPU algorithms for large graphs.
Assumption for this paper: graph is sparse
CS6963
L15: Tree Algorithms
Overview of Strategy
Parallelism comes from propagation from all frontier
vertices in parallel.
With sparse graph, searching each neighbor of a
frontier vertex in parallel will not have as much work
associated with it.
Vary the amount of GPU that is being used depending
on threshold of profitability
A single warp as the baseline
A single block as the next level
The entire device as the outermost level
CS6963
L15: Tree Algorithms
Overview of Data Decomposition
Corresponding mapping of frontier queue to levels in implementation
• Hierarchical frontier queue leads to infrequent global
synchronization.
• Split the frontier queue into levels according to position in the tree
• Lowest level is for a single warp
• Next level for per block shared memory
• Outermost level is for a larger global memory structure.
CS6963
L15: Tree Algorithms
12
A Few Details
Warp level writes to W-Frontier atomically.
Add a single element to queue and update end of
queue
Different warps write to different W-Frontier queues.
A B-Frontier is the union of 8 W-Frontiers.
A single thread walks the W-Frontiers to derive
indices of frontier nodes.
A G-Frontier is shared across the device.
Copies from B-Frontier to G-Frontier are done
atomically. (Can use coalescing.)
CS6963
L15: Tree Algorithms
Global Synchronization Approach
Global synchronization across blocks uses reference to
Va Tech TR.
Approach depends on:
Using the entire device, all SMs
Exact match of blocks to SMs
All SMs must be included in “global barrier” to
prevent deadlock.
CS6963
L15: Tree Algorithms
Another Algorithm: K-Nearest Neighbor
Ultimate goal: identify the k nearest neighbors to a
distinguished point in space from a set of points
Create an acceleration data structure, called a KDtree, to represent the points in space.
Given this tree, finding neighbors can be identified
for any distinguished point
One application of K-Nearest Neighbor is ray tracing
CS6963
L15: Tree Algorithms
Constructing a KD-Tree
Hierarchically partition a set of points (2D example)
Slide source: http://graphics.stanford.edu/courses/cs368-00spring/TA/manuals/CGAL/ref-manual2/SearchStructures/Chapter_main.html
CS6963
L15: Tree Algorithms
Representing the KD-Tree
Like the parallel prefix sum, we flatten the tree data
structure to represent KD-Tree in memory.
Difference: Tree is not fully populated. So, cannot
use linearized structure of parallel prefix sum.
An auxiliary structure, called “ropes”, provides a link
between neighboring cells.
Called a “stackless” KD-tree traversal because it is not
recursive.
CS6963
L15: Tree Algorithms
Summary of Lecture
To map trees to data parallel architectures with little
or no support for recursion
Flatten hierarchical data structures into dense
vectors
Use hierarchical storage corresponding to
hierarchical algorithm decomposition to avoid costly
global synchronization and minimize global memory
accesses.
Possibly vary amount of device participating in
computation for different levels of the tree.
CS6963
L15: Tree Algorithms