Transcript Sort-first
Introduction to Parallel
Rendering: Sorting,
Chromium, and MPI
Mengxia Zhu
Spring 2006
Parallel Rendering
Graphics rendering process is
computationally intensive
Parallel computation is a natural measure to
leverage for higher performance
Two levels of parallelism:
parallelism – pipelining
Data parallelism – multiple results computed at the
same time
Functional
Rendering Pipeline
Data Parallel Algorithms
A lot of taxonomies of categorizing parallel
algorithms
Image
space vs. object space
Shared memory architecture, distributed memory
architecture
MPI, OpenMP, …
Need a uniform framework to study and
understand parallel rendering
Sorting in Rendering
Rendering as a sorting process:
Sort from object coordinates to screen coordinates
Use this concept to study computational and
communication costs
The key procedure: calculating the effect of each
primitive on each pixel
Use this concept to study computational and
communication costs
Sorting Categories
The location of this ‘sort’ determines the
structure of the parallel algorithm
Sort-first
during
geometry processing
distributes “raw” primitives
Sort-middle
between
geom. processing and rasterization
distributes screen-space primitives
Sort-last
during
rasterization
distributes pixels/fragments
Sorting cont
A landmark paper: “A sorting classification of parallel
rendering”, Molner, et. al., IEEE CG&A’94.
Sort-First
Sort-Middle
Sort-Last
G R
G
R
G R
G R
G
R
G R
G R
G
R
G R
C
Sort First
Primitives initially assigned arbitrarily
Pre-transformation is done to determine
which screen regions are covered
Primitives are then redistributed over
the network to the correct renderer
Renderer performs the work of the
entire pipeline for that primitive from that
point on
Sort First cont
Sort First cont
Screen space is partitioned into nonoverlapping 2D tiles, each is rendered
independently by a tightly coupled pair
of geometry and rasterization
processors.
Sub-image of 2D tiles are composited
without depth comparison.
Analysis Terms
Assume a dataset containing nr raw primitives
with average size ar .
We will call primitives that result from
tessellation display primitives. If T is the
tessellation ratio, there are nd = Tnr of these,
with average size ad = ar /T. If there is no
tessellation, T = 1, nd = nr , and ad = ar .
Assume an image containing A pixels and need
to compute S samples per pixel. Assume that all
primitives within the viewing frustum.
Sort-first analysis
Pros:
Low
communication requirements when
tessellation or oversampling are high, or
when inter-frame coherence exploited
Processors implement entire rendering
pipeline for a given screen region
Cons:
Susceptible
to load imbalance (clumping)
Exploiting coherence is difficult
Sort Middle
Primitives initially assigned arbitrarily
Primitives fully transformed, lit, etc., by
the geometry processor to which they
are initially assigned
Transformed primitives are distributed
over the network to the rasterizer
assigned to their region of the screen
Sort Middle
Sort Middle Analysis
Pros:
Redistribution
occurs at a “natural” place
Cons:
High
communication cost if T is high
Susceptible to load imbalance in the same
way as sort-first
Overhead:
Display
primitive distribution cost
Tessellation factor
Sort Last
Sort Last
Defers sorting until the end (imagine
phase)
Renderers operate independently until
the visibility stage
Fragments transmitted over network to
compositing processors to resolve
visibility
Sort Last Analysis
Pros:
Renderers
implement full pipeline and are
independent until pixel merging
Less prone to load imbalance
Very scalable
Cons:
Pixel
traffic can be extremely high
Image Composition
• A naïve approach is binary compositing.
• Each disjoint pair of processors produces a new
subimage.
• N/2 subimages are left after the first stage.
• Half the number of the original processors are paired
up for the next level of compositing hence another half
would be idle.
• The binary-swap compositing method makes sure that
every processor participates in all the stages of the
process.
• The key idea – at each compositing stage, the two
processors involved in a composite operation split the
image plane into two pieces.
Binary Swap Example
• The binary-swap compositing algorithm for four processors:
Which to choose?
It depends.
Which ones can be best matched to
hardware capabilities?
Number of primitives, tessellation factor,
coherence, etc., are all considerations.
Many tradeoffs.
Load Balancing
For better load balancing,
Task queuing: the task queue can be ordered in
decreasing task size, such that the concurrency gets
finer until the queue is exhausted.
Load stealing: having nodes steal smaller tasks from
other nodes, once they have completed their own tasks
Time stamp: timeout stamps used for each task, such
that if the node can not finish its task before the
timeout, it takes the remnant of the task, re-partitions it
and re-distributes it.
Hierarchical data structures, such as octree,
k-d tree, etc., are commonly used.
References
These slides reference contents from
Jian
Huang at University of Tennessee at
Knoxville
William
Gropp and Ewing Lusk at Argonne
National Laboratory