Transcript Sort-first

Introduction to Parallel
Rendering: Sorting,
Chromium, and MPI
Mengxia Zhu
Spring 2006
Parallel Rendering

Graphics rendering process is
computationally intensive

Parallel computation is a natural measure to
leverage for higher performance

Two levels of parallelism:
parallelism – pipelining
 Data parallelism – multiple results computed at the
same time
 Functional
Rendering Pipeline
Data Parallel Algorithms

A lot of taxonomies of categorizing parallel
algorithms
 Image
space vs. object space
 Shared memory architecture, distributed memory
architecture
 MPI, OpenMP, …

Need a uniform framework to study and
understand parallel rendering
Sorting in Rendering

Rendering as a sorting process:
 Sort from object coordinates to screen coordinates
 Use this concept to study computational and
communication costs

The key procedure: calculating the effect of each
primitive on each pixel

Use this concept to study computational and
communication costs
Sorting Categories


The location of this ‘sort’ determines the
structure of the parallel algorithm
Sort-first
 during
geometry processing
 distributes “raw” primitives

Sort-middle
 between
geom. processing and rasterization
 distributes screen-space primitives

Sort-last
 during
rasterization
 distributes pixels/fragments
Sorting cont

A landmark paper: “A sorting classification of parallel
rendering”, Molner, et. al., IEEE CG&A’94.
Sort-First
Sort-Middle
Sort-Last
G R
G
R
G R
G R
G
R
G R
G R
G
R
G R
C
Sort First
Primitives initially assigned arbitrarily
 Pre-transformation is done to determine
which screen regions are covered
 Primitives are then redistributed over
the network to the correct renderer
 Renderer performs the work of the
entire pipeline for that primitive from that
point on

Sort First cont
Sort First cont
Screen space is partitioned into nonoverlapping 2D tiles, each is rendered
independently by a tightly coupled pair
of geometry and rasterization
processors.
 Sub-image of 2D tiles are composited
without depth comparison.

Analysis Terms



Assume a dataset containing nr raw primitives
with average size ar .
We will call primitives that result from
tessellation display primitives. If T is the
tessellation ratio, there are nd = Tnr of these,
with average size ad = ar /T. If there is no
tessellation, T = 1, nd = nr , and ad = ar .
Assume an image containing A pixels and need
to compute S samples per pixel. Assume that all
primitives within the viewing frustum.
Sort-first analysis

Pros:
 Low
communication requirements when
tessellation or oversampling are high, or
when inter-frame coherence exploited
 Processors implement entire rendering
pipeline for a given screen region

Cons:
 Susceptible
to load imbalance (clumping)
 Exploiting coherence is difficult
Sort Middle
Primitives initially assigned arbitrarily
 Primitives fully transformed, lit, etc., by
the geometry processor to which they
are initially assigned
 Transformed primitives are distributed
over the network to the rasterizer
assigned to their region of the screen

Sort Middle
Sort Middle Analysis

Pros:
 Redistribution

occurs at a “natural” place
Cons:
 High
communication cost if T is high
 Susceptible to load imbalance in the same
way as sort-first

Overhead:
 Display
primitive distribution cost
 Tessellation factor
Sort Last
Sort Last
Defers sorting until the end (imagine
phase)
 Renderers operate independently until
the visibility stage
 Fragments transmitted over network to
compositing processors to resolve
visibility

Sort Last Analysis

Pros:
 Renderers
implement full pipeline and are
independent until pixel merging
 Less prone to load imbalance
 Very scalable

Cons:
 Pixel
traffic can be extremely high
Image Composition
• A naïve approach is binary compositing.
• Each disjoint pair of processors produces a new
subimage.
• N/2 subimages are left after the first stage.
• Half the number of the original processors are paired
up for the next level of compositing hence another half
would be idle.
• The binary-swap compositing method makes sure that
every processor participates in all the stages of the
process.
• The key idea – at each compositing stage, the two
processors involved in a composite operation split the
image plane into two pieces.
Binary Swap Example
• The binary-swap compositing algorithm for four processors:
Which to choose?
It depends.
 Which ones can be best matched to
hardware capabilities?
 Number of primitives, tessellation factor,
coherence, etc., are all considerations.
Many tradeoffs.

Load Balancing

For better load balancing,

Task queuing: the task queue can be ordered in
decreasing task size, such that the concurrency gets
finer until the queue is exhausted.
 Load stealing: having nodes steal smaller tasks from
other nodes, once they have completed their own tasks
 Time stamp: timeout stamps used for each task, such
that if the node can not finish its task before the
timeout, it takes the remnant of the task, re-partitions it
and re-distributes it.

Hierarchical data structures, such as octree,
k-d tree, etc., are commonly used.
References

These slides reference contents from
 Jian
Huang at University of Tennessee at
Knoxville
 William
Gropp and Ewing Lusk at Argonne
National Laboratory