Transcript Lecture 23
Introduction to Parallel
Rendering
Jian Huang, CS 594, Spring 2002
Parallel Rendering
• Graphics rendering process is computationally
intensive
• Parallel computation is a natural measure to
leverage for higher performance
• Two levels of parallelism:
– Functional parallelism – pipelining
– Data parallelism – multiple results computed at the
same time
Data Parallel Algorithms
• A lot of taxonomies of categorizing parallel
algorithms
– Image space vs. object space
– Shared memory architecture, distributed memory
architecture
– MPI, OpenMP, …
• Need a uniform framework to study and
understand parallel rendering
A Core Problem
• To partition work and distribute them
– Rendering requires the involved data to go with each work
partition
– How to collect the rendered results to form the final image
– Intensive data communication
• A landmark paper: “A sorting classification of parallel
rendering”, Molner, et. al., IEEE CG&A’94.
The Rendering Process
• The rendering pipeline:
– Geometry processing: transformation, lighting, clipping
– Rasterization: scan-conversion, shading, visibility
• Parallel processing:
A New Perspective
• Rendering as a sorting process:
– Sort from object coordinates to screen coordinates
– Use this concept to study computational and
communication costs
Sorting
• Parallel geometry processing: assign a subset of
primitives or objects to each processor
• Parallel rasterization processing: assign a subregion of the screen to each processor
• The key procedure: calculating the effect of each
primitive on each pixel
• Rendering is a ‘sorting’ process from each
primitive into each screen pixel
– This sort involves redistributing data among processors
Where does the ‘sort’ take place?
• The location of this ‘sort’ determines the
structure of the parallel algorithm
• This ‘sort’ can take place during:
– Geometry processing (sort-first)
– Between geometry processing and rasterization
(sort-middle)
– Rasterization (sort-last)
• Each different sort has distinct properties
Sort-First
• Redistributing raw primitives
Sort-Middle
• Redistributing screen primitives
Sort-Last
• Redistributing fragments, samples, or pixels
Processing and Communication
Model
• A refined model
• Assume a dataset containing nr raw primitives with average size
ar .
• We will call primitives that result from tessellation display
primitives. If T is the tessellation ratio, there are nd = Tnr of
these, with average size ad = ar /T. If there is no tessellation, T =
1, nd = nr , and ad = ar .
• Assume an image containing A pixels and need to compute S
samples per pixel. Assume that all primitives within the viewing
frustum).
The Terms
Analysis of Sort-First
• c, proportion of primitives to be redistributed
• Advantages
– Low communication requirements when
the tessellation ratio and the degree of
oversampling are high, or when frame-toframe coherence can be exploited.
– Processors implement entire rendering
pipeline for a portion of the screen.
• Disadvantages
– Susceptible to load imbalance. Primitives
may clump into regions, concentrating the
work on a few renderers.
– To take advantage of frame-to-frame
coherence, retained mode and complex
data handling code are necessary.
Cost over uni-processor
rendering
Analysis of Sort-Middle
• Advantages: General and
straightforward; redistribution
occurs at a natural place in the
pipeline.
• Disadvantages:
– High communication costs if
tessellation ratio is high.
– Susceptible to load imbalance
between rasterizers when
primitives are distributed unevenly
over the screen.
Cost over uni-processor
rendering
Analysis of Sort-Last
• Sparse merging – only merge the small
region each processor renders
• Full-frame merging – always merge the
whole frame-buffer
• Advantages:
– Renderers implement the full rendering
pipeline and are independent until pixel
merging.
– Less prone to load imbalance.
– SL-full merging can be embedded in a
linear network, making it linearly scalable.
• Disadvantage: Pixel traffic may be
extremely high, particularly when
oversampling.
Cost over uni-processor
rendering
A Comparison
• Sort-first, sort-middle and sort-last.
• There is no strictly the best category
• Different categories can be combined in an
implementation as well
Parallel Volume Rendering
• A lot of algorithms choose to distribute data to
processing nodes. Each node renders its portion of
data, and generate a partial image. The partial
images then get accumulated together.
• Screen space partitioning (tiles or continuous
scan-lines) are also used.
Load Balancing
• For better load balancing,
– Task queuing: the task queue can be ordered in
decreasing task size, such that the concurrency gets
finer until the queue is exhausted.
– Load stealing: having nodes steal smaller tasks from
other nodes, once they have completed their own tasks
– Time stamp: timeout stamps used for each task, such
that if the node can not finish its task before the
timeout, it takes the remnant of the task, re-partitions it
and re-distributes it.
• Hierarchical data structures, such as octree, k-d
tree, etc., are commonly used.
Sum It Up
• Parallelism is just a tool
• It depends on which algorithm you parallelize
• What about an OpenGL application utilizing
occlusion culling?