Transcript Slide

Team Members:
Tyler Drake
Robert Wrisley
Kyle Von Koepping
Justin Walsh
Faculty Advisors:
Computer Science – Prof. Sanjay Rajopadhye
Electrical & Computer Engineering – Prof. Olivera Notaros
• Project Goals: To develop parallel versions of
applications that will run on a graphics card
and measure the performance.
– Started with a simple Matrix Multiply program.
– We intend to develop at least one or two
additional applications and also to pursue an
analysis of hardware optimizations.
– Develop a process for tuning applications &
hardware that other developers can use more
easily.
• Tyler Drake – Computer Science major
• Robert Wrisley – Computer Science/Computer Engineering
dual major
• Kyle Von Koepping – Electrical Engineering major
• Justin Walsh – Computer Science/Computer Engineering
dual major
• Shared coding responsibilities
– Enables comparison and greater understanding for all team
members
– Possibly divide responsibilities for the second half of the project
• Transistor densities on single-core
processors were doubling approximately
every 18 months.
• This trend has remained valid since first
observed in 1965 and is expected to hold for
several more years.
• This natural trend had become the standard
goal for hardware companies.
• There is an ultimate limit to Moore’s law.
• Transistors will soon reach sizes of atomic
level.
• Moore’s law does not apply to Random
Access Memory (RAM) speeds and hard
drive seek times. (AKA Memory Wall)
• Redesign of processor architecture isn’t
driven directly by Moore’s Law, but by
the fact that these and other factors have
not kept up with this growth rate.
• CPU or multiple CPU’s
are not the only
processors found on a
personal computer
• The graphics card has a
graphics processing unit
(GPU).
The GPU is specifically designed to render 3D
models onto a 2D display
• Designed for floating point computation with a
highly parallel architecture.
•
• Engineers have begun to exploit the highly
parallel architecture of the GPU for general
applications.
• Graphics companies encourage general
purpose computing on the GPU (GPGPU).
• Nvidia has developed CUDA (Compute
Unified Device Architecture).
• Based on the C language programmers can
easily shift to developing on the GPU
What We Have
Done So Far
• Learning about CUDA
– NVIDIA CUDA guides
– Lecture slides from University of Illinois, Urbana-Champaign
– Papers from various academic groups
• University of Illinois, Urbana-Champaign
• Tokyo Institute of Technology
• University of California at Berkeley
• Learning to write parallel programs in CS475 using
MPI & OpenMP
• Writing simple programs using CUDA and observing
performance
– Matrix Multiply
• Results
• Achieved 131 Gigaflops/sec on a GTX280 with N
= 1024. GTX 280 peak is 933 Gigaflops/sec.
• Optimizations
• Tiling the result matrix into smaller submatrices and having each thread block
compute a sub-matrix will reduce amount of
data needed to be loaded by each thread block.
• This helps to reduce memory latency.
• Memory
– Must allocate memory on the graphics card from
the main program being run on the CPU
– Memory for the graphics card is explicitly
managed by the programmer
• An “extension” to C, not a separate language
– Similar to MPI, OpenMP, etc.
Increasing problem complexity
 Some are no longer
“Pleasantly Parallel”
 Higher degree of kernel
analysis
 Moving to more dynamic
programs
• Additional programs being written for the GPU
include:
– Scan: Matrix computation where the ith index is
the sum of the previous i-1 indices!
– Knapsack: profit maximization given a capacity
and list of items with their weight & profit
– Matrix Multiply for still larger matrices
– Triangular Matrix Multiplication
Mandelbrot Set
 Pleasantly parallel,
familiar
 Easily scalable
Ray Tracing
 Very computationally
intensive
 Feasible for nonrealtime computations
 Very dynamic, due to
recursion
 High degree of realism
Examples of images generated by Ray Tracing
Hidden Markov
Models
 Clear parallelism
 Wide range of
applications
Uses of Hidden Markov Models
• To develop a more complex application for the
GPU and optimize the performance
• To analyze hardware optimizations and
evaluate the performance gains
• Develop a process for future programmers
that will give them the best performance
increases with the minimum development
effort
• Please Note: These goals are tentative and subject to change.
• Moore’s Law now being applied to processors
per core instead of transistors per processor.
• Multi-core machines offer the next generation
of performance enhancements… but they are
already here!
• GPUs provide massively parallel architectures
that programmers can take advantage of to
see phenomenal performance gains.
• Learning to use the
CUDA library and some
of the nuances.
• Have gotten good
performance on MatrixMultiply attempts.
• Also completing CUDA
versions of Scan and
Knapsack problems.
• Move on to a more
complex application.
• Researching hardware
optimizations that can
further enhance
performance on GPUs.
• Develop a combined
approach for future
applications
programmers to follow.
• $50 spent for a graphics card that is CUDA
compatible.
• We’d like to thank Prof. Dan Connors for the use of
his machines with Nvidia GTX280 graphics cards.
– This provided us free access to a consistent build for all of
us to run our code and sample code on.
• We don’t project any major costs next semester,
except perhaps for some materials for our E-Days
presentation.