Transcript Slide
Team Members:
Tyler Drake
Robert Wrisley
Kyle Von Koepping
Justin Walsh
Faculty Advisors:
Computer Science – Prof. Sanjay Rajopadhye
Electrical & Computer Engineering – Prof. Olivera Notaros
• Project Goals: To develop parallel versions of
applications that will run on a graphics card
and measure the performance.
– Started with a simple Matrix Multiply program.
– We intend to develop at least one or two
additional applications and also to pursue an
analysis of hardware optimizations.
– Develop a process for tuning applications &
hardware that other developers can use more
easily.
• Tyler Drake – Computer Science major
• Robert Wrisley – Computer Science/Computer Engineering
dual major
• Kyle Von Koepping – Electrical Engineering major
• Justin Walsh – Computer Science/Computer Engineering
dual major
• Shared coding responsibilities
– Enables comparison and greater understanding for all team
members
– Possibly divide responsibilities for the second half of the project
• Transistor densities on single-core
processors were doubling approximately
every 18 months.
• This trend has remained valid since first
observed in 1965 and is expected to hold for
several more years.
• This natural trend had become the standard
goal for hardware companies.
• There is an ultimate limit to Moore’s law.
• Transistors will soon reach sizes of atomic
level.
• Moore’s law does not apply to Random
Access Memory (RAM) speeds and hard
drive seek times. (AKA Memory Wall)
• Redesign of processor architecture isn’t
driven directly by Moore’s Law, but by
the fact that these and other factors have
not kept up with this growth rate.
• CPU or multiple CPU’s
are not the only
processors found on a
personal computer
• The graphics card has a
graphics processing unit
(GPU).
The GPU is specifically designed to render 3D
models onto a 2D display
• Designed for floating point computation with a
highly parallel architecture.
•
• Engineers have begun to exploit the highly
parallel architecture of the GPU for general
applications.
• Graphics companies encourage general
purpose computing on the GPU (GPGPU).
• Nvidia has developed CUDA (Compute
Unified Device Architecture).
• Based on the C language programmers can
easily shift to developing on the GPU
What We Have
Done So Far
• Learning about CUDA
– NVIDIA CUDA guides
– Lecture slides from University of Illinois, Urbana-Champaign
– Papers from various academic groups
• University of Illinois, Urbana-Champaign
• Tokyo Institute of Technology
• University of California at Berkeley
• Learning to write parallel programs in CS475 using
MPI & OpenMP
• Writing simple programs using CUDA and observing
performance
– Matrix Multiply
• Results
• Achieved 131 Gigaflops/sec on a GTX280 with N
= 1024. GTX 280 peak is 933 Gigaflops/sec.
• Optimizations
• Tiling the result matrix into smaller submatrices and having each thread block
compute a sub-matrix will reduce amount of
data needed to be loaded by each thread block.
• This helps to reduce memory latency.
• Memory
– Must allocate memory on the graphics card from
the main program being run on the CPU
– Memory for the graphics card is explicitly
managed by the programmer
• An “extension” to C, not a separate language
– Similar to MPI, OpenMP, etc.
Increasing problem complexity
Some are no longer
“Pleasantly Parallel”
Higher degree of kernel
analysis
Moving to more dynamic
programs
• Additional programs being written for the GPU
include:
– Scan: Matrix computation where the ith index is
the sum of the previous i-1 indices!
– Knapsack: profit maximization given a capacity
and list of items with their weight & profit
– Matrix Multiply for still larger matrices
– Triangular Matrix Multiplication
Mandelbrot Set
Pleasantly parallel,
familiar
Easily scalable
Ray Tracing
Very computationally
intensive
Feasible for nonrealtime computations
Very dynamic, due to
recursion
High degree of realism
Examples of images generated by Ray Tracing
Hidden Markov
Models
Clear parallelism
Wide range of
applications
Uses of Hidden Markov Models
• To develop a more complex application for the
GPU and optimize the performance
• To analyze hardware optimizations and
evaluate the performance gains
• Develop a process for future programmers
that will give them the best performance
increases with the minimum development
effort
• Please Note: These goals are tentative and subject to change.
• Moore’s Law now being applied to processors
per core instead of transistors per processor.
• Multi-core machines offer the next generation
of performance enhancements… but they are
already here!
• GPUs provide massively parallel architectures
that programmers can take advantage of to
see phenomenal performance gains.
• Learning to use the
CUDA library and some
of the nuances.
• Have gotten good
performance on MatrixMultiply attempts.
• Also completing CUDA
versions of Scan and
Knapsack problems.
• Move on to a more
complex application.
• Researching hardware
optimizations that can
further enhance
performance on GPUs.
• Develop a combined
approach for future
applications
programmers to follow.
• $50 spent for a graphics card that is CUDA
compatible.
• We’d like to thank Prof. Dan Connors for the use of
his machines with Nvidia GTX280 graphics cards.
– This provided us free access to a consistent build for all of
us to run our code and sample code on.
• We don’t project any major costs next semester,
except perhaps for some materials for our E-Days
presentation.