Day 1 Session 1

Download Report

Transcript Day 1 Session 1

Training Program on
GPU Programming
with CUDA
31st July, 7th Aug, 14th Aug 2011
CUDA Teaching Center @ UoM
Training Program on
GPU Programming with CUDA
Day 1, Session 1
Introduction
Sanath Jayasena
CUDA Teaching Center @ UoM
Outline
• Training Program Description
• CUDA Teaching Center at UoM
Subject Matter
• Introduction to GPU Computing
• GPU Computing with CUDA
• CUDA Programming Basics
July-Aug 2011
CUDA Training Program
3
Overview of Training Program
• 3 Sundays, starting 31st July
• Schedule and program outline
• Main resource persons
– Sanath Jayasena, Jayathu Samarawickrama, Kishan
Wimalawarna, Lochandaka Ranathunga
• Dept of Computer Science & Eng, Dept of Electronic &
Telecom. Engineering (of Faculty of Engineering) and
Faculty of IT
July-Aug 2011
CUDA Training Program
4
CUDA Teaching Center
• UoM was selected as a CTC
– A group of people from multiple Depts
– http://research.nvidia.com/content/cuda-teaching-centers
• Benefits
– Donation of hardware by NVIDIA (GeForce
GTX480s and Tesla C2070)
– Access to other resources
• Expectations
– Use of the resources for teaching/research,
industry collaboration
July-Aug 2011
CUDA Training Program
5
GPU Computing: Introduction
• Graphics Processing Units (GPUs)
– high-performance many-core processors that can
be used to accelerate a wide range of applications
• GPGPU - General-Purpose computation on
Graphics Processing Units
• GPUs lead the race for floating-point
performance since start of 21st century
• GPUs are being used as parallel processors
July-Aug 2011
CUDA Training Program
6
GPU Computing: Introduction
• General computing, until end of 20th century
– Relied on the advances in hardware to increase the
speed of software/apps
• Slowed down since then due to
– Power consumption issues
– Limited productivity within a single processor
• Switch to multi-core and many-core models
– Multiple processing units (processor cores) used in
each chip to increase the processing power
– Impact on software developers?
July-Aug 2011
CUDA Training Program
7
GPU Computing: Introduction
• A sequential program will only run on one of
the cores, which will not become any faster
• With each new generation of processors
– Software that will continue to enjoy performance
improvement will be parallel programs
– Where, multiple threads of execution cooperate
to achieve the functionality faster
July-Aug 2011
CUDA Training Program
8
CPU-GPU Performance Gap
Source: CUDA Prog. Guide 4.0
July-Aug 2011
CUDA Training Program
9
CPU-GPU Performance Gap
Source: CUDA Prog. Guide 4.0
July-Aug 2011
CUDA Training Program
10
GPGPU & CUDA
• GPU designed as a numeric computing engine
– Will not perform well on some tasks as CPUs
– Most applications will use both CPUs and GPUs
• CUDA
– NVIDIA’s parallel computing architecture aimed at
increasing computing performance by harnessing
the power of the GPU
– A programming model
July-Aug 2011
CUDA Training Program
11
More Details on GPUs
• GPU is typically a computer card, installed into
a PCI Express 16x slot
• Market leaders: NVIDIA, Intel, AMD (ATI)
– Example NVIDIA GPUs (donated to UoM)
GeForce GTX 480
July-Aug 2011
CUDA Training Program
Tesla 2070
12
Example Specifications
GTX 480
Tesla 2070
Peak double precision
floating point
performance
650 Gigaflops
515 Gigaflops
Peak single precision
floating point
performance
1300 Gigaflops
1030 Gigaflops
480
448
Frequency of CUDA
Cores
1.40 GHz
1.15 GHz
Memory size (GDDR5)
1536 MB
6 GigaBytes
177.4 GBytes/sec
150 GBytes/sec
NO
YES
CUDA cores
Memory bandwidth
ECC Memory
July-Aug 2011
CUDA Training Program
13
CPU vs. GPU Architecture
The GPU devotes more transistors for computation
July-Aug 2011
CUDA Training Program
14
CPU-GPU Communication
July-Aug 2011
CUDA Training Program
15
CUDA Architecture
• CUDA is NVIDA’s solution to access the GPU
• Can be seen as an extension to C/C++
CUDA Software Stack
July-Aug 2011
CUDA Training Program
16
CUDA Architecture
There are two main parts
1. Host (CPU part)
-Single Program, Single Data
2. Device (GPU part)
-Single Program, Multiple
Data
July-Aug 2011
CUDA Training Program
17
CUDA Architecture
The Grid
1. A group of threads all running
the same kernel
2. Can run multiple grids at once
GRID Architecture
July-Aug 2011
The Block
1. Grids composed of blocks
2. Each block is a logical unit
containing a number of
coordinating threads and
some amount of shared
memory
CUDA Training Program
18
Some Applications of GPGPU
Computational Structural Mechanics
Bio-Informatics and Life Sciences
Computational Electromagnetics and
Electrodynamics
Computational Finance
July-Aug 2011
CUDA Training Program
19
Some Applications…
Computational Fluid Dynamics
Data Mining, Analytics, and Databases
Imaging and Computer Vision
Medical Imaging
July-Aug 2011
CUDA Training Program
20
Some Applications…
Molecular Dynamics
Numerical Analytics
Weather, Atmospheric, Ocean Modeling
and Space Sciences
July-Aug 2011
CUDA Training Program
21
CUDA Programming
Basics
Accessing/Using the CUDA-GPUs
• You have been given access to our cluster
– User accounts on 192.248.8.13x
– It is a Linux system
• CUDA Toolkit and SDK for development
– Includes CUDA C/C++ compiler for GPUs (“nvcc”)
– Will need C/C++ compiler for CPU code
• NVIDIA device drivers needed to run programs
– For programs to communicate with hardware
July-Aug 2011
CUDA Training Program
23
Example Program 1
#include <cuda.h>
#include <stdio.h>
__global__ void kernel (void)
{ }
int main (void)
{
kernel <<< 1, 1 >>> ();
printf("Hello World!\n");
return 0;
}
July-Aug 2011
• “__global__” says
the function is to be
compiled to run on
a “device” (GPU),
not “host” (CPU)
• Angle brackets
“<<<“ and “>>>” for
passing params/args
to runtime
A function executed on the GPU
(device) is usually called a “kernel”
CUDA Training Program
24
Example Program 2 – Part 1
As can be seen in next slide:
•We can pass parameters to a kernel as we would
with any C function
• We need to allocate memory to do anything useful
on a device, such as return values to the host
July-Aug 2011
CUDA Training Program
25
Example Program 2 – Part 2
int main (void) {
int c, *dev_c;
cudaMalloc ((void **) &dev_c, sizeof (int));
add <<< 1, 1 >>> (2,7, dev_c);
cudaMemcpy(&c, dev_c, sizeof(int),
cudaMemcpyDeviceToHost);
printf(“2 + 7 = %d\n“, c);
cudaFree(dev_c);
return 0;
}
July-Aug 2011
CUDA Training Program
26
Example Program 3
Within host (CPU) code, call the kernel by using <<<
and >>> specifying the grid size (number of blocks)
and/or the block size (number of threads) - (more
details later)
July-Aug 2011
CUDA Training Program
27
Example Program 3
…contd
Note:
Details on threads and thread IDs will come later
July-Aug 2011
CUDA Training Program
28
Example Program 4
July-Aug 2011
CUDA Training Program
29
Grids, Blocks and Threads
• A grid of size 6 (3x2
blocks)
• Each block has 12
threads (4x3)
July-Aug 2011
CUDA Training Program
30
Conclusion
• In this session we discussed
– Introduction to GPU Computing
– GPU Computing with CUDA
– CUDA Programming Basics
• Next session
– Data Parallelism
– CUDA Programming Model
– CUDA Threads
July-Aug 2011
CUDA Training Program
31
References for this Session
• Chapters 1 and 2 of: D. Kirk and W. Hwu,
Programming Massively Parallel Processors,
Morgan Kaufmann, 2010
• Chapters 1-4 of: E. Kandrot and J. Sanders,
CUDA by Example, Addison-Wesley, 2010
• Chapters 1-2 of: NVIDIA CUDA C
Programming Guide, NVIDIA Corporation,
2006-2011 (Versions 3.2 and 4.0)
July-Aug 2011
CUDA Training Program
32