GPUs (Firepro) and Intro to OpenCL
Download
Report
Transcript GPUs (Firepro) and Intro to OpenCL
1
GPU – History
First true 3D graphics started with early
display controllers (video shifters)
They acted as pass between CPU and
display
RCA’s “Pixie” video chip (CDP1861) in 1976
capable of outputting video signal at 62x128
resolution
In 1977 this chip was followed by Television
Interface Adapter (TIA) 1A
2
GPU – History
TIA was integrated into Atari 2600 for
generating the screen display, sound effects
and reading the input from the controller
3
GPU – History
Basically vertices transformed into pixels
Computing by “hand” and that was very slow
Early 80’s to late 90’s early GPUs work with
fixed-function pipeline
4
GPU – History
5
GPU – History
Later general programming extended to
shader stage
Data independence is also explored
In 2006 NVidia GeForce 8800 mapped
separate graphics stage to a unified array of
processors(for vertex shading, geometry and
pixel processing)
In 2007 NVidia release of CUDA
6
GPU
Similar to computer CPU but designed for the
purpose of computing very complex
mathematical and geometric calculation
In past all this work has been done by CPU
which put strain on CPU and degraded
performance
GPU improves performance because of its
parallel processing architecture which allows
it to perform multiple calculation at same time
7
GPU
Some of the fastest GPU has much more
transistors than average CPU
GPU due to intensive calculation and speed
produce a lot of heat so on the motherboard it is
usually located under heat sink or fan
GPU typically interface with motherboard using
PCI Express bus or accelerated graphic port and
can be replaced or upgraded easily
There can be multiple GPUs do draw images
simultaneously to the screen increasing the
processing power (Google Tango Project)
8
GPU vs CPU
Architecturally CPU
consist of few cores
that can handle
multiple threads at
the time
GPU consist of
hundreds of cores
that can handle
thousands of
threads
simultaneously
9
GPU vs CPU
Discrepancy in floating point capability between CPU
and GPU is that GPU is specialized for computeintensive, highly parallel computation - exactly what
graphic rendering is about and so 80% of transistors
are devoted for data processing
10
GPU vs CPU
Same function is executed on each element of
data with high arithmetic intensity
11
Benefits of using GPU vs CPU
GPU has many benefits such as more
computing power, larger memory
bandwidth and lower power
consumption but regarding its high
computing there are some constraints
Developing a code with GPU takes more
time and requires highly skilled work
GPU code runs in parallel so data
partition and synchronization is needed
12
Benefits of using GPU vs CPU
It is hard to answer this question since it
is application dependant
Simply GPU is very good following
straight line of processing but not so
good when processing different
processing path
Code should be executed on GPU when
it must be executed many times in
parallel
13
Benefits of using GPU vs CPU
Example we can blend pixels from A to B
and put them all in C
This task when executed on CPU would be:
For (int i = 0; i < pixelCount; i++)
C[i] = A[i] + B[i];
This code can be slow when many pixels
14
Benefits of using GPU vs CPU
Code C[i] = A[i] + B[i]; and then we can populate cores with
this code assigning value i for each
This is where GPU is at its best because all cores execute
program at same time
Example where GPU is not very fast is conditional branching
which implies making copy of the program that follows branch
A and populate all cores with this code
Execute until first logical operation
Evaluate all elements and continue processing all elements
that follow branch A and enqueue all processes that chose
path B
Problem is there is no program for B and now all cores that
chose B must be idle
15
Benefits of using GPU vs CPU
Possible worst case from prev point?
Only one core executes A branch and all
others idle
Once cores executing A are done we could
activate branch B version of the program
(copying oit from memory buffer to core
memory)
Execute B branches and if needed merge
results
16
Benefits of using GPU vs CPU
GPU is designed for multithreaded
calculations
GPU makers can easily add more cores
whenever they want to add computational
power but the problem is that some problems
can not be divided in smaller problems
Lecture point: Not every problem lends itself to
parallelism
Ex: nth in Fibonacci series (CPU much faster
here)
17
Benefits of using GPU vs CPU
GPU can be more efficient for other reasons
beside parallel computing
More restrictive memory access
Does not support as many data types
GPUs have limited instruction sets to perform
specialized calculations
GPUs are highly optimized for floating point
calculations
Integer point calculation is not necessarily
faster on GPU
18
AMD FirePro™ D-Series GPU
Newest star of GPU in new Mac Pro
3 models D300, D500, D700
Main difference between above is number
of stream processors, VRAM, width of
memory bus, memory bandwidth and
teraflop performance
More processing power for video editing,
3D modeling and animation and
photography
GPU computing using OpenCL (more on it
later)
19
AMD FirePro™ D-Series GPU
Architecture of this particular GPU supports
OpenCL 2.0 and lower D300 model
supports 256 bit memory bus that delivers
160 GB per second memory bandwidth
meaning large amounts of data can be
read quickly
With support of OpenCL 2.0 it is now
possible for application to run both on GPU
and CPU simultaneously and AMD refers
to this as Accelerated Parallel
Programming
20
AMD FirePro™ D-Series GPU
21
GPU Accelerated Computing
It is basically use of GPU together with
CPU to accelerate scientific, analytics,
consumer and enterprise applications
Started in 2007 by NVIDIA
GPUs are currently accelerating
applications in platforms ranging from cars,
mobile phones, risk management etc.
GPU Accelerated computing offers better
performance by offloading compute –
intensive portions of applications to GPU
while remainder of the code still runs on
CPU
22
GPU Accelerated Computing
We can see from the
image on the left
that some part of the
code runs on GPU
and some part runs
on GPU and from
users perspective
applications simply
runs faster
23
OpenCL
Open Computing Language
From the makers of OpenGL
Wide industry support: AMD, Apple,
NVidia, Samsung etc.
OpenCL model :
24
OpenCL Architecture
Host controls multiple compute devices
25
OpenCL Architecture
Each of these compute devices consist of
multiple compute units
Compute units (execution units and
arithmetic's processing units) contain
processing elements
Processing elements execute OpenCL
kernels (these are just a functions written by
programmer in OpenCL language (C with
restrictions and special keywords and data
types)
Kernels are basic unit of executable code
Program is collection of kernels and other
functions
26
OpenCL Architecture
We should also be aware that OpenCL
program is divided in two parts
One part that executes on device (GPU)
Second part that executes on host
(CPU)
Device part is where we need to write
special functions called kernels
27
OpenCL Architecture – Device
Device is GPU
Kernel is written which is function
executed on GPU (not only one)
Kernels are entry points into device
program (only functions that can be
called from host)
We need to program kernels ourselves
28
How to program a kernel - SIMT
SIMT: Single instruction multiple thread
which reflects how instructions are
executed on device
Same code is executed in parallel by a
different thread and each thread
executes the code with different data
29
How to program a kernel – Work
Item
Work items are equivalent to threads
and are smallest execution entity
Every time kernel is launched, lots of
work items (a number specified by
programmer) are launched and each
one is executing same code
Each work item has an ID which is
accessible from the kernel and it is used
to distinguish the data to be processed
by each work item
30
How to program a kernel – Work
Group
Work groups are there to allow
communication and cooperation
between work items
They also reflect how work items are
organized
N dimensional grid of work groups (N =
1,2 or 3)
Work groups also have ID which can be
called from kernel
31
How to program a kernel – ND
Range
ND Range is next
organizational level
specifying how work
groups are
organized
N dimensional grid
of work groups
where N = 1,2 or 3
32
Kernel Example - CPU
void vector_add_cpu (const float* src_a,
const float* src_b,
float* res,
const int num)
{
for (int i = 0; i < num; i++)
res[i] = src_a[i] + src_b[i];
}
This is kernel that adds 2 vectors. Here basically we have
one thread iterating through all elements.
33
Kernel Example - GPU
__kernel void vector_add_gpu (__global const float* src_a,
__global const float* src_b,
__global float* res,
const int num)
{
/* get_global_id(0) returns the ID of the thread in execution.
As many threads are launched at the same time, executing the same kernel,
each one will receive a different ID, and consequently perform a different
computation.*/
const int idx = get_global_id(0);
/* Now each work-item asks itself: "is my ID inside the vector's range?"
If the answer is YES, the work-item performs the corresponding computation*/
if (idx < num)
res[idx] = src_a[idx] + src_b[idx];
}
34
Kernel Example - GPU
Each thread computing one elements
“kernel” reserved word which specifies
that the function is kernel
Kernel functions always return void
In similar ways we can program host
device as well
35
Kernel Example - GPU
36
Parallel Processing and OpenCL
OpenCL data parallel programming
model is very hierarchical which can be
specified in two ways
Explicitly – programmer defines total
number of items to execute in parallel as
well as how to group them
Implicitly - programmer defines total
number of items to execute in parallel
and OpenCL manages grouping them
37
OpenCL and Synchronization
The two domains of synchronization in
OpenCL are work items in single work
group and command queue in a single
context
Work group barriers enable
synchronization or work items in work
group – barrier()
Barrier and memory fences synchronize
threads in a work group
All threads are required to reach barrier
before any of them can continue
38
OpenCL and Synchronization
39
OpenCL and Synchronization
Built in functions mem_fence() and barrier()
mem_fence(CLK_LOCAL_MEM_FENCE and/or
CLK_GLOBAL_MEM_FENCE)
waits until all reads/writes to local and/or global
memory made by the calling work item prior to
mem_fence () are visible to all threads in the work
group
barrier(CLK_LOCAL_MEM_FENCE and/or
CLK_GLOBAL_MEM_FENCE)
waits until all work items in the work group have
reached this point and calls mem_fence
(CLK_LOCAL_MEM_FENCE and/or
CLK_GLOBAL_MEM_FENCE
40
OpenCL and Synchronization
Two types of synchronization between
commands in command queue
Command Queue barrier – enforces
ordering with single queue and any
resulting changes in memory are available
to next command in the queue
Events – enforces ordering between or
within queues
Enqueued commands in OpenCL return
event identifying command as well as
memory object updated by it
41
OpenCL – Memory Model
OpenCL had 4 address space
private – specific to work item and not
visible to other work items
local – specific to work group and
accessible only to work items belonging to
that work group
global – accessible to all work items
executing in context as well as to the host
constant – read only region for host
allocated objects that are not changed
during kernel execution
42
OpenCL – Memory Model
There is also host accessible region for
application data structure and program
data
Pci memory part of host (CPU) memory
accessible from and modifiable by host
program and GPU device
Modifying this memory requires
synchronization between GPU compute
device and the CPU
43
OpenCL – Communication
Communication and data transfer
between host and GPU occur on PCIe
channel
Actual transfer performance is CPU
dependant
Transfer from the host to the GPU are
done by the command processor
GPU device can read and write system
memory directly through kernel
instructions over PCIe bus
44
OpenCL – Processing API Calls
Host application does not interact with GPU
device directly (data structures for the host)
Driver layer translates and issues commands
to the hardware
Most commands to the GPU are buffered in
command queue on the host side
Queue of commands is sent to and processed
by the GPU
There is no guarantee as to when commands
from command queue are executed but only
that they are executed in order
45
OpenCL – Scheduling
GPU devices are very efficient in
parallelizing large numbers of work items in
manner transparent to application
Each GPU device uses large number of
wavefronts to hide memory access
latencies by having scheduler switch the
active wavefront in given compute unit
whenever the current wavefront is waiting
for a memory access to complete
46
OpenCL – Scheduling
47
OpenCL – Scheduling
48
Data Parallelism in OpenCL
Define N dimensional computation domain
(N = 1, 2 or 3)
Each independent element of execution in
ND domain is called a work item
The ND domain defines the total number of
work items that execute in parallel
E.g., process a 1024 x 1024 image: Global
problem dimensions:
1024 x 1024 = 1 kernel execution per pixel:
1,048,576 total executions
49
Data Parallelism in OpenCL
50
Data Parallelism in OpenCL
Kernels executed across a global domain
of work items
Global dimensions define the range of
computation one work item per
computation, executed in parallel
Work items are grouped in local
workgroups
Local dimensions define the size of the
workgroups
Executed together on one device and
share local memory and synchronization
51
OpenCL C (quick glance)
Derived from ISO C99 (with some
restrictions)
Language Features Added (Work items
and work groups, vector types and
synchronization
Included large set of built in functions for
image manipulation, work item
manipulation and math functions
52
OpenCL C language restriction
Pointers to functions are not allowed
Pointers to pointers allowed within a
kernel, but not as an argument
Variable length arrays and structures are
not supported
Recursion is not supported
3D Image writes are not supported
53
OpenCL C optional extension
Extensions are optional features
exposed through OpenCL
The OpenCL working group has already
approved many extensions to the
OpenCL specification such as double
precision floating point types, built in
functions to support doubles, byte
addressable stores (write to pointers to
types < 32 bits)
54
Work Items and Work Groups
55
Work Items and Work Groups
56
OpenCL Data Types
Scalar data types (bool, char, cl_char,
unsigned char, uchar, cl_uchar, short,
cl_short, unsigned short, etc.)
Image types (image2d_t, image3d_t,
image2d_array_t, image1d_t, etc.)
Vector data types (charn, ucharn,
shortn, ushortn, intn, uintn etc.)
Supported values of n are 2, 3, 4, 8, and
16 for all vector data types
57
Q&A
58