Transcript UNIT 3

• The death of single core solution, NVIDA and
CUDA, GPU hardware, alternatives to CUDA,
Understanding parallelism with GPUs, CUDA
hardware overview.
• Parallel architectures and Programming
principles-Parallel computing, Parallel
• architecture, Architectural classification scheme,
Parallel programming models,
• parallel algorithms, performance analysis of
parallel algorithms.
Single Core Processor:
• More than one task at a time is achieved by
concurrency
The Death of a the Single Core
Solution
• The efficiency of a CPU depends on its clock rate.
• As the clock rate increases, the time to execute
any task decreases.
• With higher clock rate, the CPU generates too
much heat which requires expensive cooling
solutions.
• Thus the power consumption also rises.
• Approximately if the voltage is fixed, the power
consumption is cube of its clock rate.
Multiple cores
• Writing programs is a challenge
NVIDIA
• NVIDIA Corporation is an American global
technology company, is founded in the year
1993.
• NVIDIA manufactures, Graphics Processing
Unit (GPU)s, System-on-a-Chip (SOC) unit.
• It provides parallel processing capabilities to
researchers.
Graphics Processing Unit
• It is a specialized electronic circuit designed
for rapid accessing and manipulating the
memory.
• designed for efficiently manipulating
computer graphics by using its highly parallel
structures.
• It is a single chip processor with integrated
transform, lighting, triangle setup, clipping
and rendering engines.
Manufacturers of GPU
•
•
•
•
•
NVIDIA
Intel
AMD
S3 graphics
Matrox
NVIDIA Series
•
•
•
•
•
Geforce
Quadro
Tesla
Tegra
The number of cores is more than 4000 in
some GPUs.
Architectures of CPU and GPU
Graphic Card
• A graphic card consists of a number of Stream
Multiprocessor (SM)s.
• To each SM, there are eight or more attached
Scalar Processors (SP).
• Each Stream Multiprocessor has and
instruction unit, called Multi Threaded
Instruction Unit.
Graphic card
GPGPU (General Purpose computing
on GPU)
• CUDA on NVIDIA
GPU Computing
CUDA
[Compute Unified Device Architecture]
• CUDA is a parallel computing platform and
programming model created by NVIDIA.
• It is used to execute the general purpose
programs that can be executed in parallel
• Fast sort algorithms on large lists, such as
quick sort, bitonic sort.
• Two dimensional fast wavelet transform.
• Molecular dynamics simulations.
• Image and video processing.
• Computational biology and chemistry.
• The multi threaded tasks are known as Kernels
in CUDA.
• The CPU spawns Kernels onto the GPU.
• The CPU has its own internal scheduler which
allocates the Kernels to the available
hardware.
• If more threads are present than the number
of processors, then some processors are
assigned with more than one thread.
• This is according to the Pigeonhole Principle.
• Programs written in CUDA are compiled by
NVIDIA’s nvcc compiler driver.
• The compilation process includes the steps
like splitting, preprocessing, compiling and
merging of the source code
• Programs are independent of GPUs
• Serial code – VC++, g++, GCC
• Parallel code - nvcc
• It is the task of the programmer to modify an
algorithm such that it can run as parallel
operations.
• CUDA ensures that, this modified parallel
operations run on simultaneous GPU threads
Alternatives to CUDA
• Open CL (Open Computing Language)
• Direct Compute
Open CL
• is an open and royalty free standard supported by
NVIDIA, AMD and others.
• Its trademark is owned by Apple.
• It is a low level API for heterogeneous computing that
runs on GPUs
• Open CL provides parallel computing using task-based
and data based parallelism.
• It defines a C-like language for writing programs called
Kernels, that execute on the compute devices.
• A compute device consists of many Processing
Elements (PEs). A single Kernel execution runs on many
PEs in parallel.
Direct compute
• Microsoft made tool.
• Image processing like, image reduction
histogram, convolution etc.
• Video processing like, transcode, super resolution
• Physics related effect like particles smoke, water
• Games and artificial intelligence.
• Ray tracing and radiosity.
Parallelism with GPUs
• Tools:
Open MP (Open Multiprocessing)
• MPI (Message Passing Interface)
• P Thread (Posix Thread)
• Zero MQ (OMQ or ZMQ or QMQ)
• It is a high performance asynchronous
meaning library which is used in scalable
distributed or concurrent applications.
• It provides a manage queue and can run
without a manage broker It is funded by
matrix.
• It is an intelligent socket.
• It has many kinds of connection patterns.
• It is multiplatform and supports 30 +
languages.
• While multithreading, zero MQ reads a task
from one of the n sockets.
• Each task is original to a thread.
• There are no wait states no locks.
• The CPU is fully utilized and it is scalable to
any number of cores.
Concurrency using CUDA
• CUDA splits problems into grids and blocks, each
containing multiple threads.
• The blocks may run in any order.
• A block must execute from start to completion.
• It can run on any one of the SMS, to which it is
allocated.
• The allocated SM must have free slots.
• This allocation is done on a round robin basis
Generatio Duration
n
First
1938-53
Second
1952-63
Third
1962-75
Fourth
1975-90
Fifth
1990-till
date
Sixth
Component
Details
Vaccum tubes and
magnetic drums
Machine language, programming
Ex. UNIVAC, ENIAC
Transistors
Smaller than vacuum tubes, Assembly
languages along with high level
languages like FORTRAN and COBOL
were in use.
Integrated circuits Transistors were miniature and placed
(MIS)
on silicon chips, which increased the
speed and efficiency drastically. Time
sharing OS and virtual memory were
developed.
Microprocessor
(VLSI)
Complete processor and large section of
main memory on single chip CRT
screen ; laser printer, floppy disk, LAN
and WANs were developed.
vector
processor object
oriented
programming
,
multiple pipelines developed world wide web was
introduced.
Goes
Artificial intelligence games, expert
parallely
system, Neural networks, robotics were
with fifth
developed.
generation
Sequential Vs Parallel programming
Types of Parallelism :
• Task-Based Parallelism :
• Data Based Parallelism :
Granularity of parallelism :
•
•
•
•
•
•
Corse grain-Program level parallelism
Process or tasks level parallelism
Parallelism at a level of group of statements
Statement level parallelism :
Parallelism within a statement
•
•
•
•
Fine grain --Instruction level parallelism.
Parallelism within an instruction.
Logic and circuit level parallelism :
Granularity of parallelism
• Program level parallelism :
• Two programs p1 cpp and p2.cpp are executed
on two processors simultaneously.
• Process or tasks level parallelism :
• Two function fun1() and fun2() within a single
program are executed.
Parallelism at a level of
group of statements
• for (i = 0 ; i < n ; i++)
• {
– ----------
• }
– ----------
• for (j = 0 ; j < k ; j++)
• {
– ----------
• }
– ----------
Statement level parallelism
•a =
•x =
b+c;
y * z;
Parallelism within a statements
•
Consider the statement
• a=b+c+c*d
• Instruction level parallelism.
• Parallelism within an instruction.
• Logic and circuit level parallelism :
•
Time shared
Message based
hybrid
Cluster based
Data flow computer
•
•
•
•
•
•
1.
2.
3.
4.
5.
6.
A
D
G
H
I
J
=
=
=
=
=
=
B+C
E–F
B*C
C/E
A*D
H+I
•
•
•
•
•
(1, 2, 3, 4, 5, 6)
(1, 2, 3, 5, 4, 6)
(1, 2, 4, 3, 5, 6)
(1, 2, 4, 5, 3, 6)
(1, 3, 2, 4, 5, 6)
DFG
Array processor
Systolic array
• A =
x + a0
an x n + a n – 1 x n – 1 + a n – 2 x n – 2 + … + a 1
• A =
((((an x + an – 1) x + an – 2) x + an – 3) x …
a1) x + a0
• t=sx + ak
Neural network
Architectural classification schemes
•
•
•
•
•
Flynn’s classification :
SISD
SIMD
MISD
MIMD
• Feng’s classification :
• Depends on degree of parallelism
• Let Pi be the number of bits that can be
processed within the ith processor cycle.
• If there are T number of processor cycles, then
the average parallelism degree Pa is defined as
•
•
•
•
•
Word Serial and Bit Serial (WSBS)
Word Parallel and Bit serial (WPBS)
Word Serial and Bit Parallel (WSBP)
Word Parallel and Bit Parallel (WPBP)
Handler’s Classification
•
T (c) =
< K ´ K, D ´ D , W ´ W>
Johnson’s Classification
Shore’s Classification
• Machine 1,2,3,4,5,6
Parallel programming models
•
•
•
•
Shared memory model
Message passing model
Threads model
Data parallel model
Parallel algorithms
• Sequential Vs parallel algorithms
Performance analysis of parallel
algorithms
• Computation step :
• Communication step : The running time
depends on :
• (i) Size of the input
• (ii) Type of the input
• (iii) Machine used for execution
•  (f (n)) :
g (n) 
c.f. (n) for alln

no
• O (f (n)) :
•
g(n) 
•  (f(n)) :
c.f. (n) for alln

no
O (f (n))
  (f (n))
•
g (n)

and g (n)
• Speed up
• Number of processors
• efficiency
• Amdahl’s law :
•
S

• where n is number of processors.
• Gustafson’s law
• S=
• Folk’s theorems