Transcript Work @ CADL

Challenges and
Opportunities in
Heterogeneous
Multi-core Era
R. Govindarajan
HPC Lab,SERC, IISc
[email protected]
Garuda-PM July 2011
(C)RG@SERC,IISc
Overview
•
•
•
•
•
•
Introduction
Trends in Processor Architecture
Programming Challenges
Data and Task Level Parallelism on Multi-cores
Exploiting Data, Task and Pipeline Parallelism
Data Parallelism on GPUs and
Task Parallelism on CPUs
• Conclusions
Garuda-PM July 2011
© RG@SERC,IISc
2
Moore’s Law : Transistors
Garuda-PM July 2011
© RG@SERC,IISc
3
Moore’s Law : Performance
Processor performance
doubles every 1.5 years
Garuda-PM July 2011
© RG@SERC,IISc
4
Overview
• Introduction
• Trends in Processor Architecture
– Instruction Level Parallelism
– Multi-Cores
– Accelerators
•
•
•
•
Programming Challenges
Data and Task Level Parallelism on Multi-cores
Exploiting Data, Task and Pipeline Parallelism
Data Parallelism on GPUs and
Task Parallelism on CPUs
• Conclusions
Garuda-PM July 2011
© RG@SERC,IISc
5
Progress in Processor Architecture
• More transistors  New architecture
innovations
– Multiple Instruction Issue processors
• VLIW
• Superscalar
• EPIC
– More on-chip caches, multiple levels of cache
hierarchy, speculative execution, …
Era of Instruction Level Parallelism
Garuda-PM July 2011
© RG@SERC,IISc
6
Moore’s Law: Processor
Architecture Roadmap (Pre-2000)
EPIC
ILP
Superscalar
Processors
Garuda-PM July 2011
© RG@SERC,IISc
7
3 GHz
4 Core
3 GHz
16 Core
3 GHz
1 Core
3 GHz
2 Core
1 GHz
1 Core
Performance
6 GHz
1 Core
Multicores : The Right Turn
Garuda-PM July 2011
© RG@SERC,IISc
8
Era of Multicores (Post 2000)
• Multiple cores in a single
die
• Early efforts utilized
multiple cores for
multiple programs
• Throughput oriented
rather than speeduporiented!
Garuda-PM July 2011
© RG@SERC,IISc
9
Moore’s Law: Processor
Architecture Roadmap (Post-2000)
EPIC
Superscalar
Garuda-PM July 2011
© RG@SERC,IISc
10
Progress in Processor Architecture
• More transistors  New architecture innovations
–
–
–
–
Multiple Instruction Issue processors
More on-chip caches
Multi cores
Heterogeneous cores and accelerators
Graphics Processing Units (GPUs)
Cell BE, Clearspeed
Larrabee or SCC
Reconfigurable accelerators …
Era of Heterogeneous Accelerators
Garuda-PM July 2011
© RG@SERC,IISc
11
Moore’s Law: Processor
Architecture Roadmap (Post-2000)
Accelerators
EPIC
Superscalar
Garuda-PM July 2011
© RG@SERC,IISc
12
Accelerators
Garuda-PM July 2011
© RG@SERC,IISc
13
Accelerators: Hype or Reality?
Some Top500 Systems (June 2011 List)
Rank
System
Description
2
Tianhe
Xeon + Nvidia Tesla
C2050 GPUs
186368
2,566
4
NebulaeDawning
Intel X5650, NVidia
Tesla C2050 GPU
55,680 +
64,960
1,271
7
Tsubame
Xeon + Nvidia GPU
73278
1,192
10
Roadrunner
Opteron + CellBE
6480
+12960
1,105
Garuda-PM July 2011
© RG@SERC,IISc
# Procs. R_max
(TFLOPS)
14
Accelerators – Cell BE
Garuda-PM July 2011
© RG@SERC,IISc
15
Accelerator – Fermi S2050
Garuda-PM July 2011
© RG@SERC,IISc
16
How is the GPU connected?
C0
C1
C2
C3
SLC
IMC
Memory
Garuda-PM July 2011
© RG@SERC,IISc
17
Overview
•
•
•
•
•
•
Introduction
Trends in Processor Architecture
Programming Challenges
Data and Task Level Parallelism on Multi-cores
Exploiting Data, Task and Pipeline Parallelism
Data Parallelism on GPUs and
Task Parallelism on CPUs
• Conclusions
Garuda-PM July 2011
© RG@SERC,IISc
18
Handling the Multi-Core
Challenge
• Shared and Distributed Memory Programming
Languages
– OpenMP
– MPI
• Other Parallel Languages (partitioned global
address space languages)
– X10, UPC, Chapel, …
• Emergence of Programming Languages for GPU
– CUDA
– OpenCL
Garuda-PM July 2011
© RG@SERC,IISc
19
GPU Programming: Good News
• Emergence of Programming Languages for GPU
– CUDA
– OpenCL – Open Standards
• Growing collection of code base
– CUDAzone
– Packages supporting GPUs by ISV
• Impressive performance
– Yes!
• What about Programmer Productivity?
Garuda-PM July 2011
© RG@SERC,IISc
20
GPU Programming:
Boon or Bane
• Challenges in GPU programming
–
–
–
–
Managing parallelism across SMs and SPMD cores
Transfer of data between CPU and GPU
Managing CPU-GPU memory bandwidth efficiently
Efficient use of different level of memory (GPU
memory, Shared Memory, Constant and Texture
Memory, …
– Efficient buffer layout scheme to ensure all accesses
to GPU memory are coalesced.
– Identifying appropriate execution configuration for
efficient execution
– Synchronization across multiple SMs
Garuda-PM July 2011
© RG@SERC,IISc
21
The Challenge : Data Parallelism
OpenCL
CUDA
AMD
CAL
Cg
Garuda-PM July 2011
© RG@SERC,IISc
22
Challenge – Level 2
Synergistic Exec. on CPU & GPU Cores
C0
C1
C2
C3
SLC
MC
Memory
Garuda-PM July 2011
© RG@SERC,IISc
23
Challenge – Level 3
Synergistic Exec. on Multiple Nodes
Memory
NIC
NIC
Memory
N/W
Switch
Memory
Garuda-PM July 2011
NIC
© RG@SERC,IISc
24
Data, Thread, & Task Level
Parallelism
OpenMP
OpenCL
SSE
CUDA
CAL,
PTX, …
MPI
Garuda-PM July 2011
© RG@SERC,IISc
25
Overview
•
•
•
•
Introduction
Trends in Processor Architecture
Programming Challenges
Data and Task Level Parallelism on Multi-cores
– CUDA for Clusters
• Exploiting Data, Task and Pipeline Parallelism
Data Parallelism on GPUs and
Task Parallelism on CPUs
• Conclusions
Garuda-PM July 2011
© RG@SERC,IISc
26
Our Apprach
OpenMP
MPI
Streaming
Lang.
Array Lang.
(Matlab)
CUDA/
OpenCL
Parallel
Lang.
Compiler/
Runtime System
Synergistic Execution on
Multiple Hetergeneous Cores
Multicores
SSE
Garuda-PM July 2011
GPUs
© RG@SERC,IISc
Other
Accel.
CellBE
27
CUDA on Other Platforms?
• CUDA programs are data parallel
• Cluster of multi-cores, typically for task parallel
applications
• CUDA for multi-core SMPs (M-CUDA)
• CUDA on Cluster of multi-nodes having multicores?
Garuda-PM July 2011
© RG@SERC,IISc
CUDA on Clusters: Motivation
Motivation
• CUDA, a natural way to write data-parallel
programs
• Large Code-base (CUDA Zone)
• Emergence of OpenCL
• How can we leverage these on (existing)
Clusters without GPUs?
Challenges:
• Mismatch in parallelism granularity
• Synchronization and context switch can be more
expensive!
Garuda-PM July 2011
© RG@SERC,IISc
29
CUDA on Multi-Cores (M-CUDA)
Memory
•
•
•
•
•
Fuses threads in a thread block to a thread
OMP Parallel for executing thread blocks in parallel
Proper synchronization across kernels
Loop Transformation techniques for efficient code
Shared memory and synchronization inherent!
Garuda-PM July 2011
© RG@SERC,IISc
CUDA-For-Clusters
• CUDA kernels naturally express parallelism
– Block level and thread-level
• Multiple levels of parallelism in hardware too!
– Across multiple nodes
– And across multiple cores within a node
• A runtime system to perform this work partitioning at
multiple granularities.
• Little communication between block (in data parallel
programs)
• Need some mechanism for
– “Global Shared memory” across multiple nodes!
– Synchronization across nodes
• A 'lightweight' software distributed shared memory layer
to provide necessary abstraction
Garuda-PM July 2011
© RG@SERC,IISc
CUDA-For-Clusters
B0
B1
B2
B3
CUDA
program
B4
B5
B6
B7
Source-to-source transforms
B0
B1
B2
B3
B4
B5
B6
B7
C program with
block-level
code
specification
MPI + OpenMP work distribution runtime
B0,B1
P1
P2
N
1
B2,B3
RAM
P1
P2
N
2
B4,B5
RAM
P1
P2
Interconnect
RAM
B6,B7
P1
P2
N
3
RAM
N
4
Software Distributed Shared Memory
Garuda-PM July 2011
© RG@SERC,IISc
• Consistency only at
kernel boundaries
• Relaxed consistency
model for global data
• Lazy Update
• CUDA memory mgmt.
functions are linked to
the DSM functions.
• Runtime system to
ensure partitioning of
tasks across cores
• OpenMP and MPI
Experimental Setup
• Benchmarks from Parboil and Nvidia CUDA SDK
suite
• 8-node Xeon clusters, each node having Dual
Socket Quad Core (2 x 4 cores)
• Baseline M-CUDA on single node (8-cores)
Garuda-PM July 2011
© RG@SERC,IISc
Experimental Results
With Lazy
Update
Garuda-PM July 2011
© RG@SERC,IISc
Overview
•
•
•
•
•
Introduction
Trends in Processor Architecture
Programming Challenges
Data and Task Level Parallelism on Multi-cores
Exploiting Data, Task and Pipeline Parallelism
– Stream Programs on GPU
• Data Parallelism on GPUs and
Task Parallelism on CPUs
• Conclusions
Garuda-PM July 2011
© RG@SERC,IISc
36
Stream Programming Model
• Higher level programming model where nodes
represent computation and channels communication
(producer/consumer relation) between them.
• Exposes Pipelined parallelism and Task-level
parallelism
• Synchronous Data Flow (SDF), Stream Flow Graph,
StreamIT, Brook, …
• Compiling techniques for achieving rate-optimal,
buffer-optimal, software-pipelined schedules
• Mapping applications to Accelerators such as GPUs
and Cell BE.
Garuda-PM July 2011
© RG@SERC,IISc
37
StreamIT Example Program
2 – Band Equalizer
Signal Source
Dup. Splitter
Bandpass Filter
+
Amplifier
Bandpass Filter
+
Amplifier
Combiner
Garuda-PM July 2011
© RG@SERC,IISc
38
Challenges on GPUs
• Work distribution between the multiprocessors
– GPUs have hundreds of processors (SMs and SIMD units)!
• Exploiting task-level and data-level parallelism
– Scheduling across the multiprocessors
– Multiple concurrent threads in SM to exploit DLP
• Determining the execution configuration (number of
threads for each filter) that minimizes execution time.
• Lack of synchronization mechanisms between the
multiprocessors of the GPU.
• Managing CPU-GPU memory bandwidth efficiently
Garuda-PM July 2011
© RG@SERC,IISc
39
Stream Graph Execution
Software Pipelined Execution
Stream Graph
A
B
C
0
1
2
3
4
5
6
7
SM1
SM2
A1
A2
B1
B2
A3
A4
C1
C2
B3
B4
D1
D2
C3
C4
D3
D4
D
Data
Parallelism
Garuda-PM July 2011
Pipeline
Parallelism
© RG@SERC,IISc
SM3
SM4
Task
Parallelism
40
Our Approach
• Multithreading
– Identify good execution configuration to exploit the
right amount of data parallelism
• Memory
– Efficient buffer layout scheme to ensure all
accesses to GPU memory are coalesced.
• Task Partition between GPU and CPU cores
• Work scheduling and processor (SM)
assignment problem.
– Takes into account communication bandwidth
restrictions
Garuda-PM July 2011
© RG@SERC,IISc
41
Compiler Framework
StreamIt
Program
Generate Code
for Profiling
Configuration
Selection
Execute Profile
Runs
ILP Partitioner
CUDA
Code
+
C Code
Code
Generation
Garuda-PM July 2011
Modulo
Scheduling
Instance
Partitioning
Task
Partitioning
Instance
Partitioning
Task
Partitioning
Heuristic Partitioner
© RG@SERC,IISc
42
20
18
16
14
12
10
8
6
4
2
0
Garuda-PM July 2011
> 65x
> 32x
> 52x
Experimental Results on Tesla
Mostly GPU
Syn-Heur
Syn-ILP
© RG@SERC,IISc
43
Overview
•
•
•
•
•
•
Introduction
Trends in Processor Architecture
Programming Challenges
Data and Task Level Parallelism on Multi-cores
Exploiting Data, Task and Pipeline Parallelism
Data Parallelism on GPUs and
Task Parallelism on CPUs
– MATLAB on CPU/GPU
• Conclusions
Garuda-PM July 2011
© RG@SERC,IISc
44
Compiling MATLAB to
Heterogeneous Machines
• MATLAB is an array language extensively used for
scientific computation
• Expresses data parallelism
– Well suited for acceleration on GPUs
• Current solutions (Jacket, GPUmat) require user
annotation to identify “GPU friendly” regions
• Our compiler, MEGHA (MATLAB Execution on
GPU-based Heterogeneous Architectures), is fully
automatic
Garuda-PM July 2011
© RG@SERC,IISc
45
Compiler Overview
• Frontend constructs an SSA
intermediate representation
(IR) from the input MATLAB
code
• Type inference is performed
on the SSA IR
– Needed because MATLAB is
dynamically typed
• Backend identifies “GPU
friendly” kernels, decides
where to run them and
inserts reqd. data transfers
Garuda-PM July 2011
© RG@SERC,IISc
Frontend
Code Simplification
SSA Construction
Type Inference
Kernel Identification
Mapping and
Scheduling
Data Transfer
Insertion
Code Generation
Backend
46
Backend : Kernel
Identification
• Kernel identification identifies sets of IR
statements (kernels) on which mapping and
scheduling decisions are made
• Modeled as a graph clustering problem
• Takes into account several costs and benefits
while forming kernels
–
–
–
–
Register utilization (Cost)
Memory utilization (Cost)
Intermediate array elimination (Benefit)
Kernel invocation overhead reduction (Benefit)
Garuda-PM July 2011
© RG@SERC,IISc
47
Backend : Scheduling and
Transfer Insertion
• Assignment and scheduling is performed
using a heuristic algorithm
– Assigns kernels to the processor which
minimizes its completion time
– Uses a variant of list scheduling
• Data transfers for dependencies within a
basic block are inserted during scheduling
• Transfers for inter basic block
dependencies are inserted using a
combination of data flow analysis and
edge splitting
Garuda-PM July 2011
© RG@SERC,IISc
48
Results
113
64
Speedup Over MATLAB
14
12
10
8
GPUmat
CPU only
6
CPU+ GPU
4
2
0
bscholes
clos
fdtd
nb1d
nb3d
Benchmarks were run on a machine
with an Intel Xeon 2.83GHz quad-core
Benchmarks
processor and a Tesla S1070. Generated code was compiled using nvcc version 2.3.
MATLAB version 7.9 was used.
Garuda-PM July 2011
© RG@SERC,IISc
49
Overview
•
•
•
•
•
•
Introduction
Trends in Processor Architecture
Programming Challenges
Data and Task Level Parallelism on Multi-cores
Exploiting Data, Task and Pipeline Parallelism
Data Parallelism on GPUs and
Task Parallelism on CPUs
• Conclusions
Garuda-PM July 2011
© RG@SERC,IISc
50
Summary
• Multi-cores and Heterogeneous
accelerators present tremendous research
opportunity in
– Architecture
– High Performance Computing
– Compilers
– Programming Languages & Models
• Challenge is to exploit Performance
PPP
• Keeping programmer Productivity
high and
is Important
• Portability of code across platforms
Garuda-PM July 2011
© RG@SERC,IISc
51
Thank You !!
Garuda-PM July 2011
(C)RG@SERC,IISc