PBSM - University of Virginia

Download Report

Transcript PBSM - University of Virginia

UNIVERSITY OF VIRGINIA
Implications of Heterogeneous,
Multicore Systems
Kevin Skadron
© Kevin Skadron, 2008
University of Virginia
Dept. of Computer Science
LAVA Lab
UNIVERSITY OF VIRGINIA
Where We are Today - Multicore
© Kevin Skadron, 2008
Programmability wall
2
Power wall
Classic architectures
http://interactive.usc.edu/classes/ctin542-designprod/archives/r2d2-01.jpg
UNIVERSITY OF VIRGINIA
Outline of Overall Talk
…or,
Year
A Day in the Life of a Computer Architect
Why multicore, why heterogeneous?
GPU/CUDA overview
© Kevin Skadron, 2008
Results with GPUs
3
UNIVERSITY OF VIRGINIA
What is Multicore?
Combine multiple processors (e.g., Pentium
4’s) on a single die
Chip multiprocessor!
Note that cores don’t have to be identical
•
•
© Kevin Skadron, 2008
•
4
e.g. AMD Fusion: combine CPU(s)+ GPU
Intel has announced similar product (will
probably come out first)
Most bets are that heterogeneous multicore is
the way of the future
© Kevin Skadron, 2008
UNIVERSITY OF VIRGINIA
5
source: chip-architect.com/news/Shanghai_Nehalem.jpg
© Kevin Skadron, 2008
UNIVERSITY OF VIRGINIA
6
source: chip-architect.com/news/Shanghai_Nehalem.jpg
UNIVERSITY OF VIRGINIA
Why Multicore?
The Good Old Days
Ideal scaling
Transistors shrink by 0.7 in each dimension with
every generation (about 1.5 years)
•
•
Pack 2X as many transistors in the same area, for the
same power density
With 1.3X boost in frequency due to lower
propagation delay
Architectural changes gave another 1.3-1.5X boost
in performance
© Kevin Skadron, 2008
•
•
•
Deeper pipelines (higher clock frequency)
Less stalling (e.g. on accessing memory,
interpreting branches)
Execute multiple instructions
simultaneously from a single thread
Doubling performance every 2 years
7
Instruction-level
parallelism - ILP
UNIVERSITY OF VIRGINIA
Why Multicore?
Non-Ideal Scaling
How did we get here? Combination of “ILP wall,”
“frequency wall,” and “power wall”
•
ILP wall: can’t figure out how to keep increasing ILP
without unreasonable area and power costs
― Only known solution is speculative execution
•
Frequency wall: can’t figure out how to increase clock
frequency without unreasonable increase in power
― Modern transistors are increasingly leaky
― Parasitic effects get worse with smaller transistors
•
Power wall: air cooling capped at ~150W (thermal design
power - TDP)
© Kevin Skadron, 2008
Moore’s Law is providing area that a single thread can’t
economically use
•
How much cache does one core need?
Area not spent on ILP can be spent on more cores !
Small simplifications in core complexity yield large
reductions in power
8
UNIVERSITY OF VIRGINIA
Intel Power Trends
Core 2 Duo
100
Pentium® II
Pentium® 4
Max Power (Watts)
Pentium® Pro
Pentium® III
10
Pentium®
Pentium®
w/MMX tech.
i486
© Kevin Skadron, 2008
i386
1
1.5m
9
1m
0.8m
0.6m
0.35m
0.25m
0.18m
0.13m
Source: Intel
UNIVERSITY OF VIRGINIA
Why Heterogeneous?
Existing accelerators are useful (e.g. GPU)
•
Some GPGPU research has produced staggering
speedups (100X+)
Specialization of some cores is a more
effective use of area and power than
general-purpose cores (large or small)
Main attraction today: performance per chip
•
A teraflop workstation will have wider impact
than a petaflop supercomputer
© Kevin Skadron, 2008
Important benefits going forward:
•
•
•
10
Energy efficiency (performance/watt)
Cooling costs (performance/$)
But not necessarily lower power per mm2!
UNIVERSITY OF VIRGINIA
Heterogeneity and the Power Wall
Specialized cores: lower energy/op, but
maybe restricted set of ops
Specialized memory hierarchy: lower
energy/mem op, but maybe harder to use
Specialized relationship among cores
•
e.g., SIMD = lockstep
Specialized interconnect among PEs
© Kevin Skadron, 2008
•
Direct forwarding of intermediate results from
one unit to the next
GPUs and Cell have the first three, GPUs a
bit of the last one (but not useful yet for
general-purpose GPU computing)
11
Heterogeneous Challenges
Which specialized units?
•
•
•
How to select? How to decide best !/$?
Very few units actually in play right now….
Possible outcome: Several CPU cores, one GPU,
one media core, one crypto core, and the rest is
cache and reconfigurable logic
How to integrate?
•
Lots of design issues
© Kevin Skadron, 2008
How to program?
•
•
12
Need portability
Dusty deck, dusty brains
Source: Retrospect Galleries, http://www.retrospectgalleries.com/artist_photos/jorghe/jorghe/detail/homer_simpson_scream.jpg
UNIVERSITY OF VIRGINIA
UNIVERSITY OF VIRGINIA
Programmability Wall
Parallel programming is often hard
•
•
Easy case: for each web request, do…
Really hard case: for each time step, simulate…
― fundamentally serial
•
Moderately hard case: build a binary tree in
parallel
― lots of interactions requiring synchronization
© Kevin Skadron, 2008
Multicore requires parallel programming to
keep riding Moore’s Law
13
UNIVERSITY OF VIRGINIA
Programming for Heterogeneity
Current model: serial master thread
offloads pieces of work
Serial Code
Parallel Kernel
KernelA<<< nBlk, nTid >>>(args);
...
© Kevin Skadron, 2008
Serial Code
Parallel Kernel
KernelB<<< nBlk, nTid >>>(args);
14
...
UNIVERSITY OF VIRGINIA
Our Research Approach
Port applications to CUDA and OpenMP
See what we learn
© Kevin Skadron, 2008
Speed up some important applications for
UVA scientists and engineers
15
UNIVERSITY OF VIRGINIA
Outline of Overall Talk
Why heterogeneous?
GPU/CUDA overview
© Kevin Skadron, 2008
Results with GPUs
16
UNIVERSITY OF VIRGINIA
Manycore GPU – Block Diagram
Tesla architecture, launched Nov 2006
GTX 280 has 240 scalar, single-precision PEs in 30 SIMD
groups of 8 (SMs); DP throughput is about 1/8
Per-block shared memory (PBSM) allows communication
among threads
Host
Input Assembler
© Kevin Skadron, 2008
Thread Execution Manager
Thread Processors
Thread Processors
PBSM
PBSM
PBSM
PBSM
Thread Processors
Thread Processors
Thread Processors
Thread Processors
Thread Processors
Thread Processors
PBSM
PBSM
PBSM
PBSM
PBSM
PBSM
PBSM
PBSM
Load/store
17
Global Memory
PBSM
PBSM
PBSM
PBSM
© NVIDIA, 2007
UNIVERSITY OF VIRGINIA
Key Parallel Abstractions in CUDA
Hierarchy of concurrent threads
Lightweight synchronization primitives
© Kevin Skadron, 2008
Shared memory model for cooperating
threads
18
UNIVERSITY OF VIRGINIA
Simple Example: Vector Addition
C=A+B
A
1
2
3
4
5
6
7
8
+
B
9
10 11 12 13 14 15 16
© Kevin Skadron, 2008
=
C
19
10 12 14 16 18 20 22 24
Example courtesy of M. Boyer, 2008
UNIVERSITY OF VIRGINIA
C Code
float *CPU_add_vectors(float *A, float *B, int N) {
// Allocate memory for the result
float *C = (float *) malloc(N * sizeof(float));
// Compute the sum;
for (int i = 0; i < N; i++) C[i] = A[i] + B[i];
// Return the result
return C;
© Kevin Skadron, 2008
}
20
UNIVERSITY OF VIRGINIA
CUDA Kernel
// GPU kernel that computes the vector sum C = A + B
// (each thread computes a single value of the result)
__global__ void add_vectors_kernel(float *A, float *B, float *C, int
N) {
// Determine which element this thread is computing
int block_id = blockIdx.x + gridDim.x * blockIdx.y;
int thread_id = blockDim.x * block_id + threadIdx.x;
// Compute a single element of the result vector (if it is valid)
if (thread_id < N) C[thread_id] = A[thread_id] + B[thread_id];
© Kevin Skadron, 2008
}
21
UNIVERSITY OF VIRGINIA
CUDA Host Code
float *GPU_add_vectors(float *A_CPU, float *B_CPU, int N) {
// Allocate GPU memory for the inputs and the result
int vector_size = N * sizeof(float);
float *A_GPU, *B_GPU, *C_GPU;
cudaMalloc((void **) &A_GPU, vector_size);
cudaMalloc((void **) &B_GPU, vector_size);
cudaMalloc((void **) &C_GPU, vector_size);
// Transfer the input vectors to GPU memory
cudaMemcpy(A_GPU, A_CPU, vector_size, cudaMemcpyHostToDevice);
cudaMemcpy(B_GPU, B_CPU, vector_size, cudaMemcpyHostToDevice);
© Kevin Skadron, 2008
// Execute the kernel to compute the vector sum on the GPU
dim3 grid_size = …
add_vectors_kernel <<< grid_size , threads_per_block >>> (A_GPU,
B_GPU, C_GPU, N);
// Transfer the result vector from the GPU to the CPU
float *C_CPU = (float *) malloc(vector_size);
cudaMemcpy(C_CPU, C_GPU, vector_size, cudaMemcpyDeviceToHost);
return C_CPU;
22
}
UNIVERSITY OF VIRGINIA
Program Output
./vector_add 50,000,000
GPU:
Transfer to GPU:
Kernel execution:
Transfer from GPU:
Total:
0.236
0.005
0.152
0.404
sec
sec
sec
sec
CPU: 0.136 sec
© Kevin Skadron, 2008
CPU outperformed GPU by 2.97x
Vector addition doesn’t do enough work per memory
operation to justify offload! (At any vector size)
23
UNIVERSITY OF VIRGINIA
Outline of Overall Talk
Why heterogeneous?
GPU/CUDA overview
© Kevin Skadron, 2008
Results with GPUs
24
UNIVERSITY OF VIRGINIA
Case Study:
Leukocyte Detection & Tracking
Leukocytes (white blood cells) play an important
role in the body’s inflammation response
Biologists studying inflammation can use in vivo
video microscopy to measure the velocity of rolling
leukocytes in mice
© Kevin Skadron, 2008
Manual measurement is tedious and error-prone
Automated approaches are computationally
demanding and still far from real-time
To appear in IPDPS 2009
25
UNIVERSITY OF VIRGINIA
Microscopic Video
© Kevin Skadron, 2008
Image processing challenges:
•
•
•
26
Leukocytes can be dark or light
Multiple layers of vessels
Jitter due to breathing of subject
UNIVERSITY OF VIRGINIA
Detection
Find the location and shape of all leukocytes in a frame:
1. At each pixel, compute the Gradient Inverse Coefficient of
Variation (GICOV) score for a range of possible ellipses
© Kevin Skadron, 2008
2. Dilate the GICOV matrix to find local maxima
3. For local maxima, use an active contour to refine the
ellipse
27
UNIVERSITY OF VIRGINIA
Tracking
New location
Old location
Given the locations of leukocytes in the previous frame,
find their updated positions in the current frame:
1. Compute the Motion Gradient Vector Flow (MGVF)
matrix around each cell
© Kevin Skadron, 2008
2. Use an active contour to minimize an energy
function defined on the MGVF matrix
28
UNIVERSITY OF VIRGINIA
OpenMP Acceleration
Straightforward parallelization using parallel pragmas
•
Detection: per pixel
•
Tracking: per cell
5x
Detection
Tracking
3.8x
Speedup
4x
3.0x
3x
3.1x
2.4x
2.0x
2x
1.6x
© Kevin Skadron, 2008
1x
0x
2
29
3
Thread Count
4
UNIVERSITY OF VIRGINIA
Detection: CUDA Optimizations
Naïve CUDA implementation: 5.1X speedup
over single-threaded C
•
All results for this application with engineering
sample of GTX 260—lower clock speed
© Kevin Skadron, 2008
Take advantage of special-purpose GPU
address spaces (constant & texture)
Increase regularity of memory access
patterns
Reduce number of memory accesses at
expense of increased computation
Provide an additional 4.7x speedup
Overall CUDA speedup: 23.9x over C
30
UNIVERSITY OF VIRGINIA
Detection: CUDA Optimizations
60x
51.7x
48.0x
50x
Speedup
40x
33.0x
30x
20x
12.8x
11.0x
10x
6.8x
5.1x
2.2x
3.5x
C
2
Threads
© Kevin Skadron, 2008
0x
3
Threads
4
Threads
C + OpenMP
31
Naïve
CUDA
Constant
Array
Memory Ordering
CUDA
OneTexture
Pass
Memory
Variance
UNIVERSITY OF VIRGINIA
Tracking: Naïve CUDA Implementation
Offload matrix-based function in critical
loop of iterative solver
Function called about 50,000 times per
frame
© Kevin Skadron, 2008
CUDA memory allocation and copying
overheads dominate runtime
Result: 4.4x slowdown compared to C
32
UNIVERSITY OF VIRGINIA
Tracking: CUDA Optimizations
Increase granularity of CUDA kernel (make
the kernel do more work)
Reduce memory allocation overhead
Perform reduction on GPU
Abandon canonical one-to-one mapping
between threads and data elements
© Kevin Skadron, 2008
Provide a 241x speedup over naïve version
Overall CUDA speedup: 55.4x over C
Double precision incurs about a 2X
slowdown
33
UNIVERSITY OF VIRGINIA
Tracking: CUDA Optimizations
120x
111.2x
100x
Speedup
80x
60x
52.4x
45.2x
40x
34.1x
18.2x
© Kevin Skadron, 2008
20x
2.0x
0x
C
4.0x
2 Threads
5.9x
0.5x
3 Threads 4 Threads
C + OpenMP
34
7.7x
Naïve
CUDA
2.3x
Larger
Kernel
Reduced
Allocation
Partial
Reduction
CUDA
Full
Full
Reduction Reduction
(2 Kernels) (1 Kernel)
Persistent
Thread
Block
UNIVERSITY OF VIRGINIA
Overall Results (GTX 280)
© Kevin Skadron, 2008
Frames per Second (FPS)
20
17.60
18
16
14
12
10
8
6
4
2
0.11
0.22
0.83
Matlab
C
C + OpenMP
0
35
CUDA
UNIVERSITY OF VIRGINIA
Average Power Comparison
300
Average Power (W)
250
200
150
100
© Kevin Skadron, 2008
50
0
Idle
36
C
OpenMP
CUDA
UNIVERSITY OF VIRGINIA
Total Energy Comparison
600
504
Total Energy (kJ)
500
400
300
200
169
© Kevin Skadron, 2008
100
8
0
C
37
OpenMP
CUDA
UNIVERSITY OF VIRGINIA
Less Promising Case Study
Heartwall tracking in ultrasound
carodiograms (mice)
•
© Kevin Skadron, 2008
•
38
Accurate heartwall tracking and understanding
of normal motion provides diagnostic benefits
Identify inner and outer heart wall, define
control points, then track these points
UNIVERSITY OF VIRGINIA
© Kevin Skadron, 2008
Heart Wall Diagram
read first frame from
input file, display
image
[0.533 s] [0.28 %]
crop image, display
image
[0.089 s] [0.05 %]
SRAD, display
image
[3.224 s] [1.72 %]
detect edges, display
image
[0.448 s] [0.24 %]
morphological
transformation,
display image
[0.275 s] [0.15 %]
dilate image, display
image
[0.285 s] [0.15 %]
inner and outer
ellipse parameter
setup
[0.001 s] [0.00 %]
Hough Search,
display images
[15.872 s] [8.47 %]
create ellipse sample
points
[0.726 s] [0.39 %]
track movement of
sample points in all
frames
display movement of
sample points
through frames
[36.63 s] [19.55 %]
save outputs into file
[0.035 s] [0.02 %]
[129.276 s] [68.99 %]
39
UNIVERSITY OF VIRGINIA
Semi-Modular Implementation
MATLAB
functions Kernel
to be
calls
recoded
Original
MATLA
Operations
Speedup
B time
per kernel
(GTX
before
(average)
280)
porting
[s]
SRAD
4
500
9
8.71
10x
Hough
Search
3
200
7
15.87
7x
Tracking
19
8000
2
115.43
2.3
© Kevin Skadron, 2008
Overall Speedup: 2.2x –> Amdahl’s Law strikes again!
-
40
-
Number of MATLAB functions to be recoded shows coding effort and time.
Kernel calls suggest penalty for each call. Number of calls differs depending on algorithm for parts of code.
SRAD and Hough Search iterate through raw data of entire image with few functions – more data per kernel.
Tracking iterates only through select data (such as frames and numerous points inside each frame) with more
functions – less data per kernel.
Operations per kernel show how efficient each kernel is, thus compensating for launch overhead.
UNIVERSITY OF VIRGINIA
Basic Efficiency Rules
Develop algorithms with a data parallel
mindset
•
Simple example – parallel summation now
requires a reduction
Maximize locality of global memory accesses
•
This will improve memory bandwidth utilization
and, depending on platform, local caching
Expose enough parallelism
© Kevin Skadron, 2008
•
•
41
For GPUs, need minimum of 1000s of threads
This degree of parallelism isn’t needed for CPUs
but can’t hurt
UNIVERSITY OF VIRGINIA
General Offload Optimizations
Reduce the number of kernel calls by increasing the
amount of work per kernel call
•
•
5 usec for a null kernel call
Not clear what happens when GPU on same chip as CPU
Reduce the memory management overhead by
allocating and deallocating memory only once
•
© Kevin Skadron, 2008
•
CUDA’s malloc and free implementations are 2,500
times slower than the C standard library
implementations
Reduce the memory transfer overhead by moving
computation to data, even if the computation is less
efficient
Tradeoff increased computation for decreased
memory access
Avoid global (inter-thread-block) memory fences
42
UNIVERSITY OF VIRGINIA
What is Computer Architecture?
Identify HW capabilities, abstractions to
meet SW needs
•
•
Just like a building architect identifies HW
capabilities to meet user needs
SW needs: language features (e.g. function
returns), common routines (e.g. square root,
stencil access), reliability (e.g. redundant
execution)
Adapt to changing technology
© Kevin Skadron, 2008
•
Ideally transparent to SW, which multicore and
heterogeneous organizations clearly are not
― …unless we figure out new HW capabilities or
middleware techniques
•
43
Next big thing? Molecular computing? Quantum
computing? On-chip wireless communication?
UNIVERSITY OF VIRGINIA
Summary
Technology trends pushing multicore and
heterogeneous organizations
© Kevin Skadron, 2008
Case studies with GPUs: impressive
potential, but major software challenges
remain
Portability and language compatibility are
serious challenges
Need to abstract away platform/productspecific details
44
UNIVERSITY OF VIRGINIA
Thank You
Questions?
Contact me:
© Kevin Skadron, 2008
•
•
45
[email protected]
http://www.cs.virginia.edu/~skadron
UNIVERSITY OF VIRGINIA
© Kevin Skadron, 2008
Backup Slides
46
UNIVERSITY OF VIRGINIA
Sample Results
(JPDC, to appear)
• NVIDIA Geforce GTX 280 vs. OMP on Intel Xeon with two
hyperthreaded dual-core processors (3.2 GHz, 2 MB L2 and 4GB memory)
• All results single precision
•
•
•
•
© Kevin Skadron, 2008
•
•
47
SRAD:
Ultrasound
despeckling
HotSpot:
Heat transfer,
regular grid
DES:
encryption
algorithm
Backprop:
Machine
learning
Kmeans:
Clustering
NeedlemanWunsch:
Dynamic
programming
for gene
alignment
SRAD
HotSpot
DES
Back Propagation
Kmeans
Needleman-Wunsch
UNIVERSITY OF VIRGINIA
Terminology: Asymmetric vs.
Heterogeneous
Asymmetric: general-purpose cores, same
ISA, only differ in size/ILP
•
•
•
•
A few ILP cores for single-thread performance
Many small, simple, cores for throughput
Problem: fixed partitioning wastes area for
workloads at either extreme
Solution: Federation or core fusion: dynamically
compose small cores into large cores only when
needed
© Kevin Skadron, 2008
Heterogeneous: some cores may be
specialized, different ISA
•
•
GPU, crypto accelerator, etc.
May be on- or off-chip
Fuzzy line between these terms
48
UNIVERSITY OF VIRGINIA
Terminology: What is “GPGPU”?
Definition 1: GPGPU = general purpose
computing with GPUs = any use of GPUs for
non-rendering tasks
Definition 2: GPGPU = general purpose
computing with 3D APIs (i.e., DirectX and
OpenGL)
•
•
© Kevin Skadron, 2008
•
•
•
49
3D APIs have processing overhead of entire
graphics pipeline
Limited interface to memory, no inter-thread
communication
Often difficult to map application as rendering of
polygon(s)
These restrictions are now indelibly tied to
“GPGPU”
No satisfactory replacement term -- “GPU
Computing”?
UNIVERSITY OF VIRGINIA
Computing Grid Size
// The number of threads per thread block
#define block_size 256
// The maximum size of each dimension of the grid
#define MAX 65535
// Determine the number of thread blocks in the x- and y-dimensions
int num_blocks = ceil((float) N / (float) block_size);
int num_blocks_y = ceil((float) num_blocks / (float) MAX);
int num_blocks_x = ceil((float) num_blocks / (float) num_blocks_y);
dim3 grid_size(num_blocks_x, num_blocks_y, 1);
© Kevin Skadron, 2008
kernel <<< grid_size, threads_per_block >>> (…);
50
UNIVERSITY OF VIRGINIA
How do GPUs differ from CPUs?
Key: perf/mm2
Emphasize throughput, not per-thread latency
Maximize number of PEs and utilization
•
•
•
Many small PEs
Amortize hardware in time--multithreading
Hide latency with computation, not caching
― Spend area on PEs instead
― Hide latencies with fast thread switch and many threads/PE
(24 on NVIDIA Tesla/G80!)
Exploit SIMD efficiency
•
Amortize hardware in space—share fetch/control among
multiple PEs
― 8 in the case of Tesla
•
Note that SIMD vector
― NVIDIA’s architecture is “scalar SIMD” (SIMT), AMD does both
© Kevin Skadron, 2008
High bandwidth to global memory
•
•
Minimize amount of multithreading needed
Tesla memory interface is 512-bit
Net result: 470 GFLOP/s and ~80 GB/s sustained in GeForce
8800GTX
51
UNIVERSITY OF VIRGINIA
How do GPUs differ from
CPUs? (2)
Hardware thread creation and management
•
•
New thread for each vertex/pixel
CPU: kernel or user-level software involvement
Virtualized cores
•
Program is agnostic about physical number of
cores
― True for both 3D and general-purpose
•
CPU: number of threads generally f(# cores)
© Kevin Skadron, 2008
Hardware barriers
These characteristics simplify problem
decomposition, scalability, and portability
Nothing prevents non-graphics hardware
from adopting these features
52
UNIVERSITY OF VIRGINIA
How do GPUs differ from
CPUs? (3)
Specialized graphics hardware exposed
through CUDA
•
Texture path
― High-bandwidth gather, interpolation
•
Constant memory
― Even higher-bandwidth access to small read-only
data regions
•
•
Transcendentals (reciprocal sqrt, trig, log2, etc.)
Different implementation of atomic memory
operations
© Kevin Skadron, 2008
― GPU: handled in memory interface
― CPU: generally handled with CPU involvement
•
•
53
Local scratchpad in each core (a.k.a. per-block
shared memory)
Memory system exploits spatial, not temporal
locality
UNIVERSITY OF VIRGINIA
Myths of GPU Computing
GPUs layer normal programs on top of graphics
NO: CUDA compiles directly to the hardware
GPUs architectures are:
•
Very wide (1000s) SIMD machines
NO: CUDA is 32-wide (multiplexed on 8-wide Tesla arch)
•
Branching is impossible or prohibitive
NO: Flexible branching and efficient management of SIMD
divergence
•
With 4-wide vector registers
Still true for AMD Radeon
NO: NVIDIA Tesla is scalar
© Kevin Skadron, 2008
GPUs don’t do real floating point
NO: Almost full IEEE single-precision FP compliance now
(still limited under/over-flow handling)
Double precision at reduced speed in GTX 260/280
54