Mathematical Primitives

Download Report

Transcript Mathematical Primitives

Why GPUs?
Robert Strzodka
Overview
• Computation / Bandwidth / Power
• CPU – GPU Comparison
• GPU Characteristics
2
Data Processing in General
memory
wall
lack of
parallelism
Processor
OUT
IN
memory
memory
IN
OUT
3
Old and New Wisdom in Computer Architecture
• Old: Power is free, Transistors are expensive
• New: “Power wall”, Power expensive, Transistors free
(Can put more transistors on chip than can afford to turn on)
• Old: Multiplies are slow, Memory access is fast
• New: “Memory wall”, Multiplies fast, Memory slow
(200 clocks to DRAM memory, 4 clocks for FP multiply)
• Old: Increasing Instruction Level Parallelism via compilers,
innovation (Out-of-order, speculation, VLIW, …)
• New: “ILP wall”, diminishing returns on more ILP HW
(Explicit thread and data parallelism must be exploited)
• New: Power Wall + Memory Wall + ILP Wall = Brick Wall
slide courtesy of
Christos Kozyrakis
4
Uniprocessor Performance (SPECint)
3X
Performance (vs. VAX-11/780)
10000
1000
From Hennessy and Patterson,
Computer Architecture: A Quantitative
Approach, 4th edition, 2006
??%/year
52%/year
100
10
25%/year
 Sea change in chip
design: multiple “cores” or
processors per chip
1
1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006
slide courtesy of
Christos Kozyrakis
5
Instruction-Stream-Based Processing
instructions
Processor
memory
cache
memory
data
6
Instruction- and Data-Streams
Addition of 2D arrays: C= A + B
instuction
stream
processing
data
data streams
undergoing a
kernel
operation
for(y=0; y<HEIGHT; y++)
for(x=0; x<WIDTH; x++) {
C[y][x]= A[y][x]+B[y][x];
}
inputStreams(A,B);
outputStream(C);
kernelProgram(OP_ADD);
processStreams();
7
Data-Stream-Based Processing
Processor
data
memory
pipeline
pipeline
pipeline
memory
data
configuration
8
Architectures: Data – Processor Locality
• Field Programmable Gate Array (FPGA)
– Compute by configuring Boolean functions and local memory
• Processor Array / Multi-core Processor
– Assemble many (simple) processors and memories on one chip
• Processor-in-Memory (PIM)
– Insert processing elements directly into RAM chips
• Stream Processor
– Create data locality through a hierarchy of memories
9
Overview
• Computation / Bandwidth / Power
• CPU – GPU Comparison
• GPU Characteristics
10
The GPU is a Fast, Parallel Array Processor
Input Arrays:
1D, 3D,
2D (typical)
Output Arrays:
1D, 3D (slice),
2D (typical)
Vertex Processor (VP)
Fragment Processor (FP)
Kernel changes index
regions of output arrays
Kernel changes each
datum independently,
reads more input arrays
Rasterizer
Creates data streams
from index regions
Stream of array elements,
order unknown
11
Index Regions in Output Arrays
• Quads and Triangles
Output region
– Fastest option
Output region
• Line segments
– Slower, try to pair lines to
2xh, wx2 quads
Output region
• Point Clouds
– Slowest, try to gather
points into larger forms
12
High Level Graphics Language for the Kernels
• Float data types:
– half 16-bit (s10e5), float 32-bit (s23e8)
• Vectors, structs and arrays:
– float4, float vec[6] , float3x4, float arr[5][3], struct {}
• Arithmetic and logic operators:
– +, -, *, /;
&&, ||, !
• Trignonometric, exponential functions:
– sin, asin, exp, log, pow, …
• User defined functions
– max3(float a, float b, float c) { return max(a,max(b,c)); }
• Conditional statements, loops:
– if, for, while, dynamic branching in PS3
• Streaming and random data access
13
Input and Output Arrays
CPU
GPU
• Input and output
arrays may overlap
• Input and output arrays
must not overlap
Input
Input
Output
Output
14
Native Memory Layout – Data Locality
CPU
GPU
• 1D input
• 1D output
• Higher dimensions
with offsets
• 1D, 2D, 3D input
• 2D output
• Other dimensions with
offsets
Input
Input
Output
Color coded locality
red (near), blue (far)
Output
15
Data-Flow: Gather and Scatter
CPU
GPU
• Arbitrary gather
• Arbitrary gather
Input
Output
• Arbitrary scatter
Input
Output
Input
Output
• Restricted scatter
Input
Output
16
Overview
• Computation / Bandwidth / Power
• CPU – GPU Comparison
• GPU Characteristics
17
GFLOPS
1) Computational Performance
ATI R520
Note: Sustained performance is usually much lower and
depends heavily on the memory system !
chart courtesy
of John Owens
18
2) Memory Performance
Memory access types: Cache, Sequential, Random
• CPU
– Large cache
– Few processing elements
– Optimized for spatial and
temporal data reuse
• GPU
GeForce 7800 GTX
Pentium 4
– Small cache
– Many processing elements
– Optimized for sequential
(streaming) data access
chart courtesy
of Ian Buck
19
3) Configuration Overhead
Configuration
limited
Computation
limited
chart courtesy
of Ian Buck
20
Conclusions
• Parallelism is now indispensable to further
increase performance
• Both memory and processing element
dominated designs have pros and cons
• Mapping algorithms to the appropriate
architecture allows enormous speedups
• Many of GPU’s restrictions are crucial for
parallel efficiency (Eat the cake or have it)
21