GPU - MICREL

Download Report

Transcript GPU - MICREL

GPGPU
Ing. Martino Ruggiero
Ing. Andrea Marongiu
[email protected]
[email protected]
Old and New Wisdom in
Computer Architecture
• Old: Power is free, Transistors are expensive
• New: “Power wall”, Power expensive, Transistors free
(Can put more transistors on chip than can afford to turn on)
• Old: Multiplies are slow, Memory access is fast
• New: “Memory wall”, Multiplies fast, Memory slow
(200 clocks to DRAM memory, 4 clocks for FP multiply)
• Old: Increasing Instruction Level Parallelism via compilers,
innovation (Out-of-order, speculation, VLIW, …)
• New: “ILP wall”, diminishing returns on more ILP HW
(Explicit thread and data parallelism must be exploited)
• New: Power Wall + Memory Wall + ILP Wall = Brick Wall
Uniprocessor Performance (SPECint)
SW Performance: 1993-2008
Instruction-Stream Based
Processing
Data-Stream-Based Processing
Instruction- and Data-Streams
Architectures: Data–Processor Locality
• Field Programmable Gate Array (FPGA)
– Compute by configuring Boolean functions and local memory
• Processor Array / Multi-core Processor
– Assemble many (simple) processors and memories on one chip
• Processor-in-Memory (PIM)
– Insert processing elements directly into RAM chips
• Stream Processor
– Create data locality through a hierarchy of memories
• Graphics Processor Unit (GPU)
– Hide data access latencies by keeping 1000s of threads in-flight
GPUs often excel in the performance/price ratio
Graphics Processing Unit (GPU)
• Development driven by the multibillion dollar game industry
– Bigger than Hollywood
• Need for physics, AI and complex
lighting models
• Impressive Flops / dollar
performance
– Hardware has to be affordable
• Evolution speed surpasses Moore’s
law
– Performance doubling approximately
6 months
What is GPGPU?
•
The graphics processing unit (GPU) on commodity video cards has evolved into an
extremely flexible and powerful processor
–
–
–
•
GPGPU: an emerging field seeking to harness GPUs for general-purpose
computation other than 3D graphics
–
•
GPU accelerates critical path of application
Data parallel algorithms leverage GPU attributes
–
–
–
•
Programmability
Precision
Power
Large data arrays, streaming throughput
Fine-grain SIMD parallelism
Low-latency floating point (FP) computation
Applications – see //GPGPU.org
–
–
Game effects (FX) physics, image processing
Physical modeling, computational engineering, matrix algebra, convolution, correlation,
sorting
Motivation 1:
• Computational Power
– GPUs are fast…
– GPUs are getting faster, faster
Motivation 2:
• Flexible, Precise and Cheap:
– Modern GPUs are deeply programmable
• Solidifying high-level language support
– Modern GPUs support high precision
• 32 bit floating point throughout the pipeline
• High enough for many (not all) applications
Parallel Computing on a GPU
•
NVIDIA GPU Computing Architecture
–
–
Via a separate HW interface
In laptops, desktops, workstations, servers
•
8-series GPUs deliver 50 to 200 GFLOPS
on compiled parallel C applications
•
•
GPU parallelism is doubling every year
Programming model scales transparently
•
•
Programmable in C with CUDA tools
Multithreaded SPMD model uses application
data parallelism and thread parallelism
GeForce 8800
Tesla D870
Tesla S870
Towards GPGPU
• The previous 3D GPU
– A fixed function graphics pipeline
• The modern 3D GPU
– A Programmable parallel processor
• NVIDIA’s Tesla and Fermi architectures
– Unifies the vertex and pixel processors
The evolution of the pipeline
Elements of the graphics pipeline:
1. A scene description: vertices,
triangles, colors, lighting
2. Transformations that map the scene
to a camera viewpoint
3. “Effects”: texturing, shadow
mapping, lighting calculations
4. Rasterizing: converting geometry
into pixels
5. Pixel processing: depth tests, stencil
tests, and other per-pixel operations.
1.
2.
3.
4.
Parameters controlling design
of the pipeline:
Where is the boundary
between CPU and GPU ?
What transfer method is used ?
What resources are provided at
each step ?
What units can access which
GPU memory elements ?
Generation I: 3dfx Voodoo
(1996)
•
•
•
•
Vertex
Transforms
CPU
One of the first true 3D game cards
Worked by supplementing standard 2D
video card.
Did not do vertex transformations: these
were done in the CPU
Did do texture mapping, z-buffering.
Rasterization
and
Interpolation
Primitive
Assembly
PCI
Raster
Operations
GPU
Frame
Buffer
Generation II: GeForce/Radeon 7500 (1998)
•
•
•
Vertex
Transforms
AGP
Primitive
Assembly
Main innovation: shifting the
transformation and lighting calculations
to the GPU
Allowed multi-texturing: giving bump
maps, light maps, and others..
Faster AGP bus instead of PCI
Rasterization
and
Interpolation
GPU
Raster
Operations
Frame
Buffer
Generation III: GeForce3/Radeon
8500(2001)
•
•
Vertex
Transforms
AGP
For the first time, allowed limited
amount of programmability in the vertex
pipeline
Also allowed volume texturing and
multi-sampling (for antialiasing)
Primitive
Assembly
Rasterization
and
Interpolation
GPU
Small vertex
shaders
Raster
Operations
Frame
Buffer
Generation IV: Radeon 9700/GeForce FX
(2002)
•
•
Vertex
Transforms
AGP
This generation is the first generation of
fully-programmable graphics cards
Different versions have different
resource limits on fragment/vertex
programs
Primitive
Assembly
Programmable
Vertex shader
Rasterization
and
Interpolation
Raster
Operations
Programmable
Fragment
Processor
Frame
Buffer
3D API
Commands 3D
3D API:
OpenGL or
Application
Direct3D
Or Game
CPU-GPU Boundary (AGP/PCIe)
Rasterization
and
Interpolation
Primitive
Assembly
Pre-transformed
Fragments
Pre-transformed
Vertices
Programmable
Vertex
Processor
Pixel
Location
Stream
•Vertex processors
•Operation on the vertices of primitives
•Points, lines, and triangles
•Typical Operations
•Transforming coordinates
•Setting up lighting and texture parameters
Pixel
Updates
Raster
Operations
Programmable
Fragment
Processor
Frame
Buffer
Transformed
Fragments
GPU
Front End
Assembled
Primitives
Transformed
Vertices
GPU
Command &
Data Stream
Vertex
Index
Stream
•Pixel processors
•Operation on rasterizer output
•Typical Operations
•Filling the interior of primitives
The road to unification
•
•
•
Vertex and pixel processors have evolved at different rates
Because GPUs typically must process more pixels than vertices, pixelfragment processors traditionally outnumber vertex processors by about
three to one.
However, typical workloads are not well balanced, leading to inefficiency.
–
•
•
For example, with large triangles, the vertex processors are mostly idle, while the pixel
processors are fully busy. With small triangles, the opposite is true.
The addition of more-complex primitive processing makes it much harder to
select a fixed processor ratio.
Increased generality  Increased the design complexity, area and cost of
developing two separate processors
•
All these factors influenced the decision to design a unified architecture:
–
to execute vertex and pixel-fragment shader programs on the same unified processor
architecture.
Previous GPGPU Constraints
What’s wrong with GPGPU?
From pixel/fragment to thread
program…
CPU style cores CPU-“style”
Slimming down
Two cores
Four cores
Sixteen cores
Add ALUs
128 elements in parallel
But what about branches?
But what about branches?
But what about branches?
But what about branches?
Clarification
SIMD processing does not imply SIMD instructions
• Option 1: Explicit vector instructions–Intel/AMD x86 SSE,
Intel Larrabee
• Option 2: Scalar instructions, implicit HW vectorization
– HW determines instruction stream sharing across ALUs (amount
of sharing hidden from software)
– NVIDIA GeForce (“SIMT”warps), ATI Radeon architectures
Stalls!
• Stalls occur when a core cannot run the next instruction
because of a dependency on a previous operation.
• Memory access latency = 100’s to 1000’s of cycles
• We’ve removed the fancy caches and logic that helps
avoid stalls.
• But we have LOTS of independent work items.
• Idea #3: Interleave processing of many elements on a
single core to avoid stalls caused by high latency
operations.
Hiding stalls
Hiding stalls
Hiding stalls
Hiding stalls
Hiding stalls
Throughput!
Summary: Three key ideas
• Use many “slimmed down cores”to run in parallel
• Pack cores full of ALUs(by sharing instruction
stream across groups of work items)
• Avoid latency stalls by interleaving execution of
many groups of workitems/ threads/ ...
– When one group stalls, work on another group
Global memory
Parallel data cache
NVIDIA Tesla
CUDA Device Memory Space
Overview
• Each thread can:
(Device) Grid
–
–
–
–
–
R/W per-thread registers
R/W per-thread local memory
R/W per-block shared memory
R/W per-grid global memory
Read only per-grid constant
memory
– Read only per-grid texture
memory
• The host can R/W
global, constant, and
texture memories
Host
Block (0, 0)
Block (1, 0)
Shared Memory
Registers
Registers
Shared Memory
Registers
Registers
Thread (0, 0) Thread (1, 0)
Thread (0, 0) Thread (1, 0)
Local
Memory
Local
Memory
Global
Memory
Constant
Memory
Texture
Memory
Local
Memory
Local
Memory
Global, Constant, and Texture Memories
(Long Latency Accesses)
• Global memory
(Device) Grid
– Main means of
communicating R/W
Data between host and
device
– Contents visible to all
threads
Block (0, 0)
Shared Memory
Registers
• Texture and Constant
Memories
– Constants initialized by
host
– Contents visible to all
threads
Block (1, 0)
Host
Registers
Shared Memory
Registers
Registers
Thread (0, 0) Thread (1, 0)
Thread (0, 0) Thread (1, 0)
Local
Memory
Local
Memory
Global
Memory
Constant
Memory
Texture
Memory
Local
Memory
Local
Memory
Memory Hierarchy
• CPU and GPU Memory Hierarchy
Disk
CPU Main
Memory
CPU Caches
CPU Registers
GPU Video
Memory
GPU Caches
GPU Constant
Registers
GPU Temporary
Registers
NVIDIA’s Fermi Generation CUDA Compute Architecture:
The key architectural highlights of Fermi are:
• Third Generation Streaming Multiprocessor (SM)
–
–
•
Second Generation Parallel
Thread Execution ISA
–
–
•
Unified Address Space with Full C++ Support
Optimized for OpenCL and DirectCompute
Improved Memory Subsystem
–
–
•
32 CUDA cores per SM, 4x over GT200
8x the peak double precision floating
point performance over GT200
NVIDIA Parallel DataCache hierarchy
with Configurable L1 and Unified L2 Caches
improved atomic memory op performance
NVIDIA GigaThreadTM Engine
–
–
–
–
10x faster application context switching
Concurrent kernel execution
Out of Order thread block execution
Dual overlapped memory transfer engines
Third Generation Streaming Multiprocessor
•
512 High Performance CUDA cores
–
–
•
16 Load/Store Units
–
–
•
Each SM features 32 CUDA processors
Each CUDA processor has a fully
pipelined integer arithmetic logic unit
(ALU) and floating point unit (FPU)
Each SM has 16 load/store units,
allowing source and destination
addresses to be calculated for sixteen
threads per clock.
Supporting units load and store the data
at each address to cache or DRAM.
Four Special Function Units
–
Special Function Units (SFUs) execute
transcendental instructions such as sin,
cosine, reciprocal, and square root.
Dual Warp Scheduler
•
•
•
•
•
The SM schedules threads in groups of 32 parallel threads called warps.
Each SM features two warp schedulers and two instruction dispatch units, allowing
two warps to be issued and executed concurrently.
Fermi’s dual warp scheduler selects two warps, and issues one instruction from each
warp to a group of sixteen cores, sixteen load/store units, or four SFUs.
Because warps execute independently, Fermi’s scheduler does not need to check for
dependencies from within the instruction stream.
Using this elegant model of dual-issue, Fermi achieves near peak hardware
performance.
Second Generation Parallel
Thread Execution ISA
PTX is a low level virtual machine and ISA designed to support the operations
of a parallel thread processor. At program install time, PTX instructions are
translated to machine instructions by the GPU driver.
The primary goals of PTX are:
–
–
–
–
–
–
–
Provide a stable ISA that spans multiple GPU generations
Achieve full GPU performance in compiled applications
Provide a machine-independent ISA for C, C++, Fortran, and other compiler targets.
Provide a code distribution ISA for application and middleware developers
Provide a common ISA for optimizing code generators and translators, which map PTX to
specific target machines.
Facilitate hand-coding of libraries and performance kernels
Provide a scalable programming model that spans GPU sizes from a few cores to many
parallel cores
Fermi and the PTX 2.0 ISA
address space
Three separate address spaces (thread private local, block shared, and global) for
load and store operations.
•
In PTX 1.0, load/store instructions were
specific to one of the three address spaces;
–
–
•
•
programs could load/ store values in a
specific target address space known at
compile time.
difficult to fully implement C/C++ pointers
since a pointer’s target address space
may not be known at compile time.
With PTX 2.0, a unified address space
unifies all three address spaces into a
single, continuous address space.
40-bit unified address space supports a
Terabyte of addressable memory, and
the load/store ISA supports 64-bit
addressing for future growth.
Summary Table