Raw Microprocessor Hardware in a slide

Download Report

Transcript Raw Microprocessor Hardware in a slide

The Raw Architecture
Signal Processing on a Scalable
Composable Computation Fabric
David Wentzlaff, Michael Taylor, Jason Kim, Jason
Miller, Fae Ghodrat, Ben Greenwald, Paul
Johnson,Walter Lee, Albert Ma, Nathan Shnidman,
Henry Hoffmann, Arvind Saraf, Volker Strumpen, Matt
Frank, Saman Amarasinghe, and Anant Agarwal
http://www.cag.lcs.mit.edu/raw
MIT Laboratory For Computer Science
Outline
Motivation
Architecture
Raw Prototype
Networks
Signal Processing Applications
Status
Wire Delay and Tiled
Architectures
Fact 1: Number of transistors growing
Fact 2: Proportionally wires not getting faster
Problem: The amount of gates we can reach in
one cycle is staying constant, but our chips
are getting bigger.
Solutions:
1. Hide wire delay latency in micro-architecture
(Clustering/Hidden communication stalls)
2. Expose the communication to the instruction set
level and allow the software exploit locality
Wire Delay and Tiled
Architectures
2. Expose the communication to the instruction set
level and allow the software exploit locality
Wire Delay and Tiled
Architectures
Make a tile as big as
you can go in one
clock cycle, and
expose longer
communication to the
programmer
2. Expose the communication to the instruction set
level and allow the software exploit locality
Wire Delay and Tiled
Architectures
Make a tile as big as
you can go in one
clock cycle, and
expose longer
communication to the
programmer
2. Expose the communication to the instruction set
level and allow the software exploit locality
What Are We Building?
The Raw Prototype
16 Replicated Tiles (Processors)
What is in a tile?
8 stage Pipelined MIPS-like 32-bit processor
Pipelined Floating Point Unit
32KB Data Cache
32KB Instruction Memory
Interconnect Routers
Raw’s Networking Resources
2 Dynamic Networks
Fire and Forget
Header encodes destination
2 Stage router pipeline
2 Static Networks
Software configurable crossbar
Interlocked and Flow Controlled
5 Stage static router pipeline
3 cycle nearest-neighbor ALU to ALU communication
latency
No header overhead, but requires knowledge of
communication patterns at compile time
Memory Mapped Communication
is Not a First Class Citizen
To other tiles, through
memory system that
happens to go over a
network.
E
M1
A
IF
D
RF
F
M2
TL
P
TV
U
F4
WB
Raw’s First Class RegisterMapped Communication
r24
Ex: add r26, r25, r24
r24
r25
r25
r26
r26
r27
r27
Network
Input
FIFOs
E
M1
A
IF
D
Network
Output
FIFOs
RF
F
M2
TL
P
TV
U
F4
WB
Signal Processing Applications
Problem: Increase performance of Signal
Processing in a scalable fashion
Solution: Exploit parallelism in Signal
Processing Applications at all levels
Types of Parallelism in Signal
Processing
DSP Filter Style
Fine Grain Dataflow
Raw
Instruction Level Parallelism
Current Architectures
Data Parallel
Thread Level Parallelism (MPI)
Instruction Level Parallelism
RawCC
Maps dataflow graphs across tiles
ILP across Multiprocessor
Heavily Latency sensitive
Single cycle reconfigurable communication
Fine Grain Dataflow
Ex: Pipelined FIR Filter
xn
xn-1
W0

xn-1
W1

Computation: mul, add
Input Operands: xi, l
Output Operands: k
xn-3
W2

W3

Cycle count
Class
First
Compute
2
Communicate 0
Overall
2
Second
2
3
5
Fine Grain Dataflow
First Class Interface
Second Class Interface
mul $r3, Wx, NET_IN_1
add NET_OUT1, NET_IN_2, $r3
ld $r4, NET_IN_1_ADDR
ld $r5, NET_IN_2_ADDR
mul $r3, Wx, $r4
add $r6, $r5, $r3
st NET_OUT_1_ADDR, $r6
Cycle count
Class
First
Compute
2
Communicate 0
Overall
2
Second
2
3
5
DSP Filter Style
FFT
Offchip
DownSample
FFT
FFT-1
Frequency
Domain
Filter
FFT-1
FFT
FFT-1
FFT
FFT-1
Offchip
Raw is Composable
Mix and match types of parallelism
White
balance
White
balance
4-way
Threaded
Java
Application
Aliasing
filter
mem
mem
2-way RawCC
Application
httpd
Zzz
.
Raw Status
Stats
IBM SA-27E .15u 6 Layer Copper
18.2 mm X 18.2 mm die
.122 Billion Transistors
2048KB SRAM On-chip
1657 Pin CCGA Package
1080 HSTL Signal IO Operating at Core Speed
225MHz
~25 Watts
The Raw Performance
16 OPS/FLOPS per cycle (@225MHz = 3.6
GFLOPS)
230 Gb/s of on-chip “bisection bandwidth”
201 Gb/s of off-chip I/O bandwidth
115 Gb/s of on-chip memory bandwidth
Raw Status
Working:
Cycle Accurate Software Simulator
RTL Simulation
Emulation System
RawCC ILP Compiler
Current:
Verification
Backend Completion
Tapeout December 2001
Chips Back Summer 2002
Summary
Raw’s First Class communication
facilitates exploitation of new forms of
parallelism in Signal Processing
applications
Extra Slides