A 167-processor 65 nm Computational Platform with Per

Download Report

Transcript A 167-processor 65 nm Computational Platform with Per

A 167-processor 65 nm Computational
Platform with Per-Processor
Dynamic Supply Voltage and
Dynamic Clock Frequency Scaling
Dean Truong, Wayne Cheng, Tinoosh Mohsenin, Zhiyi Yu,
Toney Jacobson, Gouri Landge, Michael Meeuwsen,
Christine Watnik, Paul Mejia, Anh Tran, Jeremy Webb, Eric
Work, Zhibin Xiao and Bevan Baas
VLSI Computation Lab
University of California, Davis
Outline
• Background and the First
Generation AsAP
• The Second Generation AsAP
– Processors and Shared Memories
– On-chip Communication
– DVFS
• Analysis and Summary
Project Motivation
• Fully programmable and reconfig. architecture
• High energy efficiency and performance
• Exploit task-level parallelism in:
– Digital Signal Processing
– Multimedia
CFO
Estimation
from ADC
Frame
Detection
Timing Synch
CORDIC
Rotation
802.11a Baseband Receiver
Guard
Removing
Energy
Computing
De-interleaver
2
FFT
Subcarrier
Reordering
De-interleaver
1
Constellation
Demapping
Channel
Equalizer
De-puncturing
Viterbi
Decoder
De-scrambling
Channel
Estimation
Pad
Removing
to MAC layer
Asynchronous Array of Simple Processors
(AsAP)
• Key Ideas:
– Programmable, small, and
simple fine-grained cores
– Small local memories
sufficient for DSP kernels
– Globally Asynchronous and
Locally Synchronous (GALS)
clocking
• Independent clock frequencies
on every core
• Local oscillator halts when
processor is idle
Osc
IMem
Datapath
DMem
Asynchronous Array of Simple Processors
(AsAP)
• Key Ideas, con’t:
– 2D mesh, circuit-switched network architecture
• Nearest-neighbor communication only
• Low area overhead
• Easily scalable array
– Increased tolerance to process variations
• 36-processor fully-functional chip, 0.18 µm,
610 MHz @ 2.0 V, 0.66 mm2 per processor,
802.11a/g tx consumes 407 mW @ 300 MHz
[ISSCC 06, HotChips 06, IEEE Micro 07, TVLSI 07, JSSC 08,…]
Outline
• Background and the First Generation
AsAP
• The Second Generation AsAP
– Processors and Shared Memories
– On-chip Communication
– DVFS
• Analysis and Summary
New Challenges Addressed
• Reduction in the power dissipation of
– Lightly-loaded processors (lowering Vdd)
– Unused processors (leakage)
• Achieving very high efficiencies and speed on
common demanding tasks such as FFTs,
video motion estimation, and Viterbi decoding
• Larger, area efficient on-chip memories
• Efficient, low overhead communication
between distant processors
167-processor Computational Platform
• Key features
– 164 Enhanced programmable
processors
– 3 Dedicated-purpose
processors
– 3 Shared memories
– Long-distance circuit-switched
communication network
– Dynamic Voltage and
Frequency Scaling (DVFS)
Config. and Test
Osc
Core
DVFS
Comm
Tile
FFT
Motion
Estimation
Viterbi
16 KB Shared Decoder
Memories
Homogenous Processors
• 164 in-order, single-issue, 6-stage processors
–
–
–
–
16-bit datapath with RISC-like instructions
128x16-bit data memory (DMEM)
128x35-bit instruction memory (IMEM)
Two 64x16-bit FIFOs for inter-processor communication
IF
PC
MemRd
ID
FIFO 1 FIFO 1
Wrt
Rd
FIFO 0 FIFO 0
Wrt
Rd
EXE 1
EXE 2
FIFO
Wrt
Bypass
DMem
Wrt
ALU
IMem
Instr
Decode
AdrGen
DMem
Rd
Multiplier
DC
Mem
Rd
WB
Acc
Block
Float
Point
DC
Mem
Wrt
Homogenous Processors
• Over 60 basic instructions
– Add, Sub, Logic, Multiply, MAC, Branch, …
• New instructions and features
– Min/Max, Byte-Sub/Add, Absolute value, Fixed-to-Float
conversion assist
– Jump/Return (function support)
– Zero Overhead Looping (block repeat)
– Conditional Execution (predicating)
– Block Floating Point
• Floating point CORDIC square root requires 2.9x
fewer cycles compared to first generation AsAP
• Preliminary results from one chip: 1.2 GHz, 59 mW,
1.3 V, 100% active MAC/ALU operations
Fast Fourier Transform (FFT)
MEM
MEM
MEM
O
F
MEM
MEM
MEM
MEM
1.23 mm
• Continuous flow architecture
with a single radix-4,2
butterfly
• Runtime configurable from
16-pt to 4096-pt transforms,
FFT and IFFT
• 760,000 1024-pt complex
FFTs/sec @ 989 MHz, 1.3 V
• 1.01 mm2
• Preliminary measurements
functional at 866 MHz,
34.97 mW @ 1.3 V
0.82 mm
MEM
Viterbi Decoder
• 8 Add-Compare-Select
(ACS) units
• Highly configurable
• 72 Mbps @ 789 MHz, 1.3 V
for rate = 1/2
• 0.17 mm2
• Preliminary measurements
functional at 894 MHz,
17.55 mW @ 1.3 V
MEM
0.41 mm
– Up to 32 different rates,
including 1/2 and 3/4
– Decode codes up to constraint
length 10
0.41 mm
O
F
Motion Estimation for Video
Encoding
0.82 mm
O
F
MEM
MEM
MEM
MEM
MEM
MEM
MEM
MEM
MEM
MEM
0.82 mm
• Supports a number of fixed
and programmable search
patterns
• Supports all H.264 specified
block sizes within a 48x48
search range
• 14 billion SADs/sec @
880 MHz, 1.3 V; supports
1080p HDTV @ 30fps
• 0.67 mm2
• Preliminary measurements
functional at 938 MHz,
196.17 mW @ 1.3 V
Shared Memories
• Ports for up to four processors (two
connected in this chip) to directly
connect to the block, which provides
• Uses a 16 KByte single-ported SRAM
• 1.28 GHz operation, 1.3 V
– One read or write per cycle
– 20.5 Gbps peak throughput
• 0.34 mm2
• Preliminary measurements functional
at 1.3 GHz, 4.55 mW @ 1.3 V
SRAM
0.82 mm
– Port priority
– Port request arbitration
– Programmable address generation
supporting multiple addressing modes
0.41 mm
F
F
F
F
F
F
F
F
O
Inter-Processor Communication
• Circuit-switched source-synchronous
communication
– Each link has a clk, 16-bit data bus, valid, and request
– Core can
• Write to any combination of the 8 outputs under software control
• Read from any 2 of the 8 inputs using statically configured FIFOs
Core
FIFO 0
FIFO 1
clk
data
valid
request
Comm
Tile
Long-Distance Communication
• Allows communication across
tiles without disturbing cores
– Long-distance links may be
pipelined or not
clk
data
Switch
• Depending on: source clock
frequency, distance, and latency
Source
Destination
Osc
Osc
F
I
F
O
F
I
F
O
Datapath
Comm
Osc
F
I
F
O
Datapath
Comm
Datapath
Comm
Per-Processor DVFS
• Each processor tile contains a core that operates at:
– A fully-independent clock frequency
• Any frequency below maximum
• Halts, restarts, and
changes arbitrarily
– Dynamically-changeable
supply voltage
• VddHigh or VddLow
• Disconnected for
leakage reduction
• VddAlwaysOn
powers DVFS
and inter-processor
communication
VddHigh
VddLow
VddOsc
VddAlwaysOn
control_high
DVFS control_low
Controller control_freq
VddCore
Osc
Core
GndOsc
GndCom
Tile
Comm
DVFS Controller
• Voltage and frequency is set by:
– Static configuration
– Software
DVFS Controller
– Hardware
(controller)
• FIFO
“fullness”
• Processor
“stalling
frequency”
VddAlwaysOn
Voltage Switching
Circuit
FIR/IIR
Filter
VddLow
FIFO_utilization
PMOS_VddHigh
PMOS_VddLow
config_volt
Freq. & Voltage
Selector
Stall
Counter
VddHigh
VddCore
osc_frequency
DVFS_config
Osc
Config
DVFS_software
stall
F
I
F
O
Proc
Core
VddOsc
2%
GndCom
19%
GndOsc
2%
VddHigh
26%
Vdd
AlwaysOn
6%
VddCore
19%
Core
Tile
340 μm
410 μm
VddLow
26%
Power Gates
FIFO 1
DMem
FIFO 0
• Vdd/Gnd metal 6 and 7
usage:
Osc
IMem
– 5 Vdds: 79% utilization
– 2 Gnds: 21% utilization
410 μm
360 μm
• 48 power gates surround core
• Metal 6 and 7 are devoted
to power distribution—
global and local
Power Gates and Decaps
Tile Layout and Power Grids
Supply Voltage Switching
• The switching speed and profile “shapes”
supply currents while switching to tradeoff
switching time versus power grid noise
• Processor cores normally halt during a switch
1.3V
switch
done
fastest
medium
slowest
0V
1.3V
fastest
VddHigh
medium
1.22V
50ns
clock
slowest
51ns
52ns
53ns
54ns
VddHigh
noise when
VddCore
switches
from VddLow
to VddHigh
Supply Voltage Switching
• Slow switching results in negligible power grid noise
• Early VddCore disconnect from VddLow results in
momentary core voltage drop (circled below)
1.3V
0.9V
2 ns
Outline
• Background and the First Generation
AsAP
• The Second Generation AsAP
– Processors and Shared Memories
– On-chip Communication
– DVFS
• Analysis and Summary
Tile and Core Area Breakdowns
• Communication area approximately 7%
• DVFS area approximately 8%
• Routing complexity results in 27% for gaps and fillers
I/O &
Route
Empty 5%
Spaces &
Fillers
11%
DVFS
8%
Core
73%
Tile Area
Clk Tree
& Buffers
2%
Test &
Config
1%
Decaps,
Gaps &
Fillers
21%
IM em
11%
SRAMs 34%
DM em
13%
Osc
3%
FIFOs
10%
Logic
29%
DFFs
13%
Core Area
Die Micrograph and Key Data
55 million transistors, 39.4 mm2
• 65 nm STMicroelectronics
low-leakage CMOS
410 μm
Single Tile
0.17 mm2
Max.
frequency
1.19 GHz @
1.3 V
Power
(100%
active)
59 mW @
1.19 GHz, 1.3 V
Power
(JPEG)
5.939 mm
Area
410 μm
Transistors 325,000
608 μW @
66 MHz, 0.675 V
6.3 mW @
1.095 GHz, 1.3V
Mot.
Est.
Mem
MemMem Vit
5.516 mm
FFT
Complete 802.11a Baseband Receiver
• 22 processors
– 32 processors using
only nearest-neighbor
connections
(46% increase)
Vit.
FFT
• 54 Mbps throughput
75 mW @ 610 MHz,
1.3 V
• 6x faster than TI C62x,
2x faster than SODA,
4x faster than LART
802.11a with long-distance
(all scaled to 65 nm technology,
1.3 V and 610 MHz)
802.11a without long-distance
FFT
Vit.
Summary
• 65 nm low power ST Microelectronics process
• Maintains the basic GALS architecture of AsAP
• 164 homogenous processors
– 1.2 GHz, 59 mW, 100% active @ 1.3 V
• Three 16 KB shared memories
• Three dedicated-purpose processors
• Long-distance circuit-switched communication
increases mapping efficiency without overhead
• DVFS nets a 48% reduction in energy for JPEG
with only 8% performance loss
Acknowledgements
•
•
•
•
•
•
•
•
•
•
STMicroelectronics
Intel
UC Micro
NSF Grant 430090 and CAREER award 546907
SRC GRC
Intellasys
SEM
J.-P. Schoellkopf
K. Torki and S. Dumont
R. Krishnamurthy and M. Anders