ARC FPF Presentation 2005

Download Report

Transcript ARC FPF Presentation 2005

Many-cores Make Light Work
Nigel Topham
PPar Conference & Industrial Engagement Event
June 2015
Overview
 Parallelism can, and must, become pervasive to
solve future silicon scaling problems !
 What drives us towards pervasive parallelism?
 Can embedded applications really benefit from
many-core architectures?
 What does a low-power many-core embedded
system look like?
 What does the future hold for embedded manycore systems?
2
Embedded systems in perspective
 More CPUs shipped in
embedded systems than as
PCs and servers combined
 Embedded market has highest
growth rate (PCs diminishing)
 Performance expectations of
embedded systems continue
to grow
–
–
–
–
–
–
smart-phones / tablets
embedded vision
automotive (Google car)
medical
robotics
Internet of Things...
3
Motivation for many-core embedded systems
 Imagine an embedded application requiring 20 billion operations per
second to meet real-time performance targets
 What type of processor is most appropriate?
– a standard x86 dual-core PC, running at 2.4 GHz ?
– a collection of lightweight embedded cores, running at 300 MHz ?
 Answer depends on power efficiency required, and the ability to run
the application in parallel
x86 PC
Encore (EM6)
Extended EM6
Cores / chip
2
32
32
IPC / core
2
1
4
MHz / core
2,400
300
300
GOPS / sec
8.64
9.60
38.40
Power (W)
65.00
0.25
0.35
GOPS / W
0.13
38.40
109.71
Assumes EM6 cores implemented in low-power 65nm CMOS process
4
PART I – Shaping the many-core landscape
Fundamentals and drivers...
5
Moore’s Law is not finished (yet) 
25
1.40E+08
1.20E+08
20
20
17
15
14
1.00E+08
14
8.00E+07
12
10
10
10
6.00E+07
8
7
7
5
5
5
3.5
2.5
1.8
1.3
0
4.00E+07
2.00E+07
0.00E+00
2013 2015 2017 2019 2021 2023 2025 2028
Gate density (4-input NAND gates / sq.mm)
Moore’s Law describes the rate at which transistor dimensions scale
Scaling linear dimensions by 0.7, doubles density every 2 years
ITRS 2013 defines a roadmap for transistor scaling up to 2028
Gate densities will increase by 30x in this period
BEOL processing unlikely to keep up, so interconnect density may
not scale with transistor density.
Technology node (nm)





Gate length
Node
Gate density (4-input NAND / sq.mm)
Data source: ITRS Report 2013
6
Dennard scaling ended some time ago 
“The power density on a chip remains constant as transistor sizes reduce from
one generation to the next.”
 Each generation reduces linear dimension by 30%
–
–
–
–
–
2x transistor density (Moore’s Law)
70% reduction in switching time  40% increase in MHz
30% reduction in supply voltage  65% energy reduction
65% energy reduction x 1.4 MHz increase  50% power reduction
50% power reduction x 2x transistor density  constant power density
 Moore’s Law + Dennard scaling  1000x improvement in microprocessor
speed over past 20 years
 What went wrong?
– Sub-threshold leakage current increases when Vth decreases
– We have reached scaling limit of gate oxide thickness
– Net result: supply voltage scaling breaks down, and power density rises at each
new generation.
 This does not halt Moore’s Law, but affects how we use the extra
transistors!
7
Pollack’s Rule
“The increase in microprocessor performance due to microarchitecture is roughly
proportional to the square root of the increase in complexity.”





ILP increases due to increased out-of-order processing
 increase in number of in-flight instructions
 increase in number of live register values
 increased branch prediction accuracy required
Micro-architectural structures to manage increasing ILP tend to grow
quadratically (or worse) with respect to the feature they support
 Quadratic complexity growth  Pollack’s Rule
1.
2.
Fred Pollack, “Pollack’s Rule of Thumb for Microprocessor Performance and Area”,
http://en.wikipedia.org/wiki/Pollacks_Rule (cites [2] as source)
Shekhar Borkar, Andrew A. Chien (May 2011). "The Future of Microprocessors“,
Communications of ACM 54 (5) (cites [1] as source)
8
The convergence of single-core limitations
 Dennard scaling broke
down in year ~2000
 Transistor density
increase continues, and
even accelerates!
 ILP for desktop/server
CPUs reaches a limit
imposed by:
– Dennard scaling
– Pollack’s Rule
 Note failure of architects
to improve single-core
ILP in this millenium
9
The power-density (many core) wall

Many-core approach seeks to use many smaller, lower power cores to avoid
Pollack’s rule
–
–


 Achieve O(n) performance growth, instead of O(n1/2)
 Lots of parallel programming / compiler headaches (a.k.a “research opportunities”)
But it’s not a solution – an ever-increasing proportion of future silicon will be Dark
Significant energy-efficiency gains must be found from somewhere
[Ezmaeilzadeh et al., ISCA ‘11]
10
Low-power embedded processor design



Avoid micro-architectural speculation
Avoid superscalar, out-of-order micro-architectures
Use aggressive, hierarchical clock-gating (25% of dynamic power is in the clock tree)
–



State-dependent clock-gating + (explicit) module-level clock-gating
Minimize memory accesses, including to on-chip caches
Design CPU pipelines to allow use of low-leakage libraries, especially RAMs (even if they’re slower)
H/W complexity of synthesizable cores depends on synthesis constraints!
–
–
–
Over-pipeline the core to relax the timing constraints per stage
3-stage and 10-stage implementations of same ISA will have overlapping complexity and energy curves.
Better to choose a 10-stage pipeline when F > 500 MHz

Gate count and energy consumption per instruction
depend on synthesis target – they are not constant!

10-stage pipeline can allow 2 clock periods for L1
cache accesses

L1 cache memories then able to use high density, lowleakage RAMs, even at maximum MHz.

3-stage pipeline’s critical paths include many gates,
with many opportunities for logic synthesis to replicate
logic to meet timing constraints.

3-stage pipeline can be made to run at high speed, but
efficiency drops dramatically
11
Overcoming power density limitations

“Dark Silicon” and power gating
– build heterogeneous systems, with functional specialization
– big.LITTLE systems

Rebalanced quantities of core logic and on-chip RAM
– more memory, less logic  no increase in throughput!

Spend less energy “moving data” and “synchronizing”
– improve on-chip locality  many cores with local data
– avoid coherency costs in shared memory  programming complexity!

Use lower leakage CMOS processes
– transistors are slower  many cores running at 100-500 MHz

Over-pipelined cores
– Simplifies logic complexity  many cores running at 100-500 MHz

Reduce supply voltages towards threshold voltage
– big reductions in leakage power, but transistors are much slower
– use thousands of cores at 10-50 MHz
12
[Image: courtesy H. Haas]
PART II – A Parallel Embedded Application
Making light really work...
With Oscar Almer, Freddie Qu, Bjoern Franke, Mike O’Boyle,
and in collaboration with Prof Harald Haas and his team in the School of
Engineering and Electronics.
13
OWC coding, using OFDM






First apply forward-error-correction (FEC), typically using a convolutional code
FEC output stream is partitioned into fixed-size frames
Frame data is split across N orthogonal sub-channels, one symbol per channel
Channel symbols are encoded using QAM (e.g. QAM-16 = 4 bits/symbol, as below)
OFDM is used to multiplex the digital data to a time-varying signal
Digital signal is clipped, scaled, converted to analogue, and drives an output LED
[ H. Haas, Li-Fi: High Speed Wireless Communications via Light Bulbs"., 19 June 2013 ]
14
The OFDM pipeline - an overview

Transmission pipeline:
–
–
–
–
–




QAM mapper converts transmit data to channel symbols
Serial symbol sequence formed into a parallel symbol frame
Inverse Fast Fourier Transform (IFFT) converts sub-carriers to the time domain
Cyclic prefix is inserted, symbol is scaled, clipped, and sent to D/A converter
Signal is DC-biased and drives an output LED to create the optical channel (e.g. 10 MHz)
Receiver performs a similar process in reverse, using a forward FFT
OFDM frame processing can be overlapped, at both Tx and Rx
Stages can be pipelined; each stage has very different requirements
FFT / IFFT dominates the computational load
Transmitter
Tx bits
QAM
mapper
fl,m
SP
IFFT
cyclic prefix
insertion
symbol
scaling
xl,k
clipping
PS
D/A
biasing
Receiver
Rx bits
QAM demapper
PS
channel
equalizer
FFT
cyclic prefix
removal
SP
A/D
yl,k
AWGN
zl,k
Optical channel
xl,k * h(k)
15
FFT / IFFT computations



Based on recursive Cooley-Tukey algorithm
In-place algorithm reduces memory space
Each FFT butterfly computes:
w = twiddle_factor(i,k)
tmp = x[i]
x[i] = tmp + x[j]
x[j] = w * (tmp - x[j])
// Load w from constant array
// Load x[i]
// Load x[j], Add, Store x[i]
// Sub, Mul, Store x[j]
Assuming j = i + (N >> k) was computed previously


Values are fractional fixed-point, rather than
floating-point, for speed and energy-efficiency
Key optimization targets:
Maximize Butterfly computation rate
Minimize energy consumption per Butterfly
Minimize memory footprint
k=2
k=1
k=0
16
PART III – Design of an energy-efficient many-core system
The Pasta2 32-core chip
17
Li-Fi research chip
 Background
– Developed within the EPSRC project “Pasta2”
 Goals of the chip
– Research platform for adaptive visible light comm. using OFDM
– Provide sufficient computational capacity to allow for significant
experimentation
– Deliver high performance through multi-core parallelism, using very small
lightweight cores, at low power consumption
– Proof of principle for a commercial Li-Fi communications product
 Development timeline
– Started system development in 2010-11
– Rev #1 test-chip taped out Nov 2013; chips functional, but required improvements
– Rev #2 taped out Nov 2014, funded by an ERI PoP award. Received April 2015.
18
Design Goals
 Heterogeneous clustered multi-core architecture
– Mixture of specialized DSP cores and generic RISC cores, all derived from a
common instruction-set architecture
 Custom processor extensions for FFT acceleration
– focus on energy efficiency and cycle efficiency
 DMA engines with inbuilt transformations
– Replace energy-sapping software functions with lean H/W structures
 Primarily a platform for real-time software development for OFDM
 Architecture of the test-chip may form the basis of a future product
– Design flow based on standard industry practices
– Processor core should be licensable and extensible
– Key value-add is in the processor extension, multi-core interconnect and I/O
 Targets modest clock frequencies, using a low-power process
 Relies on multi-core parallelism to achieve required throughput
19
Parallelism and streaming
Tx bits
Transmitter
convolutional
encoder
puncture
QAM
mapper
fl,m
SP
IFFT
8Q24 to
1Q15
cyclic prefix
insertion
xl,k
symbol
scaling
clipping
PS
D/A
biasing
Receiver
de-puncture
Viterbi
decoder
QAM demapper
AWGN
PS
equalizer
8Q24 to
1Q15
FFT
cyclic prefix
removal
yl,k
SP
zl,k
Optical channel
A/D
xl,k * h(k)
Rx bits

Multiple embedded cores running in parallel
–
–

Embedded core enhanced with FFT acceleration
–
–

DMA capable of chaining transforms within a DMA operation, e.g. convolutional encoder 
puncture  QAM mapper.
Coordinating core + DMA to I/O channels
–

In-place FFT / IFFT, data stored in multi-banked scratchpad RAM (DCCM)
DMA input, (I)FFF, and DMA output, all running at full speed from DCCM
Streaming DMA with on-the-fly FEC, QAM mapping and data conversion
–

One OFDM symbol per core
One core per stage of the computation
Concurrent transfer to/from PHY interfaces
Optical transducers
20
OFDM system-on-chip architecture

Based on the Synopsys EM6 embedded processor core
–
–
–

Two heterogeneous configurations of EM6 are used:
–
–

3.2 Coremarks/MHz, 1.9 DMIPS/MHz
20 – 30 kgates, highly configurable and extensible
EM6 is derived from the Encore processor developed in the School of Informatics, and licensed to Synopsys
P1: unmodified CPU core, with I-cache and D-cache
P2: extended CPU core for FFT, with I-cache, D-cache and scratchpad RAM (DCCM)
Interconnect, on-chip memory, DMA and links
–
–
–
–
AXI-based interconnect, supporting multiple concurrent transactions
4 x 64KB fast SRAM for buffering OFDM frames and servicing L1 caches
7 x DMA devices, each supporting chained transformations
4 bidirectional 10b serial links, connect chip to supporting FPGA
core 0
core 4
core 8
core 12
core 16
core 20
core 24
core 28
core 1
core 5
core 9
core 13
core 17
core 21
core 25
core 29
core 2
core 6
core 10
core 14
core 18
core 22
core 26
core 30
core 3
core 7
core 11
core 15
core 19
core 23
core 27
core 31
Interconnect
DMA
DMA
DMA
DMA
64KB
64KB
Viterbi
Viterbi
Viterbi
Viterbi
SRAM
SRAM
clock
reset
JTAG
DMA
DMA
DMA
64KB
64KB
Viterbi
Viterbi
Viterbi
SRAM
SRAM
I/O Link #1
10-bit SerDes
with ECC
I/O Link #2
10-bit SerDes
with ECC
I/O Link #3
10-bit SerDes
with ECC
I/O Link #4
10-bit SerDes
with ECC
21
System verification, synthesis and layout
Floor-planned

Pre-silicon testing in Xilinx V6 FPGA
 6-core test system

Industry standard design flow
Gates placed







7 CPU-days computing for physical design:







Synopsys DC Ultra for topographic synthesis
Synopsys IC Compiler for place-and-route
Calibre DRC, LVS for physical verification
Synopsys StarRCXT for LPE
Synopsys Primetime for signoff timing analysis
VCS for post-layout gate-level simulation
6 hours to synthesize (4 CPUs)
30 hours to place and route (~40 CPU hours)
1 hour for LVS (8 CPUs)
6 hours for DRC (8 CPUs)
8 hours for LPE (8 CPUs)
2 hours for STA (1 CPU)
5 months to fabricate the chip
22
Summary of silicon implementation
core 0
core 4
SRAM
core 8
core 12
core 9
core 13
core 10
core 14
core 11
core 15
64 KB
core 1
core 5
core 2
core 6
2-cycle access
SRAM
64 KB
2-cycle access
core 3
core 7






32 cores in 8 clusters
8 cores specialized for FFT
4 x 64KB SRAM
7 DMA engines, with transforms
4 bi-directional serial links
JTAG debug interface


4x4mm die
50 million transistors
–
4mm


core 19
core 23
core 27
core 31
core 26
core 30
core 25
core 29
core 24
core 28
SRAM
64 KB
core 18
core 22
core 17
core 21
core 16
core 20
2-cycle access
SRAM
64 KB
2-cycle access
4mm
144-pin CPGA package
65nm low-leakage process
–
–
–





70% SRAM, 30% logic
UMC 65LL
Faraday RAM macros, HVT cells
UMC cells and I/Os, SVT only
1.2V core, 2.5V I/O
3.2 mW standby power
0.96 mW / MHz
Worst-case Fmax = 235 MHz
Typical Fmax = 400 MHz
23
Power density and IR drop across the die



Supply current density @ 300 mW
Mostly constant
Small islands around some RAMs



IR drop less than 38 mV @ 300 mW
Typical for 300 MHz operation
Acceptable up to ~900 MHz
24
Computational throughput analysis


Assume device operates at 300 MHz...
Instruction throughput:
FFT cores
Regular cores
Aggregate IPC
GOPS @ 300 MHz

Viterbi decoding throughput:




Based on XOR-SHIFT operations
7 operations per bit, taking 2 cycles  equivalent 3.5 IPC / DMA
x 3 concurrent DMA devices  10.5 IPC  3.15 GOPS
QAM (64) mapping / de-mapping throughput:





Based on ADD-COMPARE-SELECT kernel
4 operations per decoded bit, taking 2 cycles  equivalent ~2 IPC / DMA
x 3 concurrent DMA devices  6 IPC  1.8 GOPS
Convolutional encoding throughput:




IPC Number Sub-total
10.5
8
84
1
24
24
108
32.4
1 cycle per mapped output value, ~equivalent to 6 instructions  6 IPC / DMA
x 6 concurrent DMA devices  36 IPC  10.8 GOPS
System throughput estimated to be equivalent to 48 GOPS
Equivalent to 160 unmodified single-issue cores running at 300 MHz
48 GOPS from high-end cores: 2.0 IPC/core  10 cores @ 2.4 GHZ  40-50 Watts!
25
Li-Fi system architectures

–
–
–
ADC
Pasta2
modem
FPGA
DAC


SDRAM
SDRAM
(a) System architecture for experimental Li-Fi system

Modem
ADC
IEEE
1901
DAC
DRAM I/F
SDRAM
Flash
Designed to be used in a multichip solution


Modem chip
FPGA for I/O and DRAM interface
ADC / DAC chips for the LED and sensor interfaces
FPGA allows for flexible I/O design and
avoids the need for DDR pads in the ASIC
FPGA allows for future flexibility in
interfacing of ADC / DAC to the rest of the
system
Future ASIC will integrate most FPGA
functions and possibly some ADC / DAC
functions too.
Depending on memory requirements, the
external SDRAM may be replaced by a Flash
program memory.
Modem should be capable of supporting
IEEE 1901 (HomePlug) for backhaul via
lighting power lines.
(b) System architecture for production Li-Fi system
26
Scaling 65nm  28nm 16nm  10nm...




65 nm
65nm is not leading-edge today
32 cores at 28nm  6.9 mm2
32 cores at 10nm  2.5 mm2
At 10nm, a 20 mm2 device  2,048 cores
16 mm2
28 nm
6.9 mm2
16 nm
3.9 mm2
10 nm
2.5 mm2
27
Summary
 The use of large-scale many-core architectures
will allow individual cores to...
– operate at low frequency & low voltage
– exploit ultra-high density logic/memory cells with ultra
low leakage
 Embedded applications are often parallel, and
can exploit heterogeneous architectures
– Baseband, Computer Vision, ...
 Future silicon technologies will demand more
energy efficient computational systems
– Silicon doesn’t have to be Dark, but it will be Dim
28
Thank you
Any Questions?
29