An Asynchronous Array of Simple Processors for DSP Applications

Download Report

Transcript An Asynchronous Array of Simple Processors for DSP Applications

Advanced VLSI Course Class Presentation
An Asynchronous Array of Simple Processors
for DSP Applications
Rasoul Yousefi
Electrical and Computer Engineering Department
University of Tehran
1
Copyright
• Slides used
– Zhiyi Yu, Michael Meeuwsen, Ryan Apperson, Omar Sattari,
Michael Lai, Jeremy Webb, Eric Work, Tinoosh Mohsenin,
Mandeep Singh, Bevan Baas “An Asynchronous Array of Simple
rocessors for DSP Applications”, VLSI Computation Lab, ECE
Department,University of California, Davis, USA,2006 IEEE
International Solid-State Circuits Conference
2
Outline
•
•
•
•
•
•
The idea
Project objectives
Key features of the AsAP processor
Design of the AsAP processor
Results
Trends
3
The idea
• Using simple processor cores with small
instruction and data memory to dramatically
reduce area and power while increasing
performance
IMem
64
words
FIFO 0
32
words
FIFO 1
32
words
ALU
MAC
Control
DMem
128
words
OSC
Output
Static
config
Dynamic
config
[1]
4
Outline
• The idea
• Project objectives
•
•
•
•
Key features of the AsAP processor
Design of the AsAP processor
Results
Trends
5
•
•
•
•
•
Programming flexibility
Increasing performance
Reducing power
Reducing area
Suitable for future
fabrication technologies
Performance &
Energy efficiency
Project Objectives
ASIC
FPGA
AsAP
Prog.
DSP
Programming flexibility
Performance, Area, Power improvement !!!
[1]
6
Outline
• The idea
• Project objectives
• Key features of the AsAP processor
• Design of the AsAP processor
• Results
• Trends
7
Key Features of the
AsAP Processor
Chip multiprocessor
Small memory
& simple processor
High performance
High energy efficiency
Globally asynchronous
locally synchronous (GALS)
Technology scalability
Nearest neighbor
communication
8
High Performance Through
Chip Multiprocessor
• Increasing the clock frequency is challenging
Deeper
pipeline
Higher
clock frequency
Increased design complexity
& lower energy efficiency
• Parallelism is more promising
– Instruction level parallelism (VLIW, Superscalar)
– Data level parallelism (SIMD)
– Task level parallelism
9
Task Level Parallelism
• Well suited for DSP applications
Proc.
Task1
Task2
Task3
Memory
A
B
C
A
Proc.1
Proc.2
Proc.3
Task1
Task2
Task3
…
…
…
B
C
Improves performance and
potentially reduces memory size
[1]
• Widely available in many DSP applications
pilots
in
scram
coding
interleave
mod.
map
training
loadfft
interleave
IFFT
window
scale
clip
upsamp
filter
up- out
samp
filter
802.11a/g wireless LAN (54 Mbps, 5 GHz) baseband transmit path
[1]
10
Memory Size in Modern Processors
• Memory occupies much of the area in modern
processors — this reduces area available for
the core and consumes large amounts of power
• Memories frequently contain the critical path
• Area and energy dissipation can be
dramatically decreased while speed can be
increased with smaller memories
Area breakdown
100%
80%
other
60%
core
40%
20%
mem
0%
TI_C64x Itanium SPARC
BlGe/L [ISSCC 02, 05]
[from1]
11
Can we use small memory ?
12
Small Memory Requirements
for DSP Tasks
• The memory required
for common DSP
tasks is quite small
• Several hundred
words of memory are
sufficient for many
DSP tasks
Task
IMem DMem
(words) (words)
N-pt FIR
6
2N
8-pt DCT
40
16
154
72
29
14
Huffman encode
200
330
N-pt convolution
29
2N
64-pt complex FFT
97
192
Bubble sort
20
1
N merge sort
50
N
Square root
62
15
Exponential
108
32
8x8 2-D DCT
Conv. coding (k = 7)
[1]
13
• A system with many processors can use
individual processors or groups of
processors to compute individual tasks and
intermediate data can be transmitted
between processors with a small memory
requirement
• So it is possible to use small memory
14
GALS Clocking Style
• The challenge of globally synchronous systems
– Design difficulty when using high clock frequencies,
long clock wires caused by large chip sizes, and
large circuit parameter variations
– High clock power consumption and lack of flexibility
to independently control clock frequencies
• How about totally asynchronous design style
– Lack of EDA tool support
– Design complexity and circuit overhead
• The GALS compromise
– Synchronous blocks each operating in their own
independent clock domain
15
The GALS architecture basics
• Power breakdown in a high-performance
CPU[2]
Datapath Memory
Control
I/O
Clock
[Adopted from 2]
16
The GALS architecture basics
• Eliminating global clock Eliminating a
major source of power consumption[2]
• Frequency of each SB can be tailored to the
local needs Reducing average frequency
and overall power consumption[2]
SB2
SB1
Data
Handshake protocol
SB3
[Adopted from 2]
17
On Chip Communication
• Methods to avoid global wires
– Network on chip (NOC)
– Local communication
– Nearest neighbor communication
nearest
local
[1]
18
Outline
• The idea
• Project objectives
• Key features of the AsAP processor
• Design of the AsAP processor
• Results
• Trends
19
AsAP Block Diagram
• GALS array of identical processors
– Each processor is a reduced complexity programmable
DSP with small memories
– Each processor can receive data from any two
neighbors and send data to any of its four neighbors
IMem
64
words
FIFO 0
32
words
FIFO 1
32
words
ALU
MAC
Control
DMem
128
words
OSC
Output
Static
config
Dynamic
config
[1]
20
Single Processor Architecture
• 54 32-bit instructions
• 9-stage pipeline
IFetch Decode Mem.
Read
Src EXE 1 EXE 2 EXE 3 Result
Select
Select
FIFO 0 RD
Bypass
PC
Proc Output
FIFO 1 RD
ALU
Inst.
Mem
Write
Back
Decode
Data
Mem
Read
Addr.
Gens.
DC
Mem
Read
Multiply Accumulator
A
×
+ C
C
Data
Mem
Write
DC
Mem
Write
[1]
21
Programmable Clock Oscillator
clk
…
– 1.66 MHz – 702 MHz
– Max gap: 0.08 MHz
(1.66 – 500 MHz)
…
• Results
…
– Delay tunable stages
using 7 parallel tri-state
inverters
– 1 to 128 clock divider
Clock
divider
…
• Standard cell based
• Configurable frequency
…
...
…
...
…
…
...
...
stage_sel
reset
halt
[1]
22
Inter-processor Communication
• Each processor
contains two
dual-clock FIFOs
• Rd/Wr in separate
clock domains
east
north
west
south
Write side Read side
SRAM
Write
logic
addr
Read
logic
addr
Binary Gray
& Sync.
east
north
C
P
U
west
south
FIFO 0
east
north
west
south
FIFO 1
Processor
[1]
23
Advantages of the
AsAP Clocking System
• Simplified clock tree design
– The maximum span is < 1 mm in 0.18 µm
• Scalable – easy to add processors
• Improved energy efficiency
– Clock halts in 9 cycles (processor dissipates
leakage power only) and restores in < 1 cycle
according to work availability
– 53% and 65% power saving
– Independent clock and voltage scaling (individual
processor voltage scaling not implemented in this
version)
24
Physical Design
• 0.18 µm TSMC
• Standard cell
• Design flow
– Completely synthesized, except oscillator
– Macro memory blocks used for four main memories
– Completely auto placed and routed
• To simplify physical design, clock gating not
implemented in this version
25
Chip Micrograph of the 6 x 6 Array
Transistors:
IMem
DMem
1 Proc
230,000
Chip
8.5 million
Max speed:
5.68
mm
OSC
FIFOs
475 MHz @ 1.8 V
Area:
1 Proc
Chip
0.66 mm²
32.1 mm²
Power (1 Proc @ 1.8V, 475 MHz):
Single
Processor
Typical application
32 mW
Typical 100% active 84 mW
Worst case
144 mW
810
µm
Power (1 Proc @ 0.9V, 116 MHz):
Typical application
2.4 mW
810 µm
5.65 mm
[1]
26
Outline
•
•
•
•
The idea
Project objectives
Key features of the AsAP processor
Design of the AsAP processor
• Results
• Trends
27
Area Evaluation
• Most of AsAP’s area is for the core (66%)
• Each processor requires a very small area;
more than 20x smaller than others
100
Processor area
(mm²)
Area breakdown
100%
80%
60%
40%
20%
0%
TI
C64x
RAW Fujitsu BlGe/ AsAP
4-VLIW
L
comm
mem
[1]
core
10
72
30
29
8.3
7
20x
1
0.1
0.34
TI
CELL/ Fujitsu RAW
C64x SPE 4-VLIW
ARM AsAP
All scaled to 0.13 µm
[ISSCC 05, 00; ISCA04; CMPON96]
[1]
28
Power and Performance
Power / Clock frequency
/ Scale * (mW/MHz)
1
AsAP
CELL/SPE
Fujitsu-4VLIW
ARM
RAW
TI C64x
0.1
All scaled to 0.13 µm
* Assume 2 ops/cycle for
CELL/SPE and 3.3 ops/cycle
for TI C64x
Note: word widths and
workload not factored
0.01
10
100
1000
10000
[ISSCC 05, 00; ISCA04;
CMPON96]
Peak performance density (MOPS/mm²)
[1]
29
802.11a/802.11g Wireless
Transmitter Implementation
Data bits
Pad
• 22 processors
• Fully functional on chip
• 407 mW @ 300 MHz
30% of 54 Mb/s
• Code unscheduled and
lightly optimized
• ~10x performance and
35x – 75x lower energy
dissipation than 8-way
VLIW TI C62x
[SIPS 04; ICC 02]
[1]
Conv.
Code
Scram
IFFT
Punc
Interleave 1
Train
Interleave 2
IFFT
Mem
IFFT
BR
Pilot
Insert
Mod.
Map
IFFT
BF
IFFT
Output
GI/
Wind.
GI/
Wind.
IFFT
Mem
IFFT
BF
FIR
FIR
IFFT
BF
IFFT
Mem
Output
Sync
To
D/A
converter30
Summary
• Scalable programmable processing array
–
–
–
–
Many processors on a single chip
Reduced complexity processors with small memories
GALS clocking style
Nearest neighbor communication
• Results
– 0.18 µm, 475 MHz @ 1.8 V
• 32 mW application power
• 84 mW 100% active power
• 2.4 mW application power @ 116 MHz and 0.9 V
– High performance density: 475 MOPS in 0.66 mm²
– Well suited for future fabrication technologies
31
Trends
• Clock Gating
• Dynamic voltage scaling but adds levelconversion for asynchronous
communication interface[3]
• CAD tool
– CAD tools for asynchronous design is mature,
but not commercially strong yet.[3]
32
References
1.Zhiyi Yu, Michael Meeuwsen, Ryan Apperson, Omar Sattari,
Michael Lai, Jeremy Webb, Eric Work, Tinoosh Mohsenin,
Mandeep Singh, Bevan Baas,
“An Asynchronous Array of
Simple rocessors for DSP Applications”,
VLSI Computation
Lab, ECE Department,University of California, Davis, USA,2006 IEEE International
Solid-State Circuits Conference
2.A.Hemani , T.Meincke , S.Kumar, A.Postula, T.olsson, P.Nilsson,
J.Oberg , P.Ellervee,D.Lundqvist, “Lowering power consumption in
clock by using Globally Asynchronous Locally Synchronous
design style”
3.Anoop Iyer ,Diana Marculescu, “Power and Performance
Evaluation of Globally Asynchronous Locally Synchronous
Processors”, Proceedings of the 29th Annual International Symposium
on Computer Architecture (ISCA.02)
33
Question?
Thanks for your attention
34