Bluespec technical deep dive

Download Report

Transcript Bluespec technical deep dive

Architectural Exploration:
802.11a Transmitter
Arvind, Nirav Dave, Steve Gerding, Mike Pellauer
Computer Science & Artificial Intelligence Laboratory
Massachusetts Institute of Technology
MIT-Nokia Architecture Group
Helsinki, June 5, 2006
June 5, 2006
1
Why architectural exploration
Architects are clever people and can
think of a variety of designs
But often cannot determine which
design is best for a given metric (e.g.,
power)

Too short of time and manpower to go far
enough with several designs for proper
evaluation
 Guess work instead of architectural exploration
New design tools can change all that
2
This talk
Architectural exploration of 802.11a
transmitter


The goal is to show that it is easy and
economical to do so in Bluespec
You don’t have to know 802.11a or
Bluespec to understand the talk
3
802.11a Transmitter Overview
headers
24
Uncoded
bits
Controller
data
Scrambler
Interleaver
IFFT
IFFT Transforms 64 (frequency domain)
complex numbers into 64 (time domain)
complex numbers
Encoder
Mapper
Cyclic
Extend
Must produce
one OFDM
symbol every
4 msec
Depending
upon the
transmission
rate,
consumes 1,
2 or 4 tokens
to produce
one OFDM
symbol
One OFDM symbol
(64 Complex Numbers)
accounts for > 95% area
4
Combinational IFFT
in0
in1
in4
…
Radix 4
Radix 4
in63
t0
t1
t2
t3
*
+
+
*
-
-
*
+
+
*
-
*j
-
Radix 4
…
out1
Permute_3
x16
Radix 4
Radix 4
Permute_2
Radix 4
in3
…
Radix 4
Radix 4
Permute_1
in2
out0
out2
out3
out4
Radix 4
…
out63
All numbers are complex
and represented as two
sixteen bit quantities.
Fixed-point arithmetic is
used to reduce area,
power, ...
5
Design Tradeoffs
1.
We can decrease the area by multiplexing
some circuits
It may be a win if the throughput requirements can
be met without increasing the frequency
2.
Power can be lowered by lowering the
frequency, which can be adjusted by changing
the voltage
power  (voltage)2
6
Combinational IFFT
Opportunity for reuse
in0
in1
…
x16
Radix 4
Radix 4
…
Radix 4
Radix 4
…
Radix 4
in63
out1
Permute_3
in4
Radix 4
Radix 4
Permute_2
in3
Radix 4
Radix 4
Permute_1
in2
out0
out2
out3
out4
…
out63
Reuse the same circuit three times
7
Circular pipeline: Reusing the
Pipeline Stage
in0
Radix 4
…
in3
Radix 4
in4
in63
out3
out4
out63
Stage
Counter
Permute_3
16 Radix 4s can be
shared but not the three
permutations. Hence the
need for muxes
out2
…
Permute_2
…
out1
64, 4-way
Muxes
in2
Permute_1
in1
out0
8
Superfolded circular pipeline:
Just one Radix-4 node!
in0
in4
Permute_3
Designs with 2, 4, and 8
Radix-4 modules make
sense too!
out2
out3
out4
…
Permute_2
in63
Index
Counter
0 to 15
4, 16-way
DeMuxes
…
out1
64, 4-way
Muxes
in3
Radix 4
Permute_1
in2
4, 16-way
Muxes
in1
out0
out63
Stage
Counter
0 to 2
9
Which design consumes the least
energy to transmit a symbol?
Can we quickly code up all the
alternatives?

single source with parameters?
Not practical in traditional hardware
description languages like Verilog/VHDL
10
Expressing the designs in
Bluespec
June 5, 2006
11
Bluespec code: Radix-4 Node
function Vector#(4,Complex)
radix4(Vector#(4,Complex) t,
Vector#(4,Complex) k);
Vector#(4,Complex) m = newVector(),
y = newVector(),
z = newVector();
m[0] = k[0] * t[0]; m[1] = k[1] * t[1];
m[2] = k[2] * t[2]; m[3] = k[3] * t[3];
*
+
+
*
-
-
*
+
+
*
-
*j
-
y[0] = m[0] + m[2]; y[1] = m[0] – m[2];
y[2] = m[1] + m[3]; y[3] = i*(m[1] – m[3]);
z[0] = y[0] + y[2]; z[1] = y[1] + y[3];
z[2] = y[0] – y[2]; z[3] = y[1] – y[3];
return(z);
endfunction
Polymorphic code:
works on any type
of numbers for
which *, + and have been defined
12
Combinational IFFT
Can be used as a reference
in0
in1
…
x16
Radix 4
Radix 4
…
Radix 4
Radix 4
…
Radix 4
in63
out1
Permute_3
in4
Radix 4
Radix 4
Permute_2
in3
Radix 4
Radix 4
Permute_1
in2
out0
out2
out3
out4
…
out63
stage_f function
repeat it three times
13
Bluespec Code for
Combinational IFFT
function SVector#(64, Complex) ifft (SVector#(64, Complex) in_data);
//Declare vectors
SVector#(4,SVector#(64, Complex)) stage_data = replicate(newSVector);
stage_data[0] = in_data;
for (Integer stage = 0; stage < 3; stage = stage + 1)
stage_data[i+1] = stage_f(stage, stage_data[i]);
return(stage_data[3]);
The code is unfolded to generate
a combinational circuit
function SVector#(64, Complex) stage_f(Bit#(2) stage,
SVector#(64, Complex) stage_in);
begin
for (Integer i = 0; i < 16; i = i + 1)
begin
Integer idx = i * 4;
let twid = getTwiddle(stage, fromInteger(i));
let y = radix4(twid, stage_in[idx:idx+3]);
stage_temp[idx]
= y[0]; stage_temp[idx + 1] = y[1];
stage_temp[idx + 2] = y[2]; stage_temp[idx + 3] = y[3];
end
//Permutation
for (Integer i = 0; i < 64; i = i + 1)
stage_out[i] = stage_temp[permute[i]];
end
Stage function
return(stage_out);
14
Synchronous pipeline
f1
f2
f3
x
inQ
sReg1
rule sync-pipeline (True);
inQ.deq();
sReg1 <= f1(inQ.first());
sReg2 <= f2(sReg1);
outQ.enq(f3(sReg2));
endrule
sReg2
outQ
This is real IFFT code;
just replace f1, f2 and f3
with stage_f code
15
Folded pipeline
f1
f
f2
x
inQ
stage
sReg
outQ
rule folded-pipeline (True);
if (stage==1)
begin inQ.deq();
sxIn= inQ.first(); end
else
sxIn= sReg;
sxOut = f(stage,sxIn);
if (stage==3) outQ.enq(sxOut);
else sReg <= sxOut;
stage <= (stage==3)? 1 : stage+1;
endrule
f3
function f (stage,sx);
case (stage)
1: return f1(sx);
2: return f2(sx);
3: return f3(sx);
endcase
endfunction
This is real IFFT code
too ...
16
Expressing these designs in
Bluespec is easy
All these designs
were done in less
than one day!
Area and power
estimates?
Combinational
Pipelined
Folded (16 Radices)
Super-Folded (8 Radices)
Super-Folded (4 Radices)
Super-Folded (2 Radices)
How long will it
take to write these
designs in Verilog?
VHDL? SystemC?
Super-Folded (1 Radix)
17
Bluespec Tool flow
Bluespec SystemVerilog source
Bluespec Compiler
Verilog 95 RTL
C
Bluespec C sim
Cycle
Accurate
Verilog sim
VCD output
Debussy
Visualization
RTL synthesis
gates
Power
estimatio
n tool
Sequence Design PowerTheater
FPGA
18
802.11a Transmitter Synthesis
results for various IFFT designs
IFFT Design
Area
(mm2)
Min. CLK
Period(ns)
Latency
(clks/Sym)
ns/output
(req 4000)
Combinational
15.15
33.0
10
132
Pipelined
15.50
12.2
12
49
Folded
(16 Radices)
6.26
13.0
12
52
Super-Folded
(8 Radices)
4.02
13.1
15
79
SF (4 Radices)
2.86
13.1
21
157
SF (2 Radices)
2.33
13.2
33
317
SF (1 Radix)
2.00
13.2
48
634
TSMC .18 micron; numbers reported are before place and route.
Some areas will be larger after layout.
19
Algorithmic Improvements
in0
in1
…
x16
Radix 4
Radix 4
…
Radix 4
Radix 4
…
Radix 4
in63
out1
Permute_3
in4
Radix 4
Radix 4
Permute_2
in3
Radix 4
Radix 4
Permute_1
in2
out0
out2
out3
out4
…
out63
1. All the three permutations can be made identical
 more saving in area
2. One multiplication can be removed from Radix-4
20
802.11a Transmitter Synthesis
results: old vs. new IFFT designs
Old Area
(mm2)
New Area
(mm2)
Combinational
15.15
5.91
Pipelined
15.50
6.26
Folded
(16 Radices)
6.26
4.61
Super-Folded
(8 Radices)
4.02
3.57
SF(4 Radices)
2.86
2.75
SF(2 Radices)
2.33
2.21
SF (1 Radix)
2.00
1.67
???
expected
IFFT Design
TSMC .18 micron; numbers reported are before place and route.
21
802.11a Transmitter Synthesis
results with new IFFT designs
IFFT Design
Area
(mm2)
Min. CLK
Period
(ns)
Latency
(clks/Sym
bol)
Min. ns/
output
Permitted
Clock scaling
Combinational
5.91
33.0
10
132
30
Pipelined
6.26
12.0
12
49
83
Folded
(16 Radices)
4.61
13.0
12
52
77
Super-Folded
(8 Radices)
3.57
13.1
15
79
51
SF(4 Radices)
2.75
13.1
21
157
25
SF(2 Radices)
2.21
13.1
33
314
13
SF (1 Radix)
1.67
13.1
57
629
6
TSMC .18 micron; numbers reported are before place and route.
22
802.11a Transmitter with new
IFFT designs: Power Estimates
IFFT Design
c1
Area
(mm2)
c2
Min
Freq.
c3
Power(mW)
@ 100MHz
c4
Power(mW)
@ Min Freq.
c5
Energy/Symb
(nJ)
c6
Combinational
5.91
1 MHz
398.6
0.399
1.594
Pipeline (48 R-4)
6.26
1 MHz
438.6
0.439
1.754
Folded (16 R-4)
4.61
1 MHz
475.6
0.476
1.902
SF (8 R-4)
3.57
1.5MHz
299.7
0.446
1.798
SF (4 R-4)
2.75
3MHz
166.2
0.499
1.994
SF (2 R-4)
2.21
6MHz
98.7
0.592
2.369
SF (1 R-4)
1.67
12MHz
66.2
0.794
3.178
c3
c4
c5
c6
=
is
=
=
min clock x scaling factor;
raw data collected by the Sequence Design PowerTheater
c4xc3/100MHz/voltage scaling(=10);
c5x4 msec
23
Summary
It is essential to do architectural
exploration for better (area, power,
performance, ...) designs.
It is possible to do so with new design
tools and methodologies.
Better and faster tools for estimating
area, timing and power would
dramatically increase our capability to
do architectural exploration.
Thanks
24