RAMP BLUE Double Floating Point Coprocessor

Download Report

Transcript RAMP BLUE Double Floating Point Coprocessor

RAMP BLUE:
Double-Floating Point
Coprocessor
Mitch Harwell
David Tylman
What is Ramp

Research Accelerator for Multiple Processors
With multiple FPGAs on multiple BEE2 boards in single
chassis, RAMP is building a massive,
parallel multi-processor system.
Why Ramp?




We have hit a “Power wall” where Power has become increasingly troublesome,
as has the dissipation of heat through the air. Power has become expensive,
while transistors are essentially free.
We have reached an “ILP wall” where the law of diminishing returns requires
more HW to squeeze out the last ILP from the design.
Along with power we have hit a “Memory wall” where the Memory latencies have
become restrictive.
(200 clock cycles to DRAM memory, 4 clocks for multiply)
Power Wall + ILP Wall + Memory Wall = Brick Wall

Because traditional Uni-processors will cease to exhibit the performance gains
of the last three decades, it is necessary to investigate other means of speeding
up computation, but the computer architecture community lacks the basic
infrastructure tools required to carry out this research.

RAMP will accelerate research across all the fields that touch multiple
processors: operating systems, compilers, debuggers, programming languages,
scientific libraries, and so on.
Design Decisions

The interface was chosen for the purpose of minimizing the time spent
transferring data over the FSL bus.




No acknowledgements or synchronization structures were used.
We transferred the control necessary to control the FPU over the
FSL_Control lines instead of sending a 5th data word.
This works under the assumption that the interface will always
expect 4 word-inputs and two word-outputs.
The hardware unit was designed to be as simple as possible.


None of the units are pipelined, and only one functional unit
(add/sub, mult, div, sqrt, comp, fx->fl, fl->fx) will be running at a
time.
New values are not processed until the old values have
completed calculating.
Software Shenanigans



gcc translates floating-point math operations
into function calls. The operands broken into
4 32-bit words and sent one at a time over
the FSL bus
For each data word, we also transmit a
control bit to specify which operation to
perform.
We stall the processor until the answer
appears on the FSL bus.
Hardware High-jinks
A
B
A
B
A
multiply
Add/sub
A
B
A
B
A
A
B
B
Float ->
fixed
Fixed à
float
comparator
Sqrt
divide
B
Result
rdy
Rfd
Add/sub
Nd
Op
nd
Comp op
Nd
nd
Enable reg64, set SR one set
Reg
64
reset
Set
Sr one
out
SR one out
Init sr
reset
Set
1 bit cnt enable
SR_two
_out
Init SR
out
1 bit
saturating
counter
[63-32]
SR one reset
[31-0]
Bit 0 bit 1
S2
Sqrt nd_and
Lots of
logic
C
Comp Op
Comp nd_and
ENB
FSL_M_Write
Multiplexer
D
128 shift
Shift enable
Mult nd_and
Div nd_and
S1
1bit out
2 bit
saturating
Add/sub op
Add/sub nd_and
Fix->fl nd_and
Fl->fix nd_and
FSL_M_Full FSL_M_Data
Shift out
FSL_S_Exists
FSL_S_Read
Fast Simplex
Link
FSL_S_Data
FSL_S_control
4 Shift
Enable
Nd
Enable
The Current Design
idle
write
read
Microblaze
crunch
FSL
A
B
A
B
A
multiply
Add/sub
A
B
A
B
A
A
B
B
Float ->
fixed
Fixed à
float
comparator
Sqrt
divide
B
Result
rdy
Rfd
Add/sub
Nd
Op
nd
Comp op
Nd
nd
Enable reg64, set SR one set
Reg
64
reset
Set
Sr one
out
SR one out
Init sr
reset
Set
1 bit cnt enable
SR_two
_out
Init SR
out
1 bit
saturating
counter
[63-32]
Bit 0 bit 1
S2
Sqrt nd_and
Lots of
logic
C
Comp Op
Multiplexer
Comp nd_and
ENB
FSL_M_Write
D
128 shift
Shift enable
Mult nd_and
Div nd_and
S1
1bit out
2 bit
saturating
Add/sub op
Add/sub nd_and
SR one reset
[31-0]
Fix->fl nd_and
Fl->fix nd_and
FSL_M_Full FSL_M_Data
Shift out
FSL_S_Read
Fast Simplex
Link
FSL_S_Data
FSL_S_Exists
FSL_S_control
4 Shift
Enable
Nd
Enable
What has been accomplished
The software talks to the hardware as is expected.
The hardware captures the operands, performs the
correct operations, and returns correct results as
expected.
The software returns the hardware results as
expected.
Benchmarks

We ran a FFT
benchmark twice.
Once on our DFPU
hardware
(6 minutes 17 seconds)

60
50
40
Hardware
Software
30
20
Once with software
routines
(56 minutes 31 seconds)

10
0
execution time (minutes)
FFT benchmark times
16000
14000
12000
time (seconds)
10000
8000
6000
4000
2000
0
1
2
3
4
5
6
7
8
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
What remains

Fully-compliant IEEE 754 math units

Multiple processors sharing one DFPU

Pipelined design