EE382-N -Final Presentationsx

Download Report

Transcript EE382-N -Final Presentationsx

Implementation of MAC Assisted
CORDIC engine on FPGA
EE382N-4
Abhik Bhattacharya
Mrinal Deo
Raghunandan K R
Samir Dutt
Motivation
• The TLL 5000 Freescale i.MX21 System-on-Chip
ARM9-based processor does not have native support
for Floating Point
• Floating point operation simulated using libraries e.g libc
• Applications which are “Math Heavy” e.g MAC based
operations which require computing sine/cos/arctan
values are thus not suitable for this platform.
Hardware Acceleration
for Trigonometric Math
operations
EE382N-4 Abhik Bhattacharya,Mrinal Deo,
Raghunandan R.K, Samir Dutt
2
Outline
• Select a basic mathematical building block. E.g
CORDIC (from OpenCores)
• Implement the CORDIC engine in hardware
(FPGA).
• Implement higher level primitives e.g Discrete
Fourier Transform, using CORDIC.
• Use these blocks in a C program instead of the
<math.h>.
• Offload the heavy number crunching to the
hardware accelerator (FPGA) freeing up valuable
CPU resources.
EE382N-4 Abhik Bhattacharya,Mrinal Deo,
Raghunandan R.K, Samir Dutt
3
CORDIC engine
• Coordinated Rotation Digital Computer is simple and
efficient algorithm to calculate hyperbolic and
trigonometric functions.
• We use it to calculate Sine and Cosine of an angle given
in Radians/Degrees .
• To determine the Sine and Cosine of angle β we need
to find the position X and Y on the unit circle.
EE382N-4 Abhik Bhattacharya,Mrinal Deo,
Raghunandan R.K, Samir Dutt
4
CORDIC contd.
• CORDIC is an iterative algorithm and used table lookup.
• First Step: Rotate the vector 45° counterclockwise.
• If ((β – α) != 0)
iterate
Else
exit.
• Successive iteration will rotate the vector in one or the other
direction in size decreasing steps.
• The magnitude of rotation is 1/2i.
– Where “i” is the iteration step.
• Terminate after 16 steps. (approximate 5 digits of
precision)
EE382N-4 Abhik Bhattacharya,Mrinal Deo,
Raghunandan R.K, Samir Dutt
5
Discrete Fourier Transform(DFT)
DFT can be implemented using CORDIC
EE382N-4 Abhik Bhattacharya,Mrinal Deo,
Raghunandan R.K, Samir Dutt
6
Design of CORDIC
• The CORDIC Verilog from OpenCores could be operated
in different modes
– Pipelined
– Iterative
– Combinatorial
• Pipeline Efficient from performance perspective. We
trade off area for performance. (max number of LUT
needed)
– Outputs result at every clock after an initial latency.
• Resolution  limited to 5 bits of precision
• Algorithm works in the 1st Quadrant of the unit circle.
Appropriate logic added to take care of the polarity
EE382N-4 Abhik Bhattacharya,Mrinal Deo,
Raghunandan R.K, Samir Dutt
7
MAC Implementation
• Pipelined CORDIC gives Sin/cosine values in every
cycle if we can maintain steady inflow of inputs.
• Can implement a MAC based engine based on
this CORDIC functionality.
• Useful in Linear Time variant Control Systems
where the coefficients may be sine/cosine values
which need to be computed & accumulated
• Simple example: Discrete Fourier Transform
EE382N-4 Abhik Bhattacharya,Mrinal Deo,
Raghunandan R.K, Samir Dutt
8
Design of DFT
• 32 point of DFT implemented using CORDIC
based MAC.
• Samples sent to the board from the user
application.
• Instantiated one copy of the Cordic based
MAC.
• The design was pipelined to avoid any bubbles
providing new input (angle) to the CORDIC
every cycle.
EE382N-4 Abhik Bhattacharya,Mrinal Deo,
Raghunandan R.K, Samir Dutt
9
Block Diagram of our System
Input Samples
Top Level
(θ)
DFT
MAC
Engine
CORDIC Gain
CORDIC
sin (θ)
cos (θ)
DFT
EE382N-4 Abhik Bhattacharya,Mrinal Deo,
Raghunandan R.K, Samir Dutt
10
Operation of the System
Time ------ >
User application writes i/p
to RAM
Initial CORDIC latency
N
MAC Operation begins
•
•
•
1st MAC output sample
Final o/p from MAC
User Application writes the 32 data samples to the RAM followed by a
“compute_dft” instruction.
Data is read from the RAM by the DFT encoder in a pipeline.
Handshaking between two pipelined stages.
– MAC operation begins after a delay of 16 clks (initial latency of CORDIC pipeline).
– 1st MAC output generated after N clocks after the initial Latency. (N == 32) is length of the
input sequence.
– After MAC generates N output samples, the result of the N-point DFT is written to the RAM
module followed by an Interrupt.
– User application reads the results from the RAM through the device driver on detection of
this interrupt.
EE382N-4 Abhik Bhattacharya,Mrinal Deo,
Raghunandan R.K, Samir Dutt
11
Performance Measurements
EE382N-4 Abhik Bhattacharya,Mrinal Deo,
Raghunandan R.K, Samir Dutt
12
Issues Faced
• Coding a aggressive pipeline (avoid bubbles) is always a
challenge.
• Time consuming process – needs to be done in 2 steps
– Code and validate in ModelSim (signals available for
debug)
– Change the design to run in it on FPGA. Iterate for all
modules.
• Design need to be aware of the memory timing issues
(e.g. – back-to-back writes from FPGA to RAM is a
problem)
• Calculating the correct polarity of CORDIC output
samples.
EE382N-4 Abhik Bhattacharya,Mrinal Deo,
Raghunandan R.K, Samir Dutt
13
Future scope
• Extending to 256 bit DFT.. Cannot extend to
higher because resolution of CORDIC is low..
Need to increase cordic resolution
EE382N-4 Abhik Bhattacharya,Mrinal Deo,
Raghunandan R.K, Samir Dutt
14
Lessons Learnt
• Debug on FPGA is interesting!!
EE382N-4 Abhik Bhattacharya,Mrinal Deo,
Raghunandan R.K, Samir Dutt
15
Thank You!!
• No Questions!!! Please!! :x :p
EE382N-4 Abhik Bhattacharya,Mrinal Deo,
Raghunandan R.K, Samir Dutt
16