Transcript Slides

Techniques for Low Power Turbo
Coding in Software Radio
Joe Antoon
Adam Barnett
Software Defined Radio
• Single transmitter for many protocols
• Protocols completely specified in memory
• Implementation:
– Microprocessors
– Field programmable logic
Why Use Software Radio?
• Wireless protocols are constantly reinvented
– 5 Wi-Fi protocols
– 7 Bluetooth protocols
– Proprietary mice and keyboard protocols
– Mobile phone protocol alphabet soup
• Custom DSP logic for each protocol is costly
So Why Not Use Software Radio?
• Requires high performance processors
• Consumes more power
Inefficient
general fork
Inefficient
Efficient application
Field-programmable
specific fork
fork
Turbo Coding
• Channel coding technique
• Throughput nears theoretical limit
• Great for bandwidth limited applications
– CDMA2000
– WiMAX
– NASA ‘s Messenger probe
Turbo Coding Considerations
• Presents a design trade-off
• Turbo coding is computationally expensive
• But it reduces cost in other areas
– Bandwidth
– Transmission power
Reducing Power in Turbo Decoders
• FPGA turbo decoders
– Use dynamic reconfiguration
• General processor turbo decoders
– Use a logarithmic number system
Generic Turbo Encoder
Data stream
s
Component
Encoder
Interleave
Component
Encoder
p1
p2
Generic Turbo Decoder
r
q1
q2
Decoder
Interleave
Decoder
Decoder Design Options
• Multiple algorithms used to
decode
• Maximum A-Posteriori (MAP)
– Most accurate estimate possible
– Complex computations required
• Soft-Output Viterbi Algorithm
– Less accurate
– Simpler calculations
Decoder
FPGA Design Options
• Goal Make an adaptive decoder
Received Data
Parity
Original sequence
Decoder
Tunable
Parameter
Low
power,
accuracy
High
power,
accuracy
Component Encoder
Generator
Function
M
• M blocks are 1-bit registers
• Memory provides encoder state
M
Encoder State
0
1
0
1
00
00
00
01
01
01
10
10
10
11
11
11
GF
1
0
Time
Viterbi’s Algorithm
• Determine most likely output
• Simulate encoder state given received values
d0
d1
s0
r0
p0
Time
d2
s1
r1
p1
s2
r2
p2
…
Viterbi’s Algorithm
• Write: Compute branch metric (likelihood)
• Traceback: Compute path metric, output data
• Update: Compute distance between paths
• Rank paths by path metric and choose best
• For N memory:
– Must calculate 2N-1 paths for each state
Adaptive SOVA
• SOVA: Inflexible path system scales poorly
• Adaptive SOVA: Heuristic
– Limit to M paths max
– Discard if path metric below threshold T
– Discard all but top M paths when too many paths
Implementing in Hardware
Control
r
q
Branch
Metric
Unit
Add
Compare
Select
Survivor
memory
Implementing in Hardware
• Controller –
– Control memory
– select paths
• Branch Metric Unit
– Compute likelihood
– Consider all possible
“next” states
• Add, Compare, Select
– Append path metric
– Discard paths
• Survivor Memory
– Store / discard path bits
Implementing in Hardware
Add, Compare, Select Unit
Path
Distance
Present
State Path
Values
Compute,
Compare
Paths
>T
Branch
Values
Threshold
Next State
Path Values
Dynamic Reconfiguration
• Bit Error Rate (BER)
– Changes with signal strength
– Changes with number of paths used
• Change hardware at runtime
– Weak signal: use many paths, save accuracy
– Strong signal: use few paths, save power
– Sample SNR every 250k bits, reconfigure
Dynamic Reconfiguration
Experimental Results
K (Number of encoder bits) proportional to average speed, power
Experimental Results
• FPGA decoding has a much higher throughput
• Due to parallelism
Experimental Results
• ASOVA performs worse than commercial cores
• However, in other metrics it is much better
– Power
– Memory usage
– Complexity
Future Work
• Use present reconfiguration means to design
– Partial reconfiguration
– Dynamic voltage scaling
• Compare to power efficient software methods
Power-Efficient Implementation of a
Turbo Decoder in SDR System
• Turbo coding systems are created by using one
of three general processor types
– Fixed Point (FXP)
• Cheapest, simplest to implement, fastest
– Floating Point (FLP)
• More precision than fixed point
– Logarithmic Numbering System (LNS)
• Simplifies complex operations
• Complicates simple add/subtract operations
Logarithmic Numbering System
• X = {s, x = log(b)[|x|]}
– S = sign bit, remaining bits used for number value
• Example
– Let b = 2,
– Then the decimal number 8 would be represented
as log(2)[8] = 3
– Numbers are stored in computer memory in 2’s
compliment form (3 = 01111101) (sign bit = 0)
Why use Logarithmic System?
• Greatly simplifies multiplication, division,
roots, and exponents
– Multiplication simplifies to addition
• E.g. 8 * 4 = 32, LNS => 3 + 2 = 5
(2^5 = 32)
– Division simplifies to subtraction
• E.g. 8 / 4 = 2, LNS => 3 – 2 = 1
(2^1 = 2)
Why use Logarithmic System?
• Roots are done as right shifts
– E.g. sqrt(16) = 4,
LNS => 4 shifted right = 2
(2^2 = 4)
• Exponents are done as left shifts
– E.g. 8^2 = 64, LNS => 3 shifted left = 6
(2^6 = 64)
So why not use LNS for all processors?
• Unfortunately addition and subtraction are
greatly complicated in LNS.
– Addition: log(b)[|x| + |y|] = x + log(b)[1 + b^z]
– Subtraction: log(b)[|x| - |y|] = x + log(b)[1 - b^z]
• Where z = y – x
• Turbo coding/decoding is computationally
intense, requiring more mults, divides, roots,
and exps, than adds or subtracts
Turbo Decoder block diagram
• Use present reconfiguration means to design
– Partial reconfiguration
– Dynamic voltage scaling
• Compare to power efficient software methods
• Each bit decision requires a subtraction, table
look up, and addition
Proposed new block diagram
• As difference between e^a and e^b becomes
larger, error between value stored in lookup
table vs. computation becomes negligible.
• For this simulation a difference of >5 was used
How it works
• For d > 5
• New Mux (on right) ignores SRAM input and simply
adds 0 to MAX result.
• d > 5, pre-Decoder circuitry disables the SRAM for
power conservation.
Comparing the 3 simulations
• Comparisons were done between a 16-bit
fixed point microcontroller, a 16-bit floating
point processor, and a 20-bit LNS processor.
• 11-bits would be sufficient for FXP and FLP,
but 16-bit processors are much more common
• Similarly 17-bits would suffice for LNS
processor, but 20-bit is common type
Power Consumption
Latency
• Recall: Max*(a,b) = ln(e^a+e^b)
Power savings
• Pre-Decoder circuitry adds 11.4% power
consumption compared to SRAM read.
• So when an SRAM read is required, we use
111.4% of the power compared to the
unmodified system
• However, when SRAM is blocked we only use
11.4% of the power we used before.
Power savings
• The CACTI simulations for the system reported
that the Max* operation accounted for 40% of
all operations in the decoder
• The Max* operations for the modified system
required 69% of the power when compared to
the unmodified system.
• This leads to an overall power savings of
69% * 40% = 27.6%
Conclusion
• Turbo codes are computationally intense,
requiring more complex operations than
simple ones
• LNS processors simplify complex operations at
the expense of making adding and subtracting
more difficult
Conclusion
• Using a LNS processor with slight
modifications can reduce power consumption
by 27.6%
• Overall latency is also reduced due to ease of
complex operations in LNS processor when
compared to FXP or FLP processors.