Transcript Slides

Design of a High-Speed
Asynchronous Turbo Decoder
Pankaj Golani, George Dimou, Mallika Prakash and Peter A. Beerel
Asynchronous CAD/VLSI Group
Ming Hsieh Electrical Engineering Department
University of Southern California
ASYNC 2007 – Berkeley, California
March 12th 2007
Motivation and Goal
Mainstream acceptance of asynchronous design
• Leverage-off of ASIC standard-cell library-based design flow
• Achieve significant benefits to overcome sync momentum
Our research goal for async designs…
• High-speed standard-cell flow
• Applications where designs yield significant improvement
• throughput and throughput per area
• energy efficiency
Single Track Full Buffer (Ferretti’02)
Forward path
B
B
L
L
1
RCD
R
R
L 1-of-N
S
Reset
A
SCD
Reset path
R
1-of-N
Follows 2 phase protocol
High performance standard cell circuit family
Comparison to synchronous standard-cell
• 4.5x better latency
• 1+GHz in 0.18µm
• ~2.4X faster than synchronous
• 2.8x more area
2
Block Processing –
Pipelining and Parallelism
K cases
M people
pipelines
Latency l
Steinhart Aquarium
First M cases arrive at t = l
Let c be the person cycle time
Consider two scenarios
• Baseline
• cycle time C1, latency L1
• Improved
• cycle time C2 = C1/2.4, latency L2 = L1/4.5
Questions
• How does cycle time affect throughput?
• How does latency affect throughput ?
Subsequent M cases arrive
every c time units
K M 
ExecutionT ime  
 cl
 M 
Throughput 
K
ExecutionT ime
Block Processing –
Combined Cycle Time and Latency Effect
Throughput vs Number of cases
Throughput vs Number of cases
Throughput
Throughput
20
15
2.6
10
5
Baseline
0
0
BaselineImproved
Improved
4.32
0
5
200
10
15
400
20
600
25
30
800
35
1000
40
45
Number of cases (K)
Number of cases (K)
Large K:
throughput ratio  cycle time ratio
Small K:
throughput ratio  latency ratio
1200
Talk Outline
• Turbo coding and decoding – an introduction
• Tree soft-input soft-output (SISO) decoder
• Synchronous turbo decoder
• Asynchronous turbo decoder
• Comparisons and conclusions
Turbo Coding – Introduction
N bits
K bits
1 1
1 0 1
Encoder
Error correcting codes
• Adds redundancy
• The input data is K bits
• The output code word is N bits (N>K)
• The code rate is r = K/N
• Type of codes
• Linear code
• Convolutional code (CC)
• Turbo code
1 1 1
0 1
Turbo Encoding - Introduction
Outer
CC
Interleaver
Inner
CC
Turbo Encoder
Turbo Encoding
• Berrou, Glavieux and Thitimajshima (1993)
• Performance close to Shannon channel capacity
• Typically uses two convolutional codes and an interleaver
• Interleaver used to improve error correction
• increases minimum distance of code
• creates a large block code
Turbo Decoding
Turbo decoder components
• Two soft-in soft-out (SISO) decoders
• one for inner CC and one for outer CC
• soft input: a priori estimates of input data
• soft output: a posterior estimates of input data
• SISO often based on Min-Sum formulation
• Interleaver / De-interleaver
• maps SISO outputs to SISO inputs
• same permutation as used in encoder
Iterative nature of algorithm leads to block processing
• One SISO must finish before next SISO starts
Received
Data
memory
Inner
SISO
Deinterleaver
Interleaver
Outer
SISO
The Decoding Problem
Requires finding paths in a graph called a trellis
• Node: State j of encoder at time index k
• Edge: Represents receiving a 0 or 1 in node for state j at time k
• Path: Represents a possible decoded sequence
• the algorithm finds multiple paths
Example Trellis
• For a 2-state encoder, encoding K bits
Sent
bit is 1
t=0
t=k
t=K
Sent
bit is 0
Decoded
Sequence
0
1
0
0
0
1
0
1
0
0
Min-Sum SISO Problem Formulation
Branch and path metrics
• Branch metric (BM)
• indicates difference between expected and received values
• Path metric
• sum of associated branch metrics
Min-Sum Formulation: for each time index k find
• Minimum path metric over all paths for which bit k = 1
• Minimum path metric over all paths for which bit k = 0
Sent
bit is 1
t=0
1
1
3
Sent
bit is 0
1
t=k 1
0
1
1
2
2
1 t=K
1
2 3
3
1
Minimum path metric when bit k = 1 is 13
Minimum path metric when bit k = 0 is 16
1
1
Talk Outline
• Turbo coding and decoding – an introduction
• Tree SISO low-latency turbo decoder architecture
• Synchronous turbo decoder
• Asynchronous turbo decoder
• Comparisons and conclusions
Conventional SISO - O(K) latency
Calculation of the minimum path can be divided into two phases
• Forward state metric for time k and state j: Fk j
• Backward state metric for time k and state j: Bkj
t=0
t = k-1
t=k
t = k+1
Received
bit is 1
Received
bit is 0
0
Fk0  min Fk01  BM k01 0 , Fk11  BM k1
1 
Bk0  min Bk01  BM k00 , Bk11  BM k01 
Data dependency loop prevents pipelining
• Cycle time limited to latency of 2-way ACS
• Latency is O(K)
t=K
Tree SISO – low latency architecture
Tree SISO (Beerel/Chugg JSAC’01)
• Calculate BMs for larger and larger segments of trellis.( BM ki ,,lj)
• Analogous to creating group-wise PG logic for tree adders
• Tree SISO can process the entire trellis in parallel
• No data dependency loops so finer pipelining possible
• Latency is O(log K)
t=0t=0
1
2 t=1
t=1
t=2
2
1
2
t=4
3
1
1
1
t=2
t=3
2
2
1
 BM
0
11
0 1 1 1  11  10  0 0  00  1  01 
0
 BM
BM
min
 BM , BM , BM  BM  BM
BM
min
 BM



0,42
0
,
1
0
,
2
2,4 1, 2 0,2 0,1 2,4  1, 2 

 min (1  2 , 2  2)  3
Remainder of Talk Outline
• Turbo Coding – an introduction
• Turbo Decoding
• Tree SISO low-latency turbo decoder architecture
• Synchronous turbo decoder
• Asynchronous turbo decoder
• Comparisons and conclusions
Synchronous Base-Line Turbo Decoder
• Synchronous turbo decoder base-line
• IBM 0.18µm Artisan standard cell library
• SCCC code was used with a rate of ½
• Number of iterations performed is 6
• Gate level pipelined to achieve high throughput
• Performed timing-driven P&R
• Peak frequency of 475MHz
• SISO area of 2.46mm2
• To achieve high throughput, multiple blocks instantiated
Asynchronous Turbo Decoder
Static Single Track Full Buffer Standard-Cell Library (Golani’06)
• Total of (only) 14 cells in IBM 0.18µm process
• Extensive spice simulations were performed
• optimized trade-off between performance and robustness
Chip design
• Standard ASIC place-and-route flow (congestion-based)
• ECO optimization flow
Chip level simulation
• Performed on critical sub-block (55K transistors)
• Verified timing constraints
• Measured latency and throughput using Synopsys Nanosim
Static Single Track Full Buffer (Ferretti’01)
1-of-N data
Sender
Receiver
SST channel
1-of-N static single-track protocol
S
M1
R
M3
1-of-N
L
M12
M11
M10
M2
2
Keeper
Channel
wire
Holds low
Drives high
1
A
Holds high
Drives low
NR
Keeper
Statically drive line → improves noise margin
Asynchronous Implementation Challenges - I
• Degradation in throughput
• Unbalanced fork and join structure
• The token on the short branch is stalled due to imbalance
• This leads to over all slowing down of the fork join
FORK
Join
• Slack matching
• Improves the throughput because of an additional pipeline buffer
• Identify fork / join bottlenecks and resolve by adding buffers
• After P&R long wires can also create such a problem
• This can be solved by adding buffers on long wires using ECO flow
Asynchronous Implementation Challenges - II
Full
adder
Full
adder
Full
adder
Full
adder
Full
adder
Full
adder
Full Adder
FORK
Full
adder
FA
Fork
Full
adder
Full
adder
Full
adder
Full
adder
Full Adder with
+ Integrated Fork
Fork
• SSTFB implements only point to point communication
• Use dedicated Fork cells
• Creates another pipeline stage
• To slack match buffers are needed on the other paths
• Integrate Fork within Full Adder
45% less area than full adder and fork
Decreases the number of slack matching buffers required
Asynchronous Implementation Challenges – III
• 60% of the design are slack matching buffers
• Most of the time these buffers occur in linear chains
Buffer
Buffer
Buffer
Buffer
Buffer
Slack2
Slack2
Slack2
Slack4
Buffer
• To save area and power two new cells were created
• SLACK2
• SLACK4
17% area and 10% power improvement for SLACK2
30% area and 19% power improvement for SLACK4
Remainder of Talk Outline
• Turbo Coding – an introduction
• Turbo Decoding
• Tree SISO low-latency turbo decoder architecture
• Synchronous turbo decoder
• Asynchronous turbo decoder
• Comparisons and conclusions
Comparisons
• Synchronous
• Peak frequency of 475MHz
• Logic area of 2.46mm2
• Asynchronous
• Peak frequency of 1.15GHz
• Logic area of 6.92mm2
• Design time comparison
• Synchronous: ~4 graduate-student months
• Asynchronous: ~12 graduate-student months
Synch vs Async
Received Memory
M pipelined
8-bit Tree SISOs
K bits
Latency l
Interleaver/ Deinterleaver
First M bits arrive at t = l
Let c be the sync clock cycle time (475 MHz)
Subsequent M bits arrive every c
time units
Two implementations
• Synch: cycle time C1 and latency L1
• Async: cycle time C2 = C1/2.4
latency
L2 = L1/4.5
Desired comparisons
• Throughput comparison vs block size
• Energy comparison vs block size
K M 
ExecutionT ime  
 cl
 M 
Throughput 
K
ExecutionT ime
Comparisons – Throughput / Area
Throughput/area vs Block size
1.28
M=3
3.91
(Mbps/mm2)
Throughput/area
2.13
M=8
M=11
Asynch (M=1)
Sync (variable M)
0
1000
2000
3000
4000
5000
Block size (bits)
• For small block sizes asynchronous provides better throughput/area
• As block size ↑ the two implementations become comparable
• For block sizes of 512 bits synchronous cannot achieve async throughput
Comparisons – Energy/Block
Energy per block
Energy per block
(nJ)
14
12
10
8
6
4
Async (M=1)
2
Synch (variable M)
0
0
1000
2000
3000
4000
5000
Blocks size (bits)
For equivalent throughputs and small block sizes
asynchronous is more energy efficient than synchronous
Async advantages grow with larger async library (e.g., w/ BUF1of4)
Conclusions
• Asynchronous turbo decoder vs. synchronous baseline
• static STFB offers significant improvements for small block sizes
• more than 2X throughput/area
• higher peak throughput (~500Mbps)
• more energy efficient
• well-suited for low-latency applications (e.g. voice)
• High-performance async advantageous for applications which require
• high performance (e.g., pipelining)
• low latency
• block processing for which parallelism has diminishing returns
• synchronous design requires extensive parallelism to achieve
equivalent throughput
Future Work
• Library Design
• Larger library with more than 1 size per cell
• 1-of-4 encoding
• Async CAD
• Automated slack matching
• Static timing analysis
Questions ??