Nokia 16-17 August 1999 - Rice University Electrical and Computer

Download Report

Transcript Nokia 16-17 August 1999 - Rice University Electrical and Computer

Overview of Implementation Issues for
Multitier Networks on DSPs
Joseph R. Cavallaro
Electrical & Computer Engineering Dept.
Rice University
August 17, 1999
Outline

Overview of Multitier Networks

DSP Rapid Prototyping Tools

Channel Estimation and Multistage Detection

DSP implementation and Real-time Issues

ASIC Implementation of Algorithm Modules

Conclusions and Future Directions
Multitier Overlay Networks
Home Area
Wireless LAN
High Speed
Office Wireless
LAN
Outdoor CDMA
Cellular Network
Time Scales in Multitier Networks



Multiple Radio Interfaces
Reconfigurability and Commonality of Modules
Multitier Network Interface Card
Medium
Access
Horizontal
Handoff
msec
secs
Vertical
Handoff
10 secs
Session
Lifetime
mins
mNIC
Server
Mobile Platform
Proxy File System
Proxy Awareness
mNIC
Transcoders
N
I
C
File
System
Network
Protocols
I
N
T BS
E
R
N BS
E
T BS
Application
Transcoders
Proxy File System
Network Protocols
Current Group

Suman Das - Universal Baseline Software System

Vishwas Sundaramurthy - System Design Issues

Sridhar Rajagopal - Channel Estimation Algorithms

Oscar Pan – Real Time Workshop Implementation

Recent Graduates:
–
Chaitali Sengupta - ML Synchronization
–
Gang Xu - Differencing Multistage Detector
W-CDMA Simulation Testbed Overview

Development of an integrated software testbed

Unified framework to evaluate new algorithms for coding,
synchronization, detection, etc.

Construction of a faster, efficient, and possibly hardware
accelerated simulation testbed

TI TMS320C6201- TMS320C6701 based system – Base Station

TI TMS320C54 and FPGA / ASIC - Mobile
Software Rapid Prototyping Methodology

Communication and
C - CODE
MATLAB
CODE
C mex - CODE
Signal Processing
Algorithms in MATLAB
MATLAB
COMPILER
and “C”

Faster Execution of “C”
Code

WRAPPER (C - Code or
Simulink)
Acceleration on DSP
Boards

C - CODE
Multiple DSP Boards
DSP CODE GENERATION TOOLS
HOST
DSP CODE
DSP
hardware
Simulink

Simulink
– Good system for algorithm evaluation in
communication systems and signal processing
– Ties in well with MATLAB environment and
functions
– More intuitive than (C/Matlab) code based
evaluation

Used in software version of wireless testbed
RTW


Real-Time Workshop
–
Generates ANSI C-code for Simulink block
diagrams
–
Tool for DSP rapid prototyping
–
Quick but inefficient/non-optimized C-code
RTW support for C67x generation boards
–
Hardware (DSP)-in-the-loop simulations
CDMA Wireless System Testbed
Simulink Version
Chip matched filter
Multiuser Detection
AWGN Channel
Error Rate Calculation
User_Data
Wireless Channel
Chip MF
Multiuser Detector
Decorrelating
Detector
Error Counter
Channel Estimation
Channel Estimation
User Data
Max. Likelihood
Channel Est.
Show Stats
Update Parameters
Parameters
Statistics
Hardware Platform Issues

Current System
–

TI TMS320C6201 and TMS32C6701 EVM boards
Multiple DSP Processor Configuration Issues and
Task Decomposition.

Planned Upgrade to BlueWave, Spectrum
DSPs in Simulink based Wireless testbed


Use of C67 based boards for simulations
– Useful for study of individual algorithms on C67
generation processors
Multiprocessing issues
– Need block diagram partitioning and code
generation support from Simulink/RTW
– Need cleaner external communication
mechanisms in the C67x DSP
– Need support for controlling multiple DSPs
Architectural Issues




Memory
– More internal memory for large temporary
matrices
Prefetch Buffers
– Matrices stored as arrays in memory.
ASIC /FPGA glue support
– To explore HW acceleration of critical parts of the
code
Specialized instructions : Square roots, reciprocals,
rotations ?
Compiler Support

Compilers for VLIW
–
Scheduling & Tracking units difficult in manual
assembly
–
Challenge to generate code to keep all units busy.
–

Small Operating System Support
Architectural improvements require coordinated
advances in compiler support.
W-CDMA Software Testbed Experiments

Third generation wireless communication systems

Multimedia capabilities

Multirate services

Quality of service

Higher Data Rates: 2 Mbps, 384 Kbps, 144 Kbps.
The Wireless Channel : Multiuser, Multipath
Noise + MAI
Antenna
Direct Path
Reflected
Paths
Desired User
Faces Attenuation, Delays and Doppler Effects : Unknown Channel Parameters
W-CDMA Base-Station Receiver
Antenna
Data
Multiuser
Detector
Demodulator
Demux
Estimated
Amplitudes
& Delays
Pilot
Channel
Estimator
Decoder
CDMA Uplink System
User 1
d1
User 2
d2
Matched
Channel
Encoder
Channel
Spreading
Spreading
+
y2
Filter
Multi-
Channel
User
Decoder
yK
User K
dK'
Filter
Spreading
Demux
User 2
d2'
Detector
R(t)
Matched
Channel
User K
dK
Encoder
User 1
d1'
Filter
AWGN
Matched
Encoder
y1
Channel
Estimator
Maximum Likelihood - Channel Estimation

Send a time-multiplexed Preamble (Pilot).

Channel properties extracted from received signal.

Compare received signal with known pilot and
estimate channel parameters.

Keep estimate for remaining data bits (static).

Repeat preamble every frame, if no tracking.
The Maximum Likelihood Algorithm

Compute the correlation matrices

Compute the channel estimate
Rrr, Rbr & Rbb.

Y  Rbr Rbb-1 .
Calculate the noise covariance matrix K.

Calculate the channel impulse response vector z.

Extract the ampitudes and delays from the channel impulse
response vector using least squares fit.
The ML Algorithm Complexity

Complex-Real Dot Product.
1
r.b

L
1

Complex-Real Matrix Product.

Complex -Real Product.
1
L
)k U

R
k
'R
U
R
R r bR  Y
bb
Offline
L
() k U
k U
H
k2
y

R
k
U
H
1 k 2
y( 
H
Real Square roots.
–


k U
'L
rb
Solving quadratic equation for least squares fit.
Critical code : Matrix-vector multiplications / Dot Product
Assuming Unity Noise Covariance
k
z
Differencing Multistage - Multiuser Detection

Based on the principle of Parallel Interference
Cancellation (PIC)

Cross-correlation information used to remove
interference of other users from desired user

Repeated iterations for convergence

Differencing techniques applied for improving the
performance of the algorithm
The Differencing Multistage Detector

Split the crosscorrelation matrix into lower, upper and the
diagonal matrix.

R  D  S  ST
Calculate the channel impulse
response iteratively using
( l 1)
ˆ
Az  Az
 ( S  S ) Ax
where xˆ ( l 1)  dˆ ( l 1)  dˆ ( l  2 )
( xˆk  {0,2,2})
(l )

( l 1)
T
x is called the differencing vector.
R
D
S
S
T
Multistage Detector Complexity

–

T
Computed only once for one frame
Dot Product:
–

B  (S  S ) A
Matrix Multiplication:
z
l 1
k
l
ˆ
 z   Bijxj
Computed iteratively
Critical code: Dot Product
l
k
TI Tools Used

Evaluation Modules (EVM) for C6201 and C6701
fixed and floating point DSPs
–
64 KB each internal program & data memory
–
256 KB SBSRAM, 8 MB SDRAM (external)

C Compiler ver 3.0 from Code Generation Tools

Code Composer ver 4.02 for profiling the code
DSP Implementation: Channel Estimation

Floating point implementation found more feasible
due to matrix inversions and square-roots.

Code optimized for the DSP

Use of Specialized approximate instructions
– Approximate reciprocal square roots
– Approximate reciprocals

Use of Assembly Code for critical part.
– TI's C67 floating point benchmarks for MatrixVector Multiplication & Dot Product

Data Memory requirements for Channel Estimation
Use of Approximate Instructions
Use of specialized instructions and assembly code on C6701 DSP
140
C6701: Original
C6701: with Intrinsics
C6701: with Assembly
TMS320C67x DSP Cycles
Approx. FP
Reciprocal
instruction
FP reciprocal
function
Approx. FP
Reciprocal Sq. root
Instruction
FP Reciprocal Sq.
root Instruction
1
28
1
34
Execution time(in milliseconds) -->
120
100
10%
improvement
80
60
40
100%
improvement
20
L = 150, P =3, N= 31,
SNR = 5dB, SINR = -10 dB
0
0
5
10
Number of users -->
15
Optimization Effects for Channel Estimation
Effect of optimizations for Channel Estimation on C6701-->
100
Base
(-o3 -pm)
Execution time(normalized) -->
90
1.08X improvement
Approx.
(-o3 -pm with intrinsics)
80
70
60
2.34X improvement
Assembly opt.
(-o3 -pm with asm)
50
40
30
20
10
0
1
2
3
Data Memory Requirements
Data to be
placed in
External
memory
6
130
DSP Implementation: Multistage Detection



16-bit Fixed Point C Code
Code optimized for the DSP
Use of Assembly Code for critical part
–
TI's C62 fixed point assembly benchmarks for Dot
Product

Data memory requirements for Multistage Detection
Optimization Effects for Multistage Detector
Effect of optimizations for Multistage Detection on C6201 -->
100
Execution time(normalized) -->
90
Global opt.
(-o3 -pm -mu)
80
70
60
50
5.22X improvement
40
Software Pipelining 7.47X improvement
(-o3 -pm)
Assembly opt.
(-o3 -pm with asm)
30
20
10
0
1
2
3
Data Memory Requirements
Data can be
placed
completely
in Internal
memory
Flops Count
14
x 10
Users:K=15 SNR=6dB
4
12
2X speedup
three-stage
detector
10
Number of Flops
for a
Conventional Method
Differencing Method
8
conventional
6
differencing
4
2
0
1
2
3
4
5
Total Number of Iterations
6
7
8
Real-Time Requirements
SNR=10dB WindowSize=12
Real-Time
capability by
C6201 DSP
MAX BIT RATE PER USER (kb/s)
350
300
Conventional Method
Differencing Method
250
12users
200
150kb/s
150
100
50
8
9
10
11
12
NUMBER OF USERS
13
14
Trends in Recent DSPs


More internal memory and higher clock speeds
–
C6203 : 512 KB data, 384 KB program, 250 MHz
–
useful for uplink channel estimation algorithms.
Specialized Blocks in the DSP Core.
–

Viterbi decoding in C54.
Lower Voltage operation
–
1.2 V in C5402 , useful for saving power consumption in the
mobile.
ASIC Implementation


Differencing Multistage Detector Block
MOSIS Tiny-Chip (40-pin DIP)
–
8 synchronous users
–
12-bit fixed point implementation
–
6000 transistors
–
1.2 m CMOS technology
–
190kb/s for each user (@12.5MHz)
–
3-stage cascade delay < 15 s
Chip (Single Stage) Architecture
z(l)
REG
( l()l )
zz
z
A
d
L
d (l)
( l 1)
d (l )
U
d ( l 1)
SHIFT
RECODER
(L+L’)A
( l 1)
(l )
ˆ
z
 z  ( L  L ) Ax
where xˆ (l )  dˆ (l )  dˆ (l 1)
(l )
T
( l 1)
Control
Logic
Internal signals
External signals
ASIC Architecture Features
Chip Layout
2.0 mm
Recoding
logic
Soft
Decisions
CrossCorrelation
12-bit ALU
3-stage Cascade Mode
Matche
d
Filter
Output
Hand
Shakin
g
Load R
Clock
Sin Sout
Sin Sout
Sin Sout
Hin Hout
Hin Hout
Hin Hout
Fin
Fin
Fin
Fout
Fout
Fout
Load
Load
Load
CLK 1/2
CLK 1/2
CLK 1/2
Detector
Output
Output
Valid
Current Work – GPP vs. DSP
•
•
•
•
•
Joint work with Prof. Sarita Adve, Praful Kaul, and
Parthasarathy Ranganathan
Performance of general-purpose systems
Comparing GPP and DSP performance
Complete 3G benchmark suite with all components
Identification of key performance bottlenecks
Preliminary Results (1 of 4)

(4 algorithms: channel estimation, multi-stage detection, FIR filter, dot product)

Performance of general-purpose processors
– Instruction-level parallelism features help (3.4X to 4.4X)
– Media ISA extensions help (1.2X to 5.4X)
 New extensions for packing/multiplication useful
Comparing GPP and DSP performance
– GPPs outperform DSPs
 UltraSPARC-II+VIS 2-4X better than TI TMS320C6701
 Caveat: compiler issues with DSP

Preliminary Results (2 of 4)
K USERS
user’s
bits
CHANNEL
CODING
SOURCE
CODING
SPREADING
(MOBILE USER)
TRANSMITTER
detected
bits of all
K users
DECODER
DETECTOR
DEMODULATION
(BASE STATION)
CHANNEL ESTIMATION

MODULATION
RECEIVER
Important to study complete system including all components
– Need for complete benchmark suite
Preliminary Results (3 of 4)

Complete 3G benchmark suite with all components
•
•
•
•
•
•
•
•

Source coding
Channel coding
Spreading
Modulation/De-modulation
Multi-stage detection
Channel estimation
Channel decoding
Source decoding
Used either public-domain or in-house “C” code
 Optimized with ISA extensions
Preliminary Results (4 of 4)
Multi-stage
Detection
18%
G728
GSM
Speech Coder
11%
Speech Coder
29%
Multi-stage
Detection
40%
Channel
Estimation
9%
Channel
Encoder
3%
Channel
Decoder
20%
Channel
Decoder
17%
Channel
Encoder
3%

Speech
Decoder
6%
Speech
Decoder
24%
Channel
Estimation
20%
Choice of source coding standard makes big difference
– G728 system: source coding/decoding dominant
– GSM system: channel estimation/detection dominant
Conclusions

Implementation issues : Estimation & Detection Algorithms

Channel Estimation - Floating Point / External Memory

Multistage Detection - Fixed Point / Internal Memory

Specialized instructions : square root/reciprocals.

Additional support for complex arithmetic useful.

Recent trends in GPP / DSPs highly encouraging for next generation
wireless communication applications.
Future Work



FPGA / ASIC Implementation via VHDL models and SPW
Program & DSP implementations for W-CDMA uplink and
downlink
–
Blind Algorithms
–
Adaptive Algorithms
Architectural bottlenecks and compiler issues in DSPs to
enhance suitability for next generation W-CDMA systems

Multiple DSPs – mixed DSP / FPGA for mNIC