To DSP or Not to DSP?

Download Report

Transcript To DSP or Not to DSP?

To DSP or Not to DSP?
Chad Erven
Words to Bits – Your
Options





ASIC
FPGA
DSP
Embedded RISC
General Purpose Processor (GPP)
Why Go Programmable?
Building the chip wrong
1.
–
–
–
Systems are increasingly too complex to efficiently be
described by RTL designers
Errors are orders of magnitudes more difficult to find in
hardware than software
Defects are extremely costly in hardware
Building the wrong chip
2.
–
Only software is flexible enough to adapt during and after
system design
HARDWARE IS TO HARD!
So Software and
Processors, Right?

Using processors has its drawbacks – especially in
SOC designs
– Never a perfect match between the application and the
hardware
– Performance costs, power penalties, wasted silicon will
ALWAYS happen to some extent
– Integrating multiple disparate cores with each other
Splitting the Difference –
ASIPs



Ever wish you were the processor designer?
Now you are! Write the exact instructions
you need and nothing more.
An Application Specific Integrate Processor
(ASIP) offers the best of both worlds
Back Up!

Isn’t hardware too much work?
– Yes

So doesn’t an ASIP defeat the
purpose?
– No

Why not?
– Extending a base processor is much easier
– Readily amiable to automation
– You only have to verify the instruction description, integration
into the processor is guaranteed
Cool, Show Me How It
Works

ASIPs derive their performance from
three problems for a processor
1.
Operations that are innately parallel must be expressed
serially
–
2.
Memory space is addressed as one continuous space
–
3.
Somewhat solved by SIMD or MIMD processors
Somewhat solved by modifiers and/or pragmas (dm/pm)
Applications are complicated by their expression as
operations on C types
–
Somewhat alleviated by powerful instructions in hardware
Working with the Innate
Nature of the Algorithm

Example – byte swap (common telecom task)
int *a, *b ;
…
for(int i= 0
{
a[i] =(
((b[i]
((b[i]
((b[i]
((b[i]
}
; i < 4096 ; i++ )
&
&
&
&
0x000000ff)
0x0000ff00)
0x00ff0000)
0xff000000)
<<
<<
>>
>>
24)
8)
8)
24)
|
|
|
);
Working with the Innate
Nature of the Algorithm

Write your own instruction:
operation swap {in AR x, out AR y}{}
{y = {x[7:0],x[15:8],x[23:16],x[31:24]};}

Making the C Code:
for(int i = 0 ; i < 4096 ; i++) a[i] = swap(b[i]) ;
Execution Cycles without TIE Extension
Execution Cycles With TIE Extension
4,915,300
1,638,524
5X SPEED UP!!!
Instruction Fusion
reg5 (output)
reg5 (output)
op2
op2
reg3 (input)
reg4 (input)
reg4 (input)
reg3 (output)
op1
op1
reg1 (input)
reg2 (input)
Unfused operation
reg1 (input)
reg2 (input)
Fused operation

Example
for(i=0 ; i<n ; i++ ) c[i] = (a[i] * b[i]) >> 4 ;
Assembly:
loop:
l8ui
l8ui
addi
addi
mull6u
srai
s8i
addi
a12,a11,0
a13,a10,0
a11,a11,1
a10,a10,1
a8,a12,a13
a8,a8,4
a8,a9,0
a9,a9,1
Example
a11
1
addi
0
0
l8ui
l8ui
addi
mull6u
4
srai
a9
s8i
1
a10
1
addi
a9
Example
a11
1
addi
0
0
l8ui
l8ui
a9
fusion.mull6u.srai.s8i.addi
a9
1
a10
addi
Example
New assembly code:
loop:
l8ui
a12,a11,0
l8ui
a13,a10,0
addi
a10,10,1
addi
a11,a11,1
fusion.mull6u.srai.s8i.addi a9,12,a13
Benchmarking
EEMBC ConsumerMarks (performance). From [Rowen] .
EEMBC Summary (Performance/MHz). From [Rowen]
• Hand coded assembly for the other processors
And I Haven’t Even
Gotten To…

Sharing input operands

Substituting variables with constants

Replacing memory tables with logic

Limiting immediate values to the minimum required width

Placing operands in special registers

Creating SIMD instructions

Reducing the size of operand specifiers

Custom input/output queues
Ok, Let Me Have It Dr.
Smith
(The rest of you can ask questions too)