Parallel Memory Access -
Download
Report
Transcript Parallel Memory Access -
Systematic development of
programs with parallel instructions
SHARC ADSP21XXX processor
M. Smith,
Electrical and Computer Engineering,
University of Calgary, Canada
smithmr @ ucalgary.ca
To be tackled today
What’s the problem?
Standard Code Development of “C”-code
Process for “Code with parallel instruction”
Rewrite with specialized resources
Move to “resource chart”
Unroll the loop
Adjust code
Reroll the loop
Check if worth the effort
ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21061
Copyright [email protected]
9/12/2016
2 / 33
ADSP-21XXX -- Parallelism opportunities
CACHE
Memory pointer operations
MEMORY
32 x 48
Post modify 2 index registers
DAG 1
DAG 2
Automatic8circular
buffer
operations
x 4 x 32
8 x 4 x 24
Automatic bit reverse addressing
PMA BUS
JTAG TEST &
Zero overhead loops
EMULATION
Instruction pipeline
issues
FLAGS
PROGRAM
SEQUENCER
TIMER
24
PMA
BUS
32
Ability forDMA
parallel
memory operation,
DMA
48
PMD BUS
One each on pm, dm and instruction cache bussesPMD
Key issue -- Only 48? bits available in OPCODE to describe
DMD BUS
40
16 data registers in 3 destinations and 6 sources = 135 bits
2 * (8 index + 8 modify + 16 data) = 64 bits
Condition code selection, 32 bit constants etc.
BUS CONNECT
DMD
Many parallel operations
and register to register bus transfers
REGISTER
Rn FLOATING
= Rx +& FIXED-POINT
Ry or Rn = RxFILE
* Ry
32-BIT
FLOATING-POINT
MULTIPLIER,
16 x 40
BARREL
& FIXED-POINT
FIXED-POINT
Rm = Rx
+ Ry, Rn = Rx - Ry with/without
Rp = Rq * Rr ALU
SHIFTER
ACCUMULATOR
Compiler is only -- somewhat useful
See article in course notes from
Embedded System Design Sept./October 2000
Need to get a systematic process to provide
Parallelism without pain
Lab. Library version of FFT, custom version of Burg
Algorithm (AR modeling)
ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21XXX
Copyright [email protected]
9/12/2016
4 / 33
Basic code development -- any system
Write the “C” code for the function
void Conjugate(float *re_pt, float *im_pt, int N)
Real and imaginary components in different arrays
Performs
input
output
= a + jb
= a - jb
Convert the code to ADSP 21XXX/68K etc.
assembly code, following the standard coding and
documentation practices
ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21XXX
Copyright [email protected]
9/12/2016
5 / 33
Parallel Instruction Code Development
Write the 21k assembly code for the function
void Conjugate(float *re_pt, float *im_pt, int N)
which etc…...
Determine the instruction flow through the
architecture using a resource usage diagram
Theoretically optimize the code -- a 2
minute counting process
Compare and contrast the amount of time to
perform the subroutine before and after
customization.
ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21XXX
Copyright [email protected]
9/12/2016
6 / 33
Standard “C” code
void Conjugate(float *re_pt, float *im_pt, int N) {
int count;
for (count = 0; count < N; count++) {
*im_pt = - *im_pt;
im_pt++;
}
void Conjugate_V2(float *re_pt, pm float *im_pt, int N) {
int count;
}
for (count = 0; count < N; count++) {
*im_pt = - *im_pt;
im_pt++;
ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21061
9/12/2016
Copyright [email protected]
7 / 33
Process for developing parallel code
Rewrite the “C” code using “LOAD/STORE” techniques
Accounts for the SHARC super scalar RISC DSP architecture
Write the assembly code using a hardware loop
Rewrite the assembly code using instructions that could be
used in parallel you could find the correct optimization
approach
Move algorithm to “Resource Usage Chart”
Optimize using techniques
Compare and contrast time -- setup and loop
ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21XXX
Copyright [email protected]
9/12/2016
8 / 33
21XXX-style load/store “C” code
void Conjugate(register float *in_pt, register float *out_pt
register int N) {
register int count;
register float *pt = out_pt;
register float scratch;
for (count = 0; count < N; count++) {
scratch = *pt;
scratch = -scratch
*pt = scratch;
pt++;
}
ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21XXX
Copyright [email protected]
9/12/2016
9 / 33
Process for developing parallel code
Rewrite the “C” code using “LOAD/STORE” techniques
Accounts for the SHARC super scalar RISC DSP architecture
Write the assembly code using a hardware loop
Check that end of loop label is in the correct place
Rewrite the assembly code using instructions that could be
used in parallel you could find the correct optimization
approach
Move algorithm to “Resource Usage Chart”
Optimize using techniques
Compare and contrast time -- setup and loop
ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21XXX
Copyright [email protected]
9/12/2016
10 / 33
Assembly code
PROLOGUE
BODY
Appropriate defines to make easy reading of code
Saving of non-volatile registers
Try to plan ahead for parallel operations
Know which 21k “multi-function” instructions are valid
EPILOGUE
Recover non-volatile registers
ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21XXX
Copyright [email protected]
9/12/2016
11 / 33
Straight conversion -- PROLOGUE
// void Conjugate(reg float *in, *out, reg int N) {
.segment/pm seg_pmco;
.global _Conjugate;
_Conjugate:
//
#define countR1 scratchR1
//
register int count = GARBAGE;
register float *pt = out_pt;
#define pt scratchDMpt
pt = INPAR2;
// dead <- R8, can re-use
#define scratchF8 F8
// float scratch = GARBAGE
// For the CURRENT code -- no volatile registers are needed
ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21XXX
Copyright [email protected]
9/12/2016
12 / 33
Straight conversion of code
//
for (count = 0; count < N; count++) {
LCNTR = INPAR3, DO LOOP_LAST UNTIL LCE:
// Dead <- INPAR3
scratchF8 = dm(0, pt);
//
scratch = *pt;
// Not ++ as pt re-used
scratchF8 = -scratchF8;
// scratch = -scratch
LOOP_LAST:
dm(pt, 1) = scratchF8;
// *pt = scratch; pt++;
5 magic lines of code
ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21XXX
Copyright [email protected]
9/12/2016
13 / 33
Process for developing parallel code
Rewrite the “C” code using “LOAD/STORE” techniques
Write the assembly code using a hardware loop
Accounts for the SHARC super scalar RISC DSP architecture
Check that end of loop label is in the correct place
Rewrite the assembly code using instructions that could be
used in parallel you could find the correct optimization
approach.
Means -- place values in appropriate registers to permit parallelism
BUT don’t actually write the parallel operations at this point.
Move algorithm to “Resource Usage Chart”
Optimize using techniques (Attempt to)
Compare and contrast time -- setup and loop
ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21XXX
Copyright [email protected]
9/12/2016
14 / 33
Speed rules for memory access
scratch = dm(0, pt);
scratch = dm(pt, 0);
dm(pt, 1) = scratch;
Can’t use PREMODIFY PERIOD
// Not ++ as to be re-used
Can’t use POST MODIFY
OPERATIONS with CONSTANTS
Use of constants as modifiers is not allowed -- not enough
bits in the opcode -- need 32 bits for each constant
Must use Modify registers to store these constants.
Several useful constants placed in modify registers
(DAG1 and DAG2) during “C-code” initialization (if linked in)
scratch = dm(pt, zeroDM); // Not ++ as to be re-used
dm(pt, plus1DM) = scratch;
ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21XXX
Copyright [email protected]
9/12/2016
15 / 33
Process for developing parallel code
Rewrite the “C” code using “LOAD/STORE” techniques
Write the assembly code using a hardware loop
Accounts for the SHARC super scalar RISC DSP architecture
Check that end of loop label is in the correct place
Rewrite the assembly code using instructions that could be
used in parallel you could find the correct optimization
approach
Means -- place values in appropriate registers to permit parallelism
BUT don’t actually write the parallel operations at this point.
Move algorithm to “Resource Usage Chart”
Optimize using techniques (Attempt to)
Compare and contrast time -- setup and loop
ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21XXX
Copyright [email protected]
9/12/2016
16 / 33
Resource Management -- Chart1 -- Basic code
MULTIPLIER
ADDER
DM BUS
PM BUS
Pt = INPAR2
Lcnt = INPAR3, DO (PC, LOOP_LAST) UNTIL LCE
F8 = dm( )
F8 = -F8
LOOP_LAST
dm( ) = F8
In theory -- if we could find out how
- and dm in parallel
DATA-BUS is limiting resource
dm
2 cycle loop possible
Before proceeding -- Is 2 cycle loop needed? Is 2 cycle loop enough?
ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21XXX
Copyright [email protected]
9/12/2016
17 / 33
Resource Management – Chart 2 -- Basic code
MULTIPLIER
ADDER
DM BUS
PM BUS
Pt = INPAR2
Lcnt = INPAR3, DO (PC, LOOP_LAST) UNTIL LCE
F8 = dm( )
F8 = -F8
LOOP_LAST
pm( ) = F8
In theory -- if we could find out how
- and dm and pm in parallel
1 cycle loop possible
MORE COMPLEX EXAMPLE – MAY BE LESS OBVIOUS
IS THIS SUFFICIENT – IF NOT PROCEED NO FURTHER
ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21XXX
Copyright [email protected]
9/12/2016
18 / 33
Process for developing parallel code
Rewrite the “C” code using “LOAD/STORE” techniques
Write the assembly code using a hardware loop
Means -- place values in appropriate registers to permit parallelism
BUT don’t actually write the parallel operations at this point.
Move algorithm to “Resource Usage Chart”
Optimize parallelism using techniques
Check that end of loop label is in the correct place
Rewrite the assembly code using instructions that could be
used in parallel you could find the correct optimization
approach
Accounts for the SHARC super scalar RISC DSP architecture
Attempt to -- watch out for special situations where code will fail
Compare and contrast time -- setup and loop
ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21XXX
Copyright [email protected]
9/12/2016
19 / 33
Unroll the loop
ADDER
DM BUS
F8 = dm( )
F8 = -F8
First time
Into loop
dm( ) = F8
F8 = dm( )
2nd time
dm( ) = F8
F8 = dm( )
3rd time
F8 = -F8
F8 = -F8
dm( ) = F8
ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21061
Copyright [email protected]
9/12/2016
20 / 33
Unrolled loop -- rewrite
ADDER
DM BUS
F4 = dm( )
First time
dm( ) = F4
F8 = dm( )
Don’t fight possible
register conflicts
F4 = -F4
F8 = -F4
dm( ) = F8
F4 = dm( )
F4 = -F4
dm( ) = F4
ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21061
Copyright [email protected]
9/12/2016
21 / 33
Unrolled loop – Move into stalls
ADDER
DM BUS
F4 = dm( )
First time
F4 = -F4
dm( ) = F4
F8 = dm( )
F8 = -F4
dm( ) = F8
F4 = dm( )
F4 = -F4
dm( ) = F4
ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21061
Copyright [email protected]
9/12/2016
22 / 33
Unrolled more optimized loop
ADDER
F4 = -F4
F8 = -F8
F4 = -F4
F8 = -F8
DM BUS
F4 = dm( )
F8 = dm( )
dm( ) = F4
dm( ) = F8
F4 = dm( )
F8 = dm( )
dm( ) = F4
dm( ) = F8
1
1
1
2
3
3
3
4
and 2
and 2
and 4
and 4
ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21061
Copyright [email protected]
9/12/2016
23 / 33
Need 1 of the resources to be
maxed out.
Otherwise algorithm is inefficient
May have to try a lot of different
approaches
ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21XXX
Copyright [email protected]
9/12/2016
24 / 33
Unrolled loop – identifying repeat
ADDER
F4 = -F4
F8 = -F8
F4 = -F4
F8 = -F8
DM BUS
F4 = dm( )
F8 = dm( )
dm( ) = F4
dm( ) = F8
F4 = dm( )
F8 = dm( )
dm( ) = F4
dm( ) = F8
Repeating
pattern
ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21061
Copyright [email protected]
9/12/2016
25 / 33
Now to to “reroll the loop”
The loop is currently just straight line coded.
Must put back into the “loop format” for coding
efficiency, maintainability and seg_pmco
limitations.
ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21XXX
Copyright [email protected]
9/12/2016
26 / 33
Re-rolled loop
ADDER
DM BUS
INPAR3 = PASS INPAR3
IF EQ JUMP ENDLOOP;
INPAR3 = ASHIFT INPAR3 BY -1
LCNT = INPAR3, DO (PC, LOOP_LAST) UNTIL LCE
F4 = dm( )
F4 = -F4
F8 = dm( )
F8 = -F8
dm( ) = F4
LOOP_LAST dm( ) = F8
ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21061
Copyright [email protected]
9/12/2016
27 / 33
ISSUES
Before
Went N times around the loop
Loop was of size 3
Total count was 4 + N *3 + 5 cycles
NOW
Went N/2 times around the loop
Loop was of size 4
Total count was 3 + N/2 * 4 + 5 cycles
BUT N MUST BE KNOWN TO BE EVEN
N = 2 K, where K = 0, 1, 2, 3, 4, 5, etc
ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21061
Copyright [email protected]
9/12/2016
28 / 33
Your job
Rewrite the code if N is known to be odd
N = 2K + 1, where K = 0, 1, 2, 3, 4, 5, 6, 7
Rewrite the code if N could be either odd or even
Rewrite the code if the imaginary part of the
number could be in program memory
Rewrite the code if you “can’t leave” the array in
place – meaning you must move both the real and
imaginary parts while conjugating
ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21061
Copyright [email protected]
9/12/2016
29 / 33
Tackled today
What’s the problem?
Standard Code Development of “C”-code
Process for “Code with parallel instruction”
Rewrite with specialized resources
Move to “resource chart”
Unroll the loop
Adjust code
Reroll the loop
Check if worth the effort
To come -- Tutorial practice of parallel coding
To come
-- Optimum FIR filter with parallelism
ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21061
9/12/2016
Copyright [email protected]
30 / 33