A Reconfigurable Processor Architecture and Software

Download Report

Transcript A Reconfigurable Processor Architecture and Software

A Reconfigurable Processor Architecture
and Software Development Environment
for Embedded Systems
Andrea Cappelli
F. Campi, R.Guerrieri, A.Lodi, M.Toma, A. La Rosa,
L. Lavagno, C. Passerone, R.Canegallo
Nice, France
April 22, 2003
Outline
Motivations
 XiRisc: a VLIW Processor
 PiCoGA: A Pipelined Configurable Gate
Array
 Software Development Environment
 Results & Measurements
 Conclusions

Motivations
Increased on-chip Transistor
density
Increased Integration costs


Algorithm
complexity
Moore’s law
400
Millions of
transistors/Chip
300
200
Technology
(nm)
100
0
1997 1999 2001 2003 2005 2007 2009
Quest for performance
and
flexibility
1997 1999 2001 2003 2005 2007 2009
Battery
capacity

Increased Algorithmic complexity

Strong limitations in power supply
Severe
power consumption
constraints
Embedded systems Algorithms analysis

90% of computational complexity is concentrated
in small kernels covering small parts of overall code

Many algorithms show a relevant instruction-level parallelism
 Performance improved by multiple parallel data paths

Operand granularity is typically different from 32-bit
 Traditional ALU is power-inefficient
Significant improvements can be obtained extending
embedded processors with application-specific function units
Reconfigurable computing
to achieve maximum flexibility
Existing Architectures
Standard processor coupled with embedded programmable
logic where application specific functions are dynamically
remapped depending on the performed algorithm
1: Coprocessor model
2: Function unit model
EXTENDED INSTRUCTION SET RISC ARCHITECTURE
 Function unit approach: Reconfigurable device fits in a
classical RISC pipeline:
Low communication overhead
 Exploits very high resource parallelism

 32-bit load/store Risc architecture (5 stages pipeline)
 VLIW Elaboration:

Concurrent fetch and execution of two 32-bit instructions per cycle
 Set of specialized function units implementing
DSP-specific operations
Architecture


Duplicated instruction decode
logic (2 simmetrical datachannels)

Duplicated commonly used
function Units (Alu and
Shifter)

All others function units are
shared (DSP operations,
Memory handler)
A tightly coupled
pipelined configurable
Gate Array
Dynamic Instruction Set Extension

Specific operation to transfer data from a configuration cache to the PiCoGA:
pGA-load

region
specification
configuration specification
32-bit and 64-bit operation to launch the execution inside the PiCoGA
(Data exchange through register file):
32-bit
pGA-op
64-bit
pGA-op
Source 1
Source 2
Dest 1
Source 1 Source 2 Source 3 Source 4
Dest 2
operation
specification
Dest 1
Dest 2
operation
specification
PiCoGA: a Pipelined Configurable
Gate Array
 Embedded function unit for dynamic extension of the Instruction Set
PiCoGA
Two-dimensional array of LUT-based Reconfigurable Logic Cells
 Each row implements a possible stage of a customized pipeline,
independent and concurrent with the processor
 Up to 4x32-bit input data and up to 2x32-bit output data from/to register File

DFG-based elaboration
 Row elaboration is activated by an embedded control unit
 Execution enable signal for of each pipeline stage
 PiCoGA operation latency is dependent on the operation performed
PiCoGA Configuration
Layer1
Layer3
PiCoGA
Layer2
Layer4
Configuration
Cache
Goal: to reduce cache misses due to PiCoGA configuration
 Multi-context programming (4 cache layers/planes inside the array)
 Dedicated Configuration Cache with high bandwith bus to the PiCoGA (192 bits)
 Partial Run-Time Reconfiguration (A region is configured while another one is
active)
 Configuration is completely concurrent with processor elaboration
The Software Development Environment
Inititial
C code
Profiling
Assembler
Level
Scheduler
pGA-op
Latency
information
Computation
PiCoGA
mapping
kernel
extraction
Executable
code
100010100001
100101001010
110110010010
100101110101
101001011101
101001010110
111111111101
Software Simulation
Goals: check the correctness of the algorithm and evaluate performances
In the source code pGA-op is described using a pragma directive:
#pragma pGA shift_add 0x12 5 c a b
c = ( a << 2 ) + b
#pragma end
/**************************************/
/* Shift_add mapped on PiCoGA */
/**************************************/
#if defined(PiCoGA)
...
asm(“pGA-op 0x12 ...”)
...
/*************************************/
/* Emulation function _shift_add */
/************************************/
#else
void _shift_add(){
...
c = ( a << 2 ) + b
...
}
#endif
Sofware Simulation
Two special instructions are defined to support emulation:
...
topga ...
jal _shft_add
fmpga ...
...
 topga saves current state and passes arguments to emulation function.
Function clock cycle count is halted
 fmpga copies emulation function result(s) and restores registers; cycle count is
incremented with the latency value of the pGA-op
Evaluation of overall performances by counting elaboration cycles
Results and Measurements
Speed-ups for several signal processing cores:
DES
CRC
Median
Filter
Motion
Estimation
Motion
Prediction
Turbo
Codes
13.5x
4.3x
7.7x
12.4x
4.5x
12x
Strong reduction of
accesses to instruction
memory
Normalized Energy Histogram
1
75% of energy
consumption for a VLIW
architecture is due to
accesses to instruction
and data memory
0,8
Only VLIW
0,6
VLIW + PiCoGA
0,4
0,2
0,27
0,15
0,22
0,076
0
DES
CRC
Median
Filter
Motion
Prediction
Conclusions
XiRisc: VLIW Risc architecture enhanced
by run-time reconfigurable function unit
 PiCoGA: pipelined, runtime configurable,
row-oriented array of LUT-based cells
 Specific software development toolchain
 Speedups range from 4.3x to 13.5x
 Up to 93% energy consumption reduction
