A Reconfigurable Processor Architecture and Software
Download
Report
Transcript A Reconfigurable Processor Architecture and Software
A Reconfigurable Processor Architecture
and Software Development Environment
for Embedded Systems
Andrea Cappelli
F. Campi, R.Guerrieri, A.Lodi, M.Toma, A. La Rosa,
L. Lavagno, C. Passerone, R.Canegallo
Nice, France
April 22, 2003
Outline
Motivations
XiRisc: a VLIW Processor
PiCoGA: A Pipelined Configurable Gate
Array
Software Development Environment
Results & Measurements
Conclusions
Motivations
Increased on-chip Transistor
density
Increased Integration costs
Algorithm
complexity
Moore’s law
400
Millions of
transistors/Chip
300
200
Technology
(nm)
100
0
1997 1999 2001 2003 2005 2007 2009
Quest for performance
and
flexibility
1997 1999 2001 2003 2005 2007 2009
Battery
capacity
Increased Algorithmic complexity
Strong limitations in power supply
Severe
power consumption
constraints
Embedded systems Algorithms analysis
90% of computational complexity is concentrated
in small kernels covering small parts of overall code
Many algorithms show a relevant instruction-level parallelism
Performance improved by multiple parallel data paths
Operand granularity is typically different from 32-bit
Traditional ALU is power-inefficient
Significant improvements can be obtained extending
embedded processors with application-specific function units
Reconfigurable computing
to achieve maximum flexibility
Existing Architectures
Standard processor coupled with embedded programmable
logic where application specific functions are dynamically
remapped depending on the performed algorithm
1: Coprocessor model
2: Function unit model
EXTENDED INSTRUCTION SET RISC ARCHITECTURE
Function unit approach: Reconfigurable device fits in a
classical RISC pipeline:
Low communication overhead
Exploits very high resource parallelism
32-bit load/store Risc architecture (5 stages pipeline)
VLIW Elaboration:
Concurrent fetch and execution of two 32-bit instructions per cycle
Set of specialized function units implementing
DSP-specific operations
Architecture
Duplicated instruction decode
logic (2 simmetrical datachannels)
Duplicated commonly used
function Units (Alu and
Shifter)
All others function units are
shared (DSP operations,
Memory handler)
A tightly coupled
pipelined configurable
Gate Array
Dynamic Instruction Set Extension
Specific operation to transfer data from a configuration cache to the PiCoGA:
pGA-load
region
specification
configuration specification
32-bit and 64-bit operation to launch the execution inside the PiCoGA
(Data exchange through register file):
32-bit
pGA-op
64-bit
pGA-op
Source 1
Source 2
Dest 1
Source 1 Source 2 Source 3 Source 4
Dest 2
operation
specification
Dest 1
Dest 2
operation
specification
PiCoGA: a Pipelined Configurable
Gate Array
Embedded function unit for dynamic extension of the Instruction Set
PiCoGA
Two-dimensional array of LUT-based Reconfigurable Logic Cells
Each row implements a possible stage of a customized pipeline,
independent and concurrent with the processor
Up to 4x32-bit input data and up to 2x32-bit output data from/to register File
DFG-based elaboration
Row elaboration is activated by an embedded control unit
Execution enable signal for of each pipeline stage
PiCoGA operation latency is dependent on the operation performed
PiCoGA Configuration
Layer1
Layer3
PiCoGA
Layer2
Layer4
Configuration
Cache
Goal: to reduce cache misses due to PiCoGA configuration
Multi-context programming (4 cache layers/planes inside the array)
Dedicated Configuration Cache with high bandwith bus to the PiCoGA (192 bits)
Partial Run-Time Reconfiguration (A region is configured while another one is
active)
Configuration is completely concurrent with processor elaboration
The Software Development Environment
Inititial
C code
Profiling
Assembler
Level
Scheduler
pGA-op
Latency
information
Computation
PiCoGA
mapping
kernel
extraction
Executable
code
100010100001
100101001010
110110010010
100101110101
101001011101
101001010110
111111111101
Software Simulation
Goals: check the correctness of the algorithm and evaluate performances
In the source code pGA-op is described using a pragma directive:
#pragma pGA shift_add 0x12 5 c a b
c = ( a << 2 ) + b
#pragma end
/**************************************/
/* Shift_add mapped on PiCoGA */
/**************************************/
#if defined(PiCoGA)
...
asm(“pGA-op 0x12 ...”)
...
/*************************************/
/* Emulation function _shift_add */
/************************************/
#else
void _shift_add(){
...
c = ( a << 2 ) + b
...
}
#endif
Sofware Simulation
Two special instructions are defined to support emulation:
...
topga ...
jal _shft_add
fmpga ...
...
topga saves current state and passes arguments to emulation function.
Function clock cycle count is halted
fmpga copies emulation function result(s) and restores registers; cycle count is
incremented with the latency value of the pGA-op
Evaluation of overall performances by counting elaboration cycles
Results and Measurements
Speed-ups for several signal processing cores:
DES
CRC
Median
Filter
Motion
Estimation
Motion
Prediction
Turbo
Codes
13.5x
4.3x
7.7x
12.4x
4.5x
12x
Strong reduction of
accesses to instruction
memory
Normalized Energy Histogram
1
75% of energy
consumption for a VLIW
architecture is due to
accesses to instruction
and data memory
0,8
Only VLIW
0,6
VLIW + PiCoGA
0,4
0,2
0,27
0,15
0,22
0,076
0
DES
CRC
Median
Filter
Motion
Prediction
Conclusions
XiRisc: VLIW Risc architecture enhanced
by run-time reconfigurable function unit
PiCoGA: pipelined, runtime configurable,
row-oriented array of LUT-based cells
Specific software development toolchain
Speedups range from 4.3x to 13.5x
Up to 93% energy consumption reduction