Diopsis Roadmap - Istituto Nazionale di Fisica Nucleare

Download Report

Transcript Diopsis Roadmap - Istituto Nazionale di Fisica Nucleare

Scalable Software Hardware Architecture Platform
for
Embedded Systems
SHAPES at DATE 2007
Pier Stanislao PAOLUCCI
chief technical officer – ATMEL Roma
& (part-time) permanent staff researcher – INFN Roma
for
the SHAPES Consortium
Project Motivation and Final Objective

SHAPES Acronym: Scalable Software Hardware Architecture Platform
for Embedded Systems

Objective: Develop a prototype of Tiled Scalable HW & SW architecture
for embedded applications characterized by inherent parallelism
Experiment: “Small” Tiles (<10 MGate) connected by “short wires”
weaving a packet switching on-chip and off-chip network
 The HW architecture should scale on next deep-submicron technologies


Challenges: how to program a tiled architecture

Benchmarks
 multi-loudspeaker multi-source wave field synthesis,
 Multi-microphone voice extraction from noise on multi-microphone
 Ultrasound scanners
 Physical modelling of quantum chromo dynamics
January, 2007
Introduction to SHAPES
2
2
HW
HW Objectives
 maintain profitable average selling prices
 control NRE by IP reuse
HW Solution
 appropriate granularity: “Small” Tiles (<10 MGate) connected by
“short (first neighbours) wires”
 Inside the typical elementary Tile:



Fully C programmable VLIW DSP for computing +
RISC for control +
Distributed Network Processor (a kind of generalized inter-tile DMA
controller) for inter-tile communication




multi-tile Silicon area >40mm2 <90mm2
management of logic & place & route complexity through IP reuse
multi-level network



Intra-tile: multi-layer bus matrix
Inter-tile: NoC (intra-chip) + 3DT (inter-chip)
distributed routing fabric connects on-chip and off-chip tiles weaving
a packet switching network
January, 2007
Introduction to SHAPES
3
3
SW

Communication centric, real-time aware programming
environment
 Application description: model based with explicit
annotation of real-time constraints
 Provide automated optimized binding of processes to
computing resources and binding of inter-process
communication on communication resources +
scheduling of processes and their communication
 Provide automated generation of hardware dependent
software support
 Retargetable compilation managing intra-tile and intertile parallelism, bandwidth and latencies
 Fast simulation
January, 2007
Introduction to SHAPES
4
4
Consortium Composition and
Roles of the Partners
System SW
ETH Zurich - Distributed Operation Layer: manages application parallelism
TIMA Lab and THALES - Hardware dependent Software Layer and RTOS
TARGET Compiler Tech. - Retargetable Compilers
RWTH Aachen Univ. – Fast Simulation of Heterogeneous Multi Proc. Systems
System HW
ATMEL Roma - Tile:
Evolution of (Diopsis®: mAgicV VLIW DSPTM + RISC) + INFN DNPTM
INFN Roma - DNPTM Distributed Network Processor + 3D Toroidal Eng.:
Evolution of APE Massive Parallel Processors
STMicrolectronics + Univ. of Cagliari and Pisa – Network on Chip:
Evolution of SpidergonTM Packet Switching Network on Chip
Parallel Application benchmarking
Fraunhofer IDMT – multi-loudspeaker Audio Wave Field Synthesis
ESAOTE, MedCom, Fraunhofer IGD - Ultrasound scanner
INFN - Physical Modelling
ATMEL – multi-microphone arrays for voice-extraction
January, 2007
Introduction to SHAPES
5
5
Deep Sub-micron Architectures…






~160 MGate available on a 100 mm2 chip (45nm CMOS, 2008)
Increasing GATES/CHIP  Design Complexity Management:
 embedded processors use a few million gates only, IP reuse possible and needed;
WIRING threatens Moore’s law:
 Wiring delay increases on new CMOS silicon generations
 The full chip cannot be reached in a single clock cycle
 Classic monolithic processor architectures do not scale
 Locally Synchronous, Globally Asynchronous needed
 Communication Centric SW and HW Architecture needed
SOLUTION: … TILED ARCHITECTURE…BY SIMPLE
GEOMETRIC DEMONSTRATION… IF CONSTANT LOGIC COMPLEXITY
INSIDE EACH TILE… THEN (LENGTH OF INTRA-TILE WIRES SCALES
DOWN AS THE TILE ITSELF… AND SHORT ~ FIRST NEIGHBOURS ONCHIP AND OFF-CHIP INTER-TILE WIRES)
QUEST OF BEST TILE, ON-CHIP AND OFF-CHIP INTERCONNECT. BUT
HOW TO PROGRAM? EXPLICIT PARALLEL PROGRAMMING PARADIGM,
and CULTURE NEEDED
… PROPOSED
POWER DISSIPATION density approaching prohibitive values if higher clock speed used;
much better Oper/Watt at moderate clock + parallelism (the human brain parallel
architecture performs an excellent job at 50 HZ!... room for improvement)
January, 2007
Introduction to SHAPES
6
6
Distributed Network Processor
DNP: a generalized DMA
controller for inter-tile or intratile packet routing
BUS Slave (to receive commands from RISC & DSP)
BUS Master (to read from intra-tile memories)
BUS Master (simultaneous intra-tile memory write)
NoC (to forward/receive inter-tile ON-CHIP packets)
3DT X+ (forward/receive inter-tile OFF-CHIP packets)
DNP
3DT X-
3DT Y+
3DT Y3DT Z+
3DT ZCollective communication
January, 2007
Introduction to SHAPES
7
7
DXM Mem Bus
POT Pads
RDT
RISC
Different Types
of Tiles
DSP
DXM
POT
Multi-Layer BUS
3DT
DNP
NoC
RDT: RISC + DSP Elementary Tile
DXM Mem Bus
POT Pads
DET
RET
RISC
DXM
POT
DSP
DNP
NoC
RET: RISC Elementary Tile
January, 2007
Introduction to SHAPES
POT
Multi-Layer BUS
Multi-Layer BUS
3DT
DXM
3DT
DNP
NoC
DET: DSP Elementary Tile
8
8
mAgicV IP Architecture
(Fully C programmable
Gigaflops VLIW DSP)
DBG
IRQ IN
IRQ OUT
RST, CLOCKS
AHB MST
AHB SLV
AHB
Master
DMA
Engine
AHB
Slave,
e.g.
DMA
Target
2-port, 8Kx128-bit, VLIW Program Memory(DPM)
VLIW Decompressor
Flow Controller, VLIW Decoder
Program
Counter
Condition
Generation
8R+8W 128x40
Data Register File
System
10-float
ops/cycle
January, 2007
Status
Register
Instruction
Decoder
4-address/cycle
Multiple DSP Address
Generation
Unit
16 multi-field Address
Register File
WP 1.6 - RISC+ VLIW DSP + DNP Tile
6-access/cycle
Data Memory
System
2x8Kx40
(DDM)
1010
Tile Complexity estimated through
Synthesis & Place & Route trials



mAgicV DSP:
 915 Kgates + 1 Mbit Prog Mem + 640 Kbit Data Mem
ARM926 & peripherals
 <2 equivalent Mgate (including 640 Kbit mem)
Tile Complexity 
 4230 equivalent Kgate + DNP gate count

January, 2007
including on chip memories
WP 1.6 - RISC+ VLIW DSP + DNP Tile
1111
Silicon Floorplan Trial of
RISC + mAgicV VLIW DSP Tile
DSP Reg File
DSP
Data
Mem
(DDM)
DSP
Logic
DSP Prog Mem
(DPM)
AMBA Multilayer
Peripherals
ARM
RDM
January, 2007
WP 1.6 - RISC+ VLIW DSP + DNP Tile
ARM926
1212
Spidergon NoC topology
• It’s a family of regular/symmetric topologies
• We look for a complexity/performance trade-off
• Low degree (router cost)
• Low number of links (wire cost)
• Symmetry (homogeneous building blocks; simple routing)
• Low diameter (performance)
• Good scalability (small network size granularity)
January, 2007
Introduction to SHAPES
1313
Background: APENext (2005) 2048
processor system, VLIW processors
designed by INFN, manufactured by ATMEL
January, 2007
Introduction to SHAPES
1616
SW challenges from
Tiled Architectures













Facilitate expression of parallelism: e.g. Network of Actors
Express real time constraints in a formal manner, feature missing in
classical languages. This is a key cultural point!!!
Avoid destroying information about available algorithm parallelism
Compilation chain must fully aware of key architectural parameters:
bandwidth, computational power, pipeline and latencies
Exploit memory locality – efficient management of Distributed
Memories – get rid of classical caches
Manage Long delays between distant tiles
Reduce Hot Spots in communications
Reduce Tiled RTOS overhead (time and memory footprint)
Introduce Hardware dependent Software and Hardware Abstraction
Layers
Capture scalability in a library of characterized SW/HW components
Support for (semi)-automation of iterative design over HW, SW, Appl
Monitor quality and real-time constraints on real HW and Simulators
Simulation speed of multi-tiled architectures
January, 2007
Introduction to SHAPES
1818
SW Architecture
application
specs
hardware platform
specification
Distributed Operation Layer
Simulator
component interaction,
properties and constraints
trace
information
Model Compiler
mapping
information
HdS Generator
Mapping
component
source code
HdS
source code
Memory
mapping
RTOS
Compiler
component
binary
glue
binary
HdS
binary
Link
Dispatch
OS services
binary
Optimised compilation on tiles and comms network
January, 2007
Introduction to SHAPES
1919
Distributed Operation Layer –
Application Specification
Two parts:
 Application structure
 @system level
 processes
 FIFO SW channels
between processes
 interconnection
between processes

A
C
.xml schema definition available
Behavior of each process
 process’ internals
January, 2007
B
Introduction to SHAPES
.c
…
.c
2020
Virtual SHAPES Platform (VSP)






Enable early software development
Explore different tile configurations
Binary compatible with the SHAPES hardware
Debugging capability
Export performance information
Scalability to multiple tiles
SHAPES
SW and app
partners
Applications
DOL
HdS
RTOS
VSP
January, 2007
Introduction to SHAPES
HW
2121
VSP-DOL interfacing
January, 2007
Introduction to SHAPES
2222
TARGET Compiler
OFFCHIP
MEM
TILE
TILE
OFFCHIP
MEM
TILE
TILE
OFFCHIP
MEM
OFFCHIP
MEM
OFFCHIP
MEM
OFFCHIP
MEM
OFFCHIP
MEM
TILE
TILE
OFFCHIP
MEM
OFFCHIP
MEM
TILE
TILE
OFFCHIP
MEM
INSTR.
DECODER
DECOMPACTION
PROGRAM
MEMORY
INSTR.
SEQUENCER
INTERRUPT
CONTROLLER
TILE
OFFCHIP
MEM
TILE
TILE
OFFCHIP
MEM
TILE
TILE
OFFCHIP
MEM
TILE
Core_bus5
Core_bus7
P6_0
P5_0
COMM I/F
DSP PROG
MEM
COMM I/F
Core_bus5
Core_bus7
4 5 6 7
0 1 2 3
0 1 2 3
RF0
P4_
0
mAgicV
DSP
ARM
uP
4 5 6 7
P2_0
P6_1
Communication
latency aware
scheduling
Intra-tile multicore on-chip
debugging
P4_1
RF1
P3_0
P5_1
P2_1
FP/I Mul2
Conv1FP/IMul1
Div1
Sh/Log1
*
*
FP/ICadd1
mAgicV
PCU
-
Min
Max1FP/I Add1
-+
Core related
requirements
January, 2007
OFFCHIP
MEM
COMM I/F
Functional unit
assignment for
clustered VLIWs
OFFCHIP
MEM
REG FILE
INSTR.
DECODER
Support of
predicated
execution
TILE
DSP DATA MEM
Support of VLIW
instruction
compaction
TILE
uP MEM
COMM I/F
Phase coupling:
reg. allocation
 SW pipelining
OFFCHIP
MEM
Introduction to SHAPES
FP/I Mul3
FP/I Mul4Conv2
Div2
Sh/Log2
*
*
P3_1
Inter-tile
communication
using DNP
FP/ICadd2
+
Min
FP/I Add2Max2
+-
mAgicV
core
Communication
related
requirements
2323
TIMA - HdS & RTOS - Principles

Communication differentiation
 Intra-subsystem & inter-subsystem communications

Networked operating system:
January, 2007
HdS API
Monitor
COMM
HAL DSP
ARM Subsystem
DSP Subsystem
HW
HAL ARM
Introduction to SHAPES
HW
HdS API
RTOS
(RT
COMM
Linux)
HdS
Application
SW
Application
SW
Hardware dependent Software: software directly
dependent on the underlying hardware
HdS

2424
SW Architecture
hardware platform
specification
simulation environment
(RWTH) WP 1.4
component interaction,
properties and constraints
trace
information
mapping
information
mapping
(ETHZ)
WP 1.11
HdS generator
(TIMA)
WP 1.10
model compiler
(ETHZ, RWTH)
WP 1.11, WP 1.4
application
specification
component
source code
January, 2007
Memory
mapping
HdS
binary
Link
Dispatch
(TARGET)
WP 1.9
Compiler
(TARGET)
WP 1.9
RTOS
(TIMA, THALES)
WP 1.10
OS services
binary
HdS
source code
component
binary
Introduction to SHAPES
glue
binary
2525
SHAPES SW Architecture: challenges



High-level exploration, mapping, and simulation:
 What is the degree of available parallelism? How can it be exposed to
the mapping stage? What is suitable model-based specification
formalism? What adaptations are necessary in order to expose the
inherent parallelism?
 Define a common Profiling Trace Interface (PTI) over which
information can be exchanged.
Hardware-dependent software and operation system:
 To use the provided features of the HdS (i.e. platform abstraction) a
generic interface API has to be defined.
Compiler technology:
 Modeling low-latency communication interfaces in the C source code
that is the input for the C compiler, for the computational tiles.
 Investigate how HdS can be modeled entirely in C source code, to be
compiled by the C compiler for the computational tiles.
January, 2007
Introduction to SHAPES
2626
OFF-CHIP
MEM
Tiled HW Architecture




Tile
Communication Centric,
not Processor Centric
Tile
Homogeneous SW
interface for on-chip and
off-chip scalable
connection and I/O
3D first-neighbour Toroidal
Tile
System Eng. (3DT) for
Off-Chip communication
Virtual tunnelling on
Tile
packed switching NoC
(Network on Chip) and offchip 3DT
Parallelism Aware System
SW: Manage memory
distribution, capture real 3DT Off-chip
time constraints
communication
Explicit parallel
programming/Network of
Actors

OFF-CHIP
MEM
Tile
Tile
Tile
Tile
Tile
Tile
OFF-CHIP
MEM
OFF-CHIP
MEM
OFF-CHIP
MEM
OFF-CHIP
MEM
OFF-CHIP
MEM
OFF-CHIP
MEM
OFF-CHIP
MEM
OFF-CHIP
MEM
OFF-CHIP
MEM

OFF-CHIP
MEM
OFF-CHIP
MEM
Tile
Tile
Tile
Tile
Tile
Tile
OFF-CHIP
MEM
OFF-CHIP
MEM
F
P
G
A
NoC
RISC
DSP
sensor
DAC
actuator
ADC
sensor
DAC
actuator
OFF-CHIP
MEM
0
DNP
ADC
Tile
1
2
3
4
15
5
14
6
13
7
Multi-Layer BUS
DXM
POT
12
11
10
9
8
ADC/DAC
January, 2007
Introduction to SHAPES
2727
DIOPSIS® +
The tile:
ICE
RCM Instr Cache
RDM IF
I
D
JTAG
RISC
MMU RCM Data Cache
BIU
D
I
DXM
DXM Interface(AHB EBI)
RDM SRAM
Multi-layer
Bus MATRIX
ROM
mAgicVTM DPM
2-port
16-port
256x40
Data Regs
10-float
ops/cycle
PDMA
Bridge
mAgicV DSPTM JTAG
January, 2007
DNP
Master
DSP
AHB
Master
4-addr/
cycle
Multiple
DSP
Addr
Gen
DSP
AHB
Slave
DNP
AHB
Master
Slave
DNP
AHB
Slave
DNP
AHB
Master
APB
DNP
DDM
6-access/
cycle
X
+
Introduction to SHAPES
X
-
Y
+
Y
-
Z
+
Z
-
C
+
NoC
(NI)
P
E
R
I
P
H
E
R
A
L
S
2828