Transcript soc5
On-Chip Communication
(Architecture and Design)
Sungjoo Yoo
ISRC, SNU
Contents
Part 1
Introduction to on-chip communication
On-chip communication architecture
Software architecture
Hardware architecture
On-chip communication networks
Part 2
Analysis and optimization of on-chip communication
network
On-chip communication design on unreliable
interconnect
Open issues and summary
Part 1
Introduction
On-chip communication design
High-level functional
specification
SoC Implementation of
on-chip communication
architecture
M1
M3
M2
mP
IP
MM1
1
M3
SW wr.
HW wr.
HW wr.
Physical Communication Network
Designer’s Objectives and
Problems
High-performance
What is the maximum bandwidth of wire?
What is the best suited OCA?
Low power consumption
What is the minimum energy required to send the
given amount of data?
How to achieve the minimum energy?
Small HW/SW overhead
Interconnection and transceiver
Conflicting objectives
Trade-offs
Incremental Refinement of
On-Chip Communication
Specification of On-Chip
Communication
Abstraction levels of on-chip
communication
Client/server level
Message level
Transaction level
Implementation level
Client/Server Level
Concept
Service request/provide relation
A client component demands a service from server(s).
Service provider component may not be fixed and can be
determined dynamically
Object request broker (ORB) is needed.
Real example
Modem service
PDA device: baseband modem vocoder
Modem service can be Bluetooth, IEEE802.11, CDMA2000,
GPS, etc. depending on the location of PDA device.
Indoor: Bluetooth or IEEE802.11
Outdoor: IEEE802.11 (short range) or CDMA2000
Message Level
Concept
Components communicate with each other
via messages.
Message sender/receiver are fixed.
A message can have any type of data.
Real example
PDA: In the CDMA2000 mode, the vocoder
sends messages to the CDMA2000 modem.
A message has a frame of voice data and
control info.
Transaction Level
Concept
Components are mapped on real processors.
Communication is mapped on abstract communication
networks.
Communication protocols are fixed.
Transaction can be read, write, burst_read,
burst_write, etc.
For each candidate of real communication networks, the
transaction performance can be analyzed.
Real example
PDA: vocoder on a DSP, modem on an IP, candidate
communication networks (AMBA, Sonics, IBM, ...)
Determine bus priorities, packet priorities, TDMA slot
assignment, etc.
Implementation Level
On-chip communication architecture is implemented.
Software and hardware architecture
Local memory
w/ I/D caches
Application SW
mP, DSP
Middleware
OS
DMA
SW architecture
Device drivers
Processor
local bus
HW architecture
Adapter
HW IP
Memory
Adapter
Adapter
Communication network
(OCBs w/ bridges, Sonics, packet/circuit switch, etc.)
On-Chip Communication
Architecture
Software
Middleware, OS, device driver and ISR,
memory instructions
Hardware
DMA, (bus) adapter, communication
network (OCBs and bridges, packet
network, etc.), memory
Software On-Chip
CommunicationArchitecture
Middleware: CORBA, COM+, JAVA, BREW
Service resolution
ORB implementation
Dynamic reconfiguration of services needs to be
supported.
802.11 baseband modem in HW -->
Bluetooth in SW
Operating system
Communication services
pipe, shared memory, semaphore, mutex,
etc.
Supported as OS system calls
Software On-Chip
Communication Architecture
Device driver and ISR
The device driver depends on OS and the processor
OS
• Preemptive or not, interrupt or not, synchronization
services (semaphore, lock var, …)
Processor
• Bus width, register set, exception behavior, etc.
Memory instructions
Load/store, load multiple/store multiple instructions
Cache/virtual memory instructions in ARM v6 architecture
Hardware On-Chip
Communication Architecture
DMA (Direct Memory Access)
Block size
Adapter
Basic functionality: protocol conversion
E.g. VCI -- AMBA
Local communication architecture
Distributed bus arbitration/network routing: e.g. Sonics,
packet switch network
mP
mP
IP
MM1
1
M3
OS
Adapter
Adapter
AMBA
M4
IP(mP) adapter
OS
Adapter
CoreConnect
Ch. adp
Ch. adp
Hardware On-Chip
Communication Architecture
Communication network
On-chip bus
AMBA, CoreConnect, PI, etc.
Sonics mNetwork
On-chip communication network
Circuit switch
• Philips
Packet switch
• W. Dally (DAC01), Guerrir (DATE00)
Hardware On-Chip
Communication Architecture
On-chip memory
Shared memory
E.g. external SDRAM in multimedia chips
Distributed memory w/ caches: e.g. Daytona architecture
Four 64-bit processing elements (PE’s)
Each PE
- 32-bit RISC with DSP enhancements
- 64-bit vector co-processor (four
MAC’s)
Split-transaction bus
- Shared memory based on L1 cache
snooping
- Caches reduce bus traffic.
Embedded RTOS dynamically schedules
tasks.
120mm2, 0.35m, 100MHz
Hardware On-Chip
Communication Architecture
On-chip memory (cont’d)
On-chip implementation of linked list
Philips, DATE01
Data transfer and storage exploration
(DTSE)
IMEC
• Focus on low power consumption and area of
memory
On-Chip Communication
Networks
Routing
Sonics mNetwork SiliconBackplane
Philips, Circuit Switch Network
Packet Switch Networks, Guerrir, DATE00
Network topologies
Mesh, W. Dally, DAC2001
Octagon, ST Microelectronics, DAC2001
Sonics mNetwork
SiliconBackplane
On-chip bus
Time-division multiple access (TDMA)
Pre-characterized on-chip
bus agent
Two-step Arbitration
Originally assigned module TDMA
If no bus access priority-based
Pipelined TDMA Bus
Arbitration
Pipeline depth
Based on memory target latency at the desired clock
frequency
Design Example: CarrierClass VOIPProcessing Card
DSP + CPU banks + IO + DRAM
DSP: ~16 processors
voice and modem protocols
LEC
CPU: ~4 processors
Packet protocols
Control (call setup)
Hi BW SDRAM
Communication Bandwidth
Requirements: Basic I/O
IO traffic is low BW
Data IO rates
= 1000 ch x 64kb/s x 3 full duplex
= 48MB/s (worst case)
Data are buffered to SDRAM
Communication Bandwidth
Requirements: Cache Updates
CPU cache swap
-assuming 1.6MIPS/channel
-Total BW requirements:
48 + 600 + 320 = 968 (MB/s)
mNetwork Implementation
Derivative Design Example
-Full G.168 LEC uses a specialized core
-LEC has local 4MB memory
-# of channels: 1000 2000
-Increased traffic
-Bus width: 64 128 (bits)
Circuit Switch Network:
Philips PROPHID Architecture
Focus on high-throughput signal
processing for multimedia applications
Requirements
High computation capacity and high communication bandwidth
Performance and programmability
PROPHID
Heterogeneous multi-processor architecture consisting
of general and application specific processors
General purpose processor
Control and low-medium signal processing
Application specific processors
High performance signal processing
Philips Multi-window TV
application
PNX8500
PROPHID architecture
PROPHID: An Architecture
Template For high throughput: ~ 10 Gbits/s
and reconfigurable connection
(switch matrix, 20 proc’s, 64MHz)
Programmability
and control app’s
~10 GOPS
Control-oriented bus
Autonomous tasks based on
data-driven execution
PROPHID: Autonomous
Execution of ADS Processors
- Autonomous task execution on Application Domain
Specific (ADS) processors
- Steam-based execution
- Data-availability determines the execution of tasks.
- Master(CPU)-slave synchronization can be avoided.
Khan Process Network Model
of Multi-window Application
Communication
Infrastructure
Processor Model and
Surrounding Shell
Circuit Switch Network
Guaranteeing the throughput of streams
with hard-real-time constraints in the
PROPHID architecture.
Requirements of task execution on ADS
processors
Time-interleaved task execution
Each task requires input/output FIFO’s.
Circuit Switch Network
Network Topology
Time-Space-Time Routing
High-Performance
Communication Network in
PROPHID Architecture
time
space
time
Chip Photo and Metrics
A Generic Architecture for On-Chip
Packet-Switched Interconnections,
DATE 2000.
A scalable system-level interconnection template is presented.
A Generic Architecture for
On-Chip Packet-Switched
Interconnections
Bus-based architecture will not meet the bandwidth
requirements, since
it is inherently non-scalable in terms of bandwidth
Bandwidth is shared by connected comp’s.
Multiple on-chip bus approaches like VSIA
case-specific grouping of IP’s
Not a truly scalable and reusable interconnection.
In this paper, a generic interconnection template is
presented.
A Generic Architecture for
On-Chip Packet-Switched
Interconnections
Switching networks
Circuit switching
like PROPHID communication network
High performance
Drawbacks
• lack of reactivity against rapidly changing comm.
– E.g. data bursts in MPEG (worst case should be
assumed.), random traffic between CPU master and
slaves.
Packet switching
Packets are transferred by routers like Internet.
Routing decisions are distributed over the routers, the
network can remain very reactive.
Packet Routing
Wormhole routing
Network Topology: Fat-tree
Network
-Ex. 16 terminals: 8 --> 8 communication
-The terminals can be processors, DSPs, memory, etc.
- Routers are free to use any of the available paths
- Packet: a sequence of 32 bit words
- Packet payload may be of any size
Scalability of Fat-Tree
Network
Scaling and Protocol Stack
Real Implementation
Network Costs and Latency
- One drawback of packet-switched
network
--> inherently arbitrary delay
Pros and Cons:
Bus versus Network
Structured On-Chip
Communication Network,
DAC2001
Why structured network?
Global routing on SoC is hard to
characterized and design.
It would be better to have electrically well
characterized wiring.
-Top 2 metal layers are used
-2D folded torus topology
-Each tile can have processor, DSP,
memory, I/O, etc.
-256bit data line
-Virtual channel support
Router Architecture
Real Implementation
0.1m CMOS
Router overhead
Eight virtual channels at each edge of tile
4 flits x 300b/flit = 1200 b
Each tile has ~5kB (=4 x 1200 b) buffer
storage
Metal routing: 50mm x 3mm
Total router overhead: 6.6% (0.59mm2)
Network Processor Design:
ST Microelectronics Octagon
OC-768
40Gbps
114x106 packets/s, 44B/packet
Processing requirement
1/114x106 = 9ns/packet
1 packet needs 500 instructions execution
57GIPS
• No single processor!
• Multiprocessors w/ high communication BW
Communication network for multiprocessor
SoC of OC-768
Octagon
ST Microelectronics
Octagon
Octagon
Cross Bar
Node Model
Scaling and Comparison
with Cross Bar
Summary
Introduction to on-chip communication
On-chip communication architecture
Software architecture
Hardware architecture
On-chip communication networks
Routing
Topology
Part 2 will treat
Analysis and optimization of on-chip communication network
On-chip communication design on unreliable interconnect
Open issues and summary
Part 2
Analysis of On-Chip
Communication
Analysis
Quality of service, runtime, power
consumption, etc.
Modeling of architecture components
OS modeling
Communication network modeling
On-chip bus
Packet switch network
Analysis of On-Chip
Communication
Given communication network and
mapping
Trace-based
S. Dey
Worst-case
R. Ernst : SW + HW
Statistical analysis
Queueing theory in packet switch network
Other modeling methods
Performance Analysis of
On-Chip Communication
Analysis with synthetic statistical testbenchs
Hierarchical bus, TDMA, Ring
ICVD'00
Trace-based analysis
Hierarchical bus
ICCAD99, SiPS
Queueing theory
Circuit, packet switch
DAC01
Optimization of On-Chip
Communication
HW architecture
Communication resource management
Mapping, (reconfigurable) interconnection
(topology), scheduling and routing
Performance and power
Modulation/demodulation
Power
Average performance
SW architecture
Optimization of On-Chip
Communication Network
On-chip bus design
Gajski
Daveau
Glesner
Mapping and interconnection topology
S. Dey, ICCAD00
Potkojnak, ICCAD00
Pedram, DATE00
Others for low power, DAC2001
Optimization of On-Chip
Communication Network
Scheduling and routing
S. Dey: ICCAD, DAC (CAT, reconfigurable)
W/ mapping and interconnection topology
Circuit switch
Comm. arch. Book
Packet switch
Comm. arch. Book
Octagon
For better optimization, Not physical module basis, but
virtual channel or message basis!
Optimization in SW On-Chip
Comm. Architecture Design
Middleware, OS, device driver
Minimum service implementation
Component-based middleware/OS design
Pebble, GO!, …
TIMA
JAVA-based implementation
JavaOS and JVM
Application-specific implementation
BREW
On-chip communication on
unreliable interconnect
Encoding/decoding
Low-power bus encoding, DAC, Benini
Communication on unreliable
communication media
CDMA style
To maintain average/statistical performance
Find the paper
“Designing Systems-on-Chip
Using Cores”, DAC 2000.
R. A. Bergamaschi and W. R. Lee,
The problem of assembling SoC’s using IP blocks
error-prone, labor-intensive, timing-consuming
since the designer should understand
the functionality
interfaces
electrical characteristics of cores such as
processors, mem. controllers, bus arbiters, etc.
Moreover, cores are parameterized and need to be
configured according to their use in the SoC.
Designing Systems-on-Chip
Using Cores
With the VSIA’s Virtual Component Interfaces, the
designer still has to do
wrapper design
architecture design
assembling the SoC using VCI’s and wrappers
A digression: two key points in our design flow
application specific wrapper (comm. co-processor)
design
application specific architecture design flow
Designing Systems-on-Chip
Using Cores
Designing Systems-on-Chip
Using Cores
Designers’ tasks to configure the bus architecture
define the cores to be used
32, 64, 128 bit bus, proc. charateristics, HW/SW
understand the functionality of all pins on all cores
and determine their connections
define request priorities, e.g. interrupt priorities
define the usage of DMA
define address maps
define clock domains
insert glue logic
insert/configure test logic
There has been no tool to automate those tasks.
Designing Systems-on-Chip
Using Cores
Automating SoC integration: 6 steps
1. Virtual design
Virtual component (VC) is a representation of a class of real
components.
E.g. PowerPC VC represents all real PowerPC cores (e.g.
401, 405, etc.).
Virtual interface is used instead of real interface.
• Smaller number of interface pins
2. Glueless interface
Automatic generation of glue logic
• First, include necessary glue logic into the core.
• Remaining minor glue logic is automatically generated.
Designing Systems-on-Chip
Using Cores
Automating SoC integration: 6 steps
3. Core and pin properties
encode the structural and functional characteristics of a
component and its pins.
Properties attached to all components and pins
Automatic pin connection algorithm is used.
Properties
•
•
•
•
•
BUS_TYPE: ASB, APB, etc.
INTERFACE_TYPE: MASTER, SLAVE
FUNCTION_TYPE: READ, WRITE, INTERRUPT
OPERATION_TYPE: REQUEST, ACKNOWLEDGE
DATA_TYPE, RESOURCE_TYPE
Designing Systems-on-Chip
Using Cores
Automating SoC integration: 6 steps
4. Interconnection engine
5. Virtual to real synthesis
Designing Systems-on-Chip
Using Cores
Automating SoC integration: 6 steps
6. Configuration engine
clocking, address map, interrupt map, DMA channel
assignment, etc.
Comments
To free the designer from pin interconnection and
glue logic design
Limitation
Automation applies to HW Module interface only at pin level
(with a fixed target architecture)
No SW module interfacing (i.e. targeting and processor
interfacing) is not considered.
Open Issues
Architectural trade-off
HW/SW trade-off
in middleware and OS service implementation
Communication network design
Prioritized packet network design
Interconnection topology design with
physical DSM effects
Open Issues
Reconfigurable on-chip communication
In connection with component-based SoC
design
On-chip communication design w/
unreliable media
Unreliable physical wiring and environment
Summary