No Slide Title

Download Report

Transcript No Slide Title

Platform Design
Multi-Processor Systems-on-Chip
MPSoC
TU/e 5kk70
Henk Corporaal
Bart Mesman
Overview
• What is a platform, and why platform based
design?
• Why parallel platforms?
• A first classification of parallel systems
• Design choices for parallel systems
• Shared memory systems
– Memory Coherency, Consistency, Synchronization,
Mutual exlusion
• Message passing systems
• Further decisions
7/20/2015
Platform Design
H. Corporaal and B. Mesman
2
Design & Product requirements?
• Short Time-to-Market
– Reuse / Standards
– Short design time
• Flexible solution
– Reduces design time
– Extends product lifetime; remote inspect and debug, …
• Scalability
• High performance and Low power
– Memory bottleneck, Wiring bottleneck
•
•
•
•
7/20/2015
Low cost
High quality, reliability, dependability
RTOS and libs
Good programming environment
Platform Design
H. Corporaal and B. Mesman
3
Solution ?
• Platforms
– Programmable
• One or more processor cores
– Reconfigurable
– Scalable and flexible
– Memory hierarchy
• Exploit locality
– Separate local and global wiring
– HW and SW IP reuse
• Standardization (on SW and HW-interfaces)
• Raising design abstraction level
– Reliable
– Cheaper
– Advanced Design Flow for Platforms
7/20/2015
Platform Design
H. Corporaal and B. Mesman
4
What is a platform?
Definition:
A platform is a generic, but domain specific
information processing (sub-)system
Generic means that it is flexible, containing programmable
component(s).
Platforms are meant to quickly realize your next system
(in a certain domain).
Single chip?
7/20/2015
Platform Design
H. Corporaal and B. Mesman
5
Example Platform: Sanyo Camera
7/20/2015
Platform Design
H. Corporaal and B. Mesman
6
Platform example: TI OMAP
Up to 192Mbyte off-chip memory
192Kbyte shared SRAM
8Kb data cache (2-way,
512 lines of 16 bytes)
Write buffer (17 elements)
16Kb (2-way)
16Kb (2-way)
8Kb mem (2x 4K)
64Kb dual port (8x 4K x 16b)
96Kb single port (12x 4k x 16b)
32Kb ROM
7/20/2015
Platform Design
H. Corporaal and B. Mesman
7
Platform and platform design
SDT
system design
technology
PDT
platform design
technology
Design technology
Applications
Platform
Enabling technologies
7/20/2015
Platform Design
H. Corporaal and B. Mesman
8
Why parallel processing
•
•
•
•
Performance drive
Diminishing returns for exploiting ILP and OLP
Multiple processors fit easily on a chip
Cost effective (just connect existing processors or
processor cores)
• Low power: parallelism may allow lowering Vdd
However:
• Parallel programming is hard
7/20/2015
Platform Design
H. Corporaal and B. Mesman
9
Low power through parallelism
• Sequential Processor
–
–
–
–
Switching capacitance C
Frequency f
Voltage V
P = fCV2
• Parallel Processor (two times the number of units)
–
–
–
–
7/20/2015
Switching capacitance 2C
Frequency f/2
Voltage V’ < V
P = f/2 2C V’2 = fCV’2
Platform Design
H. Corporaal and B. Mesman
10
Power efficiency: compare 2 examples
• Intel Pentium-4 (Northwood) in 0.13 micron
technology
–
–
–
–
3.0 GHz
20 pipeline stages
Aggressive buffering to boost clock frequency
13 nano Joule / instruction
• Philips Trimedia “Lite” in 0.13 micron technology
–
–
–
–
250 MHz
8 pipeline stages
Relaxed buffering, focus on instruction parallelism
0.2 nano Joule / instruction
• Trimedia is doing 65x better than Pentium
7/20/2015
Platform Design
H. Corporaal and B. Mesman
11
Parallel Architecture
• Parallel Architecture extends traditional
computer architecture with a
communication network
– abstractions (HW/SW interface)
– organizational structure to realize abstraction
efficiently
Communication Network
Processing
node
7/20/2015
Processing
node
Processing
node
Platform Design
Processing
node
H. Corporaal and B. Mesman
Processing
node
12
Platform characteristics
•
•
•
•
•
7/20/2015
System level
Processor level
Communication network
Memory system
Tooling
Platform Design
H. Corporaal and B. Mesman
13
System level characteristics
•
•
•
•
7/20/2015
Homogeneous  Heterogeneous
Granularity of processing elements
Type of supported parallelism: TLP, DLP
Runtime mapping support?
Platform Design
H. Corporaal and B. Mesman
14
Homogeneous or Heterogeneous
• Homogenous:
– replication effect
– memory dominated any way
– solve realization issues
once and for all
– less flexible
• Typically:
– data level parallelism
– shared memory
– dynamic task mapping
7/20/2015
Platform Design
H. Corporaal and B. Mesman
15
16
Example: Philips Wasabi
• Homogeneous multiprocessor for media applications
• Two-level communication hierarchy
– Top: scalable message passing
network plus tiles
– Tile: shared memory plus processors, accelerators
TM
TM
memory
TM
ARM
TM
pixel
simd
TM
video
scale
TM
picture
improve
• Fully cache coherent to support data parallelism
Homogeneous or Heterogeneous
• Heterogeneous
– better fit to application domain
– smaller increments
• Typically:
– task level parallelism
– message passing
– static task mapping
7/20/2015
Platform Design
H. Corporaal and B. Mesman
17
18
Example: Viper2
•
•
•
•
•
•
•
•
•
•
Heterogeneous
Platform based
>60 different cores
Task parallelism
Sync with interrupts
Streaming
communication
Semi-static
application graph
50 M transistors
120nm technology
Powerful, efficient
MBS
VMPG
TM3260
TDCS
VIP
MIPS
PR4450
TM3260
MSP
QVCP5L
MDCS
QVCP2L
Homogeneous or Heterogeneous
• Middle of the road approach
• Flexibile tiles
• Fixed tile structure at top level
7/20/2015
Platform Design
H. Corporaal and B. Mesman
19
Types of parallelism
TLP
Program/Thread
level
Module
level
Kernel
level
7/20/2015
Platform Design
H. Corporaal and B. Mesman
Heterogenous
Multi-threaded / MIMD
DLP
Homogenous
SIMD / Vector
ILP
Heterogenous
VLIW /
Superscalar/
Dataflow arch.
20
Processor level characteristics
Processor consists of
– Instruction engine (Control Processor, Ifetch unit)
– Processing element (PE): Register file, Function unit(s), L1
DMem
•
•
•
•
•
•
•
•
•
7/20/2015
Single PE  Multiple PEs (as in SIMD)
Single FU/PE  Multiple FUs/PE (as in VLIW)
Granularity of PEs, FUs
Specialized  Generic
Interruptable, pre-emption support
Multithreading support (fast context switches)
Clustering of PEs; Clustering of FUs
Type of inter PE and inter FU communication network
Others: MMU – virtual memory, …..
Platform Design
H. Corporaal and B. Mesman
21
Generic or Specialized?
Intrinsic computational efficiency
7/20/2015
Platform Design
H. Corporaal and B. Mesman
22
General processor organization
0
M
u
x
1
PE:
processing
engine
Instruction fetch - Control
IF/ID
ID/EX
EX/MEM
MEM/WB
Add
Add
Add result
4
PC
Address
Instruction
memory
Instruction
Shift
left 2
Read
register 1
Read
data 1
Read
register 2
Registers Read
Write
data 2
register
Write
data
0
M
u
x
1
Zero
ALU ALU
result
Address
FU
Data
memory
Write
data
16
7/20/2015
Sign
extend
Read
data
1
M
u
x
0
32
Platform Design
H. Corporaal and B. Mesman
23
DMem
DMem
DMem
DMem
DMem
DMem
DMem
IMem
DMem
Control
Processor
DMem
(Linear) SIMD Architecture
RF
RF
RF
RF
RF
RF
RF
RF
RF
FU
FU
FU
FU
FU
FU
FU
FU
FU
PE1
7/20/2015
To be added:
-inter PE communication
-communication from PEs to Control Processor
-Input and Output Platform Design H. Corporaal and B. Mesman
PEn
24
Communication network
• Bus (single all2all connection)  Crossbar  NoC with pointto-point connections
• Topology, Router degree
• Routing
– path, path control, collision resolvement, network support, deadlock
handling, livelock handling
•
•
•
•
•
virtual layer support
flow control and buffering
error handling
Inter-chip network support
Guarantees
– TDMA
– GT  BE traffic
• etc, etc.
7/20/2015
Platform Design
H. Corporaal and B. Mesman
25
Comm. Network: Performance metrics
• Network Bandwidth
– Need high bandwidth in communication
– How does it scale with number of nodes?
• Communication Latency
– Affects performance, since processor may have to wait
– Affects ease of programming, since it requires more thought
to overlap communication and computation
• Latency Hiding
– Global memory access can take hundreds of cycles
– How can a mechanism help hide latency?
– Examples:
• overlap message send with computation,
• prefetch data,
• switch to other tasks
7/20/2015
Platform Design
H. Corporaal and B. Mesman
26
How good is your network?
Topology determines:
• Degree = number of links from a node
• Diameter = max number of links crossed
between nodes
• Average distance = number of links to random
destination
• Bisection = minimum number of links that
separate the network into two halves
• Bisection bandwidth = link bandwidth x
bisection
7/20/2015
Platform Design
H. Corporaal and B. Mesman
27
Metrics for common topologies
Type
Degree Diameter Ave Dist
1D mesh
2
N-1
2D mesh
4
2(N1/2 - 1) 2N1/2 / 3
N1/2
3D mesh
6
3(N1/3 - 1) 3N1/3 / 3
N2/3
nD mesh
2n
n(N1/n - 1) nN1/n / 3
N(n-1) / n
Ring
2
N/2
N/4
2
2D torus
4
N1/2
N1/2 / 2
2N1/2
n/2
N/2
Hypercube
N/3
Bisection
Log2N n=Log2N
1
2D Tree
3
2Log2N
~2Log2 N 1
Crossbar
N-1
1
1
N2/2
N = number of nodes, n = dimension
7/20/2015
Platform Design
H. Corporaal and B. Mesman
28
More topology metrics
Hypercube
Grid/Mesh
Torus
Assume 64 nodes:
Criteria
Bus
Ring
Mesh
2Dtorus
6-cube
Fully
connected
Performance
Bisection
bandwidth
1
2
8
16
32
1024
1
3
128
5
176
5
192
7
256
64
2080
Cost
Ports/switch
Total #links
7/20/2015
Platform Design
H. Corporaal and B. Mesman
29
Multi-stage network: Butterfly or Omega
• All paths equal length
• Unique path from any
input to any output
• Try to avoid conflicts !!
8 x 8 butterfly switch
How to make a bigger
butterfly network?
7/20/2015
Platform Design
H. Corporaal and B. Mesman
N/2
Butterfly
°
°
°
N/2
Butterfly
°
°
°
30
Multistage Fat Tree
• A multistage fat tree (CM-5) avoids congestion at
the root node
• Randomly assign packets to different paths on way
up to spread the load
• Increase degree near root, decrease congestion
7/20/2015
Platform Design
H. Corporaal and B. Mesman
31
What did architects design in the 90ties?
Old (off-chip) MP Networks
Name
nCube/ten
iPSC/2
MP-1216
Delta
CM-5
CS-2
Paragon
T3D
Number
1-1024
16-128
32-512
540
32-2048
32-1024
4-1024
16-1024
Topology
10-cube
7-cube
2D grid
2D grid
fat tree
fat tree
2D grid
3D Torus
Bits
Clock
1 10 MHz
1 16 MHz
1 25 MHz
16 40 MHz
4 40 MHz
8 70 MHz
16 100 MHz
16 150 MHz
Link Bis. BW
1.2
640
2
345
3
1,300
40
640
20
10,240
50
50,000
200
6,400
300
19,200
Year
1987
1988
1989
1991
1991
1992
1992
1993
MBytes/s
No standard topology!
However, for on-chip: mesh and torus are in favor !
7/20/2015
Platform Design
H. Corporaal and B. Mesman
32
Memory hierarchy
• Number of memory levels: 1, 2, 3, 4
• HW  SW controlled level 1
– Cache or Scratchpad memory L1
•
•
•
•
Central  Distributed memory
Shared  Distributed memory address space
Intelligent DMA support: Communication Assist
For shared memory:
– coherency
– consistency
– synchronization
7/20/2015
Platform Design
H. Corporaal and B. Mesman
33
Intermezzo:
What’s the problem with memory ?
Performance
1000
µProc:
55%/yea
r
[Patterson]
CPU
100
10
Processor-Memory
Performance Gap:
(grows 50% / year)
“Moore’s Law”
DRAM:
7%/year
DRAM
1
1980
1985
1990
1995
2000
Time
Memories can be also big power consumers !
7/20/2015
Platform Design
H. Corporaal and B. Mesman
34
Multiple levels of memory
Architecture concept:
Reconfigurable
Reconfigurable
Accelerators
Accelerators Reconfigurable
HWblocks
blocks
Accelerators
HW
HW blocks
CPUs
CPUs
CPUs
Communication network
Level 0
Memory
Memory
I/O
Communication network
Level 1
Communication network
Level N
7/20/2015
Memory
Platform Design
Memory
H. Corporaal and B. Mesman
I/O
35
Communication models: Shared Memory
Shared
Memory
(read, write)
(read, write)
Process P2
Process P1
• Coherence problem
• Memory consistency issue
• Synchronization problem
7/20/2015
Platform Design
H. Corporaal and B. Mesman
36
Communication models: Shared memory
• Shared address space
• Communication primitives:
– load, store, atomic swap
Two varieties:
• Physically shared => Symmetric MultiProcessors (SMP)
– usually combined with local caching
• Physically distributed => Distributed Shared
Memory (DSM)
7/20/2015
Platform Design
H. Corporaal and B. Mesman
37
SMP: Symmetric Multi-Processor
• Memory: centralized with uniform access time (UMA)
and bus interconnect, I/O
• Examples: Sun Enterprise 6000, SGI Challenge, Intel
Processor
Processor
Processor
Processor
One or
more cache
levels
One or
more cache
levels
One or
more cache
levels
One or
more cache
levels
Main memory
7/20/2015
Platform Design
H. Corporaal and B. Mesman
I/O System
38
DSM: Distributed Shared Memory
• Nonuniform access time (NUMA) and scalable
interconnect (distributed memory)
Processor
Processor
Processor
Processor
Cache
Cache
Cache
Cache
Memory
Memory
Memory
Memory
Interconnection Network
Main memory
7/20/2015
Platform Design
H. Corporaal and B. Mesman
I/O System
39
Shared Address Model Summary
• Each processor can name every physical location
in the machine
• Each process can name all data it shares with
other processes
• Data transfer via load and store
• Data size: byte, word, ... or cache blocks
• Memory hierarchy model applies:
– communication moves data to local proc. cache
7/20/2015
Platform Design
H. Corporaal and B. Mesman
40
Communication models: Message Passing
• Communication primitives
– e.g., send, receive library calls
• Note that MP can be build on top of SM and vice
versa
receive
send
Process P2
Process P1
send
receive
FiFO
7/20/2015
Platform Design
H. Corporaal and B. Mesman
41
Message Passing Model
• Explicit message send and receive operations
• Send specifies local buffer + receiving process
on remote computer
• Receive specifies sending process on remote
computer + local buffer to place data
• Typically blocking communication, but may use
DMA
Message structure
Header
7/20/2015
Data
Platform Design
H. Corporaal and B. Mesman
Trailer
42
Message passing communication
Processor
Processor
Processor
Processor
Cache
Cache
Cache
Cache
Memory
Memory
Memory
Memory
DMA
DMA
DMA
DMA
Network
interface
Network
interface
Network
interface
Network
interface
Interconnection Network
7/20/2015
Platform Design
H. Corporaal and B. Mesman
43
Communication Models: Comparison
• Shared-Memory
– Compatibility with well-understood (language) mechanisms
– Ease of programming for complex or dynamic communications
patterns
– Shared-memory applications; sharing of large data structures
– Efficient for small items
– Supports hardware caching
• Messaging Passing
– Simpler hardware
– Explicit communication
– Improved synchronization
7/20/2015
Platform Design
H. Corporaal and B. Mesman
44
Challenges of parallel processing
Q1: can we get linear speedup
Suppose we want speedup 80 with 100 processors.
What fraction of the original computation can be
sequential (i.e. non-parallel)?
Q2: how important is communication latency
Suppose 0.2 % of all accesses are remote, and require
100 cycles on a processor with base CPI = 0.5
What’s the communication impact?
7/20/2015
Platform Design
H. Corporaal and B. Mesman
45
Three fundamental issues for shared
memory multiprocessors
• Coherence,
about: Do I see the most recent data?
• Consistency,
about: When do I see a written value?
– e.g. do different processors see writes at the same time (w.r.t.
other memory accesses)?
• Synchronization
How to synchronize processes?
– how to protect access to shared data?
7/20/2015
Platform Design
H. Corporaal and B. Mesman
46
Coherence problem, in single CPU
system
CPU
CPU
CPU
cache
cache
cache
a'
100
a'
550
a'
100
b'
200
b'
200
b'
200
memory
memory
memory
a
100
a
100
a
100
b
200
b
200
b
440
I/O
7/20/2015
I/O
Platform Design
I/O
H. Corporaal and B. Mesman
47
Coherence problem, in Multi-Proc
system
CPU-1
CPU-2
cache
cache
a'
550
a''
100
b'
200
b''
200
memory
7/20/2015
Platform Design
a
100
b
200
H. Corporaal and B. Mesman
48
What Does Coherency Mean?
• Informally:
– “Any read must return the most recent write”
– Too strict and too difficult to implement
• Better:
– “Any write must eventually be seen by a read”
– All writes are seen in proper order (“serialization”)
7/20/2015
Platform Design
H. Corporaal and B. Mesman
49
Two rules to ensure coherency
• “If P writes x and P1 reads it, P’s write will be seen by
P1 if the read and write are sufficiently far apart”
• Writes to a single location are serialized:
seen in one order
– Latest write will be seen
– Otherwise could see writes in illogical order
(could see older value after a newer value)
7/20/2015
Platform Design
H. Corporaal and B. Mesman
50
Potential HW Coherency Solutions
• Snooping Solution (Snoopy Bus):
– Send all requests for data to all processors (or local caches)
– Processors snoop to see if they have a copy and respond
accordingly
– Requires broadcast, since caching information is at
processors
– Works well with bus (natural broadcast medium)
– Dominates for small scale machines (most of the market)
• Directory-Based Schemes
7/20/2015
– Keep track of what is being shared in one centralized place
– Distributed memory => distributed directory for scalability
(avoids bottlenecks)
– Send point-to-point requests to processors via network
– Scales better than Snooping
– Actually existed BEFORE
Snooping-based
schemes
Platform Design
H. Corporaal
and B. Mesman
51
Example Snooping protocol
• 3 states for each cache line:
– invalid, shared, modified (exclusive)
• FSM per cache, receives requests from both processor
and bus
Processor
Cache
Processor
Cache
Processor
Cache
Main memory
7/20/2015
Platform Design
H. Corporaal and B. Mesman
Processor
Cache
I/O System
52
Cache coherence protocal
Write invalidate protocol for write-back cache
• Showing state transitions for each block in the cache
7/20/2015
Platform Design
H. Corporaal and B. Mesman
53
Synchronization problem
• Computer system of bank has credit process
(P_c) and debit process (P_d)
7/20/2015
/* Process P_c */
shared int balance
private int amount
/* Process P_d */
shared int balance
private int amount
balance += amount
balance -= amount
lw
lw
add
sw
lw
lw
sub
sw
$t0,balance
$t1,amount
$t0,$t0,t1
$t0,balance
Platform Design
H. Corporaal and B. Mesman
$t2,balance
$t3,amount
$t2,$t2,$t3
$t2,balance
54
Critical Section Problem
• n processes all competing to use some shared data
• Each process has code segment, called critical section,
in which shared data is accessed.
• Problem – ensure that when one process is executing in
its critical section, no other process is allowed to execute
in its critical section
• Structure of process
while (TRUE){
entry_section ();
critical_section ();
exit_section ();
remainder_section ();
}
7/20/2015
Platform Design
H. Corporaal and B. Mesman
55
Attempt 1 – Strict Alternation
Process P0
Process P1
shared int turn;
shared int turn;
while (TRUE) {
while (turn!=0);
critical_section();
turn = 1;
remainder_section();
}
while (TRUE) {
while (turn!=1);
critical_section();
turn = 0;
remainder_section();
}
Two problems:
• Satisfies mutual exclusion, but not progress
(works only when both processes strictly alternate)
• Busy waiting
7/20/2015
Platform Design
H. Corporaal and B. Mesman
56
Attempt 2 – Warning Flags
Process P0
Process P1
shared int flag[2];
shared int flag[2];
while (TRUE) {
flag[0] = TRUE;
while (flag[1]);
critical_section();
flag[0] = FALSE;
remainder_section();
}
while (TRUE) {
flag[1] = TRUE;
while (flag[0]);
critical_section();
flag[1] = FALSE;
remainder_section();
}
• Satisfies mutual exclusion
– P0 in critical section: flag[0]!flag[1]
– P1 in critical section: !flag[0]flag[1]
• However, contains a deadlock
to TRUE
!!) and B. Mesman
Design
H. Corporaal
7/20/2015 (both flags may be setPlatform
57
Software solution: Peterson’s Algorithm
(combining warning flags and alternation)
Process P0
Process P1
shared int flag[2];
shared int turn;
shared int flag[2];
shared int turn;
while (TRUE) {
flag[0] = TRUE;
turn = 0;
while (turn==0&&flag[1]);
critical_section();
flag[0] = FALSE;
remainder_section();
}
while (TRUE) {
flag[1] = TRUE;
turn = 1;
while (turn==1&&flag[0]);
critical_section();
flag[1] = FALSE;
remainder_section();
}
Software solution is slow !
7/20/2015
Platform Design
H. Corporaal and B. Mesman
58
Issues for Synchronization
• Hardware support:
– Un-interruptable instruction to fetch-and-update
memory (atomic operation)
• User level synchronization operation(s) using
this primitive;
• For large scale MPs, synchronization can be a
bottleneck; techniques to reduce contention and
latency of synchronization
7/20/2015
Platform Design
H. Corporaal and B. Mesman
59
Uninterruptable Instructions to Fetch and
Update Memory
• Atomic exchange: interchange a value in a register for a
value in memory
– 0 => synchronization variable is free
– 1 => synchronization variable is locked and unavailable
• Test-and-set: tests a value and sets it if the value passes
the test (also Compare-and-swap)
• Fetch-and-increment: it returns the value of a memory
location and atomically increments it
– 0 => synchronization variable is free
7/20/2015
Platform Design
H. Corporaal and B. Mesman
60
•
User Level Synchronization—
Operation
Spin locks: processor continuously tries to acquire, spinning
around a loop trying to get the lock
LI R2,#1
;load immediate
lockit:
EXCH R2,0(R1)
;atomic
exchange
BNEZR2,lockit
;already locked?
• What about MP with cache coherency?
– Want to spin on cache copy to avoid full memory latency
– Likely to get cache hits for such variables
• Problem: exchange includes a write, which invalidates all other
copies; this generates considerable bus traffic
• Solution: start by simply repeatedly reading the variable; when
it changes, then try exchange (“test and test&set”):
• try: LI R2,#1
;load immediate
lockit:
LW R3,0(R1)
;load var
BNEZR3,lockit
;not free=>spin
EXCHR2,0(R1)
;atomic exchange
BNEZR2,try
;already
locked?
Platform Design
H. Corporaal and B. Mesman
7/20/2015
61
Fetch and Update (cont'd)
• Hard to have read & write in 1 instruction: use 2 instead
• Load Linked (or load locked) + Store Conditional
– Load linked returns the initial value
– Store conditional returns 1 if it succeeds (no other store to same
memory location since preceding load) and 0 otherwise
• Example doing atomic swap with LL & SC:
try:
LL
SC
BEQZ
MOV
OR
R3,R4,R0
; R4=R3
R2,0(R1)
; load linked
R3,0(R1)
; store conditional
R3,try
; branch store fails (R3=0)
R4,R2
; put load value in R4
• Example doing fetch & increment with LL & SC:
try:
ADDUI
SC
BEQZ
7/20/2015
LL
R2,0(R1)
; load linked
R3,R2,#1
; increment
R3,0(R1)
; store conditional
R3,try
; branch store fails (R2=0)
Platform Design
H. Corporaal and B. Mesman
62
Another MP Issue: Memory
Consistency
• What is consistency? When must a processor see a new
memory value?
• Example:
P1:
L1:
A = 0;
.....
A = 1;
if (B == 0) ...
P2:
L2:
B = 0;
.....
B = 1;
if (A == 0) ...
• Seems impossible for both if-statements L1 & L2 to be
true?
– What if write invalidate is delayed & processor continues?
• Memory consistency models:
what are the rules for such cases?
7/20/2015
Platform Design
H. Corporaal and B. Mesman
63
Tooling, OS, and Mapping
• Which mapping steps are performed in HW?
• Pre-emption support
• Programming model
– streaming or vector support
• (like KernelC and StreamingC for Imagine,
• StreamIT for RAW
– Process communication: Shared memory  Message
passing
– Process Synchronization
7/20/2015
Platform Design
H. Corporaal and B. Mesman
64
A few platform examples
7/20/2015
Platform Design
H. Corporaal and B. Mesman
65
Massively Parallel Processors Targeting
Digital Signal Processing Applications
7/20/2015
Platform Design
H. Corporaal and B. Mesman
66
Field Programmable Object Array
MathStar
7/20/2015
Platform Design
H. Corporaal and B. Mesman
67
PACT XPP-III Processor array
7/20/2015
Platform Design
H. Corporaal and B. Mesman
68
RAW processor from MIT
7/20/2015
Platform Design
H. Corporaal and B. Mesman
69
RAW: Switch Detail
Raw exposes wire delay at the ISA level. This allows the compiler to
explicitly manage static network, where routes compiled into static
router and messages arrive in known order.
Latency: 2 + #hops ; Throughput: 1 word/cycle per dir. Per network
7/20/2015
Platform Design
H. Corporaal and B. Mesman
70
Philips AETHEREAL
Router provides both
guaranteed throughput
(GT) and best effort
(BE) services to
communicate with IPs.
Router
Network
Combination of GT and
BE leads to efficient use
of bandwidth and simple
programming model.
R
IP
7/20/2015
Network
Interface
Platform Design
R
R
R
R
R
R
R
H. Corporaal and B. Mesman
R
Network
Interface
Network
Interface
IP
IP
71