Dataflow Architectures - Computer Systems @ JSI
Download
Report
Transcript Dataflow Architectures - Computer Systems @ JSI
DATAFLOW
ARCHITECTURES
JURIJ SILC, BORUT ROBIC, THEO UNGERER
1
Literature
Jurij Silc, Borut Robic, and Theo Ungerer: Processor Architecture: From
Dataflow to Superscalar and Beyond (Springer-Verlag, Berlin, New
York,1999).
Jurij Silc, Borut Robic, and Theo Ungerer: "Asynchrony in parallel
computing: From dataflow to multithreading", Parallel and Distributed
Computing Practices, 1(1):3-30, 1998.
Borut Robic, Jurij Silc, and Theo Ungerer: "Beyond dataflow", J.
Computing and Information Technology, 8(2):89-101, 2000.
2
Dataflow Processors - Motivation
In basic processor pipelining hazards limit performance
Structural hazards
Data hazards due to
true dependences or
name (false) dependences: anti and output dependences
Control hazards
Name dependences can be removed by:
compiler (register) renaming
renaming hardware advanced superscalars
single-assignment rule dataflow computers
Data hazards due to true dependences and control hazards can be
avoided if succeeding instructions in the pipeline stem from different
contexts
dataflow computers, multithreaded processors
3
Dataflow vs. Control-Flow
von Neumann or control flow computing model:
a program is a series of addressable instructions, each of which either
specifies an operation along with memory locations of the operands or
it specifies (un)conditional transfer of control to some other instruction.
Essentially: the next instruction to be executed depends on what happened
during the execution of the current instruction.
The next instruction to be executed is pointed to and triggered by the PC.
The instruction is executed even if some of its operands are not available yet
(e.g. uninitialized).
Dataflow model: the execution is driven only by the availability of operand!
no PC and global updateable store
the two features of von Neumann model that become bottlenecks in exploiting
parallelism are missing
4
Dataflow model of computation
Enabling rule:
An instruction is enabled (i.e. executable) if all operands are
available.
Notice, that in von Neumann model, an instruction is enabled if it is pointed to
by PC.
The computational rule or firing rule, specifies when an enabled instruction is
actually executed.
Basic instruction firing rule:
An instruction is fired (i.e. executed) when it becomes enabled.
The effect of firing an instruction is the consumption of its input data
(operands) and generation of output data (results).
Where are the structural hazards? Answer: ignored!!
5
Dataflow languages
Main characteristic: The single-assignment rule
A variable may appear on the left side of an assignment only
once within the area of the program in which it is active.
Examples: VAL, Id, LUCID
A dataflow program is compiled into a dataflow graph which is a directed
graph consisting of named nodes, which represent instructions, and arcs,
which represent data dependences among instructions.
The dataflow graph is similar to a dependence graph used in
intermediate representations of compilers.
During the execution of the program, data propagate along the arcs in
data packets, called tokens.
This flow of tokens enables some of the nodes (instructions) and fires
them.
6
Example in VAL
function Stats: computes the mean
and standard deviation of three
input values.
function Stats(x,y,z: real returns real,real);
let
Mean := (x + y + z)/3;
StDev := SQRT((x**2 + y**2 + z**2)/3
- Mean**2);
in
Mean, StDev
endlet
endfun
x
y
z
+
2
2
2
3
+
+
3
+
2
Mean
sqrt
StDev
7
Example in Id
n
Id program segment computes factorial n!
of integer n
( initial j <- n; k <- 1
while j > 1 do
new j <- j - 1;
new k <- k * j;
return k )
F
T
CHOOSE
F
1
>1
*
-1
SWITCH
T
F
n!
8
Two important characteristics of dataflow graphs
Functionality: The evaluation of a dataflow graph is equivalent to
evaluation of the corresponding mathematical function on the same input
data.
Composability: Dataflow graphs can be combined to form new graphs.
9
Dataflow architectures
Pure dataflow computers:
static,
dynamic,
and the explicit token store architecture.
Hybrid dataflow computers:
Augmenting the dataflow computation model with control-flow
mechanisms, such as
RISC approach,
complex machine operations,
multithreading,
large-grain computation,
etc.
10
Pure Dataflow
A dataflow computer executes a program by receiving, processing and sending out
tokens, each containing some data and a tag.
Dependences between instructions are translated into tag matching and tag
transformation.
Processing starts when a set of matched tokens arrives at the execution unit.
The instruction which has to be fetched from the instruction store (according to
the tag information) contains information about
what to do with data
and how to transform the tags.
The matching unit and the execution unit are connected by an asynchronous
pipeline, with queues added between the stages.
Some form of associative memory is required to support token matching.
a real memory with associative access,
a simulated memory based on hashing,
or a direct matched memory.
11
Static Dataflow
A dataflow graph is represented as a collection of activity templates,
each containing:
the opcode of the represented instruction,
operand slots for holding operand values,
and destination address fields, referring to the operand slots in subsequent activity templates that need to receive the result value.
Each token consists only of a value and a destination address.
12
Dataflow graph and Activity template
x
y
2
3
*
ni
sqrt
nj
z
ni
x
y
*
2
3
nj
x
y
nj
sqrt
data token
acknowledge signal
z
ni
z
data arc
acknowledgement arc
13
Acknowledgement signals
Notice, that different tokens destined for the same destination cannot be
distinguished.
Static dataflow approach allows at most one token on any one arc.
Extending the basic firing rule as follows:
An enabled node is fired if there is no token on any of its
output arcs.
Implementation of the restriction by acknowledge signals (additional
tokens ), traveling along additional arcs from consuming to producing
nodes.
The firing rule can be changed to its original form:
A node is fired at the moment when it becomes enabled.
Again: structural hazards are ignored assuming unlimited resources!
14
MIT Static Dataflow Machine
PE
...
Communication
Network
PE
Processing Element
SU
to/from the
Communication
Network
Operation
Unit(s)
local
communication
RU
Instruction
Queue
Fetch
Unit
Update
Unit
Activity
Store
15
Deficiencies of static dataflow
Consecutive iterations of a loop can only be pipelined.
Due to acknowledgment tokens, the token traffic is doubled.
Lack of support for programming constructs that are essential to modern
programming language
no procedure calls,
no recursion.
Advantage: simple model
16
Dynamic Dataflow
Each loop iteration or subprogram invocation should be able to execute in
parallel as a separate instance of a reentrant subgraph.
The replication is only conceptual.
Each token has a tag:
address of the instruction for which the particular data value is
destined
and context information
Each arc can be viewed as a bag that may contain an arbitrary number of
tokens with different tags.
The enabling and firing rule is now:
A node is enabled and fired as soon as tokens with identical
tags are present on all input arcs.
Structural hazards ignored!
17
The U-interpreter (U = unraveling)
Each token consists of an activity name and data
the activity name comprises the tag.
the tag has
an instruction address n,
the context field c that uniquely identifies the context in which the
instruction is to be invoked,
and the initiation number i that identifies the loop iteration in which
this activity occurs.
Note, that c is itself an activity name.
Since the destination instruction may require more than one input, each
token also carries the number of its destination port p.
We represent a token by c.i.n, data p
18
The U-interpreter
if the node ni performs a dyadic function f, and if the port p of nj is the
destination of ni, then we have
in : { c.i.ni , x 1 , c.i.ni , y 2 }
c.i.ni , x
c.i.ni , y
1
1
ni
f
out : { c.i.n j , ( x, y) }
p
2
2
execution
f
ni
c.i.nj , f(x,y)
p
p
nj
nj
19
MERGE and SWITCH nodes
X
execution
X
execution
MERGE node
X
SWITCH
X
SWITCH
T
T
F
F
T
execution
SWITCH
execution
SWITCH
F
T
T
F
F
SWITCH node
20
Branch Implementations
in : c.i.ni , x
data
; c.i.ni ; b
control
x
n
i
n
b
P
COPY
n
n
j
n
j
k
k
g
f
g
X
b T
bF
x
SWITCH
T
F
f
c.i.n j , x if
out :
c.i.nk , x if
Branch
b
n
i
T
F
CHOOSE
P
Speculative
branch
evaluation
21
L, L-1, D, and D-1 Operators for Loop
Implementation
L : in : c.i.ni , x
D : in : c'. j.n j , x
out : c'.1.nk , x , where c' c.i.ni
out : c'. j 1.nk , x
D1 : in : c'.k.nl , x
out : c'.1.nm , x
L1 : in : c'.1.nm , x
out : c'.i.nn , x
L:
D:
D-1:
L-1:
initiation, new loop context
increments loop iteration number
reset loop iteration number to 1
restore original context
22
Basic Loop Implementation
n
i
L
X
n
j
D
n
SWITCH
T
F
k
f
D -1
P
n
l
n
L -1
m
new x
23
A, A-1, BEGIN, and END Operators for function calls
A:
in : c.i.ni , q
, c.i.ni , a
c' c.i.n j
where
func
END: in : c'.i.nend , q (a )
arg
out : c'.1.nbegin , a
and nj is the address of the A-1 operator
out : c.i.n j , q (a )
A: create new context
BEGIN: replicate tokens for each fork
END: return results, unstack return address
A-1: replicate output for successors
24
Function application
q
q
a
n
a
n
A
i
n
i
begin
BEGIN
q
APPLY
n
A -1
j
n
END
end
25
I-structures (I = incremental)
Problem: Single-assignment rule and complex data structures
each update of a data structure consumes the structure and the value
producing a new data structure.
awkward or even impossible to implement.
Solution: concept of I-structure:
a data repository obeying the single-assignment rule
each element of the I-structure may be written only once but it may
be read any number of times
The basic idea is to associate with each element status bits and a queue
of deferred reads.
26
I-structures
The status of each element of the I-structure can be:
present: the element can be read but not written,
absent: a read request has to be deferred but a write operation into
this element is allowed,
waiting: at least one read request of the element has been deferred.
After an element of the data structure has become defined (initialized,
value assigned; can happen exactly once),
all deferred reads, which are kept in the associated queue, become
immediately satisfied.
I-structure makes it possible to use a data structure before it is fully
defined.
It allows defining complex data structures from existing though partially
defined data structures.
27
I-structures
The status of each element of the I-structure can be:
present: the element can be read but not written,
absent: a read request has to be deferred but a write operation into
this element is allowed,
waiting: at least one read request of the element has been deferred.
28
I-structure
The following three elementary operations are defined on I-structures:
allocate: reserves a specified number of elements for a new Istructure,
I-fetch: retrieves the contents of the specified I-structure element
(if the element has not yet been written, then this operation is
automatically deferred),
I-store: writes a value into the specified I-structure element
(if that element is not empty, an error condition is reported).
These elementary operations are used to construct nodes SELECT and
ASSIGN.
I-fetch instruction is implemented as split-phase memory operation:
a read request issued to an I-structure is independent in time from the
response received and thus does not cause a wait by the issuing PE.
29
I-structure select and assign
A
j
A
A j x
j
SELECT
ni
I-fetch
ASSIGN
I-store
signal
Storage
Storage
nj
x
I-structure
I-structure
nj
j
address
address
ni
A
nj
nj
30
MIT Tagged-Token Dataflow Architecture
I-Structure
PE
...
Communication
Network
Processing Element
Storage
...
I-Structure
PE
Storage
Token
Queue
RU
to/from the
Communication
Network
Wait-Match Unit
&
Waiting Token Store
local
communication
Instruction
Fetch
Unit
SU
Form
Token
Unit
ALU
&
Form
Tag
Program & Constant
Store
Store
31
Manchester Dataflow Machine
Host
Processing Element
Switch
Switch
Structure
Storage
PE
Switch
Structure
Storage
Matching
Unit
Instruction
Store
Processing Unit
ALU
...
Token
Queue
ALU
Switch
input
output
32
Advantages and Deficiencies of Dynamic Dataflow
Major advantage: better performance (compared with static) as it allows
multiple tokens on each arc thereby unfolding more parallelism.
Problems:
efficient implementation of the matching unit that collects tokens with
matching tags.
Associative memory would be ideal.
Unfortunately, it is not cost-effective since the amount of memory
needed to store tokens waiting for a match tends to be very large.
As a result, all existing machines use some form of hashing techniques
that are typically not as fast as associative memory.
33
Explicit Token Store Approach
Target: efficient implementation of token matching.
Basic idea: allocate a separate frame in the frame memory for each
active loop iteration or subprogram invocation.
A frame consists of slots where each slot holds an operand that is used in
the corresponding activity.
Since access to slots is direct (i.e. through offsets relative to the frame
pointer), no associative search is needed.
34
Explicit Token Store
Instruction Memory
op-code
offset
in the
activation
frame
<FP, IP, 3.01>
destinations
left
right
*
*
sqrt
2
+
3
5
-
+1
+1
+3
+2
+2
+5
+2
IP
Frame Memory
presence
bit
+
sqrt
-
value
FP
FP + 2
2.34
35
t=0
Frame memory
<FP, IP, 3.01>
presence
bit
Explicit Token
Store Matching Scheme
FP
*
+
sqrt
2.34
t=1
Frame Memory
< FP, IP, 2.0 >
,
presence
bit
<FP, IP, 7.4 >
value
FP
*
3.01
2.34
+
sqrt
value
Frame Memory
presence
bit
t=2
<FP, IP+1, 6.02 >
sqrt
*
<FP, IP+2, 6.02 >
value
FP
+
2.34
FP
,
7.4
36
k-bounded Loop Scheme
...
1
1+k
1+2k
...
k-1
2k-1
3k-1
...
p
...
SWITCH
T
F
Dreset
SWITCH
T
Dreset
Result
X
Dk
Synchronization Tree
...
...
...
Signal
D-2
k
loop body
Dk
G
F
activation
frame 1
X
k
2k
activation
frame 2
X
activation
frame k
.
.
.
k -1 tokens
Frame Memory
loop prelude
37
k-bounded Loop Scheme - new Operators
New: gate operator G, operator Dk and a synchronization tree.
The Dk operator performs the modulo k increment of the iteration number
of tokens circulation through loop.
The G operator has two inputs (for control and data) and functions as loop
throttle by passing one token from its data input to the output for each
token on its control input.
At the end of each iteration, a new control token is generated by
combining the output of all Dk operators into a single value using the
synchronization tree.
Loop throttling: at most k consecutive loop iterations active at the same
time and, as a result, only k frames are needed.
38
Monsoon, an Explicit Token Store Machine
Each PE is using an eight-stage pipeline
instruction fetch
--- precedes token matching (in contrast to dynamic dataflow
processors with associative matching units)!
token matching1: effective address generation: explicit token address
is computed from the frame address and operand offset
token matching2: presence bit operation: a presence bit is accessed to
find out if the first operand of a dyadic operation has already arrived
not arrived presence bit set and the current token is stored into
the frame slot of the frame memory
arrived presence bit is reset and the operand can be retrieved
from the slot of the frame memory in next stage
token matching3: frame operation stage: Operand storing or
retrieving.
39
Monsoon, an Explicit Token Store Machine
Each PE is using an eight-stage pipeline (continued)
Next three stages: execution stages in the course of which the next
tag is also computed concurrently.
Eighth stage: form-token: forms one or two new tokens that are sent
to the network, stored in a user token queue, a system token queue,
or directly recirculated to the instruction fetch stage of the pipeline.
40
Monsoon, an Explicit Token Store Machine
I-Structure
PE
...
Processing Element
Multistage
Packet
Switching
Network
Storage
...
I-Structure
PE
Storage
Instruction
Memory
Frame
Operation
Form
Token
Frame Memory
Effective
Address
Generation
Presence
Bit
Operation
User Queue
to/from the
Communication
Network
System Queue
Instruction
Fetch
ALU
41
Monsoon Prototype
16 prototypes at beginning of 90ies!
Processing element:
10 MHz clock
56 kW Instruction Memory (32 bit wide)
256 kW Frame Memory (word + 3 presence bits, word size: 64 bit data + 8 bit
tag)
Two 32 ktoken queues (system, user)
I-structure storage:
4MW (word + 3 presence bits)
5 M requests/sec
Network
Multistage, pipelined
Packet Routing Chips (PaRC, 4 x 4 crossbar)
4 M tokens/s/link (100 MB/s)
42
Advantages and Deficiencies of Dynamic Dataflow
Major advantage: better performance (compared with static) because it
allows multiple tokens on each arc thereby unfolding more parallelism.
Problems:
efficient implementation of the matching unit that collects tokens with
matching tags.
Associative memory would be ideal.
Unfortunately, it is not cost-effective since the amount of memory
needed to store tokens waiting for a match tends to be very large.
All existing machines use some form of hashing techniques.
bad single thread performance (when not enough workload is present)
dyadic instructions lead to pipeline bubbles when first operand tokens
arrive
no instruction locality no use of registers
43
Augmenting Dataflow with Control-Flow
Poor sequential code performance by dynamic dataflow computers
An instruction of the same thread can only be issued to the dataflow
pipeline after the completion of its predecessor instruction.
In the case of an 8-stage pipeline, instructions of the same thread can
be issued at most every eight cycles.
Low workload: the utilization of the dataflow processor drops to one
eighth of its maximum performance.
Another drawback: the overhead associated with token matching.
before a dyadic instruction is issued to the execution stage, two result
tokens have to be present.
The first token is stored in the waiting-matching store, thereby
introducing a bubble in the execution stage(s) of the dataflow
processor pipeline.
measured pipeline bubbles on Monsoon: up to 28.75 %
No use of registers possible!
44
Augmenting Dataflow with Control-Flow
Solution: combine dataflow with control-flow mechanisms.
Several techniques for combining control-flow and dataflow emerged:
hybrid dataflow,
RISC dataflow,
dataflow with complex machine operations,
threaded dataflow,
large-grain dataflow.
45
Threaded Dataflow
Threaded dataflow: the dataflow principle is modified so that instructions
of certain instruction streams are processed in succeeding machine cycles.
In a dataflow graph a subgraph that exhibits a low degree of parallelism is
transformed into a sequential thread.
The thread of instructions is issued consecutively by the matching unit
without matching further tokens except for the first instruction of the
thread.
Threaded dataflow covers
the repeat-on-input technique used in Epsilon-1 and Epsilon-2
processors,
the strongly connected arc model of EM-4, and
the direct recycling of tokens in Monsoon.
46
Threaded Dataflow (continued)
Data passed between instructions of the same thread is stored in registers
instead of written back to memory.
These registers may be referenced by any succeeding instruction in the
thread.
Thereby single-thread performance is improved.
The total number of tokens needed to schedule program instructions is
reduced which in turn saves hardware resources.
Pipeline bubbles are avoided for dyadic instructions within a thread.
Two threaded dataflow execution techniques can be distinguished:
direct token recycling (Monsoon),
consecutive execution of the instructions of a single thread (Epsilon &
EM).
47
Direct token recycling of Monsoon
Cycle-by-cycle instruction interleaving of threads similar to multithreaded
von Neumann computers!
8 register sets can be used by 8 different threads.
Dyadic instructions within a thread (except for the start instruction!) refer
to at least one register, i.e. need only a single token to be enabled.
A result token of a particular thread is recycled ASAP in the 8-stage
pipeline, i.e. every 8th cycle the next instruction of a thread is fired and
executed.
This implies that at least 8 threads must be active for a full pipeline
utilization.
Threads and fine-grain dataflow instructions can be mixed in the pipeline.
48
Epsilon and EM-4
Epsilon and EM-4 execute instructions from a thread consecutively.
The circular pipeline of fine-grain dataflow is retained.
However, the matching unit has to be enhanced with a mechanism that,
after firing the first instruction of a thread, delays matching of further
tokens in favor of consecutive issuing of all instructions of the started
thread.
Example: strongly connected arc model (EM-4):
each arc of the dataflow graph is classified as either a normal arc or a
strongly connected arc
The set of nodes that are connected by strongly connected arcs is
called strongly connected block.
such a block is fired if its source nodes are enabled
the instructions in the block executed successively
Problem: implementation of an efficient synchronization mechanism
49
Strongly connected blocks in EM-4
2
1
3
4
normal arc
5
7
A
6
8
9
10
B
11
13
strongly connected arc
12
strongly connected block
14
50
Direct matching:
Instruction Memory and Operand Memory
SF = dyadic right
*
Instruction Memory
op-code
Operand Memory
-
presence
bit
OSN
value
TSN
Operand
Segment
Template Segment
+
sqrt
TSN
*
sqrt
<SF,OSN,DPL, 6.02>
+
-
2.34
51
Processing Element:
EMC-R Processor + Memory
PE
...
Register
Files
Input Buffer Unit
Fetch Matching
Unit
Instruction
Fetch
Execute and
Emit Tokens
Memory Control Unit
PE
Execution Unit
Omega Network
EM-4
Operand
Segments
Template
Segments
Heap
Memory
Switching
Unit
EMC-R
to/from the
Communication
Network
52
Large-Grain (coarse-grain) Dataflow
A dataflow graph is enhanced to contain fine-grain (pure) dataflow nodes
and macro dataflow nodes.
A macro dataflow node contains a sequential block of instructions.
A macro dataflow node is activated in the dataflow manner,
its instruction sequences is executed in the von Neumann style!
Off-the-shelf microprocessors can be used to support the execution stage.
Large-grain dataflow machines typically decouple the matching stage
(sometimes called signal stage, synchronization stage, etc.) from the
execution stage by use of FIFO-buffers.
Pipeline bubbles are avoided by the decoupling and FIFO-buffering.
53
Large-Grain Dataflow: StarT
Message
Queue
Message
Queue
Node Memory
Data
Processor
dP
Continuation
Queue
local
to/from the
Communication
Network
Message
Queue
Synchronization
Processor
sP
Remote Memory
Request Processor
Rmem
54
Dataflow with Complex Machine Operations
Use of complex machine instructions, e.g. vector instructions
ability to exploit parallelism at the subinstruction level
Instructions can be implemented by pipeline techniques as in vector
computers.
The use of a complex machine operation may spare several nested
loops.
Structured data is referenced in block rather than element-wise and can
be supplied in a burst mode.
Problem: I-structure scheme: each data element within a complex data
structure is fetched individually from a structure store.
The structure-flow technique (SIGMA-1) defines structure load/store
instructions that can move whole vectors to/from structure store.
55
Dataflow with Complex Machine Operations
and combined with LGDF
Often: use of FIFO-buffers to decouple the firing stage and the execution
stage
bridges different execution times within a mixed stream of simple and
complex instructions.
Major difference to pure dataflow: tokens do not carry data (except for
the values true or false).
Data is only moved and transformed within the execution stage.
Applied in: Decoupled Graph/Computation Architecture, the Stollmann
Dataflow Machine, and the ASTOR architecture.
These architectures combine complex machine instructions with largegrain dataflow.
56
PE
...
PE
Dynamic
Code
Storage
Danymic Code
Access
Manager
Control
Construct
Managers
I/O
Manager
Static Code
Access
Manager
...
...
Static
Code
Storage
Program
flow
control
part
I/O
Manager
to/from the Instruction
Communication
Network
Instruction
Comunication
Network
Data
Communication
Network
Augsburg Structure-Oriented Architecture (ASTOR)
Data
Access
Managers
I/O
Manager
Data
Storage
Processing Element
Computational
Structure Manager
Data Transformation
Units
Data object
processing
part
to/from the Data
Communication
Network
57
RISC and Dataflow
Another dataflow/von Neumann hybrid of Arvind
Support the execution of existing software written for conventional
processors.
Using such a machine as a bridge between existing systems and new
dataflow supercomputers should have made the transition from imperative
von Neumann languages to dataflow languages easier!?
The basic philosophy underlying the development of the RISC dataflow
architecture can be summarized as follows:
use a RISC-like instruction set,
change the architecture to support multithreaded computation,
add (explicit!!) fork and join instructions to manage multiple threads,
implement all global storage as I-structure storage, and
implement load/store instructions to execute in split-phase mode.
58
RISCifying dataflow
op-code
sqrt
mul
join
operation
fork
sub
frame pointer
dest. instruction address
add
div
operands
109
join 2
...
110
mul
d1, ...
111
jump
120
115
sqrt
d2
116
jump
120
120
join 2
dj
121
add
d, d1, d2
122
fork
159
123
jump
141
141
sub
... , d, ...
159
join 2
...
160
div
... , d, ...
59
P-RISC Architecture
PE
...
Processing Element
Communication
Network
Global
Memory
Operand
Fetch
Unit
Communication
Queue
to/from the
Communication
Network
or
Global (I-structure)
Memory
Token Queue
ALU
Operand
Store
Unit
Frame Memory
Instruction
Fetch
Unit
Instruction
Memory
PE
60
P-RISC Characteristics
Load/store instructions are the only instructions accessing GM
(implemented as I-structure storage)
Arithmetic/logical instructions operate on local memory (registers)
Fixed instruction length
One-cycle instruction execution (except for load/store instructions)
No explicit matching unit: all operands associated with a sequential thread
of computation are kept in a frame in local Program Memory (PM).
Continuation: an (IP;FP) pair
IP serves to fetch the next instruction
FP serves as the base for fetching and storing operands.
To make P-RISC multithreaded, the stack of frames is arranged as a tree
of frames, and a separate continuation is associated with each thread.
61
Other Hybrids
Various different blends of operating principle possible:
Treleven 1982: dataflow and reduction principle
What is reduction?
Dataflow is data-driven,
reduction is demand-driven
MUSE 1985: dataflow, tree machine, and graph reduction
MADAME (Silc and Robic 1989): synchronous dataflow principle
DTN Dataflow Computer 1990: Dataflow workstation, based upon NEC
Image Pipelined Processor chips
62
Lessons Learned from Dataflow
The latest generation of superscalar microprocessors displays an out-of order dynamic execution that is referred to as local dataflow or micro
dataflow.
Colwell and Steck 1995, in the first paper on the PentiumPro:
The flow of the Intel Architecture instructions is predicted and these
instructions are decoded into micro-operations (ops), or series of ops,
and these ops are register-renamed, placed into an out-of-order
speculative pool of pending operations, executed in dataflow order (when
operands are ready), and retired to permanent machine state in source
program order.
State-of-the-art microprocessors typically provide 32 (MIPS R10000), 40
(Intel PentiumPro) or 56 (HP PA-8000) instruction slots in the instruction
window or reorder buffer.
Each instruction is ready to be executed as soon as all operands are
available.
63
Comparing dataflow computers with superscalar
microprocessors
Superscalar microprocessors are von Neumann based:
(sequential) thread of instructions as input
not enough fine-grained parallelism to feed the multiple functional units
speculation
dataflow approach resolves any threads of control into separate
instructions that are ready to execute as soon as all required operands
become available.
The fine-grained parallelism generated by dataflow principle is far larger
than the parallelism available for microprocessors.
However, locality is lost no caching, no registers
64
Lessons Learned from Dataflow (Pipeline Issues)
Microprocessors: Data and control dependences potentially cause
pipeline hazards that are handled by complex forwarding logic.
Dataflow: Due to the continuous context switches, pipeline hazards are
avoided; disadvantage: poor single thread performance.
Microprocessors: Antidependences and output dependences are
removed by register renaming that maps the architectural registers to the
physical registers.
Thereby the microprocessor internally generates an instruction stream
that satisfies the single assignment rule of dataflow.
The main difference between the dependence graphs of dataflow and
the code sequence in an instruction window of a microprocessor:
branch prediction and speculative execution.
Microprocessors: rerolling execution in case of a wrongly predicted path
is costly in terms of processor cycles.
65
Lessons Learned from Dataflow (Continued)
Dataflow: The idea of branch prediction and speculative execution has
never been evaluated in the dataflow environment.
Dataflow was considered to produce an abundance of parallelism while
speculation leads to speculative parallelism which is inferior to real
parallelism.
Microprocessors: Due to the single thread of control, a high degree of
data and instruction locality is present in the machine code.
Microprocessors: The locality allows to employ a storage hierarchy that
stores the instructions and data potentially executed in the next cycles
close to the executing processor.
Dataflow: Due to the lack of locality in a dataflow graph, a storage
hierarchy is difficult to apply.
66
Lessons Learned from Dataflow (Continued)
Microprocessors: The operand matching of executable instructions in
the instruction window is restricted to a part of the instruction sequence.
Because of the serial program order, the instructions in this window are
likely to become executable soon.
The matching hardware can be restricted to a small number of slots.
Dataflow: the number of tokens waiting for a match can be very high.
A large waiting-matching store is required.
Dataflow: Due to the lack of locality, the likelihood of the arrival of a
matching token is difficult to estimate,
caching of tokens to be matched soon is difficult.
67
Lessons Learned from Dataflow (Memory Latency)
Microprocessors: An unsolved problem is the memory latency caused by cache
misses.
Example: SGI Origin 2000:
latencies are 11 processor cycles for a L1 cache miss,
60 cycles for a L2 cache miss,
and can be up to 180 cycles for a remote memory access.
In principle, latencies should be multiplied by the degree of superscalar.
Microprocessors: Only a small part of the memory latency can be hidden by outof-order execution, write buffer, cache preload hardware, lockup free caches, and
a pipelined system bus.
Microprocessors often idle and are unable to exploit the high degree of internal
parallelism provided by a wide superscalar approach.
Dataflow: The rapid context switching avoids idling by switching execution to
another context.
68
Lessons Learned from Dataflow
(Continued)
Microprocessors: Finding enough fine-grain parallelism to fully exploit
the processor will be the main problem for future superscalars.
Solution: enlarge the instruction window to several hundred instruction
slots; two draw-backs
Most of the instructions in the window will be speculatively assigned
with a very deep speculation level (today's depth is normally four at
maximum).
most of the instruction execution will be speculative.
The principal problem here arises from the single instruction stream
that feeds the instruction window.
If the instruction window is enlarged, the updating of the instruction
states in the slots and matching of executable instructions lead to
more complex hardware logic in the issue stage of the pipeline thus
limiting the cycle rate.
69
Lessons Learned from Dataflow (Continued)
Solutions:
the decoupling of the instruction window with respect to different
instruction classes,
the partitioning of the issue stage into several pipeline stages,
and alternative instruction window organizations.
Alternative instruction window organization: the dependence-based
microprocessor:
Instruction window is organized as multiple FIFOs.
Only the instructions at the heads of a number of FIFO buffers can be
issued to the execution units in the next cycle.
The total parallelism in the instruction window is restricted in favor of
a less costly issue that does not slow down processor cycle rate.
Thereby the potential fine-grained parallelism is limited
somewhat similar to the threaded dataflow approach.
70
Lessons Learned from Dataflow
(alternative instruction window organizations)
Look at dataflow matching store implementations
Look into dataflow solutions like threaded dataflow
(e.g. repeat-on-input technique or strongly-connected arcs model)
Repeat-on-input strategy issues compiler-generated code sequences
serially (in an otherwise fine-grained dataflow computer).
Transferred to the local dataflow in an instruction window:
an issue string might be used;
a series of data dependent instructions is generated by a compiler and
issued serially after the issue of the leading instruction.
However, the high number of speculative instructions in the instruction
window remains.
71