Transcript Lec24

Quasi-static Scheduling for Reactive
Systems
Jordi Cortadella, Universitat Politècnica de Catalunya, Spain
Alex Kondratyev, Cadence Berkeley Labs, USA
Luciano Lavagno, Politecnico di Torino, Italy
Claudio Passerone, Politecnico di Torino, Italy
Yosinori Watanabe, Cadence Berkeley Labs, USA
Outline
• The problem
– Synthesis of concurrent specifications
• Previous work: Dataflow networks
– Static scheduling of SDF networks
• Quasi-Static Scheduling of process
networks
– Petri net representation of process networks
– Scheduling and code generation
• Open problems
Embedded Software Synthesis
• Specification: concurrent functional netlist
(Kahn processes, dataflow actors, SDL processes, …)
• Software implementation:
(smaller) set of concurrent software tasks
• Two sub-problems:
– Generate code for each task
– Schedule tasks dynamically
• Goals:
– minimize real-time scheduling overhead
– maximize effectiveness of compilation
Environmental controller
Temperature
TSENSOR
HSENSOR
TEMP
FILTER
HUMIDITY
FILTER
Humidity
ENVIRONMENTAL
CONTROLLER
TDATA
HDATA
CONTROLLER
AC
Dehumidifier
Alarm
AC-on DRYER-on ALARM-on
Environmental controller
TEMP-FILTER
float sample, last;
last = 0;
forever {
sample = READ(TSENSOR);
if (|sample - last| > DIF) {
last = sample;
WRITE(TDATA, sample);
}
}
TSENSOR
HSENSOR
TEMP
FILTER
HUMIDITY
FILTER
TDATA
HDATA
CONTROLLER
AC-on DRYER-on ALARM-on
Environmental controller
TEMP-FILTER
float sample, last;
last = 0;
forever {
sample = READ(TSENSOR);
if (|sample - last| > DIF) {
last = sample;
WRITE(TDATA, sample);
}
}
HUMIDITY-FILTER
float h, max;
forever {
h = READ(HSENSOR);
if (h > MAX) WRITE(HDATA, h);
}
TSENSOR
HSENSOR
TEMP
FILTER
HUMIDITY
FILTER
TDATA
HDATA
CONTROLLER
AC-on DRYER-on ALARM-on
Environmental controller
CONTROLLER
float tdata, hdata;
forever {
select(TDATA,HDATA) {
case TDATA:
tdata = READ(TDATA);
if (tdata > TFIRE)
WRITE(ALARM-on,10);
else if (tdata > TMAX)
WRITE(AC-on, tdata-TMAX);
case HDATA:
hdata = READ(HDATA);
if (hdata > HMAX)
WRITE(DRYER-on, 5);
}
}
TSENSOR
HSENSOR
TEMP
FILTER
HUMIDITY
FILTER
TDATA
HDATA
CONTROLLER
AC-on DRYER-on ALARM-on
Environ.
Processes
OS
Tsensor
T-FILTER
wakes up
Operating system
T-FILTER
executes
T-FILTER
sleeps
Hsensor
TSENSOR
HSENSOR
TEMP
FILTER
HUMIDITY
FILTER
H-FILTER
wakes up
H-FILTER
executes &
sends data
to HDATA
H-FILTER
sleeps
CONTROLLER
wakes up
CONTROLLER
executes &
reads data
from HDATA
..
.
TDATA
HDATA
CONTROLLER
AC-on DRYER-on ALARM-on
Goal: improve performance
• Reduce operating system overhead
Operating system
TSENSOR
HSENSOR
TEMP
FILTER
HUMIDITY
FILTER
• Reduce communication overhead
How?: Do as much as possible statically
TDATA
HDATA
• Scheduling
CONTROLLER
• Compiler optimizations
AC-on DRYER-on ALARM-on
Outline
• The problem
– Synthesis of concurrent specifications
• Previous work: Dataflow networks
– Static scheduling of SDF networks
• Quasi-Static Scheduling of process
networks
– Petri net representation of process networks
– Scheduling and code generation
• Open problems
Intuitive semantics
• (Often stateless) actors perform computation
• Unbounded FIFOs perform communication via sequences
of tokens carrying values
– (matrix of) integer, float, fixed point
– image of pixels, …..
• Determinacy:
– unique output sequences given unique input sequences
– Sufficient condition: blocking read
(process cannot test input queues for emptiness)
Intuitive semantics
• Example: FIR filter
– single input sequence i(n)
– single output sequence o(n)
– o(n) = c1 * i(n) + c2 * i(n-1)
i
i(-1)
 c1
 c2
+
o
Examples of Dataflow actors
• SDF: Static Dataflow: fixed number of input and
output tokens
1
+
1
1
1024
FFT
1024
10
1
• BDF: Boolean Dataflow control token determines
number of consumed and produced tokens
T
F
select
merge
T
F
Static scheduling of DF
• Key property of DF networks: output sequences do not
depend on firing sequence of actors
• SDF networks can be statically scheduled at compile-time
– execute an actor when it is known to be fireable
– no overhead due to sequencing of concurrency
– static buffer sizing
• Different schedules yield different
– code size
– buffer size
– pipeline utilization
Balance equations
• Number of produced tokens must equal number of
consumed tokens on every edge
np
A
nc
B
A
np
nc
B
• Repetitions (or firing) vector vS of schedule S: number of
firings of each actor in S
• vS(A) np = vS(B) nc
must be satisfied for each edge
Balance equations
3
2
2
1
B
A
1
1
1
1
C
• Balance for each edge:
–
–
–
–
3 vS(A) - vS(B) = 0
vS(B) - vS(C) = 0
2 vS(A) - vS(C) = 0
2 vS(A) - vS(C) = 0
Balance equations
3
1
B
A
2
2
1
1
1
1
C
3
M= 0
2
2
• M vS = 0
iff S is periodic
• Full rank (as in this case)
• no non-zero solution
• no periodic schedule
(too many tokens accumulate on AB or BC)
-1
1
0
0
0
-1
-1
-1
Balance equations
2
1
B
A
2
2
1
1
1
1
C
2
M= 0
2
2
-1
1
0
0
• Non-full rank
• infinite solutions exist (linear space of dimension 1)
• Any multiple of q = |1 2 2|T satisfies the balance equations
• ABCBC and ABBCC are minimal valid schedules
• ABABBCBCCC is non-minimal valid schedule
0
-1
-1
-1
From repetition vector to schedule
• Repeatedly schedule fireable actors up to number
of times in repetition vector
A
q = |1 2 2|T
2
2
1
B
2
1
1
1
1
C
• Can find either ABCBC or ABBCC
• If deadlock before original state, no valid schedule
exists (Lee ‘86)
Compilation optimization
• Assumption: code stitching
(chaining custom code for each actor)
• More efficient than C compiler for DSP
• Comparable to hand-coding in some cases
• Explicit parallelism, no artificial control
dependencies
• Main problem: memory and processor/FU
allocation depends on scheduling, and viceversa
Code size minimization
• Assumptions (based on DSP architecture):
– subroutine calls expensive
– fixed iteration loops are cheap
(“zero-overhead loops”)
• Global optimum: single appearance schedule
e.g. ABCBC  A (2BC), ABBCC  A (2B) (2C)
• may or may not exist for an SDF graph…
• buffer minimization relative to single appearance
schedules
(Bhattacharyya ‘94, Lauwereins ‘96, Murthy ‘97)
Buffer size minimization
• Assumption: no buffer sharing
• Example:
A
1
10
C
B
1
10
1
10
D
q = | 100 100 10 1|T
• Valid SAS: (100 A) (100 B) (10 C) D
• requires 210 units of buffer area
• Better (factored) SAS: (10 (10 A) (10 B) C) D
• requires 30 units of buffer areas, but…
• requires 21 loop initiations per period (instead of 3)
Scheduling more powerful DF
• SDF is limited in modeling power
• More general DF is too powerful
– non-Static DF is Turing-complete (Buck ‘93)
– bounded-memory scheduling is not always possible
• Boolean Data Flow: Quasi-Static Scheduling of special
“patterns”
– if-then-else, repeat-until, do-while
• Dynamic Data Flow: run-time scheduling
– may run out of memory or deadlock at run time
• Kahn Process Networks: quasi-static scheduling using
Petri nets
– conservative: schedulable network may be declared
unschedulable
Outline
• The problem
– Synthesis of concurrent specifications
– Compiler optimizations across processes
• Previous work: Dataflow networks
– Static scheduling of SDF networks
– Code and data size optimization
• Quasi-Static Scheduling of process networks
– Petri net representation of process networks
– Scheduling and code generation
• Open problems
Quasi-Static Scheduling
QSS
• Sequentialize concurrent operations as much as possible
• less communication overhead
(run-time task generation)
• better starting point for compilation
(straight-line code from function blocks)
 Must handle
• data-dependent control
• multi-rate communication
The problem
• Given: a network of Kahn processes
– Kahn process: sequential function + ports
– communication: port-based, point-to-point, unidirectional, multi-rate
TSENSOR
• Find: a single task
– functionally equivalent to the original
network (modulo concurrency)
– driven by input stimuli
(no OS intervention)
TEMP
FILTER
HSENSOR
HUMIDITY
FILTER
TDATA
HDATA
CONTROLLER
AC-on DRYER-on ALARM-on
The scheduling procedure
1. Specify a network of processes
– process: C + communication operations
– netlist: connection between ports
2. Translate to the computational model:
Petri nets
3. Find a “schedule” on the Petri net
4. Translate the schedule to a task
TSENSOR
TSENSOR
TEMP
FILTER
last = 0
TDATA
sample = READ(TSENSOR)
TEMP-FILTER
float sample, last;
last = 0;
while (1) {
sample = READ(TSENSOR);
if (|sample - last|> DIF) {
last = sample;
WRITE(TDATA, sample);
}
}
F
T
last = sample;
WRITE(TDATA,sample)
TDATA
Petri nets for Kahn process networks
Channels
Input/Output
Sequential
(point-to-point
ports processes
(communication
communication
(1 tokenwith
perbetween
the
process)
environment)
processes)
Petri nets for Kahn process networks
True
False
True
False
Data-dependent choices
• Conservative assumption (any outcome is possible)
Scheduling game
Adversary
Scheduler
t1
t2
t3
Data choice + inputs
t1 t2
t1
t3
The rest of transitions
t4
t5
t6
t4 t5 t6
Scheduling game
Adversary
Scheduler
t1
t2
t3
Data choice + inputs
t1 t2
t1
t3
The rest of transitions
t4
t5
t6
t4 t5 t6

Scheduling game
Adversary
Scheduler
t1
t2
t3
Data choice + inputs
t1 t2 t1 t2 t1 t2…
The rest of transitions
t4
t5
t4 t4 t4

t6
Schedule generation
a
p0
d
p1
p3
b
e
p2
p4
c
f
p0p1
p2
a p0
d
p0p3
p5
p4p5
g
p0p5p1
Schedule is an RG subset:
p2p5
1. Finite
2. Sequential
3. Live wrt to source transitions
4. All FCS transitions are fired in a state
a
p0p5
d
p0p3p5
(FCS: always conflicting transitions)
Depth first traversal with backtracking
Schedule generation
a
p0
d
p1
p3
b
e
p2
p4
c
f
p0p1
p2
a p0
d
p0p3
Await
states
p5
p4p5
g
p0p5p1
Schedule is an RG subset:
p2p5
1. Finite
2. Sequential
3. Live wrt to source transitions
4. All FCS transitions are fired in a state
a
p0p5
d
p0p3p5
(FCS: always conflicting transitions)
Depth first traversal with backtracking
Handling infinity
PN with source transitions has infinite reachability space
Need for termination conditions during traversal
Irrelevance Criterion:
- Impose place “bounds” by the structure of the PN.
- Identify “irrelevant nodes” in the reachability tree.
- If the algorithm hits an irrelevant node, backtrack.
Bounds the reachability space!!!
Irrelevance criterion
i1
o1
bound of place=max of
- max I k + max o j
im
on
2
4
3
1
-1
- the initial marking of p
max(3+4-1, 1) = 6
v is irrelevant node iff:
1. v succeds u,
2. p, M(u, p)  M(v, p),
u
v
3. p, if M(u, p) < M(v, p), then M(u, p)  the bound of p.
v is as at least capable as u
u already hits the bounds
no traversal beyond v
Irrelevance is more than marking, it is marking+history!!!
Quality of irrelevance criterion
Heuristic for the general Petri nets:
p2
p1
C
p1 p3
A
p3
p1 p3 p4
A
A
2
B
p5
p4
D
p1 p3 p4 p4
C
irrelevant
E
B
p4 p4 p5 p5
p6
2
p1 p2 p4 p4
D
E
p4 p5 p6
D
p6 p6
For unique and/or free choice PNs irrelevance may be exact
(if yes, then schedulability is decidable in this class)
Open issue
Properties of the Algorithm
Claim1:
• If the algorithm terminates successfully, a
schedule is obtained.
Claim2:
• If the algorithm does NOT terminate
successfully, no schedule exists under given
termination conditions
Semi-decision procedure!!!
Divide and conquer
a
Single Source Schedules
d
p0
p1
p3
b
e
p2
p4
c
f
p0
d
p0p3
SSS(a)
p5
p0p1
g
a p0
p4p5
p0p5
p2
Schedule
Isomorphic
p2,p0
p0p5p1
p2p5
p0p5
p0p3p5
p0,p0
d
p0,p0p3
p0,p4p5
p4p5
a
d
Composition
p0p1,p0 a
p0p1 a p0
d
p2
p0p3
SSS(d)
d
p0p3p5
p0p1,p0p5
a
p2,p0p5
p0,p0p5
d
p0,p0p3p5
Divide and conquer
a
p0
Single Source Schedules
d
p1
p3
b
e
p2
p4
c
f
SSS(a)
p5
p0p1
g
a p0
p2
p0
d
p0p3
p4p5
d
p4p5p3
SSS(d)
p0p3p5
Composition
No isomorphic
schedule exists
p0p1,p0 a
p2,p0
Independence of SSS!!!
p0,p0
d
p0,p0p3
a p0,p4p5
d
p2,p4p5 p0,p0p5
p0p1,p4p5
p0,p0p3p5
Checking SSS independence
SSS
independence
Consumption
of tokens
Marking
equations
N. and S. condition:
M0(p) – worst_change(p,a) – SSS_change(p,a)  0
Worst consumption
of p in SSS(a)
Worst consumption
of p in other SSSs
Complexity of checking: O( |SSS|)
Composition has exponentially larger number of states!!!
Code generation
Initialization
I1
system
Await state
I1
I2
I2
Generated code:
Choice
I1
T
I2
• ISRs driven by
input stimuli (I1 and I2)
F
F
T
I1
I2
• Each tasks contains threads
from one await state to
another await state
Code generation
I1
system
I1
I2
I2
Generated code:
I1
T
I2
• ISRs driven by
input stimuli (I1 and I2)
F
F
T
I1
I2
• Each tasks contains threads
from one await state to
another await state
Code generation
Init
I1
I1
S1
C1
system
I2
I2
C9
C4
C2
C3
Generated code:
C5
C11
I1
S3
S2
I1
I2
F
I2
C8
C6
C10
C7
T
• ISRs driven by
input stimuli (I1 and I2)
• Each tasks contains threads
from one await state to
another await state
Code generation
enum state {S1, S2, S3} S;
C0
I1
S1
C1
I2
C9
C4
C5
C2
C3
C11
I1
S3
S2
I1
I2
F
I2
C8
C6
C10
C7
T
Code generation
C0
S1
enum state {S1, S2, S3} S;
Init () {
C0(); S = S1; return;
}
Code generation
enum state {S1, S2, S3} S;
I1
S1
C1
C5
C2
C3
C11
I1
S3
S2
I1
C6
C7
ISR1 () {
switch(S) {
case S1: C1(); C2(); S=S2; return;
case S2: C3(); C2(); return;
case S3: C6(); C7(); C11(); C5(); return;
}}
Code generation
enum state {S1, S2, S3} S;
S1
I2
C9
C4
C5
C11
S3
S2
I2
F
I2
C8
C10
C7
T
ISR2 () {
switch(S) {
case S1: C4(); C5(); S=S3; break;
case S2: C10(); C11(); C5(); S=S3; return;
case S3: if (C8()) {
C7(); C11(); C5(); return;
} else {
C9(); S = S1; return;
}}
}
Code generation
enum state {S1, S2, S3} S;
C0
I1
S1
C1
Init () {
C0(); S = S1; return;
}
I2
C9
C4
C5
C2
C3
C11
I1
S3
S2
I1
I2
F
I2
C8
C6
C10
C7
T
ISR1 () {
switch(S) {
case S1: C1(); C2(); S=S2; return;
case S2: C3(); C2(); return;
case S3: C6(); C7(); C11(); C5(); return;
}}
ISR2 () {
switch(S) {
case S1: C4(); C5(); S=S3; break;
case S2: C10(); C11(); C5(); S=S3; return;
case S3: if (C8()) {
C7(); C11(); C5(); return;
} else {
C9(); S = S1; return;
}}
}
Code generation
enum state {S1, S2, S3} S;
Init () {
C0(); S = S1; return;
}
Reset
Init ()
I1
ISR1 ()
I2
ISR2 ()
S
ISR1 () {
switch(S) {
case S1: C1(); C2(); S=S2; return;
case S2: C3(); C2(); return;
case S3: C6(); C7(); C11(); C5(); return;
}}
ISR2 () {
switch(S) {
case S1: C4(); C5(); S=S3; break;
case S2: C10(); C11(); C5(); S=S3; return;
case S3: if (C8()) {
C7(); C11(); C5(); return;
} else {
C9(); S = S1; return;
}}
}
Experimental Results
Thdr
Thdr
QSS
Tvld
Tisiq
TdecMV
Tpredict
Tidct
Tadd
Tvld
TdecMV
Tpredict
Tidct
Tadd
Tisiq
QSS applied to a subset of the MPEG-2 Decoder
(5 processes out of the original 11)
The MPEG2 decoder
Total
Philips
QSS
7.5
4.1
Philips
QSS
Total
4.66
2.51
Total
3.72
1.57
MPEG
vld+hdr
0.94
0.94
5 blocks
3.72
1.57
TestBench
OS
0.27
0.28
2.58
1.31
5 blocks
Computation Communication
1.49
2.23
1.44
0.13
• Performance increased by 45%
– reduction of communication (no internal FIFOs
between statically scheduled processes)
– reduction of run-time scheduling (OS)
– no reduction in computation
Open problems
• Is a system schedulable ? (decidability)
• False paths in concurrent systems
(data dependencies)
• Synthesis for concurrent architectures
• Timing models
(Quasi) Static Scheduling approaches
• Lee et al. ‘86: Static Data Flow: cannot specify datadependent control
• Buck et al. ‘94: Boolean Data Flow: undecidable
schedulability check, heuristic pattern-based algorithm
• Thoen et al. ‘99: Event graph: no schedulability check,
no task minimization
• Lin ‘97: Safe Petri Net: no schedulability check, singlerate, reachability-based algorithm
• Thiele et al. ‘99: Bounded Petri Net: partial
schedulability check, reachability-based algorithm
• Cortadella et al. ‘00: General Petri Net: maybe
undecidable schedulability check, balance equationbased algorithm
The false path problem (example)
while (true) {
a = rnd();
Write(ct, a, 1);
if (a > 0.5) {
Write(dt, d, 2);
} else {
Write(dt, d, 1);
}
}
Process P1
ct
dt
while (true) {
Read(ct, a, 1);
if (a > 0.5) {
Read(dt, d, 2);
} else {
Read(dt, d, 1);
}
}
Process P2
If P1 does Write(dt, d, 2), then P2 never does Read(dt, d, 1), i.e. this path is false!
False path elimination
• The designer manually tags sets of
correlated conditions
• An implicit data-dependent correlation
become explicit communication and
control-dependent synchronization
– the tool automatically adds synchronization
channels to model correlation
• Scheduling is then possible
– synchronization channels can be deleted after
code generation (no overhead in final
implementation)
The false path problem (example)
while (true) {
a = rnd();
Write(ct, a, 1);
if (a > 0.5) {
Write(dt, d, 2);
} else {
Write(dt, d, 1);
}
}
Process P1
ct
dt
while (true) {
Read(ct, a, 1);
if (a > 0.5) {
Read(dt, d, 2);
} else {
Read(dt, d, 1);
}
}
Process P2
False path elimination algorithm
#pragma tag sync
if (cond1) {
Write(syncT, d, 1);
stm1;
} else {
Write(syncF, d, 1);
stm2;
}
Process P1
port
syncT
syncF
#pragma tag sync
(void) (cond2);
switch(Select(syncT,syncF)) {
} case 0: Read(syncT, d, 1);
stm3; break;
case 1: Read(syncF, d, 1);
stm4; break;
}
Process P2
False path elimination algorithm
1. For each correlated control pair, add two ports
<tag>T and <tag>F to processes P1 and P2 and
connect them.
2. Add Write statements at the beginning of both
branches of if-then-else, writing on the created
ports in process P1.
3. Delete if-then-else from process P2 and add a
switch on the output of a Select statement on
the created ports.
4. Fill in the case clauses with the appropriate
code from the branches in process P2,
reading data from the created ports.
5. Finally, apply QSS and eliminate the added
synchronization.
False path elimination algorithm
#pragma tag sync
if (cond1) {
Write(syncT, d, 1);
stm1;
} else {
Write(syncF, d, 1);
stm2;
}
Process P1
port
syncT
syncF
#pragma tag sync
(void) (cond2);
switch(Select(syncT,syncF)) {
case 0: Read(syncT, d, 1);
stm3; break;
case 1: Read(syncF, d, 1);
stm4; break;
}
Process P2
QSS
(void) (cond2);
if (cond1) {
Write(syncT, P1_d, 1);
Read(syncT, P2_d, 1);
stm1; stm3;
} else {
Write(syncF, P1_d, 1);
Read(syncF, P2_d, 1);
stm2; stm4;
}
Process P
Conclusions
• QSS shows significant gains in real examples
• Current theory has several open problems
• Future extensions are:
Concurrency
QSS
QSS
Timing
Dataflow
analysis