K1-Co-synthesis Algorithms (HW-SW Partitioning)

Download Report

Transcript K1-Co-synthesis Algorithms (HW-SW Partitioning)

Design & Co-design of
Embedded Systems
HW/SW Partitioning Algorithms
+ Process Scheduling
Maziar Goudarzi
Today Program
Introduction
Preliminaries
Hardware/Software Partitioning
Distributed System Co-Synthesis (Next session)
Reference:
Wayne Wolf, “Hardware/Software Co-Synthesis Algorithms,”
Chapter 2, Hardware/Software Co-Design: Principles and
Practice, Eds: J. Staunstrup, W. Wolf, Kluwer Academic
Publishers, 1997.
Fall 2005
Design & Co-design of
Embedded Systems
2
Topics
Introduction
A Classification
Examples
Vulcan
Cosyma
Fall 2005
Design & Co-design of
Embedded Systems
3
Introduction to
HW/SW Partitioning
The first variety of co-synthesis applications
Definition
A HW/SW partitioning algorithm implements a
specification on some sort of multiprocessor
architecture
Usually
Multiprocessor architecture = one CPU + some
ASICs on CPU bus
Fall 2005
Design & Co-design of
Embedded Systems
4
Introduction to
HW/SW Partitioning (cont’d)
A Terminology
Allocation
Synthesis methods which design the multiprocessor
topology along with the PEs and SW architecture
Scheduling
The process of assigning PE (CPU and/or ASICs) time to
processes to get executed
Fall 2005
Design & Co-design of
Embedded Systems
5
Introduction to
HW/SW Partitioning (cont’d)
In most partitioning algorithms
Type of CPU is fixed and given
ASICs must be synthesized
What function to implement on each ASIC?
What characteristics should the implementation have?
Are single-rate synthesis problems
CDFG is the starting model
Fall 2005
Design & Co-design of
Embedded Systems
6
HW/SW Partitioning (cont’d)
Normal use of architectural components
CPU performs less computationally-intensive
functions
ASICs used to accelerate core functions
Where to use?
High-performance applications
No CPU is fast enough for the operations
Low-cost application
ASIC accelerators allow use of much smaller, cheaper CPU
Fall 2005
Design & Co-design of
Embedded Systems
7
A Classification
Criterion: Optimization Strategy
Trade-off between Performance and Cost
Primal Approach
Performance is the primary goal
First, all functionality in ASICs. Progressively move more
to CPU to reduce cost.
Dual Approach
Cost is the primary goal
First, all functions in the CPU. Move operations to the
ASIC to meet the performance goal.
Fall 2005
Design & Co-design of
Embedded Systems
8
A Classification (cont’d)
Classification due to optimization strategy
(cont’d)
Example co-synthesis systems
Vulcan (Stanford): Primal strategy
Cosyma (Braunschweig, Germany): Dual strategy
Fall 2005
Design & Co-design of
Embedded Systems
9
Co-Synthesis Algorithms:
HW/SW Partitioning
HW/SW Partitioning
Examples:
Vulcan
Partitioning Examples:
Vulcan
Gupta, De Micheli, Stanford University
Primal approach
1. All-HW initial implementation.
2. Iteratively move functionality to CPU to reduce cost.
System specification language
HardwareC
Is compiled into a flow graph
Fall 2005
Design & Co-design of
Embedded Systems
11
Partitioning Examples:
Vulcan (cont’d)
x=a; y=b;
HardwareC
nop
1
1
x=a
y=b
cond
if (c>d)
x=e;
else y=f;
HardwareC
Fall 2005
c>d
c<=d
x=e
y=f
Design & Co-design of
Embedded Systems
12
Partitioning Examples:
Vulcan (cont’d)
Flow Graph Definition
A variation of a (single-rate) task graph
Nodes
Represent operations
Typically low-level operations: mult, add
Edges
Represent data dependencies
Each contains a Boolean condition under which the edge
is traversed
Fall 2005
Design & Co-design of
Embedded Systems
13
Partitioning Examples:
Vulcan (cont’d)
Flow Graph
is executed repeatedly at some rate
can have initiation-time constraints for each node
t(vi)+lij  t(vj)  t(vi)+uij
can have rate constraints on each node
mi  Ri  Mi
Fall 2005
Design & Co-design of
Embedded Systems
14
Partitioning Examples:
Vulcan (cont’d)
Vulcan Co-synthesis Algorithm
Partitioning quantum is a thread
Algorithm divides the flow graph into threads and
allocates them
Thread boundary is determined by
1. (always) a non-deterministic delay element, such as wait for an
external variable
2. (on choice) other points of flow graph
Target architecture
CPU + Co-processor (multiple ASICs)
Fall 2005
Design & Co-design of
Embedded Systems
15
Partitioning Examples:
Vulcan (cont’d)
Vulcan Co-synthesis algorithm (cont’d)
Allocation
Primal approach
Scheduling
is done by a scheduler on the target CPU
• is generated as part of synthesis process
• schedules all threads (both HW and SW threads)
cannot be static, due to some threads non-deterministic
initiation-time
Fall 2005
Design & Co-design of
Embedded Systems
16
Partitioning Examples:
Vulcan (cont’d)
Vulcan Co-synthesis algorithm (cont’d)
Cost estimation
SW implementation
• Code size
– relatively straight forward
• Data size
– Biggest challenge.
– Vulcan puts some effort to find bounds for each thread
HW implementation
•?
Fall 2005
Design & Co-design of
Embedded Systems
17
Partitioning Examples:
Vulcan (cont’d)
Vulcan Co-synthesis algorithm (cont’d)
Performance estimation
Both SW- and HW-implementation
• From flow-graph, and basic execution times for the operators
Fall 2005
Design & Co-design of
Embedded Systems
18
Partitioning Examples:
Vulcan (cont’d)
Algorithm Details
Partitioning goal
Map each thread to one of two partitions
• CPU Set: FS
• Co-processor set: FH
Required execution-rate must be met, and total cost
minimized
Fall 2005
Design & Co-design of
Embedded Systems
19
Partitioning Examples:
Vulcan (cont’d)
Algorithm Details (cont’d)
Algorithm steps
1. Put all threads in FH set
2. Iteratively do
2.1. Move some operations to FS.
2.1.1. Select a group of operations to move to FS.
2.1.2. Check performance feasibility, by computing worst-case
delay through flow-graph given the new thread times
2.1.3. Do the move, if feasible
2.2. Incrementally update the new cost-function to reflect the
new partition
Fall 2005
Design & Co-design of
Embedded Systems
20
Partitioning Examples:
Vulcan (cont’d)
Algorithm Details (cont’d)
Vulcan cost function
f(w) = c1Sh(FH) - c2Ss(FS) + c3B - c4P + c5|m|
c: weight constants
S(): Size functions
B: Bus utilization (<1)
P: Processor utilization (<1)
m: total number of variables to be transferred between
the CPU and the co-processor
Fall 2005
Design & Co-design of
Embedded Systems
21
Partitioning Examples:
Vulcan (cont’d)
Algorithm Details (cont’d)
Complementary notes
A heuristic to minimize communication
• Once a thread is moved to FS, its immediate successors
are placed in the list for evaluation in the next iteration.
No back-track
• Once a thread is assigned to FS, it remains there
Experimental results
• considerably faster implementations than all-SW, but
much cheaper than all-HW designs are produced
Fall 2005
Design & Co-design of
Embedded Systems
22
Co-Synthesis Algorithms:
HW/SW Partitioning
HW/SW Partitioning
Examples:
Cosyma
Partitioning Examples:
Cosyma
Rolf Ernst, et al: Technical University of
Braunschweig, Germany
Dual approach
1. All-SW initial implementation.
2. Iteratively move basic blocks to the ASIC
accelerator to meet performance objective.
System specification language
Cx
Is compiled into an ESG (Extended Syntax Graph)
ESG is much like a CDFG
Fall 2005
Design & Co-design of
Embedded Systems
24
Partitioning Examples:
Cosyma (cont’d)
Cosyma Co-synthesis Algorithm
Partitioning quantum is a Basic Block
A Basic Block is a branch-free block of program
Target Architecture
CPU + accelerator ASIC(s)
Scheduling
Allocation
Cost Estimation
Performance Estimation
Algorithm Details
Fall 2005
Design & Co-design of
Embedded Systems
25
Partitioning Examples:
Cosyma (cont’d)
Cosyma Co-synthesis Algorithm (cont’d)
Performance Estimation
SW implementation
• Done by examining the object code for the basic block
generated by a compiler
HW implementation
• Assumes one operator per clock cycle.
• Creates a list schedule for the DFG of the basic block.
• Depth of the list gives the number of clock cycles required.
Communication
• Done by data-flow analysis of the adjacent basic blocks.
• In Shared-Memory
& Co-design
of
– ProportionalDesign
to number
of variables
to be accessed
Fall 2005
Embedded Systems
26
Partitioning Examples:
Cosyma (cont’d)
Algorithm Steps
Change in execution-time caused by moving basic
block b from CPU to ASIC:
Dc(b) = w( tHW(b)-tSW(b) + tcom(Z) - tcom(ZUb)) x It(b)
 w:
Constant weight
t(b):
Execution time of basic block b
tcom(b): Estimated communication time between CPU and the
accelerator ASIC, given a set Z of basic blocks implemented on
the ASIC
It(b): Total number of times that b is executed
Fall 2005
Design & Co-design of
Embedded Systems
27
Partitioning Examples:
Cosyma (cont’d)
 Experimental Results
By moving only basic-blocks to HW
Typical speedup of only 2x
Reason:
• Limited intra-basic-block parallelism
Cure:
• Implement several control-flow optimizations to increase parallelism
in the basic block, and hence in ASIC
• Examples: loop pipelining, speculative branch execution with multiple
branch prediction, operator pipelining
Result:
• Speedups: 2.7 to 9.7
• CPU times: 35 to 304 seconds on a typical workstation
Fall 2005
Design & Co-design of
Embedded Systems
28
Summary
What’s co-synthesis
Various keywords used in classification of cosynthesis algorithms
HW/SW Partitioning: One broad category of
co-synthesis algorithms
Criteria by which a co-synthesis algorithm is
categorized
Fall 2005
Design & Co-design of
Embedded Systems
29
Processes and operating
systems
Scheduling policies:
RMS;
EDF.
Scheduling modeling assumptions.
Interprocess communication.
Power management.
Reference (and slides):
Wayne Wolf, “Computers as Components: Principles of
Embedded Computing System Design,” Chapter 6 (Processes
© 2000 Morgan
Overheads for Computers as
and Operating Systems), MKP, 2001.
Kaufman
Components
30
Metrics
How do we evaluate a scheduling policy:
Ability to satisfy all deadlines.
CPU utilization---percentage of time devoted
to useful work.
Scheduling overhead---time required to
make scheduling decision.
© 2000 Morgan
Kaufman
Overheads for Computers as
Components
31
Rate monotonic
scheduling
RMS (Liu and Layland): widely-used,
analyzable scheduling policy.
Analysis is known as Rate Monotonic
Analysis (RMA).
© 2000 Morgan
Kaufman
Overheads for Computers as
Components
32
RMA model
All process run on single CPU.
Zero context switch time.
No data dependencies between
processes.
Process execution time is constant.
Deadline is at end of period.
Highest-priority ready process runs.
© 2000 Morgan
Kaufman
Overheads for Computers as
Components
33
Process parameters
Ti is computation time of process i; ti is
period of process i.
period ti
Pi
computation time Ti
© 2000 Morgan
Kaufman
Overheads for Computers as
Components
34
Rate-monotonic analysis
Response time: time required to finish
process.
Critical instant: scheduling state that
gives worst response time.
Critical instant occurs when all higherpriority processes are ready to execute.
© 2000 Morgan
Kaufman
Overheads for Computers as
Components
35
Critical instant
interfering processes
P1
P1
P2
critical
instant
P3
P1 P1
P2
P1
P2
P3
P4
© 2000 Morgan
Kaufman
Overheads for Computers as
Components
36
RMS priorities
Optimal (fixed) priority assignment:
shortest-period process gets highest priority;
priority inversely proportional to period;
break ties arbitrarily.
No fixed-priority scheme does better.
© 2000 Morgan
Kaufman
Overheads for Computers as
Components
37
RMS example
P2 period
P2
P1
0
P1 period
P1
P1
5
10
time
© 2000 Morgan
Kaufman
Overheads for Computers as
Components
38
RMS CPU utilization
Utilization for n processes is
S i Ti / ti
As number of tasks approaches infinity,
maximum utilization approaches 69%.
© 2000 Morgan
Kaufman
Overheads for Computers as
Components
39
RMS CPU utilization,
cont’d.
RMS cannot asymptotically guarantee
use 100% of CPU, even with zero context
switch overhead.
Must keep idle cycles available to handle
worst-case scenario.
However, RMS guarantees all processes
will always meet their deadlines.
© 2000 Morgan
Kaufman
Overheads for Computers as
Components
40
RMS implementation
Efficient implementation:
scan processes;
choose highest-priority active process.
© 2000 Morgan
Kaufman
Overheads for Computers as
Components
41
Earliest-deadline-first
scheduling
EDF: dynamic priority scheduling
scheme.
Process closest to its deadline has
highest priority.
Requires recalculating processes at every
timer interrupt.
© 2000 Morgan
Kaufman
Overheads for Computers as
Components
42
EDF example
P1
P2
© 2000 Morgan
Kaufman
Overheads for Computers as
Components
43
EDF analysis
EDF can use 100% of CPU.
But EDF may miss a deadline.
© 2000 Morgan
Kaufman
Overheads for Computers as
Components
44
EDF implementation
On each timer interrupt:
compute time to deadline;
choose process closest to deadline.
Generally considered too expensive to
use in practice.
© 2000 Morgan
Kaufman
Overheads for Computers as
Components
45
Fixing scheduling
problems
What if your set of processes is
unschedulable?
Change deadlines in requirements.
Reduce execution times of processes.
Get a faster CPU.
© 2000 Morgan
Kaufman
Overheads for Computers as
Components
46
Priority inversion
Priority inversion: low-priority process
keeps high-priority process from running.
Improper use of system resources can
cause scheduling problems:
Low-priority process grabs I/O device.
High-priority device needs I/O device, but
can’t get it until low-priority process is done.
Can cause deadlock.
© 2000 Morgan
Kaufman
Overheads for Computers as
Components
47
Solving priority inversion
Give priorities to system resources.
Have process inherit the priority of a
resource that it requests.
Low-priority process inherits priority of
device if higher.
© 2000 Morgan
Kaufman
Overheads for Computers as
Components
48
Data dependencies
Data dependencies
allow us to improve
utilization.
Restrict combination
of processes that can
run simultaneously.
P1 and P2 can’t run
simultaneously.
© 2000 Morgan
Kaufman
Overheads for Computers as
Components
P1
P2
49
Context-switching time
Non-zero context switch time can push
limits of a tight schedule.
Hard to calculate effects---depends on
order of context switches.
In practice, OS context switch overhead
is small.
© 2000 Morgan
Kaufman
Overheads for Computers as
Components
50
What about interrupts?
Interrupts take
time away from
processes.
Perform minimum
work possible in
the interrupt
handler.
© 2000 Morgan
Kaufman
Overheads for Computers as
Components
P1
OS
intr
P2
OS
P3
51
Device processing
structure
Interrupt service routine (ISR) performs
minimal I/O.
Get register values, put register values.
Interrupt service process/thread
performs most of device function.
© 2000 Morgan
Kaufman
Overheads for Computers as
Components
52
Summary
Two major scheduling policies
RMS
EDF
Other parameters affecting scheduling
Priority inversion
Data dependencies
Context switching time
Interrupts
Fall 2005
Design & Co-design of
Embedded Systems
53
Assignment
Questions from Chapter 6 of text book
From 6.17 to 6.26
Due date:
Two weeks: Sunday, Azar 27th
Fall 2005
Design & Co-design of
Embedded Systems
54