K1-Co-synthesis Algorithms (HW-SW Partitioning)
Download
Report
Transcript K1-Co-synthesis Algorithms (HW-SW Partitioning)
Design & Co-design of
Embedded Systems
HW/SW Partitioning Algorithms
+ Process Scheduling
Maziar Goudarzi
Today Program
Introduction
Preliminaries
Hardware/Software Partitioning
Distributed System Co-Synthesis (Next session)
Reference:
Wayne Wolf, “Hardware/Software Co-Synthesis Algorithms,”
Chapter 2, Hardware/Software Co-Design: Principles and
Practice, Eds: J. Staunstrup, W. Wolf, Kluwer Academic
Publishers, 1997.
Fall 2005
Design & Co-design of
Embedded Systems
2
Topics
Introduction
A Classification
Examples
Vulcan
Cosyma
Fall 2005
Design & Co-design of
Embedded Systems
3
Introduction to
HW/SW Partitioning
The first variety of co-synthesis applications
Definition
A HW/SW partitioning algorithm implements a
specification on some sort of multiprocessor
architecture
Usually
Multiprocessor architecture = one CPU + some
ASICs on CPU bus
Fall 2005
Design & Co-design of
Embedded Systems
4
Introduction to
HW/SW Partitioning (cont’d)
A Terminology
Allocation
Synthesis methods which design the multiprocessor
topology along with the PEs and SW architecture
Scheduling
The process of assigning PE (CPU and/or ASICs) time to
processes to get executed
Fall 2005
Design & Co-design of
Embedded Systems
5
Introduction to
HW/SW Partitioning (cont’d)
In most partitioning algorithms
Type of CPU is fixed and given
ASICs must be synthesized
What function to implement on each ASIC?
What characteristics should the implementation have?
Are single-rate synthesis problems
CDFG is the starting model
Fall 2005
Design & Co-design of
Embedded Systems
6
HW/SW Partitioning (cont’d)
Normal use of architectural components
CPU performs less computationally-intensive
functions
ASICs used to accelerate core functions
Where to use?
High-performance applications
No CPU is fast enough for the operations
Low-cost application
ASIC accelerators allow use of much smaller, cheaper CPU
Fall 2005
Design & Co-design of
Embedded Systems
7
A Classification
Criterion: Optimization Strategy
Trade-off between Performance and Cost
Primal Approach
Performance is the primary goal
First, all functionality in ASICs. Progressively move more
to CPU to reduce cost.
Dual Approach
Cost is the primary goal
First, all functions in the CPU. Move operations to the
ASIC to meet the performance goal.
Fall 2005
Design & Co-design of
Embedded Systems
8
A Classification (cont’d)
Classification due to optimization strategy
(cont’d)
Example co-synthesis systems
Vulcan (Stanford): Primal strategy
Cosyma (Braunschweig, Germany): Dual strategy
Fall 2005
Design & Co-design of
Embedded Systems
9
Co-Synthesis Algorithms:
HW/SW Partitioning
HW/SW Partitioning
Examples:
Vulcan
Partitioning Examples:
Vulcan
Gupta, De Micheli, Stanford University
Primal approach
1. All-HW initial implementation.
2. Iteratively move functionality to CPU to reduce cost.
System specification language
HardwareC
Is compiled into a flow graph
Fall 2005
Design & Co-design of
Embedded Systems
11
Partitioning Examples:
Vulcan (cont’d)
x=a; y=b;
HardwareC
nop
1
1
x=a
y=b
cond
if (c>d)
x=e;
else y=f;
HardwareC
Fall 2005
c>d
c<=d
x=e
y=f
Design & Co-design of
Embedded Systems
12
Partitioning Examples:
Vulcan (cont’d)
Flow Graph Definition
A variation of a (single-rate) task graph
Nodes
Represent operations
Typically low-level operations: mult, add
Edges
Represent data dependencies
Each contains a Boolean condition under which the edge
is traversed
Fall 2005
Design & Co-design of
Embedded Systems
13
Partitioning Examples:
Vulcan (cont’d)
Flow Graph
is executed repeatedly at some rate
can have initiation-time constraints for each node
t(vi)+lij t(vj) t(vi)+uij
can have rate constraints on each node
mi Ri Mi
Fall 2005
Design & Co-design of
Embedded Systems
14
Partitioning Examples:
Vulcan (cont’d)
Vulcan Co-synthesis Algorithm
Partitioning quantum is a thread
Algorithm divides the flow graph into threads and
allocates them
Thread boundary is determined by
1. (always) a non-deterministic delay element, such as wait for an
external variable
2. (on choice) other points of flow graph
Target architecture
CPU + Co-processor (multiple ASICs)
Fall 2005
Design & Co-design of
Embedded Systems
15
Partitioning Examples:
Vulcan (cont’d)
Vulcan Co-synthesis algorithm (cont’d)
Allocation
Primal approach
Scheduling
is done by a scheduler on the target CPU
• is generated as part of synthesis process
• schedules all threads (both HW and SW threads)
cannot be static, due to some threads non-deterministic
initiation-time
Fall 2005
Design & Co-design of
Embedded Systems
16
Partitioning Examples:
Vulcan (cont’d)
Vulcan Co-synthesis algorithm (cont’d)
Cost estimation
SW implementation
• Code size
– relatively straight forward
• Data size
– Biggest challenge.
– Vulcan puts some effort to find bounds for each thread
HW implementation
•?
Fall 2005
Design & Co-design of
Embedded Systems
17
Partitioning Examples:
Vulcan (cont’d)
Vulcan Co-synthesis algorithm (cont’d)
Performance estimation
Both SW- and HW-implementation
• From flow-graph, and basic execution times for the operators
Fall 2005
Design & Co-design of
Embedded Systems
18
Partitioning Examples:
Vulcan (cont’d)
Algorithm Details
Partitioning goal
Map each thread to one of two partitions
• CPU Set: FS
• Co-processor set: FH
Required execution-rate must be met, and total cost
minimized
Fall 2005
Design & Co-design of
Embedded Systems
19
Partitioning Examples:
Vulcan (cont’d)
Algorithm Details (cont’d)
Algorithm steps
1. Put all threads in FH set
2. Iteratively do
2.1. Move some operations to FS.
2.1.1. Select a group of operations to move to FS.
2.1.2. Check performance feasibility, by computing worst-case
delay through flow-graph given the new thread times
2.1.3. Do the move, if feasible
2.2. Incrementally update the new cost-function to reflect the
new partition
Fall 2005
Design & Co-design of
Embedded Systems
20
Partitioning Examples:
Vulcan (cont’d)
Algorithm Details (cont’d)
Vulcan cost function
f(w) = c1Sh(FH) - c2Ss(FS) + c3B - c4P + c5|m|
c: weight constants
S(): Size functions
B: Bus utilization (<1)
P: Processor utilization (<1)
m: total number of variables to be transferred between
the CPU and the co-processor
Fall 2005
Design & Co-design of
Embedded Systems
21
Partitioning Examples:
Vulcan (cont’d)
Algorithm Details (cont’d)
Complementary notes
A heuristic to minimize communication
• Once a thread is moved to FS, its immediate successors
are placed in the list for evaluation in the next iteration.
No back-track
• Once a thread is assigned to FS, it remains there
Experimental results
• considerably faster implementations than all-SW, but
much cheaper than all-HW designs are produced
Fall 2005
Design & Co-design of
Embedded Systems
22
Co-Synthesis Algorithms:
HW/SW Partitioning
HW/SW Partitioning
Examples:
Cosyma
Partitioning Examples:
Cosyma
Rolf Ernst, et al: Technical University of
Braunschweig, Germany
Dual approach
1. All-SW initial implementation.
2. Iteratively move basic blocks to the ASIC
accelerator to meet performance objective.
System specification language
Cx
Is compiled into an ESG (Extended Syntax Graph)
ESG is much like a CDFG
Fall 2005
Design & Co-design of
Embedded Systems
24
Partitioning Examples:
Cosyma (cont’d)
Cosyma Co-synthesis Algorithm
Partitioning quantum is a Basic Block
A Basic Block is a branch-free block of program
Target Architecture
CPU + accelerator ASIC(s)
Scheduling
Allocation
Cost Estimation
Performance Estimation
Algorithm Details
Fall 2005
Design & Co-design of
Embedded Systems
25
Partitioning Examples:
Cosyma (cont’d)
Cosyma Co-synthesis Algorithm (cont’d)
Performance Estimation
SW implementation
• Done by examining the object code for the basic block
generated by a compiler
HW implementation
• Assumes one operator per clock cycle.
• Creates a list schedule for the DFG of the basic block.
• Depth of the list gives the number of clock cycles required.
Communication
• Done by data-flow analysis of the adjacent basic blocks.
• In Shared-Memory
& Co-design
of
– ProportionalDesign
to number
of variables
to be accessed
Fall 2005
Embedded Systems
26
Partitioning Examples:
Cosyma (cont’d)
Algorithm Steps
Change in execution-time caused by moving basic
block b from CPU to ASIC:
Dc(b) = w( tHW(b)-tSW(b) + tcom(Z) - tcom(ZUb)) x It(b)
w:
Constant weight
t(b):
Execution time of basic block b
tcom(b): Estimated communication time between CPU and the
accelerator ASIC, given a set Z of basic blocks implemented on
the ASIC
It(b): Total number of times that b is executed
Fall 2005
Design & Co-design of
Embedded Systems
27
Partitioning Examples:
Cosyma (cont’d)
Experimental Results
By moving only basic-blocks to HW
Typical speedup of only 2x
Reason:
• Limited intra-basic-block parallelism
Cure:
• Implement several control-flow optimizations to increase parallelism
in the basic block, and hence in ASIC
• Examples: loop pipelining, speculative branch execution with multiple
branch prediction, operator pipelining
Result:
• Speedups: 2.7 to 9.7
• CPU times: 35 to 304 seconds on a typical workstation
Fall 2005
Design & Co-design of
Embedded Systems
28
Summary
What’s co-synthesis
Various keywords used in classification of cosynthesis algorithms
HW/SW Partitioning: One broad category of
co-synthesis algorithms
Criteria by which a co-synthesis algorithm is
categorized
Fall 2005
Design & Co-design of
Embedded Systems
29
Processes and operating
systems
Scheduling policies:
RMS;
EDF.
Scheduling modeling assumptions.
Interprocess communication.
Power management.
Reference (and slides):
Wayne Wolf, “Computers as Components: Principles of
Embedded Computing System Design,” Chapter 6 (Processes
© 2000 Morgan
Overheads for Computers as
and Operating Systems), MKP, 2001.
Kaufman
Components
30
Metrics
How do we evaluate a scheduling policy:
Ability to satisfy all deadlines.
CPU utilization---percentage of time devoted
to useful work.
Scheduling overhead---time required to
make scheduling decision.
© 2000 Morgan
Kaufman
Overheads for Computers as
Components
31
Rate monotonic
scheduling
RMS (Liu and Layland): widely-used,
analyzable scheduling policy.
Analysis is known as Rate Monotonic
Analysis (RMA).
© 2000 Morgan
Kaufman
Overheads for Computers as
Components
32
RMA model
All process run on single CPU.
Zero context switch time.
No data dependencies between
processes.
Process execution time is constant.
Deadline is at end of period.
Highest-priority ready process runs.
© 2000 Morgan
Kaufman
Overheads for Computers as
Components
33
Process parameters
Ti is computation time of process i; ti is
period of process i.
period ti
Pi
computation time Ti
© 2000 Morgan
Kaufman
Overheads for Computers as
Components
34
Rate-monotonic analysis
Response time: time required to finish
process.
Critical instant: scheduling state that
gives worst response time.
Critical instant occurs when all higherpriority processes are ready to execute.
© 2000 Morgan
Kaufman
Overheads for Computers as
Components
35
Critical instant
interfering processes
P1
P1
P2
critical
instant
P3
P1 P1
P2
P1
P2
P3
P4
© 2000 Morgan
Kaufman
Overheads for Computers as
Components
36
RMS priorities
Optimal (fixed) priority assignment:
shortest-period process gets highest priority;
priority inversely proportional to period;
break ties arbitrarily.
No fixed-priority scheme does better.
© 2000 Morgan
Kaufman
Overheads for Computers as
Components
37
RMS example
P2 period
P2
P1
0
P1 period
P1
P1
5
10
time
© 2000 Morgan
Kaufman
Overheads for Computers as
Components
38
RMS CPU utilization
Utilization for n processes is
S i Ti / ti
As number of tasks approaches infinity,
maximum utilization approaches 69%.
© 2000 Morgan
Kaufman
Overheads for Computers as
Components
39
RMS CPU utilization,
cont’d.
RMS cannot asymptotically guarantee
use 100% of CPU, even with zero context
switch overhead.
Must keep idle cycles available to handle
worst-case scenario.
However, RMS guarantees all processes
will always meet their deadlines.
© 2000 Morgan
Kaufman
Overheads for Computers as
Components
40
RMS implementation
Efficient implementation:
scan processes;
choose highest-priority active process.
© 2000 Morgan
Kaufman
Overheads for Computers as
Components
41
Earliest-deadline-first
scheduling
EDF: dynamic priority scheduling
scheme.
Process closest to its deadline has
highest priority.
Requires recalculating processes at every
timer interrupt.
© 2000 Morgan
Kaufman
Overheads for Computers as
Components
42
EDF example
P1
P2
© 2000 Morgan
Kaufman
Overheads for Computers as
Components
43
EDF analysis
EDF can use 100% of CPU.
But EDF may miss a deadline.
© 2000 Morgan
Kaufman
Overheads for Computers as
Components
44
EDF implementation
On each timer interrupt:
compute time to deadline;
choose process closest to deadline.
Generally considered too expensive to
use in practice.
© 2000 Morgan
Kaufman
Overheads for Computers as
Components
45
Fixing scheduling
problems
What if your set of processes is
unschedulable?
Change deadlines in requirements.
Reduce execution times of processes.
Get a faster CPU.
© 2000 Morgan
Kaufman
Overheads for Computers as
Components
46
Priority inversion
Priority inversion: low-priority process
keeps high-priority process from running.
Improper use of system resources can
cause scheduling problems:
Low-priority process grabs I/O device.
High-priority device needs I/O device, but
can’t get it until low-priority process is done.
Can cause deadlock.
© 2000 Morgan
Kaufman
Overheads for Computers as
Components
47
Solving priority inversion
Give priorities to system resources.
Have process inherit the priority of a
resource that it requests.
Low-priority process inherits priority of
device if higher.
© 2000 Morgan
Kaufman
Overheads for Computers as
Components
48
Data dependencies
Data dependencies
allow us to improve
utilization.
Restrict combination
of processes that can
run simultaneously.
P1 and P2 can’t run
simultaneously.
© 2000 Morgan
Kaufman
Overheads for Computers as
Components
P1
P2
49
Context-switching time
Non-zero context switch time can push
limits of a tight schedule.
Hard to calculate effects---depends on
order of context switches.
In practice, OS context switch overhead
is small.
© 2000 Morgan
Kaufman
Overheads for Computers as
Components
50
What about interrupts?
Interrupts take
time away from
processes.
Perform minimum
work possible in
the interrupt
handler.
© 2000 Morgan
Kaufman
Overheads for Computers as
Components
P1
OS
intr
P2
OS
P3
51
Device processing
structure
Interrupt service routine (ISR) performs
minimal I/O.
Get register values, put register values.
Interrupt service process/thread
performs most of device function.
© 2000 Morgan
Kaufman
Overheads for Computers as
Components
52
Summary
Two major scheduling policies
RMS
EDF
Other parameters affecting scheduling
Priority inversion
Data dependencies
Context switching time
Interrupts
Fall 2005
Design & Co-design of
Embedded Systems
53
Assignment
Questions from Chapter 6 of text book
From 6.17 to 6.26
Due date:
Two weeks: Sunday, Azar 27th
Fall 2005
Design & Co-design of
Embedded Systems
54