No Slide Title
Download
Report
Transcript No Slide Title
Modular design and the
importance of models of
computation
Bart Kienhuis,
Leiden University, LIACS
Computer Systems Group
Based on the presentation given at the 37th DAC tutorial on Embedded System Design
New Applications
Stream oriented applications
Multi-media
(Smart) imaging
Bioinformatics
Classical digital signal processing
Ferocious appetite for compute
power
2
Smart Imaging
Camellia project
Target 1
Target 2
Core for ambient and
mobile intelligent
imaging applications
IST project (fr5)
Detection of a
Huge compute Requirements pedestrian walking in
Giga operations per second
front of a car
Real Time
Embedded System
Low cost
Low power
Renault
Philips
3
Heterogeneous
Architectures
Microprocessors
Programmable
Interconnect (NoC)
(DSP, CPU)
Reconfigurable Units
Memory
Memory
(FPGA)
...
Micro
Processor
CPU
Memory banks
RPU
Distributed Memory
IPcore
(IP cores)
Memory
Dedicated Hard
Stream-based applications
Autonomous operating components
(task-level parallelism)
Low bandwidth communication between components
Programmable interconnect
Distributed memory
4
Intrinsic Computational
Efficiency (ICE)
Computational efficiency
106 [MOPS/W]
105
104
Playing field
Intrinsic Computational
Efficiency of Silicon
103
7400
102
i386SX
101
100
601
microsparc
604
i486DX P5 Super
sparc
68040
2
1
Turbosparc
604e
604e
21164a 21364
Ultra
P6
sparc
0.5
Microprocessors
0.07
0.13
0.25
Feature size [m]
E. Roza, System-on-chip: what are the limits?, IEE Electronics
Communication Engineering Journal, vol13, No6, Dec 2001, pp 249-255.
5
Xilinx Virtex II Pro
PowerPC based
PowerPCs
420 Dhrystone
MIPS at 300 MHz
1 to 4 PowerPCs
Virtex-4
Already over a
billion transistors
90nm technology
Reconfigurable logic
and memory blocks
Source: Xilinx
6
SpaceCake
Title
Current Development in
Philips Research. Idea is to
make a platform that is
Moore’s Law resilient.
CPU0
Homogenous Tiles
If more transistors become
available, more titles can be
combined delivering more
compute power.
Stravers, P and Hoogerbrugge, J. 2001. Homogeneous multiprocessoring
and the future of silicon design paradigms. In proceedings of the
Int. Symposium on VLSI Technology, Systems, and Applications.
CPU1
CPU2
Programmable Interconnect
RPU
IPcore
Memory
7
PicoArray
Startup,
England
Processor with
430 16-bit RISC
cores on a
single die.
Projected
Markets:
WCDMA /
802.11
Homogeneous Architecture
Source: www.picoarray.com
8
Different Views
Homogeneous architectures
Linear scaling
More CPUs leading to an equal increase in available
compute power
Load balancing
Each CPU is working at the same workload level
Heterogeneous architectures
Flexibility
Match the computation to the correct component in terms of
ICE; Take advantage of heterogeneity
9
Software Efficiency
58% / Yr. compound
complexity growth rate
10,000,000
1,000,000
100,000
10,000
1,000
100,000
Productivity
gap
10,000
1,000
100
100
21% / Yr. compound
productivity growth rate
10
2009
2005
2001
1997
1993
1989
10
1981
Source: SEMATECH
© Kreutzer
Productivity
Trans. / Staff . Month
1,000,000
100,000,000
1985
Logic transistors per chip
(K)
10,000,000
10
Problem
“We know how to build billion transistor
ICs, but we do not know how to program
them”
Current compiler technology is not
capable to handle the heterogeneity of
the architectures
How come and how to solve?
11
Y-chart Approach
Three different ways to improve the performance of a system.
Applications
Applications
Applications
Architecture
Instance
Mapping
Performance
Analysis
Use different
Mapping strategies
Suggest architectural
improvements
Performance
Numbers
Rewrite the
applications
Kienhuis,B., Deprettere, E., Van der Wolf, P., and Vissers, K. 2002. A Methodology to Design
Programmable Embedded Systems. LNCS, vol. 2268. Springer Verlag, pages 18 – 37.
12
Applications
Applications
Applications
Architecture
Instance
Mapping
Mapping
Performance
Analysis
Performance
Numbers
quantization control
MPEG Demux
VLD
Coded
video
Q-1
IDCT
motion vectors
& mode
+
Reorder
ordering
Decoded
video
MAPPING
Motion
Buffer
MPEG
Decoding
Programmable
Interconnect (NoC)
Micro
Processor
Memory
CPU
RPU
IPcore
...
Memory
Memory
13
Mapping
CPU
quantization control
MPEG
bus
Demux
VLD
Q-1
IDCT
Coded
video
coproc
coproc.
+
Reorder
ordering
Decoded
video
motion vectors
& mode
Motion
Buffer
MPEG Decoding
Both described a network of components that perform
a particular function and that communication in a
particular way
Application:
Architecture:
•Resources
• Computations
•ALUs, CORDICS, PEs
•Registers, SRAM, DRAM
•Busses, Switches
•Communication
•Bits, Signals
•IDCT, SQRT, Quantizer
• Communication
•Pixels, Blocks
14
Mapping
Architecture
Application
Mapping
quantization control
CPU
MPEG
Demux
VLD
Q-1
IDCT
Coded
video
+
Reorder
ordering
Decoded
video
bus
motion vectors
& mode
coproc
coproc.
Motion
Buffer
MPEG Decoding
Can we formalize the description of these networks?
“Models of Architecture” and “Models of Computation”
15
Model of Computation
A Model of computation is a formal representation
of the operational semantics of networks of
functional blocks describing the computations.
A
C
B
D
16
Model of Computation
Terminology
Actor
Actor
A
Describes the functionality
C
Relation
B
The actors can communicate
with each other using relations.
D
Relation
Token
The exchange of a quantum
of information.
It represents a signal
Firing
A quantum of computation
Moment of interaction with other
actors
token
Port
fire {
…
token = get();
…
send(token);
…
}
Port
(Active/Passive)
17
Active/Passive Actors
A
C
B
D
fire {
token = get();
…
send(token);
…
}
Exit
fire {
Two kinds of Actors:
while(1) {
token = get();
send(token);
}
}
Passive Actor:
Active Actor:
•Scheduler needed.
•Schedules itself
•A firing typically doesn’t terminate
•Schedule ABBCD
•A firing needs to terminate
•Fire-and-exit behavior
•Endless while loop
•Process behavior
18
Communication Between
Actors
fire {
…
send();
…
}
Actor 1.
Token
port
port
Communication
(Semantics)
Data Type of the Token
•Integer, Double, Complex
•Matrix, Vector
•Record
fire {
…
get();
…
}
Actor 2.
Way exchange takes place
•Buffered
•Timed
•Synchronized
19
Different Semantics
Analog computers (odes)
Discrete time (difference
equations)
Discrete-event systems
(DE)
Process networks (Kahn)
Sequential processes with
rendezvous (CSP)
Dataflow (Dennis)
Synchronous-reactive
systems (SR)
Codesign finite state
machines (CFSM)
discrete time:
continuous time:
discrete events:
synchronous/
reactive:
partially-ordered
events:
E1 E2 E3
E4
E5
E6
20
Synchronous/reactive
Models (SR)
x f A (1)
y f ( z)
Passive actors
b
Communication is unbuffered
Computation and communication is instantaneous. z f c ( x, y )
Network of concurrent executing actors
Fixed point equation
A model progresses as a sequence of “ticks.”
At a tick, the signals are defined by a fixed point equation:
x
fire {
…
send();
…
}
Token
port
port
fire {
…
get();
…
}
A
y
C
B
z
D
Characteristics of SR models
Tightly synchronized
Stable state points
Control intensive systems
21
Process Network (PN)
Network of concurrent
executing processes
Active actors
Communicate over unbounded
FIFOs
Performing some operation, a
blocking read or a nonblocking write
Characteristics of process
networks
Deterministic execution
Doesn’t impose a particular
schedule
(Dynamic) dataflow
fire {
…
send();
…
}
Token
port
port
A
fire {
…
get();
…
}
Process
C
B
D
Stream
channel
22
Synchronous Dataflow
(SDF)
Network of concurrent
executing actors
Passive actors
Communication is buffered
A model progresses as a
sequence of “iterations.”
A “firing rule” determines the
firing condition of an actor.
At each firing, a fixed number
of tokens is consumes and
produces.
Characteristics of SDF
Compile time analyzable.
Memory/schedule/speed
Static dataflow
fire {
…
send();
…
}
Tokens
port
port
A
fire {
…
get();
…
}
1
1
C
B
1
1
3
3
D
3
3
Schedule: ABBBC
23
Codesign Finite State
Machine (CFSM)
Network of concurrent
executing actors
Passive actors
Synchronous locally
Asynchronous globally
An “event” causes the
evaluation (firing) of a FSM.
Characteristics of CFSM
Compile time analyzable.
Reactive systems
Token
FSM
FSM
port
port
A
C
B
D
Timed Event
24
Finite State Machine (FSM)
KEY=0N => START
Port_KEY
KEY=OFF or BELT=ON =>
ALARM=OFF
WAIT
Port_START
OFF
Port_BELT
Port_ALARM
Port_END
END=10 or
BELT=ON or
KEY=OFF =>
ALARM=OFF
ALARM
END=5 =>
ALARM=ON
•FSM may only have one state active at the time
•FSM has only a finite number of states.
•More efficient way to describe sequential control.
•Formal semantics which allows for verifying various properties like
safety, liveness, and fairness.
25
Model of Architecture
A Model of architecture is a formal representation
of the operational semantics of networks of
functional blocks describing architectures.
A
A,B,C and D are now
hardware resources like
CPUs, busses, Memory,
and dedicated coprocessors.
C
B
D
Model of Architecture is similar to
Model of Computation, but the focus is on
the architecture instead of on the applications.
26
Examples
Bus
Control Dominated Tasks
•Sequential
Memory
CPU
Low
Bus
PE
Memory
Programmable Communication Network
CPU
PE
1
PE
2
PE
3
Memory
[1] Wolf, Wayne. A Decade of Hardware/Software Codesign.
IEEE Computer, Volume 36, April 2003, pages 38-43
Data Dominated Tasks
•Parallel / DMA
•Data flow
•Distributed computation
Complexity
CPU
Control/ Data Tasks
•Sequential
•Centralized computation
•Mutual Exclusive [1]
High
Less mature then MoC
27
Conclusion:
Matching Models
Application
Architecture
Natual Fit
Model of Architecture
Model of Computation
Data Type
When the MoC and MoA match,
a simple mapping results
28
Putting It Together
Example 1.
Platform: microprocessor “von-Neumann architecture”
Machine Description
Applications
Applications
Application
Micro
Processor
Compiler
SPECInt
Benchmarks
GCC
Pentium/Arm
MIPS/Alpha
Architecture Instances
Performance
Numbers
29
Putting It Together
Example 1.
Micro Processor
Picture in Picture
Memory
(address)
Program
Counter
Instruction
Decoder
ALU
Model of Architecture:
• Sequential (Program Counter)
• one item over the bus
at the time.
• Shared Memory
Compiler
Simulator
Natural FIT
for i=1:1:10
for j=1:1:10
A(i,j) =FIR();
end
end
for i=1:1:10,
for j=1:1:10,
A(i,j) =SRC( A(i,j) );
end
end
Model of Computation:
• Sequential
• Shared Memory
Performance
Numbers
30
Putting It Together
Example 2.
Applications
Applications
Applications
Architecture
Instance
Mapping
Matlab Code (QR Algorithm)
Programmable
Interconnect (NoC)
Micro
Processor
Memory
CPU
RPU
IPcore
...
Memory
Memory
Performance
Analysis
Performance
Numbers
Model of Architecture:
• Task Level Parallelism
• Heterogeneity
• Distributed Memory
%parameter N 8 16;
%parameter K 100 1000;
for k = 1:1:K,
for j = 1:1:N,
[ r(j,j), x(k,j), t ]=Vectorize( r(j,j), x(k,j) );
for i = j+1:1:N,
[ r(j,i), x(k,i), t]=Rotate( r(j,i), x(k,i), t );
end
end
end
Model of Computation:
NO Natural FIT
• Imperative
• Sequential Execution
31
• Global Memory
Putting It Together
Example 2.
Applications
Applications
Applications
Architecture
Instance
Mapping
CPU
Micro
Processor
...
Memory
Memory
Memory
RPU
IPcore
Programmable
Interconnect (NoC)
Performance
Analysis
P1
Sink
S1
Source
Performance
Numbers
P2
P3
P4
Model of Architecture:
Model of Computation:
• Task Level Parallelism
• Heterogeneity
• Distributed Memory
• Process Networks
• Distributed Memory
• Distributed Control 32
Natural FIT
Our Research Focus….
Process Network
Application
Matlab Code (QR Algorithm)
P1
%parameter N 8 16;
%parameter K 100 1000;
for k = 1:1:K,
for j = 1:1:N,
[ r(j,j), x(k,j), t ]=Vectorize( r(j,j), x(k,j) );
for i = j+1:1:N,
[ r(j,i), x(k,i), t]=Rotate( r(j,i), x(k,i), t );
end
end
end
Programming
Compaan
P3
P4
Memory
Micro
Processor
Memory
CPU
RPU
IPcore
...
Sink
S1
Source
Programmable Interconnect
(NoC)
P2
33
Other Examples
Other example of the “Natual Fit”
concept
VSP architecture, CSDF
Tangam, CSP
Polis system, CFSM
34
Conclusions
We will get billion transistor ICs
We already know how to build them, but not
how to program them
Current compilers do not take into account
the notion of “natural fit” in terms of models
of computation and models of architecture
New compiler research is needed that takes
into account different models of computation
35