Audio Visual Hints - University of California, Los Angeles

Download Report

Transcript Audio Visual Hints - University of California, Los Angeles

Architectures and Design
Techniques for Energy Efficient
Embedded DSP and Multimedia
Ingrid Verbauwhede, Christian
Piguet, Bart Kienhuis/Ed Deprettere,
Patrick Schaumont
Outline
•
•
•
•
•
•
•
Intro
Low Power Observations
SoC Architectures: Ingrid
Low Power Components: Christian
Design Methods: Ed
Design Methods: Patrick
Conclusion
2
Introduction
Applications
Mapped
Embedded DSP &
Multimedia
Design Methods
onto
Architectures
= Low Power!
Ingrid Verbauwhede,
UCLA, K.U.Leuven
3
Low Power observation 1:
architecture tuned to application
Display
AD7873
Digitizer
Maxim
Control
RF Micro
Poweramp
Driver
8M SDRAM
4M FLASH
MemoryCard
Slot
Motorola
DragonBall
Agere POM
Baseband
Motorola
Transceiver
Philips
USB
FPGA
Maxim
Transceivers
4
Observation 2:
Energy-flexibility trade-off
Application
Power
Cost
ASIC
Fixed
Platform
???
General
Purpose
5
Example: DSP processors
•
Specialized instructions: MAC
•
Dedicated co-processors: Viterbi
acceleration
6
FIR on TI C55x: Dual MAC
FIR filter: two outputs in parallel with 3 busses
y(0) = c(0)x(0) + c(1)x(-1) + c(2)x(-2) + . . .
+ c(N-1)x(1-N);
y(1) = c(0)x(1) + c(1)x(0) + c(2)x(-1) + . . .
+ c(N-1)x(2-N);
y(2) = c(0)x(2) + c(1)x(1) + c(2)x(0) + . . .
+ c(N-1)x(3-N);
y(3) = c(0)x(3) + c(1)x(2) + c(2)x(1) + . . .
+ c(N-1)x(4-N);
DB(16)
BB(16)
CB(16)
N-1
y(n) =
 c(i) x(n-i)
x(n-i+1)
i=0
x(n-i)
c(i)
c(i)
X
X
MAC0
MAC1
+
y(n+1)
AC0
+
y(n)
AC1
7
Energy comparison
Total energy for one output sample:
Energy
Single
MAC
Dual
MAC
Dual MAC
3 busses
Dual MAC
with REG
No. of MAC operations
N
N
N
N
No of Memory reads
2N
2N
1.5N
N
No of Instruction Cycles
N
N/2
N/2
N/2
Adaptation of the datapath: MAC, DMAC
Adaptation of the memory architecture and bus network
Adaptation of the instruction set
8
Viterbi on TIC54x
ALU and CSSU: CMPS instruction
DB1(16)
DB0(16)
G = min [(G1 + m1), (G2 + m2)]
+
• ALU splits in 16 bit halves
• ACC splits in half
m2
m1
G1
TREG
+
Accumulator
• Shortest distance saved
ALU
G2
• CSSU compares halves
• Path indicator saved
ALU
MSW/LSW
Comp decision bit
Select
G
• 4 cycles / butterfly
TRN reg
Source: TI Application Report, Viterbi Decoding in the TMS320C54x family, document SPRA071
Data bus EB, to memory
9
Observation: Also general purpose
architectures become heterogeneous.
Conexant
3.125Gb Serial
IBM PowerPC ®
RISC CPU
XtremeDSP™
Synchronous
Dual-Port
RAM
SelectIO-ltra™
SystemIO™ &
XCITE ™
10
Source: Xilinx webpage
Question
•
•
•
Energy - flexibility are opposite demands!
How to navigate in this jungle?
3D design space:
Computational Abstraction Level
Binding rate
•
Reconfigurable feature
Next question: how to map (or compile) an application onto
such an architecture?
11
Flexibility (1) - Abstraction level
Computational Abstraction Level
•
Instruction set level = “programmable”
•
CLB level = “reconfigurable”
12
Flexibility (2) - Reconfigurable feature
•
Basic components:
Computational Abstraction Level
Communication Storage
Computation
Systems
Interconnect
network
Memory
hierarchy
Number & type
of processes
Instruction set
Architecture
Size address/
data bus
Register
set
Custom
instructions
Microarchitecture
Cross-bar
Busses
Register
file
Execution unit
type
RAM
details
CLB
Implementation Switches,
Muxes
Reconfigurable feature
13
Flexibility (3) - Binding rate
Binding rate
Compare processing to binding
• Configurable (“compile-time”)
• Re-configurable
• Dynamic reconfigurable (“adaptive”)
14
SOC architecture: RINGS
Software
DomainSpecific
Hardware
MEMORY
CPU
Networking
Medium access
Baseband Proc
mArchitecture
Circuit
Standard
Algorithm
Architecture
mArchitecture
Circuit
Networking
Video
Algorithm
Architecture
mArchitecture
Assembly
Signal Proc
RF
Baseband
Processing
Video
Engine
DSP
Reconfigurable Interconnect
15
Instruction set extension
•
•
•
•
Instruction set extension
Register mapped
Tightly coupled
Experiment: DFT
1000
iterations
SW on
Embedded
proc.
SW with
HW
datapath
Improvement
Energy
67.6 mJ
5.76 mJ
12.5 times
16
Co-processor
•
•
•
Memory mapped
Loosely coupled
Experiment: AES
Local
Memory
175
iterations
SW on
emb. Proc.
SW with
HW
datapath
Improvement
Energy
89.2 mJ
13.5 mJ
25 times
17
Independent IP
•
•
•
•
Loosely coupled
Network on chip
connected
Flexible
interconnect
Experiment:
TCP/IP checksum
router
router
100
packets
SW on
emb. Proc.
HW
datapath
Improvement
Energy
17.0 mJ
0.20 mJ
84 times
18
Communication: Energy-flexibility
•
•
•
•
Also energy flexibility conflict!
General purpose
NOC: tiles
FPGA: general
purpose
[Daly-DAC01]
20
21
22
23
10
11
12
13
00
01
02
03
Proc A
Therefore: domain
specific NOC
2D
router
Proc B
Proc A
1D
router
2D
router
2D
router
2D
router
Proc X
Proc Y
19
Conclusion
•
•
Low Power by going domain-specific
Energy-flexibility conflict
•
•
How to “program” this RINGS?
Next: Ultra-low power components:
Christian
Design exploration: Ed
Co-design environment: Patrick
•
•
20
Introduction
Applications
Mapped
Embedded DSP &
Multimedia
Design Methods
onto
Architectures
= Low Power!
21
Efficient Embedded DSP
Ultra-Low-Power Components
Christian Piguet, CSEM
Ultra Low Power DSP Processors
• The design of DSP processors is very challenging,
as it has to take into account contradictory goals:
•
•
an increased throughput request
at a reduced energy budget
• New issues due to very deep submicron
technologies such as interconnect delays and
leakage
• History of hearing aids circuits:
•
•
•
analog filters 15 years ago
digital ASIC-like circuits 5 years ago
powerful DSP processors today, below 1 Volt and 1 mW
23
DSP Architectures for Low-Power
• single MAC DSP core of 5-10 years ago
• parallel architectures with several MAC working
in parallel
• VLIW or multitask DSP architectures
• Benchmark:
•
number of simple operations executed per clock cycle,
up to 50 or more
• Drawbacks of VLIW:
•
•
•
very large instruction words up to 256 bits
Some instructions in the set are still missing
transistor count is not favorable to reduce leakage
24
VLIW TMS320C6x (VelociTI)
Instruction RAM / Cache
VLIW
Crossbar
Function Unit A
Register File A
Function Unit B
Register File B
Data RAM
Peak:
C62:
About
1000 MIPS/watt
(5’000 MOPS/watt)
C64:
About
5’000 MIPS/watt
(25’000 MOPS/watt)
About 5 Op/Instr.
25
3 Ways to be more Energy Efficient
• To to design specific very small DSP
engines for each task, in such a way that
each DSP task is executed in the most
energy efficient way on the smallest piece
of hardware (N co-processors)
• to design reconfigurable architectures such
as the DART cluster
• in which configuration bits allow the user to
modify the hardware in such a way that it
can much better fit to the executed
algorithms.
26
Co-processors
Microcontroller I/O
start
interru
pt
Mux
Co-proc:
DSP, JPEG
or
convolution
- best one regarding power
- minimal number of transistors
and transitions to perform a task
- control code on a
microcontroller
- main issue is the software
mapping of a given application
onto so many heterogeneous
processors and co-processors
- cut-off for leakage reduction
Data Memory
Ph.D of O. Paker,
DTU, Lyngby, June 2002, Minicores:
- IIR: 73’000 MOPS/watt
- FIR: 33’000 MOPS/watt
27
DART: O. Sentieys, ENSSAT
RDPx
RDP: Reconfigurable DataPath
- Many identical co-processors or hardware accelerators for // tasks
- Reconfigurable interconnections
- in 0.18, 1.9 V., 130 MHz, WCDMA, 6.2 GOPS, 40’000 MOPS/watt
- FPGA Xcv200E: 4’000 MOPS/watt; DSP C64: 3’000 MOPS/watt
28
Reconfigurable DSP Architectures
• Not FPGA, much more efficient than FPGA.
The key point is to reconfigure only a limited
number of units
• Reconfigurable datapath
• Reconfigurable interconnections
• Reconfigurable Addressing Units (AGU)
• FPGA:
• MACGIC DSP consumes 1 mW/MHz in 0.18
• Same MACGIC in Altera Stratix consumes 10
mW/MHz plus 900 mW of static power, so 1’000
mW at 10 MHz
29
Reconfigurable Datapaths
Memory Memory
x
Memory Memory
+
-
x
30
Reconfigurable Addressing Modes
•
•
operands fetch is generally a severe bottleneck in
parallel machines for which 8-16 operands are
required each clock cycle.
sophisticated addressing modes can be
dynamically reconfigured depending on the DSP
task to be executed
PR2
PR1
PR0
OFFA
W
 Examples :
an <- (an+on)%mn + OFFA
an <- (an+OFFA)%mn + 1
Selection of 3 operations
an <- (an+1)%mn + OFFA
an <- (an+1)%mn
(MACGIC)
31
MACGIC: Performance results
• Power-consumption for a 24-bits, 10 MHz synthesis
@0.9 V in the 0.18µm TSMC technology (SYNOPSYS &
MACHTA/PA simulations)
•
•
•
•
•
•
•
NOP:
ADD 24-bit:
MAC 24-bit/56-bit:
4* ADD 24-bit:
4* MAC 24-bit/56-bit:
MACV 24-bit/56-bit:
CBFY4 radix-4 FFT:
25 µA / MHz
102 µA / MHz
137 µA / MHz
167 µA / MHz
283 µA / MHz
269 µA / MHz
273 µA / MHz
98k MOPS/Watt
81k MOPS/Watt
120k MOPS/Watt
86k MOPS/Watt
90k MOPS/Watt
131k MOPS/Watt
• Number of transistors for this 24-bit version : ˜ 600’000
• Number of transistors for a 16-bit version: ˜ 400’000
32
Performance results: 64 pts Cpx FFT
•
•
•
•
•
•
•
•
Macgic® :
CARMEL:
PalmDSPCore:
SC140 Starcore:
R.E.A.L DSP:
SP-5flex (3DSP):
TI C62x:
TI C64x:
~250 clock cycles
526 clock cycles
450 clock cycles
288 clock cycles
850 clock cycles
500 clock cycles
675 clock cycles
276 clock cycles
33
Comparison
MOPS/watt
ASIC IIR: 240’000
ASIC Phonak:
70’000
Minicore IIR: 73’000
Minicore FIR:
33’000
Not flexible
Which developpment tools
for co-processors?
Programmable, developpement tools
MACGIC: 80’000
(0.9 V., 10 MHz)
MACGIC: 30’000 (1.8 V. 150 MHz)
C64: 25’000
ARC: 5’000
OAK, C54, 56K: 2’500
Flexibility
34
Introduction
Applications
Mapped
Embedded DSP &
Multimedia
Design Methods
onto
Architectures
= Low Power!
35
Design & Architecture
Exploration
Ed Deprettere, Professor
Bart Kienhuis, Assistant Professor
Leiden University
LIACS, The Netherlands
Embedded DSP Architectures
Programmable
Interconnect (NoC)
Micro
Processor
Weakly coupled
Processing elements
...
Memory
CPU
RPU
IPcore
Synfora
PicoChip
Silicon Hive
Art Builder
Memory
CPU: A simple Microprocessor
RPU: Reconfigurable Processing Unit
IPcore: Dedicated Accelerator block
37
NoC: Network on a Chip
System Level Design
•
Three aspects are important in System
Level Design
• The Architecture
• The Application
• How the Application is Mapped on the
Architecture.
•
•
To optimize a system, you need to take all
three aspect into consideration.
This is expressed in terms of the Y-Chart
38
Y-chart Approach
Applications
Applications
Applications
Architecture
Instance
Mapping
Performance
Analysis
Tuning the architecture
to get better performance
Performance
Numbers
39
Y-chart Approach
Applications
Applications
Applications
Architecture
Instance
Mapping
Use different
Performance
Mapping strategies
Analysis
Suggest architectural
improvements
Performance
Numbers
Rewrite the
applications
There are three different ways to improve a system.
40
Design Space Exploration
Exploration is about finding
The inverse relationship
Decision
Variables
Valuation
System
Metrics
Acquisition of Insight
41
Design Space Exploration
Exploration is about finding
The inverse relationship
Applications
Applications
Applications
Mapping
Performance
Analysis
Metrics
Decision
Variables
Architecture
Instance
Performance
Numbers
42
Y-chart Design For GP Processors
Applications
Applications
Applications
In C
ARM Processor
Tensilica
X86 process
Compiler
(GNU GCC)
Performance
Analysis
Performance Numbers are:
•Clock Speed
Numbers
•Power Consumption
•Throughput
•Latency
43
Y-chart Design For DSP Applications
Architecture Template
Xilinx FPGA
Synplicity
Xilinx Foundation
Tools
Companies like
PICO and ART
(Arm/Adelante) are
trying to provide the
Y-chart components
as well.
Applications
Applications
Applications
In VHDL/C
Performance
Analysis
Performance Numbers are:
•Clock Speed
Numbers
•Power Consumption
•Throughput
•Latency
44
How to improve performance
•
How can we improve the performance of the
system we are interested in.
•
•
For a low-power architecture parallelism is
important.
•
•
•
•
Others focus on architecture, we want to focus on the
application.
Exploiting more parallelism leads to fast calculations
using Voltage and Frequency scaling, we assume that
power is saved
There is already a lot of theory developed to
employ bit-level parallelism, instruction parallelism
and Task Level parallelism.
Especially Task Level parallelism is getting more
and more important to effectively map DSP
application onto the new emerging architectures.
45
Programming Problem
EASY to specify
DIFFICULT to specify
Application
Sequential
Application Specification
for j = 1:1:N,
[x(j)] = Source1( );
end
for i = 1:1:K,
[y(i)] = Source2( );
end
for j = 1:1:N,
for i = 1:1:K,
[y(i), x(j)] = F( y(i), x(j) );
end
end
for i = 1:1:K,
[Out(i)] = Sink( y( I ) );
end
Parallel
Application Specification
P1
Programming
Compaan
Memory
Sink
S1
Source
P3
Programmable Interconnect
(NoC)
P2
P4
EASY to map
DIFFICULT to map
Micro
Processor
Memory
CPU
RPU
IPcore
...
46
Kahn Process Network (KPN)
•
•
•
Kahn Process Networks [Kahn 1974][Parks&Lee 95]
• Processes run autonomously
• Communicate via unbounded FIFOs
• Synchronize via blocking read
Process is either
Fa
• executing (execute)
Fifo
• communicating
A
(send/get)
Fb
B
Characteristics
•
•
•
get
Deterministic
execute
Distributed Control
– No Global Scheduler send
send
is needed
Distributed Memory
– No memory contention
Fc
get
get
execute
send
C
get
execute
send C
47
Kahn Process Network (KPN)
FPGA A
Fifo
Fifo
Process B
Process A
Fifo
CPU 1
FPGA B
Fifo
Fifo
Process C
•Autonomously operating Processes; no global schedule needed
•Blocking Read simple realize in Hardware
•Buffer Sizes of the FIFOs are quite often very small
48
Matlab to Process Networks to FPGA
Matlab Program
C++ / SystemC / YAPI
Compaan Compiler
Kahn Process Network
Laura
Functional Simulation in Ptolemy II
Synthesizable VHDL
FPGA
49
Matlab to Matlab Transformations
•
To make the flow from Matlab to FPGA
interesting, we had to give the designers
means to change the characteristics
• Unrolling (Unfolding)
–Increases parallelism
• Retiming (Skewing)
–Improved pipeline behavior
• Clustering (Merging)
–Reducing parallelism
•
All these operations can be applied to the
source level of Matlab, leading to a new
Matlab program
50
Y-chart Design For DSP Applications
Algorithmic
Toolbox
Retiming Unrolling
Merging
Matlab
Program
Compaan
Xilinx/Altera
FPGA
Laura
Synplicity
Xilinx Foundation
Process
Network
Performance
Analysis
Performance
Numbers
51
Case Study
•
We use the Y-chart environment on a real case study
•
•
•
•
•
•
Adaptive QR [DAES 2002]
Using commercial IP cores
– QinetiQ Ltd.
– Vectorize 42 pipeline stages
– Rotate 55 pipeline stages
QR is interesting as it requires deeply pipelined IP cores
Most Design tools have difficulties with such IP cores
We will explore a number of simple steps to improve the
performance of the QR algorithm
Was reported to run at 12Mflops.
52
Example: Adaptive QR (Step 1)
for k = 1:1:21,
for j = 1:1:7,
for i = j:1:7,
if i <= j,
[r(j,j), rr(j,j), a(j),b(j), d(k) ] = bcell( r(j,j), rr(j,j), x(k,j), d(k) );
else
[r(j,i), x(k,i), a(I), b(I) ] = icell(r(j,i),x(k,i), a(i), b(i) );
end
end
end
end
•This algorithm runs at 60 Mflops @ 100Mhz, out-of-the-box. This is
more than 12Mflops [DAES2002] due to improved hardware design of
the processes and controllers.
•Still, the efficiency of the pipelines of bcell/icell is very poor <1%, for
example, due to the self-loop created by variables a and b.
53
Example: Adaptive QR (Step 2)
for k = 1:1:21,
for j = 1:1:7,
for i = j:1:7,
if i <= j,
[r(j,j), rr(j,j), a(j),b(j), d(k) ] = bcell( r(j,j), rr(j,j), x(k,j), d(k) );
else
[a(i)] = Pass_qr(a(i-1));
[b(i)] = Pass_qr(b(i-1));
[r(j,i), x(k,i) ] = icell(r(j,i),x(k,i), a(i-1), b(i-1) );
end
end
end
end
•We broke the self-loops on variables ‘a’ and ‘b’ by
introducing a ‘pass’ function. This gave a 5% improvement
leading to 67.7 Mflops.
•Still the efficiency is bcell (3.76%) / icell (1.56%) is very poor.
54
Example: Adaptive QR (Step 3)
for k = 1:1:21,
for j = 1:1:7,
for i = j:1:7,
for p = 1: 1: 10,
if i <= j,
[r(p,j,i), rr(p,j,i), a(p,i),b(p,i),d(p,k) ] =
bcell( r(p,j,i), rr(p,j,i), x(p,k,i), d(p,k) );
else
[a(p,i)] = Pass_qr(a(p,i-1));
[b(p,i)] = Pass_qr(b(p,i-1));
[r(p,j,i), x(p,k,i) ] = icell( r(p,j,i), x(p,k,i), a(p,i-1), b(p,i-1) );
end
end
end
end
end
•To fill the pipelines, we run independent problems over the
same resources (sometimes referred to as strip mining). This
fills the pipelines more efficiently, leading to 472.55 Mflops.
(bcell 8.2%/ icell 24.2%).
55
Example: Adaptive QR (Step 4)
for k = 1:1:21,
for j = 1:1:7,
for i = j:1:7,
for p = 1: 1: 10,
if i <= j,
[r(p,j,i), rr(p,j,i), a(p,i),b(p,i), d(p,k) ] =
bcell( r(p,j,i), rr(p,j,i), x(p,k,i), d(p,k) );
else
[a(p,i)] = Pass_qr(a(p,i-1));
[b(p,i)] = Pass_qr(b(p,i-1));
if mod(p,2) <= 0,
[r(p,j,i), x(p,k,i) ] = icell( r(p,j,i), x(p,k,i), (p,i-1), b(p,i-1) );
else
[r(p,j,i), x(p,k,i) ] = icell( r(p,j,i), x(p,k,i), a(p,i-1), b(p,i-1) );
end
end
end
end •The icell and bcell work load is not balanced. The bcell gives
more workload then the icell can process. Therefore, we unroll the
end
icell, leading to 673.81 Mflops. (bcell 11.8%/ icell 2*17.4% = 34.8%).
end
56
Conclusions
•
•
•
•
Optimizing a System Level Design for LowPower requires that you look at the
architecture, the mapping, and the
application.
The Y-chart gives a simple framework to
tune a system for optimal performance.
The Y-chart forms the basis for DSE
We showed that by playing with the way
applications are written, we get in a number
of steps orders better performance
60MFlops -> 673MFlops. This without
changing the architecture!
57
Introduction
Applications
Mapped
Embedded DSP &
Multimedia
Design Methods
onto
Architectures
= Low Power!
58
Domain-Specific Co-design
Environments
Patrick Schaumont, UCLA
Let’s waste no time !
DATE Party
You
Presentation
60
Design of the RINGS Multiprocessor
Core
Core
Hardware
NoC
Interface
NoC
Interface
NoC
Interface
Coprocessor
NoC Routers & Topology
Need an environment that combines
• Instruction-set Simulation (multi-core)
• Hardware Simulation
61
ARMZILLA
C
C
Network
On Chip
Hardware
Processors
FDL
FDL
C
Cross
Compiler
EXE
EXE
EXE
ARM ISS
ARM ISS
ARM ISS
ARMZILLA
Config
Configuration
Unit
Memory-mapped
Channels
GEZEL
Kernel
62
ARMZILLA
C
C
Network
On Chip
Hardware
Processors
FDL
FDL
C
Cross
Compiler
EXE
EXE
EXE
Config
Configuration
Unit
ARM ISS
ARM ISS
ARM ISS
ARMZILLA
Memory-mapped
Channels
GEZEL
Kernel
63
Requirements for an ISS in
a multiprocessor simulator
Hardware Simulation
Kernel
memory::write( )
memory::read( )
Linkable
Reentrant
Datainterface
ISS
Accessible
simulation control
Synchronizationinterface
64
ARMZILLA
C
C
Network
On Chip
Hardware
Processors
FDL
FDL
C
Cross
Compiler
EXE
EXE
EXE
Config
Configuration
Unit
ARM ISS
ARM ISS
ARM ISS
ARMZILLA
GEZEL
Kernel
Memory-mapped
Channels
65
Memory-mapped channels easy in C
volatile int * k = 0x80000000;
Software on core
Intialized Pointers
b
= *k;
// read
ARM ISS
Configuration
Private address-space per core
Symbolic name per core
Hardware Network
Parametrized Interfaces
(corename, address)
Memorybus
ARM ISS
Hardware Network
& Coprocessors
66
Hardware Simulation Kernel
C
C
Network
On Chip
Hardware
Processors
FDL
FDL
C
Cross
Compiler
EXE
EXE
EXE
Config
Configuration
Unit
ARM ISS
ARM ISS
ARM ISS
ARMZILLA
GEZEL
Kernel
Memory-mapped
Channels
67
GEZEL Hardware Simulation Kernel
has Hybrid Architecture
.fdl
parser
GEZEL Language
symbol
table
C++
SH3
DSP
LEON2
Sparc
ARM
code
generators
cycle
simulator
hdl
translation
sim API
syn API
SystemC
VHDL
68
The PONG Example
paddle
paddle
•
•
•
ball
Designed in 1967 by Baer - interactive TV feature
1977, General Instruments AY-3-8500 pong-in-a-chip
Magnavox, Coleco, Atari, Philips, URL, GHP, ...
69
Multiprocessor Model of PONG
Four-processor Model
PADDLE2
PADDLE1
(Player Strategy)
paddle
paddle
ball
BALL
(Ball dynamics)
Bounce = Message
FIELD
(Rendering)
70
Multiprocessor Operation - Initialize
Initialization Sequence
FIELD
PADDLE1
PADDLE2
BALL
FIELD DIMENSIONS
PADDLE SIZE & POS
TIME
BALL SPEED & POS
71
Multiprocessor Operation - Play
TRACK BALL
TRACK
PlayBALL
Sequence
FIELD
PADDLE1
TRACK BALL
PADDLE2
BALL
UPDATE POS
!!!
COLLISION
UPDATE POS & SPEED VECTOR
(CONDITIONAL)
LOOP
72
Let’s play!
QuickTime™ and a YUV420 codec decompressor are needed to see this picture.
73
Multiprocessor Architecture
PADDLE1
PADDLE2
BALL
FIELD
memorymapped
memorymapped
memorymapped
memorymapped
O
I
O
I
O
I
O
DISPLAY
HW
I
Point-to-point Model
74
Multiprocessor Architecture
Network-on-Chip Model
PADDLE1
PADDLE2
BALL
FIELD
memorymapped
memorymapped
memorymapped
memorymapped
1D
Router
1D
Router
1D
Router
1D
Router
payload
To:
DISPLAY
HW
From:
75
Links
SimIt-ARM ISS (W. Qin, Princeton University)
http://www.ee.princeton.edu/~wqin/armsim.htm
Cross Compiler arm-linux-gcc
ftp://ftp.arm.linux.org.uk
GEZEL & ARMZILLA:
http://www.ee.ucla.edu/~schaum/gezel
All these tools are free!
(free as in freedom, not as in free beer)
76
Conclusion
Applications
Mapped
Embedded DSP &
Multimedia
Design Methods
onto
Architectures
= Low Power!
77
Thanks for your attention !
78