Transcript lecture16

Lecture 16
Heterogeneous Systems
(Thanks to Wen-Mei Hwu for many of the figures)
Spring 2007
Lecture 16
What are Heterogeneous Systems?
• Programmable -- not restricted to one particular
application, though may be heavily optimized for a class
of applications.
• Multi-core -- Multiple, independent, execution units on a
chip
– Some people are starting to use the term “many-core” for architectures
where there are enough cores that you have to use a non-sequential
programming model to get full performance out of the system.
• Heterogeneous -- Cores are different
– Optimize cores for specific types of applications
– Can schedule for performance or power
Spring 2007
Lecture 16
Why are they Interesting?
Embedded applications have tough performance and
power requirements
Example: GSM decoder requires 10 Minst/second in
software
Motorola V70 GSM cell phone has power budget of
approximately 0.8 watts total when in use.
– Includes both encode and decode
– Includes microphone, speaker, radio
Spring 2007
Lecture 16
Application-Specific Integrated Circuits
Input Data
Custom
Logic
Buffer
Control
CPU
Spring 2007
Lecture 16
Custom Output Data
Logic
Why Not Keep Using ASICs?
• Decreasing Product Cycles
• Design Time/Cost
– Transistors/chip rising at 50%/year
– Transistors/designer day rising at 10%/year
• Re-usable cores helping some, but not enough
– Mask cost greater than $1M
• Need to fabricate many chips to justify a design
• Lack of Flexibility
– More and more, consumers want multifunction devices (ex. Cell phone
with camera)
– Increases design time, cost
Spring 2007
Lecture 16
Why Heterogeneous Systems?
• Different parts of programs have different requirements
– Control-intensive portions need good branch predictors, speculation, big
caches to achieve good performance
– Data-processing portions need lots of ALUs, have simpler control flows
• Power Consumption
– Features like branch prediction, out-of-order execution, tend to have
very high power/performance ratios.
– Applications often have time-varying performance requirements
• Observation: Much of the performance, power
advantages of ASICs comes from application-specific
memory, not application-specific processing
Spring 2007
Lecture 16
Changing Memory to Communication
CPU
DRAM
Az_4
synth
Weight_Ai (Az, F_ga3, Ap3)
DRAM
PE’s
CPU
Az_4
PE’s
Weight_Ai (Az, F_g4, Ap4)
synth
Residu (Ap3, &syn_subfr[i],)
res2
Copy(Ap3, h, 11)
res2
Weight_Ai
m_syn
Set_zero(&h[11], 11)
Syn_filt (Ap4, h, h, 22, &h)
F_g3
F_g4
syn
Ap3
Ap4
h
tmp = h[0] * h[0];
for (i = 1 ; i < 22 ; i++)
tmp = tmp + h[i] * h[i];
tmp1 = tmp >> 8;
tmp = h[0] * h[1];
for (i = 1 ; i < 21 ; i++)
tmp = tmp + h[i] * h[i+1];
tmp2 = tmp >> 8;
if (tmp2 <= 0)
tmp2 = 0;
else
tmp2 = tmp2 * MU;
tmp2 = tmp2/tmp1;
preemphasis(res2, temp2, 40)
Weight_Ai
m_syn
Copy+
F_g3
Set_zero
F_g4
Syn_filt
syn
D
R
A
M
Ap3
Ap4
h
Corr0/Corr1
preemph
Syn_filt
tmp
tmp1
tmp2
Syn_filt (Ap4, res2, &syn_p),
40, mem_syn_pst, 1);
agc (&syn[i_subfr], &syn)
29491, 40)
Spring 2007
Residu
tmp
tmp1
agc
tmp2
Lecture 16
View from source code
•Note how memory
operations dominate
•Note presence of
“expensive” instructions
Spring 2007
Lecture 16
Not as Easy as it Looks
Order of access to data may make transforming memory
ops into communication hard
Residu
Syn_filt
preemphasis
* * * *
* * * *
+
+
[39:0]
[39:0]
[0:39]
time
Spring 2007
[0:39]
res
Lecture 16
MEM
Compilers to the Rescue!
• Remove anti-dependence by
array renaming
• Apply loop reversal to match
producer/consumer I/O
• Convert array access to intercomponent communication
Residu
* * * *
+
preemphasis
res
Syn_filt
res2
* * * *
time
Spring 2007
+
Interprocedural pointer analysis + array dependence test +
array access pattern summary+ interprocedural memory data flow
Lecture 16
Heterogeneous Processor Vision
General-purpose processor
orchestrates activity
LOCAL
MEMORY
Memory transfer
module
schedules
system-wide bulk
data movement
GPP
MAIN
MEMORY
ACC
ACC
LOCAL
MEMORY
Accelerated activities and associated private data
are localized for bandwidth, power, efficiency
Spring 2007
or can operate on
locally-buffered data
pushed to them in
advance
MTM
ACC
Lecture 16
Accelerators can use
scheduled, streaming
communication…
Intel Network Processor -- Existing Example
QDR
SRAM
QDR
SRAM
QDR
SRAM
QDR
SRAM
Spring 2007
Micro
engine
Micro
engine
Micro
engine
Micro
engine
Micro
engine
Micro
engine
Micro
engine
TFIFO
Micro
engine
Micro
engine
Micro
engine
Micro
engine
Hash
Engine
Micro
engine
Micro
engine
Micro
engine
Lecture 16
Micro
engine
RFIFO
Scratchpad
SRAM
CSRs
SPI4 / CSIX
RDRAM
RDRAM
RDRAM
PCI
XScale
Core
Micro
engine
STI Cell Processor-- Emerging Example
Power Processor Element (PPE)
(Simplified 64-bit PowerPC with VMX)
Dual configurable
High-speed
channels
(38.4 GB/sec ea.)
I/O
Controller
Memory
Controller
RAM
I/O
Controller
Memory
Controller
RAM
SPE1
SPE5
Dual 12.8 GB/sec
memory busses.
EIB
Synergistic
Processing
Element (SPE)
SPE2
SPE6
SPE3
SPE7
SPE4
SPE8
Element Interconnect Bus (EIB) internal communication system.
Spring 2007
Lecture 16
Overview of the Rest of the Semester
• This is the last formal lecture
– If we haven’t covered it already, we can’t really expect you to use it on
your projects
• Final project proposal due Tuesday in class
• I’ll be in my office (208 CSL) during class on 3/27 to
provide an opportunity to discuss project issues
• Quiz 2 is 3/29
• Final project demos are 5/3
Spring 2007
Lecture 16