or on streamed data
Download
Report
Transcript or on streamed data
Breaking Down the Memory Wall
for Future Scalable Computing Platforms
Wen-mei Hwu
Sanders-AMD Endowed Chair Professor
with
John W. Sias, Erik M. Nystrom, Hong-seok Kim, Chien-wei Li,
Hillery C. Hunter, Ronald D. Barnes, Shane Ryoo, Sain-Zee Ueng,
James W. Player, Ian M. Steiner, Chris I. Rodrigues, Robert E. Kidd,
Dan R. Burke, Nacho Navarro, Steven S. Lumetta
University of Illinois at Urbana-Champaign
DUSD(Labs)
Large interconnect delay
Increasing interconnect delay
and shrinking clock domains
Limited size of individual
computing engines
1.4
1.3
1.1
1.0
0.9
1
1000
100
10
130nm
5X
2
3
4
5
Normalized Leakage (Isb)
Interconnect RC Delay
Clock Period
Copper Interconnect
RC delay of 1mm interconnect
1
Data: Shekhar Borkar, Intel
Wen-mei W. Hwu—University of Illinois at Urbana-Champaign
30%
1.2
10000
Delay (ps)
High variability
Increasing speed and power
variability of transistors
Limited frequency increase
Reliability / verification
challenges
Normalized Frequency
Trends in hardware
2
350 250 180 130 90
65
SIGMICRO Online Seminar—January 18, 2005
Trends in architecture
Transistors are free… until connected or used
Continued scaling of traditional processor core no longer
economically viable
2-3X effective area yields ~1.6X performance [PollackMICRO32]
Verification, power, transistor variability
Only obvious scaling route: “Multi-Everything”
Multi-thread, multi-core, multi-memory, multi-?
CW: Distributed parallelism is easy to design
But what about software?
If you build a better mousetrap…
Wen-mei W. Hwu—University of Illinois at Urbana-Champaign
3
SIGMICRO Online Seminar—January 18, 2005
A “multi-everything” processor of the future
LOCAL
MEMORY
Distributed, less complex
components
GPP
APP
Variability, power density, and
verification – easier to address
MAIN
MEMORY
Who bears the SW mapping
burden?
MTM
ACC
ACC
LOCAL
MEMORY
LOCAL
MEMORY
Wen-mei W. Hwu—University of Illinois at Urbana-Champaign
4
General purpose software
changes prohibitively expensive
(cf. SIMD, IA-64)
Advanced compiler features
“Deep Analysis”
New programming models /
frameworks
Interactive compilers
SIGMICRO Online Seminar—January 18, 2005
General purpose processor component(s)
LOCAL
MEMORY
GPP
The system director
Performs traditionally-
programmed tasks
APP
software migration starts here
MAIN
MEMORY
Likely multiple GPP’s
Less complex processor cores
MTM
ACC
ACC
LOCAL
MEMORY
LOCAL
MEMORY
Wen-mei W. Hwu—University of Illinois at Urbana-Champaign
5
SIGMICRO Online Seminar—January 18, 2005
Computational efficiency through customization
Goal: Offload most processing
LOCAL
MEMORY
GPP
to more specialized, more
efficient units
Application Processors (APP)
MAIN
MEMORY
APP
Programmable Accelerators
MTM
(ACC)
ACC
Specialized instruction sets,
memory organizations and
access facilities
ACC
Think ASIC with knobs
Highly-specialized pipelines
Approximate ASIC design points
Higher performance/watt than
LOCAL
MEMORY
general purpose for target
applications
LOCAL
MEMORY
Wen-mei W. Hwu—University of Illinois at Urbana-Champaign
6
SIGMICRO Online Seminar—January 18, 2005
Memory efficiency through diversity
Traditional monolithic memory
LOCAL
MEMORY
GPP
MAIN
MEMORY
APP
MTM
ACC
ACC
LOCAL
MEMORY
LOCAL
MEMORY
Wen-mei W. Hwu—University of Illinois at Urbana-Champaign
model – major power /
performance sink
Need partnership of generalpurpose memory hierarchy and
software-managed memories
Local memories will reduce
unnecessary memory traffic
and power consumption
Bulk data transfer scheduled
by Memory Transfer Module
Software will gradually adopt
decentralized model for power
and bandwidth
7
SIGMICRO Online Seminar—January 18, 2005
Tolerating communication & adding macropipelining
Bulk communication overhead
LOCAL
MEMORY
GPP
MAIN
MEMORY
APP
MTM
ACC
ACC
LOCAL
MEMORY
LOCAL
MEMORY
Wen-mei W. Hwu—University of Illinois at Urbana-Champaign
often substantial for traditional
accelerators
Shared memory / snooping
communication approach
limits available bandwidth
Compilation tools will have to
seamlessly connect
processors and accelerators
Accelerators will be able to
operate on bulk transferred,
buffered data…
… or on streamed data
8
SIGMICRO Online Seminar—January 18, 2005
Embedded systems already trying out this paradigm
Micro
engine
Micro
engine
Micro
engine
Micro
engine
Micro
engine
Micro
engine
Micro
engine
TFIFO
Micro
engine
Micro
engine
Micro
engine
Micro
engine
Hash
Engine
RFIFO
SPI4 / CSIX
RDRAM
RDRAM
RDRAM
PCI
XScale
Core
Micro
engine
Scratchpad
SRAM
QDR
SRAM
QDR
SRAM
QDR
SRAM
QDR
SRAM
Micro
engine
Micro
engine
Micro
engine
Micro
engine
MPEG VIDEO
Philips
Nexperia
(Viper)
MSP
MIPS
Intel IXP2400
Network
Processor
CSRs
ARM
ACCESS
CTL.
Intel IXP1200
Network
Processor
VLIW
Wen-mei W. Hwu—University of Illinois at Urbana-Champaign
MICRO-
ENGINES
9
SIGMICRO Online Seminar—January 18, 2005
Decentralizing parallelism in a JPEG decoder
Upsampled
Image
Conversion
text
text
Tables
text
l
na le
o
i
t
p
Op am
s
Up
Bypassed
Upsample
Color
Conversion
RGB
Image
YCC
Image
Conceptual dataflow view of two JPEG decoding steps
Convert a typical media-processing application to the
decentralized model
Arrays used to implement streams
Multiple loci of computation with various models of parallelism
Memory access bandwidth a bottleneck w/o private data
Wen-mei W. Hwu—University of Illinois at Urbana-Champaign
10
SIGMICRO Online Seminar—January 18, 2005
Data privatization and local memory
Upsampled
Image
Conversion
text
text
Tables
text
l
na le
o
i
t
p
Op am
s
Up
Bypassed
Upsample
Color
Conversion
RGB
Image
YCC
Image
Conceptual dataflow view of two JPEG decoding steps
Accelerate color conversion first (execute in ACC or APP)
Main processor sends inputs, receives outputs
Large tables – inefficient to send data from main processor
Need tables to reside in the accelerator for efficiency of access
Tables are initialized once during program execution, and never modified
again
Accurate pointer analysis necessary to determine this
Wen-mei W. Hwu—University of Illinois at Urbana-Champaign
11
SIGMICRO Online Seminar—January 18, 2005
Increasing parallelism
Upsampled
Image
YCC
Image
RGB
Image
Convert
Upsample
Time
Upsample
a
re
St
m
Convert
Heavyweight loop nests communicate though intermediate array
Direct streaming of data is possible, supports higher parallelism (macropipelining)
Convert() and Upsample() loops can be chained
Accurate interprocedural dataflow analysis is necessary
Wen-mei W. Hwu—University of Illinois at Urbana-Champaign
12
SIGMICRO Online Seminar—January 18, 2005
How the next-generation compiler will do it (1)
Heavyweight
loops
Upsample
Table
Initialization
Load
Scanline
Color
Conversion
Callgraph
Memory
To-do list:
Acceleration opportunities:
o Heavyweight loops identified for acceleration
o However, they are isolated in separate functions called
through pointers
Wen-mei W. Hwu—University of Illinois at Urbana-Champaign
13
o Identify acceleration
opportunities
o Localize memory
o Stream data and
overlap computation
SIGMICRO Online Seminar—January 18, 2005
How the next-generation compiler will do it (2)
Initialization code
identified
Accelerator 1
Upsample
Table
Initialization
Accelerator 2
Load
Scanline
Color
Conversion
Callgraph
Memory
Large constant lookup
tables identified
To-do list:
Localize memory:
Identify acceleration
opportunities
o Pointer analysis identifies localizable memory objects
o Private tables inside accelerator initialized once, saving most o Localize memory
o Stream data and
traffic
overlap computation
Wen-mei W. Hwu—University of Illinois at Urbana-Champaign
14
SIGMICRO Online Seminar—January 18, 2005
How the next-generation compiler will do it (3)
Summarize output
access pattern
Constant table
privatized
Accelerator 1
Accelerator 2
Summarize input
access pattern
Upsample
Table
Initialization
Load
Scanline
Color
Conversion
Callgraph
Memory
To-do list:
Streaming and computation overlap:
o Memory dataflow summarizes array/pointer access patterns
o Opportunities for streaming are automatically identified
o Unnecessary memory operations replaced with streaming
Wen-mei W. Hwu—University of Illinois at Urbana-Champaign
15
Identify acceleration
opportunities
Localize memory
o Stream data and
overlap computation
SIGMICRO Online Seminar—January 18, 2005
How the next-generation compiler will do it (4)
Accelerator 1
Upsample
Table
Initialization
Load
Scanline
Color
Conversion
Callgraph
Memory
Accelerator 2
To-do list:
Achieve macropipelining of parallelizable accelerators Identify acceleration
o Upsampling and color conversion can stream to each other
o Optimizations can have substantial effect on both efficiency
and performance
Wen-mei W. Hwu—University of Illinois at Urbana-Champaign
16
opportunities
Localize memory
Stream data and
overlap computation
SIGMICRO Online Seminar—January 18, 2005
Memory dataflow in the pointer world
Cols
Cols
C
Y C C
Y C C
...
Y C C
Y C C
...
…
C
...
Rows
Y
Rows
Y C C
Array of constant
pointers
Y C C
Row arrays never
overlap
Arrays are not true 3D arrays (unlike in Fortran)
Actual implementation: array of pointers to array of samples
New type of dataflow problem – understanding the semantics of
memory structures instead of true arrays
Wen-mei W. Hwu—University of Illinois at Urbana-Champaign
17
SIGMICRO Online Seminar—January 18, 2005
Compiler vs. hardware memory walls
Hardware memory wall
Prohibitive implementation cost of memory system while trying
to keep up with the processor speed under power budget
Compiler memory wall
The use of memory as a generic pool obstructs compiler’s view
of true program and data structures
The decentralized and diversified memory approach is key
to breaking the hardware memory wall
Breaking the compiler memory wall will be increasingly
important in breaking the hardware memory wall
Wen-mei W. Hwu—University of Illinois at Urbana-Champaign
18
SIGMICRO Online Seminar—January 18, 2005
Pointer analysis: sensitivity, stability and safety
[PASTE2004]
Improved efficiency increases the
scope over which unique, heapallocated objects can be discovered
Discovered Objects
132.ijpeg
Improved analysis algorithms provide more
accurate call graphs (below) instead of a
blurred view (above) for use by program
transformation tools
A multitude
of distinct
objects
1000
100
3
2
10
WORSE
19
1
0
1
1
10
1000
100
Observed Connectivity
A few, highlyconnected
objects
Wen-mei W. Hwu—University of Illinois at Urbana-Champaign
ANALYSIS
SCOPE
BETTER
10000
... ...
... ...
SIGMICRO Online Seminar—January 18, 2005
Pointer analysis: sensitivity, stability and safety
Context
Heap
SubField
typing
Arithmetic
compatible is a major contribution
Flow
?
Analysis is abstract execution
simplifying abstractions → analysis stability
“unrealizable dataflow” results
Many components of accuracy
Typical to cut some corners to enable “key”
component for particular applications
Making the components usefully
No need for a priori corner-cutting → better
results across broad code base
Safety in “unsafe” languages
C poses major challenges
Efficiency challenge increased in safe algos.
Wen-mei W. Hwu—University of Illinois at Urbana-Champaign
20
SIGMICRO Online Seminar—January 18, 2005
How do sensitivity, stability and safety coexist?
Our two-pronged approach to sensitive, stable, safe pointer analysis
Summarization:
Only relevant details are forwarded
to a higher level
VP
MANAGER
VP
MANAGER
MANAGER
MANAGER
MANAGER
Increased Abstraction
CEO
Containment:
The algorithm can cut its losses
locally (like a bulkhead) …
WORKER
WORKER
WORKER
WORKER
WORKER
WORKER
WORKER
WORKER
… to avoid a global
explosion in problem size
Example: summarization-based context sensitivity…
Wen-mei W. Hwu—University of Illinois at Urbana-Champaign
21
SIGMICRO Online Seminar—January 18, 2005
Context sensitivity: naïve inlining
int g;
iris()
int a;
p1 := &g; q1 := 1;
*p1 := q1;
jade1
jade1(&g, 1) r1 := g + 5;
x1 := z1
g := 1
Excess
statements
unnecessary
and costly
p2 := &a; q2 := 3;
jade2(&a, 3)
*p2 := q2;
jade2
r2 := g + 5;
a := 3
x2 := z2
jade(int *p, int q)
int r;
*p := q;
r := g + 5;
x := z
p := &g;
q := 1;
p := &a;
q := 3;
g := 1
a := 3
g := 3
a := 1
r := 8
Retention of side
effect still leads to
spurious results
Wen-mei W. Hwu—University of Illinois at Urbana-Champaign
22
r := 6
SIGMICRO Online Seminar—January 18, 2005
Context sensitivity: summarization-based
int g;
iris()
int a;
p1 := &g;
q1 := 1;
p2 := &a;
q2 := 3;
g := 1
jade1(&g, 1)
*p1 := q1;
jade2(&a, 3)
jade(int *p, int q)
int r;
Compact summary
*p2 := q2;
of jade used
p := &g;
q := 1;
p := &a;
q := 3;
g := 1
a := 3
Summary accounts for all
side-effects. BLOCK
assignment to prevent
r := g + 5;
contamination
x := z
*p := q;
Wen-mei W. Hwu—University of Illinois at Urbana-Champaign
23
a := 3
r := 6
Now, only correct
result derived
SIGMICRO Online Seminar—January 18, 2005
Analyzing large, complex programs
[SAS2004]
more contexts were encountered
1E+14
1E+12
1E+10
1E+06
1E+04
INACCURATE
Context
Insensitive
(seconds)
1E+02
espresso
main()
21
23
25
27
29
31
33
35
37
39
1E+00
104
Call Graph Depth
li
leaves
New algorithm contains problem
size with each additional context
1E+05
ijpeg
perl
gcc
1E+04
1E+03
perlbmk
1E+02
gap
1E+01
vortex
1E+00
twolf
1
3
5
7
9
11
13
15
17
19
21
23
25
27
29
31
33
35
37
39
New Compaction
Algorithm
This results in an efficient analysis
process without loss of accuracy
Benchmark
1E+08
1
3
5
7
9
11
13
15
17
19
Naïve
Exhaustive Inlining
1012 Originally, problem size exploded as
main()
008.espresso
099.go
176.gcc
Call Graph Depth
124.m88ksim
130.li
254.gap
Wen-mei W. Hwu—University of Illinois at Urbana-Champaign
2
1
2
4
52
155
62
5
1
PREV
ContextSensitive
(seconds)
9
1332
85
408
HOURS
MONTHS
3350
136
2
NEW
ContextSensitive
(seconds)
1
1
1
11
124
198
117
3
1
leaves
175.vpr
134.perl
255.vortex
24
SIGMICRO Online Seminar—January 18, 2005
The outlook in software
Software is changing too, more gradually
Applications driving development – rich in parallelism
Physical world – medicine, weather
Video, games – signal & media processing
Source code availability
Open Source continues to grow
Microsoft’s Phoenix Compiler Project
New programming models
Enhanced developer productivity & enhanced parallelism
Wen-mei W. Hwu—University of Illinois at Urbana-Champaign
25
SIGMICRO Online Seminar—January 18, 2005
Beyond the traditional language environment
Domain-specific, higher-level modeling languages
More intuitive than C for inherently parallel problems
Implementation details abstracted away from developers
increased productivity, increased portability
Still an important role for the compiler in this domain
Little visibility “through” the model for low-level optimization by
developers
communication, memory optimization will be critical in next-gen systems
Model can provide structured semantics for the compiler, beyond
what can be derived from analysis of low-level code
As new system models are developed, compilers,
modeling languages, and developers will take on new,
interactive roles
Wen-mei W. Hwu—University of Illinois at Urbana-Champaign
26
SIGMICRO Online Seminar—January 18, 2005
Domain-specific modeling and optimization
NPClick Programming Model
fromDevice
fromDevice
checkIPHeader
checkIPHeader
toDevice
rt
(ipLookup)
toDevice
discardPush
Naïve Implementation
ipLookup
Load IP
Header
checkIPHeader
Packet
Data
Load IP
Header
Redundant Load
Elimination
Main
Memory
checkIPHeader
Packet
Token
Main
Memory
Load IP
Header
Compiler Optimized Implementation
ipLookup
Programming Model provides the compiler with information that one cannot
extract with analysis alone
Compiler breaks the limitations that are imposed by the model, allowing for
efficient, high-performance binaries
Wen-mei W. Hwu—University of Illinois at Urbana-Champaign
27
SIGMICRO Online Seminar—January 18, 2005
Concluding thoughts
Reaching the true potential of multi-everything hardware
Scalability requires distributed parallelism and memory models
Requires new compilation tools to break compiler memory wall
Broad suite of analyses necessary
Advanced pointer analysis
Memory dataflow analysis
New interactions of classical analyses
This is not just reinventing HPF
New distributed parallelism paradigms
New applications new challenges!
As the field develops, new domain-specific programming
models will also benefit from advanced compilation
technology
Wen-mei W. Hwu—University of Illinois at Urbana-Champaign
28
SIGMICRO Online Seminar—January 18, 2005