or on streamed data

Transcript or on streamed data

Breaking Down the Memory Wall
for Future Scalable Computing Platforms
Wen-mei Hwu
Sanders-AMD Endowed Chair Professor
with
John W. Sias, Erik M. Nystrom, Hong-seok Kim, Chien-wei Li,
Hillery C. Hunter, Ronald D. Barnes, Shane Ryoo, Sain-Zee Ueng,
James W. Player, Ian M. Steiner, Chris I. Rodrigues, Robert E. Kidd,
Dan R. Burke, Nacho Navarro, Steven S. Lumetta
University of Illinois at Urbana-Champaign
DUSD(Labs)
 Large interconnect delay
 Increasing interconnect delay
and shrinking clock domains
 Limited size of individual
computing engines
1.4
1.3
1.1
1.0
0.9
1
1000
100
10
130nm
5X
2
3
4
5
Normalized Leakage (Isb)
Interconnect RC Delay
Clock Period
Copper Interconnect
RC delay of 1mm interconnect
1
Data: Shekhar Borkar, Intel
Wen-mei W. Hwu—University of Illinois at Urbana-Champaign
30%
1.2
10000
Delay (ps)
 High variability
 Increasing speed and power
variability of transistors
 Limited frequency increase
 Reliability / verification
challenges
Normalized Frequency
Trends in hardware
2
350 250 180 130 90
65
SIGMICRO Online Seminar—January 18, 2005
Trends in architecture
 Transistors are free… until connected or used
 Continued scaling of traditional processor core no longer
economically viable


2-3X effective area yields ~1.6X performance [PollackMICRO32]
Verification, power, transistor variability
 Only obvious scaling route: “Multi-Everything”
 Multi-thread, multi-core, multi-memory, multi-?
 CW: Distributed parallelism is easy to design
 But what about software?
 If you build a better mousetrap…
Wen-mei W. Hwu—University of Illinois at Urbana-Champaign
3
SIGMICRO Online Seminar—January 18, 2005
A “multi-everything” processor of the future
LOCAL
MEMORY
 Distributed, less complex
components
GPP

APP
Variability, power density, and
verification – easier to address
MAIN
MEMORY
 Who bears the SW mapping
burden?
MTM

ACC
ACC
LOCAL
MEMORY
LOCAL
MEMORY



Wen-mei W. Hwu—University of Illinois at Urbana-Champaign
4
General purpose software
changes prohibitively expensive
(cf. SIMD, IA-64)
Advanced compiler features
“Deep Analysis”
New programming models /
frameworks
Interactive compilers
SIGMICRO Online Seminar—January 18, 2005
General purpose processor component(s)
LOCAL
MEMORY
GPP
 The system director
 Performs traditionally-
programmed tasks
APP

software migration starts here
MAIN
MEMORY
 Likely multiple GPP’s
 Less complex processor cores
MTM
ACC
ACC
LOCAL
MEMORY
LOCAL
MEMORY
Wen-mei W. Hwu—University of Illinois at Urbana-Champaign
5
SIGMICRO Online Seminar—January 18, 2005
Computational efficiency through customization
 Goal: Offload most processing
LOCAL
MEMORY
GPP
to more specialized, more
efficient units
 Application Processors (APP)

MAIN
MEMORY
APP
 Programmable Accelerators
MTM
(ACC)

ACC
Specialized instruction sets,
memory organizations and
access facilities

ACC

Think ASIC with knobs
Highly-specialized pipelines
Approximate ASIC design points
 Higher performance/watt than
LOCAL
MEMORY
general purpose for target
applications
LOCAL
MEMORY
Wen-mei W. Hwu—University of Illinois at Urbana-Champaign
6
SIGMICRO Online Seminar—January 18, 2005
Memory efficiency through diversity
 Traditional monolithic memory
LOCAL
MEMORY
GPP
MAIN
MEMORY
APP
MTM
ACC
ACC
LOCAL
MEMORY
LOCAL
MEMORY
Wen-mei W. Hwu—University of Illinois at Urbana-Champaign
model – major power /
performance sink
 Need partnership of generalpurpose memory hierarchy and
software-managed memories
 Local memories will reduce
unnecessary memory traffic
and power consumption
 Bulk data transfer scheduled
by Memory Transfer Module
 Software will gradually adopt
decentralized model for power
and bandwidth
7
SIGMICRO Online Seminar—January 18, 2005
Tolerating communication & adding macropipelining
 Bulk communication overhead
LOCAL
MEMORY
GPP
MAIN
MEMORY
APP
MTM
ACC
ACC
LOCAL
MEMORY
LOCAL
MEMORY
Wen-mei W. Hwu—University of Illinois at Urbana-Champaign
often substantial for traditional
accelerators
 Shared memory / snooping
communication approach
limits available bandwidth
 Compilation tools will have to
seamlessly connect
processors and accelerators
 Accelerators will be able to
operate on bulk transferred,
buffered data…
… or on streamed data
8
SIGMICRO Online Seminar—January 18, 2005
Embedded systems already trying out this paradigm
Micro
engine
Micro
engine
Micro
engine
Micro
engine
Micro
engine
Micro
engine
Micro
engine
TFIFO
Micro
engine
Micro
engine
Micro
engine
Micro
engine
Hash
Engine
RFIFO
SPI4 / CSIX
RDRAM
RDRAM
RDRAM
PCI
XScale
Core
Micro
engine
Scratchpad
SRAM
QDR
SRAM
QDR
SRAM
QDR
SRAM
QDR
SRAM
Micro
engine
Micro
engine
Micro
engine
Micro
engine
MPEG VIDEO
Philips
Nexperia
(Viper)
MSP
MIPS
Intel IXP2400
Network
Processor
CSRs
ARM
ACCESS
CTL.
Intel IXP1200
Network
Processor
VLIW
Wen-mei W. Hwu—University of Illinois at Urbana-Champaign
MICRO-
ENGINES
9
SIGMICRO Online Seminar—January 18, 2005
Decentralizing parallelism in a JPEG decoder
Upsampled
Image
Conversion
text
text
Tables
text
l
na le
o
i
t
p
Op am
s
Up
Bypassed
Upsample
Color
Conversion
RGB
Image
YCC
Image
Conceptual dataflow view of two JPEG decoding steps
 Convert a typical media-processing application to the
decentralized model



Arrays used to implement streams
Multiple loci of computation with various models of parallelism
Memory access bandwidth a bottleneck w/o private data
Wen-mei W. Hwu—University of Illinois at Urbana-Champaign
10
SIGMICRO Online Seminar—January 18, 2005
Data privatization and local memory
Upsampled
Image
Conversion
text
text
Tables
text
l
na le
o
i
t
p
Op am
s
Up
Bypassed
Upsample
Color
Conversion
RGB
Image
YCC
Image
Conceptual dataflow view of two JPEG decoding steps
 Accelerate color conversion first (execute in ACC or APP)
 Main processor sends inputs, receives outputs
 Large tables – inefficient to send data from main processor
 Need tables to reside in the accelerator for efficiency of access
 Tables are initialized once during program execution, and never modified
again
 Accurate pointer analysis necessary to determine this
Wen-mei W. Hwu—University of Illinois at Urbana-Champaign
11
SIGMICRO Online Seminar—January 18, 2005
Increasing parallelism
Upsampled
Image
YCC
Image
RGB
Image
Convert
Upsample
Time
Upsample
a
re
St
m
Convert
 Heavyweight loop nests communicate though intermediate array
 Direct streaming of data is possible, supports higher parallelism (macropipelining)
 Convert() and Upsample() loops can be chained
 Accurate interprocedural dataflow analysis is necessary
Wen-mei W. Hwu—University of Illinois at Urbana-Champaign
12
SIGMICRO Online Seminar—January 18, 2005
How the next-generation compiler will do it (1)
Heavyweight
loops
Upsample
Table
Initialization
Load
Scanline
Color
Conversion
Callgraph
Memory
To-do list:
Acceleration opportunities:
o Heavyweight loops identified for acceleration
o However, they are isolated in separate functions called
through pointers
Wen-mei W. Hwu—University of Illinois at Urbana-Champaign
13
o Identify acceleration
opportunities
o Localize memory
o Stream data and
overlap computation
SIGMICRO Online Seminar—January 18, 2005
How the next-generation compiler will do it (2)
Initialization code
identified
Accelerator 1
Upsample
Table
Initialization
Accelerator 2
Load
Scanline
Color
Conversion
Callgraph
Memory
Large constant lookup
tables identified
To-do list:
Localize memory:
 Identify acceleration
opportunities
o Pointer analysis identifies localizable memory objects
o Private tables inside accelerator initialized once, saving most o Localize memory
o Stream data and
traffic
overlap computation
Wen-mei W. Hwu—University of Illinois at Urbana-Champaign
14
SIGMICRO Online Seminar—January 18, 2005
How the next-generation compiler will do it (3)
Summarize output
access pattern
Constant table
privatized
Accelerator 1
Accelerator 2
Summarize input
access pattern
Upsample
Table
Initialization
Load
Scanline
Color
Conversion
Callgraph
Memory
To-do list:
Streaming and computation overlap:
o Memory dataflow summarizes array/pointer access patterns
o Opportunities for streaming are automatically identified
o Unnecessary memory operations replaced with streaming
Wen-mei W. Hwu—University of Illinois at Urbana-Champaign
15
 Identify acceleration
opportunities
 Localize memory
o Stream data and
overlap computation
SIGMICRO Online Seminar—January 18, 2005
How the next-generation compiler will do it (4)
Accelerator 1
Upsample
Table
Initialization
Load
Scanline
Color
Conversion
Callgraph
Memory
Accelerator 2
To-do list:
Achieve macropipelining of parallelizable accelerators  Identify acceleration
o Upsampling and color conversion can stream to each other
o Optimizations can have substantial effect on both efficiency
and performance
Wen-mei W. Hwu—University of Illinois at Urbana-Champaign
16
opportunities
 Localize memory
 Stream data and
overlap computation
SIGMICRO Online Seminar—January 18, 2005
Memory dataflow in the pointer world
Cols
Cols
C
Y C C
Y C C
...
Y C C
Y C C
...
…
C
...
Rows
Y
Rows
Y C C
Array of constant
pointers
Y C C
Row arrays never
overlap
 Arrays are not true 3D arrays (unlike in Fortran)
 Actual implementation: array of pointers to array of samples
 New type of dataflow problem – understanding the semantics of
memory structures instead of true arrays
Wen-mei W. Hwu—University of Illinois at Urbana-Champaign
17
SIGMICRO Online Seminar—January 18, 2005
Compiler vs. hardware memory walls
 Hardware memory wall
 Prohibitive implementation cost of memory system while trying
to keep up with the processor speed under power budget
 Compiler memory wall
 The use of memory as a generic pool obstructs compiler’s view
of true program and data structures
 The decentralized and diversified memory approach is key
to breaking the hardware memory wall
 Breaking the compiler memory wall will be increasingly
important in breaking the hardware memory wall
Wen-mei W. Hwu—University of Illinois at Urbana-Champaign
18
SIGMICRO Online Seminar—January 18, 2005
Pointer analysis: sensitivity, stability and safety
[PASTE2004]
Improved efficiency increases the
scope over which unique, heapallocated objects can be discovered
Discovered Objects
132.ijpeg
Improved analysis algorithms provide more
accurate call graphs (below) instead of a
blurred view (above) for use by program
transformation tools
A multitude
of distinct
objects
1000
100
3
2
10
WORSE
19
1
0
1
1
10
1000
100
Observed Connectivity
A few, highlyconnected
objects
Wen-mei W. Hwu—University of Illinois at Urbana-Champaign
ANALYSIS
SCOPE
BETTER
10000
... ...
... ...
SIGMICRO Online Seminar—January 18, 2005
Pointer analysis: sensitivity, stability and safety
Context
Heap
SubField
typing
Arithmetic
compatible is a major contribution
Flow
?
 Analysis is abstract execution
 simplifying abstractions → analysis stability
 “unrealizable dataflow” results
 Many components of accuracy
 Typical to cut some corners to enable “key”
component for particular applications
 Making the components usefully

No need for a priori corner-cutting → better
results across broad code base
 Safety in “unsafe” languages
 C poses major challenges
 Efficiency challenge increased in safe algos.
Wen-mei W. Hwu—University of Illinois at Urbana-Champaign
20
SIGMICRO Online Seminar—January 18, 2005
How do sensitivity, stability and safety coexist?
 Our two-pronged approach to sensitive, stable, safe pointer analysis
Summarization:
Only relevant details are forwarded
to a higher level
VP
MANAGER
VP
MANAGER
MANAGER
MANAGER
MANAGER
Increased Abstraction
CEO
Containment:
The algorithm can cut its losses
locally (like a bulkhead) …
WORKER
WORKER
WORKER
WORKER
WORKER
WORKER
WORKER
WORKER
… to avoid a global
explosion in problem size
 Example: summarization-based context sensitivity…
Wen-mei W. Hwu—University of Illinois at Urbana-Champaign
21
SIGMICRO Online Seminar—January 18, 2005
Context sensitivity: naïve inlining
int g;
iris()
int a;
p1 := &g; q1 := 1;
*p1 := q1;
jade1
jade1(&g, 1) r1 := g + 5;
x1 := z1
g := 1
Excess
statements
unnecessary
and costly
p2 := &a; q2 := 3;
jade2(&a, 3)
*p2 := q2;
jade2
r2 := g + 5;
a := 3
x2 := z2
jade(int *p, int q)
int r;
*p := q;
r := g + 5;
x := z
p := &g;
q := 1;
p := &a;
q := 3;
g := 1
a := 3
g := 3
a := 1
r := 8
Retention of side
effect still leads to
spurious results
Wen-mei W. Hwu—University of Illinois at Urbana-Champaign
22
r := 6
SIGMICRO Online Seminar—January 18, 2005
Context sensitivity: summarization-based
int g;
iris()
int a;
p1 := &g;
q1 := 1;
p2 := &a;
q2 := 3;
g := 1
jade1(&g, 1)
*p1 := q1;
jade2(&a, 3)
jade(int *p, int q)
int r;
Compact summary
*p2 := q2;
of jade used
p := &g;
q := 1;
p := &a;
q := 3;
g := 1
a := 3
Summary accounts for all
side-effects. BLOCK
assignment to prevent
r := g + 5;
contamination
x := z
*p := q;
Wen-mei W. Hwu—University of Illinois at Urbana-Champaign
23
a := 3
r := 6
Now, only correct
result derived
SIGMICRO Online Seminar—January 18, 2005
Analyzing large, complex programs
[SAS2004]
more contexts were encountered
1E+14
1E+12
1E+10
1E+06
1E+04
INACCURATE
Context
Insensitive
(seconds)
1E+02
espresso
main()
21
23
25
27
29
31
33
35
37
39
1E+00
104
Call Graph Depth
li
leaves
New algorithm contains problem
size with each additional context
1E+05
ijpeg
perl
gcc
1E+04
1E+03
perlbmk
1E+02
gap
1E+01
vortex
1E+00
twolf
1
3
5
7
9
11
13
15
17
19
21
23
25
27
29
31
33
35
37
39
New Compaction
Algorithm
This results in an efficient analysis
process without loss of accuracy
Benchmark
1E+08
1
3
5
7
9
11
13
15
17
19
Naïve
Exhaustive Inlining
1012 Originally, problem size exploded as
main()
008.espresso
099.go
176.gcc
Call Graph Depth
124.m88ksim
130.li
254.gap
Wen-mei W. Hwu—University of Illinois at Urbana-Champaign
2
1
2
4
52
155
62
5
1
PREV
ContextSensitive
(seconds)
9
1332
85
408
HOURS
MONTHS
3350
136
2
NEW
ContextSensitive
(seconds)
1
1
1
11
124
198
117
3
1
leaves
175.vpr
134.perl
255.vortex
24
SIGMICRO Online Seminar—January 18, 2005
The outlook in software
 Software is changing too, more gradually
 Applications driving development – rich in parallelism
 Physical world – medicine, weather
 Video, games – signal & media processing
 Source code availability
 Open Source continues to grow
 Microsoft’s Phoenix Compiler Project
 New programming models
 Enhanced developer productivity & enhanced parallelism
Wen-mei W. Hwu—University of Illinois at Urbana-Champaign
25
SIGMICRO Online Seminar—January 18, 2005
Beyond the traditional language environment
 Domain-specific, higher-level modeling languages
 More intuitive than C for inherently parallel problems
 Implementation details abstracted away from developers

increased productivity, increased portability
 Still an important role for the compiler in this domain
 Little visibility “through” the model for low-level optimization by
developers


communication, memory optimization will be critical in next-gen systems
Model can provide structured semantics for the compiler, beyond
what can be derived from analysis of low-level code
 As new system models are developed, compilers,
modeling languages, and developers will take on new,
interactive roles
Wen-mei W. Hwu—University of Illinois at Urbana-Champaign
26
SIGMICRO Online Seminar—January 18, 2005
Domain-specific modeling and optimization
NPClick Programming Model
fromDevice
fromDevice
checkIPHeader
checkIPHeader
toDevice
rt
(ipLookup)
toDevice
discardPush
Naïve Implementation
ipLookup
Load IP
Header
checkIPHeader
Packet
Data
Load IP
Header
Redundant Load
Elimination
Main
Memory
checkIPHeader
Packet
Token
Main
Memory
Load IP
Header
Compiler Optimized Implementation
ipLookup
 Programming Model provides the compiler with information that one cannot
extract with analysis alone
 Compiler breaks the limitations that are imposed by the model, allowing for
efficient, high-performance binaries
Wen-mei W. Hwu—University of Illinois at Urbana-Champaign
27
SIGMICRO Online Seminar—January 18, 2005
Concluding thoughts
 Reaching the true potential of multi-everything hardware
 Scalability requires distributed parallelism and memory models
 Requires new compilation tools to break compiler memory wall
 Broad suite of analyses necessary
 Advanced pointer analysis
 Memory dataflow analysis
 New interactions of classical analyses
 This is not just reinventing HPF
 New distributed parallelism paradigms
 New applications  new challenges!
 As the field develops, new domain-specific programming
models will also benefit from advanced compilation
technology
Wen-mei W. Hwu—University of Illinois at Urbana-Champaign
28
SIGMICRO Online Seminar—January 18, 2005

or on streamed data

Transcript or on streamed data

Directory