Transcript Slide 1
From Adaptive to Self-Tuning Systems
Sudhakar Yalamanchili, Subramanian Ramaswamy and Gregory Diamos
School of Electrical and Computer Engineering
Architectural Challenges
Frequency Wall
Power Wall
•Negative returns with power
•Increasing inefficiencies due to
• speculation
• control flow
Power
Not much headroom left in the stage to
stage times (currently 8-12 FO4 delays) [4]
Single Thread Performance
Leakage current increases 7.5X
with each generation [3]
ILP
Memory Wall
Source:http://techreport.com/reviews/2005q2/opteron-x75/dualcore-chip.jpg
Cache Area
80% of transistor budget 50% of total area [1]
Defects in cache affect processor yield
Significant power consumers (e.g. > 40% of total power in
Strong ARM)[2]
On-chip-DRAM gap continues to grow
1.
2.
3.
4.
Pipeline
in-order
OOO aggressive OOO
Economic Wall
Costs of developing next generation processors
Design & Manufacturing costs
Extreme Device Variability
P. Ranganathan, S. Adve, N. Jouppi. Reconfigurable Caches and their Application to Media Processing. ISCA 2000
Michael Zhang, Krste Asanovic “Fine-Grain CAM-Tag Cache Resizing Using Miss Tags” ISLPED 02
S. Borkar “Design Challenges of Technology Scaling” Micro 1999
Vikas Agarwal, M. S. Hrishikesh, Stephen W. Keckler, Doug Burger. Clock rate versus IPC: the end of the road for conventional microarchitectures. In ISCA 2000
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
2
System View
1. Capture and adapt to intrinsic application
behavior
Static, off-line
characterizations
Many-core,
Heterogeneous System
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
M
M
M
M M scale
M M
Large
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
Dynamic, on-line,
evolutionary behaviors
2. Device-Level Variations
reduce architecture yield
Solution: Systems are self-tuning
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
3
The Space of Solutions
Ill- Structured Workloads
Structured Workloads
State of the Practice
Rigid, HW/SW Boundaries
P
P
M
M
Traditional
Architectures
(Fixed)
Evolutionary or Self-Tuning Systems
P
P
M
Ability to Customize
Architectures Before
Application Deployment
P
M
Architectures
Change At SWdetermined Points
of Execution
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
M
Architectures
continuously
autonomously evolve
and adapt
4
From Adaptive to Self Tuning
Where
do we make future investments in transistors and
software?
Hardware
software co-design for continuous monitoring
and/or tuning
Expose and (dynamically) eliminate design redundancies
Two
Examples
Cache memory hierarchy
On-Chip Networks
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
5
Generational Behavior of Caches
Memory Lines
miss
Idle interval
hit
new generation
new generation
Time
1. Kaxiras, S., Hu, Z. and Martonosi, M., "Cache Decay: Exploiting Generational Behavior to Reduce Cache Leakage Power“ ISCA 2001
2. Jaume Abella, Antonio Gonzlez, Xavier Vera, Michael F. P. O'Boyle “IATAC: a smart predictor to turn-off L2 cache lines.” TACO 2005
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
6
Cache Tuning: Conceptual Model
Remap
memory into the cache shape the cache
Match the program footprint resize the cache
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
7
Cache Tuning: System Model & Opportunities
statement
statement
Structured accesses
remapping directive
statement
statement
end loop
z
Placement( B[][], param )
Region A
loop
y
Static analysis or
programmer supplied
Placement ( B[][] , param)
Profile based insertion
x
P
Thread 2
Thread 1
L1
Run-time
tuning
L2
AT
Alternative
implementations
LUT
logic
M
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
8
Static Tuning: Scientific Applications
Targeted
to programs with predictable access patterns
Compiler can both resize and remap
Advanced compiler optimizations made possible
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
9
Dynamic Tuning: Folding Heuristics
Comparisons shown for a 256KB L2 cache
Find and utilize redundancies in the design
Miss folding fold misses via re-mapping memory lines into the same
cache set
S. Ramaswamy, S. Yalamanchili. Improving Cache Efficiency via Resizing + Remapping. ICCD 2007
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
10
Tuning for Yield: Decreasing Defect Sensitivity*
Recovering
Design
Inefficiencies
Performance Yield yield at a given performance (e.g. AMAT) for 1000
units
Up to four times greater than modulo placement
Exploiting redundancies application to power management
S. Ramaswamy, S. Yalamanchili, “Customizable Fault Tolerant Caches for Embedded Processors,” ICCD 2006
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
11
Opportunities
Voltage
scaling
Combine voltage scaling and remapping for program phase
dependent power management
Compiler-directed
hardware optimizations
For example concurrent data layout + cache placement
Application
to multi-threaded and multi-core domains
Cache sharing across threads
Challenge: coherency traffic
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
12
The On-Chip Network
The
network is in the critical path (performance)
Operand networks
Cache hierarchy
System on Chip
Increasing
impact of wire (channel) delays
Wire delays must be actively managed
On-demand
resource management
Initial studies: link tuning
Reference: Research at EPFL & Stanford on robust link design
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
13
A System for Tuning and Actively Reconfiguring SoC
Links (STARS)
Too Fast Well Tuned
Latch 1
Value 1
Too Slow
Value 2
Latch 2
Value 1
Value 2
Latch 3
Value 1
Value 2
Time
Variable delays and and cascaded registers measure link delay
Digital PLL tunes the clock to match the link delay
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
14
FPGA Tests
Monitoring
Find End of
Link Transition
Find Start of
Link Transition
Tuning
Adjust Clock
Frequency
Low
Determine
Slack In the
Link
speed tests to validate the control strategy
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
15
Prototyping: 180nm
Variable
Delay Elements (VDE)
Variable delay from 118ps to 1.47ns
10 bits of resolution
502 transistors
Digitally
Controlled Oscillator (DCO)
Clock period from 240ps to 2.97ns
10 bits of resolution
528 transistors
Digital
Clock Divider (DCD)
Min input clock period 480ps
8 bits of resolution
1127 transistors
Allows
tuning links up to 2.083 GHz
From reference clock of 8.13MHz
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
16
Extensions
Modulate
link widths
Modulate
buffer organizations
Channels/depth
Feedback
between local congestion detection and link and
buffer resources
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
17
Summary
Application
demands will be time varying
Technology
will introduce time-varying hardware
characteristics
Continuous
cooperative HW/SW tuning provides a
methodology for addressing these concerns
Need the support of abstractions for tuning
Influence of prior applications to datapaths (Razor-UMich),
communication systems (Vizor-GT), and reliable links (Stanford/EPFL)
Build on existing research in cache performance & power
management
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
18