Transcript Slide 1

From Adaptive to Self-Tuning Systems
Sudhakar Yalamanchili, Subramanian Ramaswamy and Gregory Diamos
School of Electrical and Computer Engineering
Architectural Challenges
Frequency Wall
Power Wall
•Negative returns with power
•Increasing inefficiencies due to
• speculation
• control flow
Power
Not much headroom left in the stage to
stage times (currently 8-12 FO4 delays) [4]
Single Thread Performance
Leakage current increases 7.5X
with each generation [3]
ILP
Memory Wall
Source:http://techreport.com/reviews/2005q2/opteron-x75/dualcore-chip.jpg
Cache Area
 80% of transistor budget  50% of total area [1]
 Defects in cache affect processor yield
Significant power consumers (e.g. > 40% of total power in
Strong ARM)[2]
On-chip-DRAM gap continues to grow



1.
2.
3.
4.
Pipeline
in-order
OOO aggressive OOO
Economic Wall

Costs of developing next generation processors


Design & Manufacturing costs
Extreme Device Variability
P. Ranganathan, S. Adve, N. Jouppi. Reconfigurable Caches and their Application to Media Processing. ISCA 2000
Michael Zhang, Krste Asanovic “Fine-Grain CAM-Tag Cache Resizing Using Miss Tags” ISLPED 02
S. Borkar “Design Challenges of Technology Scaling” Micro 1999
Vikas Agarwal, M. S. Hrishikesh, Stephen W. Keckler, Doug Burger. Clock rate versus IPC: the end of the road for conventional microarchitectures. In ISCA 2000
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
2
System View
1. Capture and adapt to intrinsic application
behavior
Static, off-line
characterizations
Many-core,
Heterogeneous System
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
M
M
M
M M scale
M M
Large
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
Dynamic, on-line,
evolutionary behaviors
2. Device-Level Variations
reduce architecture yield
Solution: Systems are self-tuning
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
3
The Space of Solutions
Ill- Structured Workloads
Structured Workloads
State of the Practice
Rigid, HW/SW Boundaries
P
P
M
M
Traditional
Architectures
(Fixed)
Evolutionary or Self-Tuning Systems
P
P
M
Ability to Customize
Architectures Before
Application Deployment
P
M
Architectures
Change At SWdetermined Points
of Execution
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
M
Architectures
continuously
autonomously evolve
and adapt
4
From Adaptive to Self Tuning
 Where
do we make future investments in transistors and
software?
 Hardware
software co-design for continuous monitoring
and/or tuning

Expose and (dynamically) eliminate design redundancies
 Two
Examples
Cache memory hierarchy
 On-Chip Networks

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
5
Generational Behavior of Caches
Memory Lines
miss
Idle interval
hit
new generation
new generation
Time
1. Kaxiras, S., Hu, Z. and Martonosi, M., "Cache Decay: Exploiting Generational Behavior to Reduce Cache Leakage Power“ ISCA 2001
2. Jaume Abella, Antonio Gonzlez, Xavier Vera, Michael F. P. O'Boyle “IATAC: a smart predictor to turn-off L2 cache lines.” TACO 2005
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
6
Cache Tuning: Conceptual Model
 Remap
memory into the cache  shape the cache
 Match the program footprint  resize the cache
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
7
Cache Tuning: System Model & Opportunities
statement
statement
Structured accesses
remapping directive
statement
statement
end loop
z
Placement( B[][], param )
Region A
loop
y
Static analysis or
programmer supplied
Placement ( B[][] , param)
Profile based insertion
x
P
Thread 2
Thread 1
L1
Run-time
tuning
L2
AT
Alternative
implementations
LUT
logic
M
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
8
Static Tuning: Scientific Applications
 Targeted
to programs with predictable access patterns
 Compiler can both resize and remap

Advanced compiler optimizations made possible
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
9
Dynamic Tuning: Folding Heuristics
Comparisons shown for a 256KB L2 cache
Find and utilize redundancies in the design
 Miss folding  fold misses via re-mapping memory lines into the same
cache set

S. Ramaswamy, S. Yalamanchili. Improving Cache Efficiency via Resizing + Remapping. ICCD 2007
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
10
Tuning for Yield: Decreasing Defect Sensitivity*
Recovering
Design
Inefficiencies
Performance Yield  yield at a given performance (e.g. AMAT) for 1000
units
 Up to four times greater than modulo placement
 Exploiting redundancies  application to power management

S. Ramaswamy, S. Yalamanchili, “Customizable Fault Tolerant Caches for Embedded Processors,” ICCD 2006
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
11
Opportunities
 Voltage

scaling
Combine voltage scaling and remapping for program phase
dependent power management
 Compiler-directed

hardware optimizations
For example concurrent data layout + cache placement
 Application
to multi-threaded and multi-core domains
Cache sharing across threads
 Challenge: coherency traffic

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
12
The On-Chip Network
 The
network is in the critical path (performance)
Operand networks
 Cache hierarchy
 System on Chip

 Increasing

impact of wire (channel) delays
Wire delays must be actively managed
 On-demand
resource management
Initial studies: link tuning
 Reference: Research at EPFL & Stanford on robust link design

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
13
A System for Tuning and Actively Reconfiguring SoC
Links (STARS)
Too Fast Well Tuned
Latch 1
Value 1
Too Slow
Value 2
Latch 2
Value 1
Value 2
Latch 3
Value 1
Value 2
Time

Variable delays and and cascaded registers measure link delay

Digital PLL tunes the clock to match the link delay
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
14
FPGA Tests
Monitoring
Find End of
Link Transition
Find Start of
Link Transition
Tuning
Adjust Clock
Frequency
 Low
Determine
Slack In the
Link
speed tests to validate the control strategy
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
15
Prototyping: 180nm
 Variable
Delay Elements (VDE)
Variable delay from 118ps to 1.47ns
 10 bits of resolution
 502 transistors

 Digitally
Controlled Oscillator (DCO)
Clock period from 240ps to 2.97ns
 10 bits of resolution
 528 transistors

 Digital
Clock Divider (DCD)
Min input clock period 480ps
 8 bits of resolution
 1127 transistors

 Allows

tuning links up to 2.083 GHz
From reference clock of 8.13MHz
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
16
Extensions
 Modulate
link widths
 Modulate
buffer organizations

Channels/depth
 Feedback
between local congestion detection and link and
buffer resources
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
17
Summary
 Application
demands will be time varying
 Technology
will introduce time-varying hardware
characteristics
 Continuous
cooperative HW/SW tuning provides a
methodology for addressing these concerns
Need the support of abstractions for tuning
 Influence of prior applications to datapaths (Razor-UMich),
communication systems (Vizor-GT), and reliable links (Stanford/EPFL)
 Build on existing research in cache performance & power
management

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
18