reading_group_feb10_..

Download Report

Transcript reading_group_feb10_..

Single-ISA Heterogeneous
Multi-Core Architectures:
The Potential for Processor
Power Reduction
Rakesh Kumar, Keith I. Farkas,
Norman P. Jouppi, Parthasarathy
Ranganathan, Dean M. Tullsen
Presenter: Borys Bradel
1
Introduction

Different programs have different
requirements (e.g. ILP)




Extends to phases of a single program
Heterogeneous cores
Use core that matches the requirements
Reuse existing cores

Use multiple generations of the same
family of processors
2
Outline

Methodology




Experiments




Hardware
Assumptions
Power
Optimal – energy/energy delay product
Heuristic based – static/dynamic
Related Work
Conclusion
3
Single ISA Multi-Core Benefits
Small area overhead because of the
growth in core sizes between
generations
 Clock frequencies of older cores would
scale with technology



P3 1 GHz = P4 1.4 GHz
Increased pipeline depth precisely because
could not scale
4
Hardware – Alpha Family

2 in order cores



EV4=21064
EV5=21164
2 out of order cores


EV6=21264
EV8-=21464 (multi thread support
removed)
5
Hardware Size

15% more area than
just using 21464
6
Assumptions










Can switch cores dynamically
Private L1 cache and common L2 cache
All cores use 0.10 micron technology
Single process executing on a single core at any one
time
2.1 GHz clock (=21264 0.35 micron 600 MHz)
Input voltage 1.2V
Cores shut down when idle
1000 cycle restart cost (staged, phase lock loop left
alone)
150 ms memory access
Stall cycles through CACTI
7
Core Configurations
8
Power Model





Use Wattch to account for activity based
dissipation
Use scaling and offset factors to account for
other factors
This hybrid model is closer to manufacturer’s
data points
Peak power: data sheets less L2 cache and
output pins
Typical power: scaled based on Intel chips
9
Power and Area Statistics
10
Performance Modeling
Use SMTSIM, a cycle accurate simulator
 simpoint is used to identify
representative instructions of programs
and how many instructions need to be
fast forwarded

11
Varying Performance Ratio
12
Varying Energy Efficiency Ratio
13
Oracle Switching for Energy

Performance always within 10% of EV8-
14
Oracle Switching for Energy
15
Oracle Switching for Energy
Delay Product

Performance always within 50% of EV8-
16
Oracle Switching for Energy
Delay Product
17
Others
Voltage/frequency scaling – not as good
 Static core selection



only EV6 and EV8- are used
Dynamic heuristic



Running average performance within 10%
Every 100 time intervals (100 million
instructions) cores are sampled for 5
intervals
Select best core based on sampling
18
Results for Heuristics
19
Results for Heuristics/Static Core
20
Related Work

Gating based power optimization




Cannot gate at a fine enough granularity
May still have leakage
This could be thought of as gating to
reduce capabilities of different units
Voltage and frequency scaling


Chip wide – one size does not fit all
Fine grained – granularity problems
21
Conclusions

Heterogeneous multi core architectures
reduce the energy-delay product


Using several cores from the same
family is good



More fine grained than other approaches
Reduces development/testing costs
Is it scalable?
Just use EV6??
22