Transcript Slides

Memory Management Challenges in
the Power-Aware Computing Era
Dr. Avi Mendelson,
Intel - Mobile Processors Architecture group
[email protected]
and adjunct Professor in the CS and EE departments,
Technion Haifa
mendlson@{cs,ee}.technion.ac.il
1
Disclaimer



No Intel proprietary information is disclosed.
Every future estimate or projection is only a
speculation
Responsibility for all opinions and conclusions
falls on the author only. 
 It
June 10th 2006
does not mean you cannot trust them… 
© Dr. Avi Mendelson - ISMM'2006
2
Before we start
• Personal observation: focusing on low-power
resembles Alice through the Looking-Glass:
We are looking at the same old problems, but
from the other side of the looking glass, and
the landscape appears much different...
Out of the box thinking is needed
June 10th 2006
© Dr. Avi Mendelson - ISMM'2006
3
Agenda

What are power aware architectures and why they are
needed



Memory related issues




Background of the problem
Implications and architectural directions
Static power implications
Dynamic power implications
Implications on software and memory management
Summary
June 10th 2006
© Dr. Avi Mendelson - ISMM'2006
4
The “power aware era”


Introduction
Implications on computer architectures
June 10th 2006
© Dr. Avi Mendelson - ISMM'2006
5
Moore’s law
“Doubling the number of transistors on a manufactured
die every year” - Gordon Moore, Intel Corporation
109
256M
64M
Memory
108
16M
Transistors Per Die
Microprocessor
4M
107
1M
256K
106
i486™
64K
Pentium®III
Pentium® II
Pentium®
Pro
®
Pentium
16K
105
i386™
4K
80286
1K
104
103
Pentium®4
8086
Source: Intel

8080
4004
102
’70
June 10th 2006
’73
’76
’79
’82
’85
’88
© Dr. Avi Mendelson - ISMM'2006
’91
’94
'97
2000
6
In the last 25 years life was easy
(*)

Idle process technology allowed us to





Double transistor density every 30 months
Improve their speed by 50% every 15-18 month
Keep the same power density
Reduce the power of an old architecture or introduce a new
architecture with significant performance improvement at the
same power
In reality

Process usually is not ideal and more performance than
process scaling is needed, so:

Tech




Die size and power and power densities increased over time
Old Arch
i386C
i486C
Pentium®
Pentium® III
mm (linear)
6.5
9.5
12.2
10.3
New Arch
i486
Pentium®
Pentium® Pro
Next Gen
mm (linear)
11.5
17
17.3
?
Ratio
Ratio
3.1
3.2
2.1
2--3
(*) source Fred Pollack,
June 10th 2006
© Dr. Avi Mendelson - ISMM'2006
Micro-32
7
Processor power evolution
?
100
Pentium® II
Pentium® 4
Max Power (Watts)
Pentium® Pro
Pentium® III
10
Pentium®
Pentium®
w/MMX tech.
i486
i386
1



6
3
2

3
Traditionally: a new generation always increase power
Compactions: higher performance at lower power
Used to be “one size fits all”: start with high power and shrink for mobile
June 10th 2006
© Dr. Avi Mendelson - ISMM'2006
8
Suddenly, the power monster
appears in all market segments
June 10th 2006
© Dr. Avi Mendelson - ISMM'2006
9
Power & energy
Power
 Dynamic power: consumed by all transistor that
switch
 P = aCV2f - Work done per time unit (Watts)
(a: activity, C: capacitance, V: voltage, f: frequency)
 Static power (leakage): consumed by all “inactive
transistors” - depends on temperature and
voltage.
Energy

Power consumed during a time period.
Energy efficiency

Energy * Delay (or Energy * Delay2)
June 10th 2006
© Dr. Avi Mendelson - ISMM'2006
10
Why high power maters
Power Limitations
 Higher power  higher current
–

Cannot exceed platform power delivery constraints
Higher power  higher temperature
Cannot exceed the thermal constraints (e.g., Tj < 100oC)
– Increases leakage.
 The heat must be controlled in order to avoid electric migration
and other “chemical” reactions of the silicon
 Avoid the “skin effect”
–
Energy
 Affects battery life.


Consumer devices – the processor may consume most of the
energy
Mobile computers (laptops) - the system (display, disk, cooling,
energy supplier, etc) consumes most of the energy
 Affects the cost of electricity
June 10th 2006
© Dr. Avi Mendelson - ISMM'2006
11
The power crisis – power consumption
Sourse:
cool-chips,
Micro 32
June 10th 2006
© Dr. Avi Mendelson - ISMM'2006
12
Power density
1000
Rocket
Nozzle
Nuclear Reactor
Watts/cm 2
100
Pentium® 4
Pentium® III
Pentium® II
Hot plate
10
Pentium® Pro
Pentium®
i386
i486
1




3
2

3


* “New Microarchitecture Challenges in the Coming Generations of CMOS Process Technologies” –
Fred Pollack, Intel Corp. Micro32 conference key note - 1999.
June 10th 2006
© Dr. Avi Mendelson - ISMM'2006
13
Conclusions so far




Currently and in the near future, new processes keep
providing more transistors, but the improvement in power
reduction and speed is much lower than in the past
Due to power consumption and power density
constrains, we are limited in the amount of logic that can
be devoted to improve single thread performance
We can use the transistors to add more “low power
density” and “low power consumption” such as memory,
assuming we can control the static power.
BUT, we still need to double the performance of new
computer generations every two years (in average).
June 10th 2006
© Dr. Avi Mendelson - ISMM'2006
14
We must go parallel


In theory, power increases in the order of the
frequency cube.
Assuming that frequency approximates
performance
 Doubling
performance by increasing its frequency
increases the power exponentially
 Doubling performance by adding another core,
increases the power linearly.

Conclusion: as long as enough parallelism
exists, it is more efficient to achieve the same
performance by doubling the number of cores
rather than doubling the frequency.
June 10th 2006
© Dr. Avi Mendelson - ISMM'2006
15
CPU architecture - multicores
Performance
CMP
Power Wall
MP Overhead
Uniprocessors
Uniprocessors have lower power efficiency
due to higher speculation and complexity
Source: Tomer Morad, Ph.D student, Technion
June 10th 2006
© Dr. Avi Mendelson - ISMM'2006
Power
16
The new trend – parallel systems
on die

There are at least three camps in the computer
architects community
 Multi-cores
- Systems will continue to contain a small
number of “big cores” – Intel, AMD, IBM
 Many-cores – Systems will contain a large number of
“small cores” – Sun T1 (Niagara)
 Asymmetric-cores – combination of a small number of
big cores and a large number of small cores – IBM Cell
architecture.
June 10th 2006
© Dr. Avi Mendelson - ISMM'2006
17
New era in computer
architectures
I remember, in the late 80’s it
was clear that we cannot
improve the performance of
single threaded programs
any longer, so we went
parallel as well
But this is totally
different!!!!, now Alewife
is called Niagra, DASH
is called NOC and
shared bus architecture
is called CMP!!!
Let’s hope this time
we will have an
“happy end”
Hammm, you are right - it is
all of a new world.
We deserve
it!!!.
We deserve it!!!.
June 10th 2006
© Dr. Avi Mendelson - ISMM'2006
18
From the opposite side of the mirror



June 10th 2006
There are many similarities
between the motivation and the
solutions we are building today and
what was developed in the late
80’s
But the root-cause is different, the
software environment is different
and so new approach is needed to
come with right solutions
Power and power density are real
physical limitations so changing
them requires a new direction
(biological computing?????)
© Dr. Avi Mendelson - ISMM'2006
19
Agenda

What is power aware architectures and why they are
needed



Memory related issues




Background on the problem
Implications and architectural directions
Static power implications
Dynamic power implications
Implications on software and memory management
Summary
June 10th 2006
© Dr. Avi Mendelson - ISMM'2006
20
Memory related implications


The portion of the memory out of the overall die area
increases over time (cache memories)
Most of memory has very little contribution to the active
power consumption




Most of the active power that on-die memory consumes is spent
at the L1 cache (which remains at the same or smaller size)
Most of the active power that off-die memory consumes is spent
on the busses, interconnect and coherency related activities.
Snoop traffic may have significant impact on active power
But larger memory may consume significant static power
(leakage) if not handled very carefully

Will be discussed in the next few slides
June 10th 2006
© Dr. Avi Mendelson - ISMM'2006
21
Accessing the cache can be power
hungry if designed for speed
Address

TLB
If access time is very important:




Set select
Tag
Data
Way

Memory cells are tuned for speed,
not for power
parallel access to all ways in the
tag and data arrays of the cache
TLB is accessed in parallel (need
to make sure that the width of the
cache is less than a memory
page)
Power is mainly spent in the
sense amplifiers.
High associativity increases the
power of external snoops as well.
V-address
Sets
P-address
To CPU
June 10th 2006
© Dr. Avi Mendelson - ISMM'2006
22
There are many techniques to control
the power consumption of the cache
(and memory) if designed for power

Active power
 Sequential
access (tag first and only then data)
 Optimized cells (can help both active and leakage
power)

Passive (Leakage) power
 Will
be discuss later on
BUT if the cache is large, the accumulative static
power can be very significant.
June 10th 2006
© Dr. Avi Mendelson - ISMM'2006
23
How to control static power - General


Process optimizations are out of the scope of this talk
Design techniques




Micro-architecture level



Sleep transistors
Power gating
Forward and backward biasing (out of our scope)
Hot spot control
Allowing advanced design techniques
Architectural level



Program behavior dependent techniques
Compiler and program’s hint based techniques
ACPI
June 10th 2006
© Dr. Avi Mendelson - ISMM'2006
24
Design techniques – few examples
Description
Sleep transistor –
data preserved
Allow to lower the voltage Smallon the gate to the level
medium
where data is preserved
but can not be accessed
Sleep transistor – Lower the power on gate
data not preserved to a level that data may
be corrupted
Power gate
June 10th 2006
Granularity
“Cut the power” from the
gate
Impact
~10x leakage
reduction. Longer
access time to
retrieve data
Small medium
20-100x Leakage
reduction. Need to
bring data from
higher memory
(costs time and
power)
Large
No leakage.
Longer access time
than sleep
transistor
© Dr. Avi Mendelson - ISMM'2006
25
Architectural level – program behavior
related techniques
We knows for long time that most of the lines in the cache are
“dead”
But dead lines are not free since they are consuming leakage power


K=10 Tetha = 3
40
30
B
D
L
25
20
15
L
B
10
5
D
48
46
44
42
40
38
36
34
32
30
28
26
24
22
20
18
16
14
12
8
10
6
4
2
0
0
Unique Lines
35
Time

So, if we can predict what lines are dead we could put them under

sleep transistor that keeps the data (in the case that the prediction was
not perfect) – Drowsy cache
 save more leakage power and use sleep transistors that loose the data
and pay higher penalty for miss-predicting a dead line -- cache decay
June 10th 2006
© Dr. Avi Mendelson - ISMM'2006
26
Architectural level – ACPI


Operating system mechanism to
control power consumption at the
system level. (we will focus on CPU
only)
Control three aspects of the system

C-State, when the system has nothing
to do, how deep it can go to sleep.
Deeper  more power is saved, but
more time is needed to wake it.
 P-State, when run, what is the
minimum frequency (and voltage) the
processor can run at without causing
a notable slow down to the system
 T-State, prevents the system from
becoming too hot.
June 10th 2006

Implementation:

© Dr. Avi Mendelson - ISMM'2006
Periodically check the activity
of the system and decide if to
change the operational point.
27
Combining sleep transistors and ACPI
– Intel Core Duo example








Intel Core Duo was designed to be dual core
for low-power computer.
L2 cache size in Core Duo is 2MB and in
Core Duo-2, 4M
When the system runs (C0 states) all the caches
are active
It has a sophisticated DVS algorithm to control the T and P states
When starting to “nap” (C3), it cleans the L1 caches and close the power to
them (to save leakage)
When in sleep (C4), it gradually shrink the size of the L2 till it is totally
empty.
How much performance you pay for this?  some time you gain
performance, most of the time it is in order of 1%.
More details: Intel Journal of Technology, May, 2006
June 10th 2006
© Dr. Avi Mendelson - ISMM'2006
28
Agenda

What is power aware architectures and why they are
needed



Memory related issues




Background on the problem
Implications and architectural directions
Static power implications
Dynamic power implications
Implications on software and memory management
Summary
June 10th 2006
© Dr. Avi Mendelson - ISMM'2006
29
Memory management

Adding larger shared memory arrays on die may not
make sense any more

Access time to the LLC (last level cache) start to be very slow


May cause a contention on resources (shared memory and
buses).


But it is too fast for handling it with SW or OS
Solving these problems may cost a significant power
What are the alternatives (WIP)


COMA within the chip
Use of buffers instead of caches – change of the programming
model


May require different approach for memory allocation
Separate the memory protection mechanisms from the VM
mechanism
June 10th 2006
© Dr. Avi Mendelson - ISMM'2006
30
Compiler and applications

Currently most of the cache related optimization are based on
performance

Many of them are helping energy since they improve the efficiency of
the CPU.
 But may worsen the max power and power density

Increasing parallelism is THE key for future systems.




Speculation may hurt if not done with a high degree of confidence.
Do we need a new programming models such as transactional memory
for that?
Reducing working sets can help reducing leakage power if the
system supports Drowsy or Decay caches
The program may give the HW and the OS “hints” that can help
improving the efficiency of power consumption


Response time requirements
If new HW/SW interfaces defined, we can control the power of the
machine at a very fine granularity; e.g., when Floating-Point is not used,
close it to save leakage power
June 10th 2006
© Dr. Avi Mendelson - ISMM'2006
31
Garbage collection

In many situations, not all the cores in the system will be
active. An idle processor can be used to perform GC


Most of CMP architectures shares many of the memory
hierarchies. GC done by one processor may replace the
cache content and slow down the execution of the entire
system.


In this case we may want to do the CG at a very fine granularity.
Thus we may like to do the GC at a very coarse granularity in
order to limit the overhead.
New interface between HW and SW may be needed in
order to allow new algorithms for GC or optimize the
execution of the system when using the current ones
June 10th 2006
© Dr. Avi Mendelson - ISMM'2006
32
Summary and future directions


Power aware era impacts all aspects of computer architecture
It forces the market to “go parallel” and may cause the memory
portion of the die to increase over time



We may need to start looking at new paradigms for memory usage
and HW/SW interfaces



To take advantage of “cold transistors”
To reduce memory and IO bandwidth
At all levels of the machine; e.g., programming models, OS etc.
New programming models such as Transactional Memory may become
very important in order to allow better parallelism. Do we need to
develop a new memory paradigm to support it?
Software will determine if the new (old) trend will become a major
success or not.



Increase parallelism (but use speculative execution only with high
confidence)
Control power
New SW/HW interfaces may be needed.
June 10th 2006
© Dr. Avi Mendelson - ISMM'2006
33
Question?
June 10th 2006
© Dr. Avi Mendelson - ISMM'2006
34
Multi-cores – Intel AMD and IBM

Both companies have dual core processors








Intel uses shared cache architecture and AMD introduces split cache architecture
AMD announces that in their next generation processors they will use shared LLC (last level
cache) as well
Intel announced that they are working on a four-core processors. Analysts think that
AMD are doing the same.
Intel said that they will consider going to 8 way processors only after SW will catch
up. AMD had a similar announcement.
For servers, analysts claims that Intel is building a 16 way Itanium based processor
for 2009 time frame.
Power4 has 2 cores and power5 has 2 cores+SMT. They are considering to move in
the near future to 2 cores+SMT for each of them.
Xbox has 3 Power4 cores.
All the three companies promise to increase the number of cores in a pace that fits
the market’s needs.
back
June 10th 2006
© Dr. Avi Mendelson - ISMM'2006
35
Sun – Sparc-T1: Niagara
(a)
(b)




(c)
Looking at in-order machine, each thread has computational time followed by LONG
memory access time (a)
If you put 4 of them on die, you can overlap between I/O, memory and computation
(b)
You can use this approach to extend your system (c)
Alewife project did it in the 80’s
back
June 10th 2006
© Dr. Avi Mendelson - ISMM'2006
36
Cell Architecture - IBM
Small
core
Ring
base bus
unit
BIG core
back
June 10th 2006
© Dr. Avi Mendelson - ISMM'2006
37