Adv. Memory Mgt

Download Report

Transcript Adv. Memory Mgt

Outline
• Objective for Lecture
– Finish up memory issues:
• Remote memory in the role of
backing store
• Power-aware memory (Duke
research)
Topic: Paging to Remote
Memory
Motivation:
How to exploit remote memory?
P
$
Memory
“I/O bottleneck”
VM and file caching
Distributed Shared
Memory (DSM)
Low latency networks
make retrieving data across
the network faster than local disk
Remote
Memory
Motivation:
How to exploit remote memory?
P
$
Memory
“I/O bottleneck”
VM and file caching
Low latency networks
make retrieving data across
the network faster than local disk
Remote
Memory
Global Memory Service (GMS)
Pageout/pageinpeer-to-peer style
Compute nodes
M
M
P
P
M
P
Pageout/pagein
Switched LAN
M
P
M
P
M
P
M
“Idle” workstations or memory servers
– GMS kernel module for Digital Unix and FreeBSD
– remote paging, cooperative caching
P
What does it take to page out to
remote memory (and back in)?
• How to uniquely name and identify pages/blocks?
• 128-bit Unique Identifiers (UIDs) Global Page Numbers (GPN).
• How to maintain consistent information about the
location of pages in the network?
• Each node keeps a Page Frame Directory of pages cached locally.
• A distributed Global Cache Directory identifies the caching site(s)
for each page or block.
• Hash on page ID’s to get GCD site.
• How to keep a consistent view of the membership in the
cluster and the status of participating nodes?
• How to globally coordinate memory usage for the page
cache?
• “Think globally, act locally.” – choice of replacement algorithm
Global Memory System (GMS)
in a Nutshell
• Basic operations: putpage, getpage, movepage,
discardpage.
• Directory site is self for private pages, hashed for sharable.
• Caching site selected locally based on global information.
GMS Client
M
request
M
P
getpage
P
Directory Site
(GCD)
request
GMS Client
M
page
M
update
P
Caching Site
(PFD)
M
P
putpage
P
Directory Site
(GCD)
page
M
P
Caching Site
(PFD)
GCD: Global Cache Directory, PFD: Page Frame Directory, P: Processor, M: Memory
GMS Page Fetch (getpage)
Page fetches are requested with a getpage
RPC call to page directory site.
Getpage can result from a
demand fault or a prefetch
policy decision (asynchronous).
Caching site sends page
directly to requesting node.
getpage
reply
getpage
request
getpage forward
(delegated RPC)
caching site
directory site
Directory site delegates request
to the correct caching site.
Global LRU
Global -- Local
Scenarios
Node A
Node B
Faulted page
Local LRU
In each case Node B faults
on some Virtual Addr.
Case 1: Page in another node’s
Global memory.
Case 2: case 1, but B has no
Global pages.
Case 3: Page is on disk -- going
active . . .
Or
copy
Or
Case 4: Shared page in local
memory of node A.
Global Replacement
• Fantasy: on a page fetch, choose the globally “coldest”
page in the cluster to replace (i.e., global LRU).
– may demote a local page to global
• Reality: approximate global LRU by making replacement
choices across epochs of (at most) M evictions in T
seconds.
–
–
–
–
–
trade off accuracy for performance
nodes send summary of page values to initiator at start of epoch
initiator determines which nodes will yield the epoch’s evictions
favor local pages over global pages
victims are chosen probabilistically
GMS Putpage/Movepage
Page/block evictions trigger an asynchonous putpage
request from the client; the page is sent as a
payload piggybacked on the message.
Directory (GCD) site is selected
by a hash on the global page
number; every node is GCD site
for its own private pages.
Caching site is selected locally
based on global information
distributed each epoch.
update
putpage
update
directory site
movepage
caching site #1
caching site #2
Topic: Energy Efficiency of
Memory
Motivation:
What can the OS do about the
energy consumption of memory?
Laptop Power Budget
Handheld Power Budget
9 Watt Processor
1 Watt Processor
Memory
Other
Memory
Other
Energy Efficiency Metrics
• Power consumption in watts (mW).
• Battery lifetime in hours (seconds,
months).
• Energy consumption in Joules (mJ).
• Energy * Delay
• Watt per megabyte
Physics Basics
• Voltage is amount of energy transferred
from a power source (e.g., battery) by the
charge flowing through the circuit.
• Power is the rate of energy transfer
• Current is the rate of flow of charge in
circuit
Terminology and Symbols
Concept Symbol
Current
I
Charge
Q
Voltage
V
Energy
E
Power
P
Resistance
R
Unit
Abbrev.
amperes
A
coulombs
C
volts
V
joules
J
watts
W
ohms
W
Relationships
Energy (Joules) = Power (watts) * Time (sec)
E=P*t
Power (watts) = Voltage (volts) * Current (amps)
P=V*I
Current (amps) = Voltage (volts) / Resistance
(ohms)
I=V/R
Power Management
(RAMBUS memory)
Read or write
60ns
300mW
active
+6s
+60ns
nap
30mW
Powered
down
+6ns
3mW
standby
2 chips
180 mW
v.a. -> p.a mapping
could exploit page coloring ideas
Power Management
(RAMBUS memory)
Read or write
60ns
300mW
active
+60ns
nap
30mW
+6ls
Powered
down
+6ns
3mW
standby
2 chips
180 mW
v.a. -> p.a mapping
could exploit page coloring ideas
Power Management
(RAMBUS memory)
Read or write
60ns
300mW
active
+60ns
nap
30mW
+6ls
Powered
down
+6ns
3mW
standby
2 chips
180 mW
v.a. -> p.a mapping
could exploit page coloring ideas
RDRAM as a Memory Hierarchy
• Each chip can be
independently put
into appropriate
power mode
• Number of chips at
each “level” of the
hierarchy can vary
dynamically.
Active
Active
Nap
Policy choices
– initial page placement in an
“appropriate” chip
– dynamic movement of page
from one chip to another
– transitioning of power state
of chip containing page
Two Dimensions to Control Energy
CPU/$
Software
control
OS
Page Mapping
Allocation
Hardware
control
ctrl
ctrl
ctrl
Chip
1
Chip
2
Chip
n
Standby
Power
Down
Active
Dual-state (Static) HW Power
State Policies
access
• All chips in one base
state
• Individual chip Active
while pending
requests
• Return to base power
state if no pending
access
Active
access
No pending
access
Standby/Nap/Powerdown
Active
Access
Base
Time
Quad-state (Dynamic) HW
Policies
• Downgrade state if no
access for threshold time
• Independent transitions
based on access pattern
to each chip
• Competitive Analysis
– rent-to-buy (Ski rental)
– Active to nap 100’s of ns
– Nap to PDN 10,000 ns
Access
access
access
Active
no access
for Ta-s
no access for
Ts-n
access
access
PDN
STBY
no access
for Tn-p
Active
STBY
Nap
PDN
Time
Nap
Ski Rental Analogy
• Week 1: “Am I gonna like
skiing? I better rent skis.”
• Week 10: “I’m skiing black
diamonds and loving it. I
could’ve bought skis for all
the rental fees I’ve paid. I
should have bought them in
the first place. Guess I’ll buy
now.
• If he buys exactly when
rentalsPaid==skiCost
then pays 2*optimal
Page Allocation Polices
• Random Allocation
– Pages spread across chips
• Sequential First-Touch Allocation
– Consolidate pages into minimal number of chips
– One shot
• Frequency-based Allocation
– First-touch not always best
– Allow movement after first-touch
Dual-state + Random Allocation*
2.42
Normalized Energy*Delay
1.2
7.46
2.25
1.82
4.94
1
0.8
Active
Standby
Nap
Power Down
0.67
0.6
0.4
0.2
0
acrord32
compress
go
netscape
powerpnt
winword
 Active to perform access, return to base state
 Nap is best ~85% reduction in E*D over full power
 Little change in run-time, most gains in energy/power
2 state
model
*NT Traces
2.0
1.8
1.6
1.4
1.2
1.0
0.8
0.6
0.4
0.2
0.0
vp
r
gc
c
go
co
m
bz
i
pr
es
s
Dual-Nap-Seq
Quad-Random
Quad-Sequential
p
Normalized Energy*Delay
Quad-state HW + Sequential Allocation*
• Base: Dual-state Nap Sequential Allocation
• Thresholds: 0ns A->S; 750ns S->N; 375,000 N->P
Subsequent analysis shows 0ns threshold S->N is better.
• Quad-state + Sequential 30% to 55% additional
improvement over dual-state nap sequential
*SPEC
• Sophisticated HW not enough
The Design Space
Random
Allocation
Dual-state
Hardware
(static)
Quad-state
Hardware
(dynamic)
Nap is best
dual-state
policy
60%-85%
No compelling
benefits over
simpler policies
Sequential
Allocation
Additional
10% to 30%
over Nap
Best Approach:
6% to 55% over
dual-nap-seq,
99% to 80% over
all active.
Spin-down Disk Model
Trigger:
request or
predict
Spinning
up
Not
Spinning
Request
Spinning
down
Spinning
& Seek
Spinning
& Access
Spinning
& Ready
Inactivity Timeout
threshold*
Spin-down Disk Model
Etransition = Ptransition * Ttransition
~1- 3s delay
Trigger:
request or
predict
Pdown
Tdown
Spinning
up
Not
Spinning
Request
Spinning
& Seek
Etransition =
Ptransition * Ttransition
Spinning
down
Spinning
& Access
Tidle
Spinning
& Ready
Pspin
Tout
Inactivity Timeout
threshold*
Reducing Energy Consumption
Energy = S Poweri x Timei
i e power
states
To reduce energy used for task:
– Reduce power cost of power state I through better technology.
– Reduce time spent in the higher cost power states.
– Amortize transition states (spinning up or down) if significant.
Pdown*Tdown + 2*Etransition + Pspin * Tout < Pspin*Tidle
Tdown = Tidle - (Ttransition + Tout)
Power Specs
IBM Microdrive (1inch)
• writing 300mA
(3.3V)
1W
• standby 65mA
(3.3V)
.2W
IBM TravelStar (2.5inch)
• read/write 2W
• spinning 1.8W
• low power idle .65W
• standby .25W
• sleep .1W
• startup 4.7 W
• seek 2.3W
Spin-down Disk Model
2.3W
4.7W
Trigger:
request or
predict
.2W
Spinning
up
Not
Spinning
Request
Spinning
down
Spinning
& Seek
Spinning
& Ready
2W
Spinning
& Access
.65-1.8W
Spin-Down Policies
• Fixed Thresholds
– Tout = spin-down cost s.t. 2*Etransition = Pspin*Tout
• Adaptive Thresholds: Tout = f (recent accesses)
– Exploit burstiness in Tidle
• Minimizing Bumps (user annoyance/latency)
– Predictive spin-ups
• Changing access patterns (making burstiness)
– Caching
– Prefetching