Transcript ppt

Virtual Memory Primitives for User
Programs
Presentation by David Florey
CS533 - Concepts of Operating Systems
Overview




This paper provides basic primitives, how there used
and the implementation details on various OSs
Discuss the various primitives and how they are used
(in user level algorithms)
Discuss the performance on various OSs
Discuss the ramifications of these uses (algorithms)
on system design
CS533 - Concepts of Operating Systems
The Primitives (VM Services)

TRAP
o
o

PROT1
o
o

o
o
Increases the accessibility of a single page
A procedure call (via messaging, trap to OS, etc)
DIRTY
o
o

Decreases accessibility of n pages
A procedure call (via messaging, trap to OS, etc)
UNPROT
o

Decreases accessibility of a single page
A procedure call (via messaging, trap to OS, etc)
PROTN
o

Facility allowing user level handling of page faults (protection or otherwise)
An event that is raised (in the form of a message or signal from OS)
Returns a set of pages that have been touched since the last call to dirty
A procedure call (via messaging, trap to OS, etc)
MAP2
o
o
o
o
Map two different virtual addresses to point to the same physical page
Each virtual address has its own protection level
This is in the same address space (not two different processes or tasks or address spaces)
A procedure call (via messaging, trap to OS, etc)
CS533 - Concepts of Operating Systems
VM Service Usage
Concurrent Garbage Collection







Stop all threads
Divide memory into from-space and to-space
Copy all objects reachable from “roots” and registers into tospace
Use PROTN to protect all pages in unscanned area
Use MAP2 to allow collector access to all pages while
preventing mutators from accessing the same pages
Restart threads
As mutator threads attempt to access pages in to-space that
are unscanned, TRAP event:
o
o
o

Stops mutator in its tracks
Calls collector, collector scans, forwards and UNPROTs page
Mutator allowed to continue
At some point this process is restarted and all objects left in
from-space are considered garbage and removed
CS533 - Concepts of Operating Systems
Concurrent Garbage Collection
From-Space
Registers
R1->A
R2->F
To-Space
From-Space
A
B
C
D
E
F->L
G
H
I
J
K
L
Registers
R1->A1
R2->F1
B
C
D
E
G
H
I
J
K
To-Space
A1
F1
L1
(PROTN)
Mutator Thread
A1
F1
F1
A1.Data.B (Protection Fault)
Collector Thread
Forward B->B1
AP
TR
C
D
E
G
H
I
J
K
MAP2
A1.Data.B1
A1
F1
L1
B1
B1
VM Service Usage
Shared Virtual Memory








Each CPU (or machine) has its own memory and memory mapping
manager
Memory mapping managers keep CPU memory consistent with
the “shared” memory
When a page is shared, it is marked “read-only” (PROT1)
Upon writing this page, a fault occurs in the writing thread
causing TRAP event associated Mapping Manager
Mapping Manager uses trap to notify other MMs, which in turn
flush their copy of the page (this mechanism may also be used
to get an up-to-date copy of the page)
Page is then marked writable (UNPROT) and written
MAP2 is used to allow the trap-handler to access the
protected page while the client cannot
TRAP is also used by MM to pull down a page from another CPU
or disk when not available locally
CS533 - Concepts of Operating Systems
Shared Memory
CPU1
CPU1
CPU2
Thread
attempts to
write C
A
C
E
B
C
A TRAP
C
F
E
Mapping
Manager
Mapping
Manager
Mapping
Manager
B
C
D
E
F
A
PR
OT
C
B
Mapping
Manager
C
D
E
F
Thread resumed and
allowed to write to C
CPU1
C
CPU2
E
B
Mapping
Manager
A
F
Access to C
allowed because of
MAP2
Protected with PROT1
A
B
C is MINE!
UN
A
CPU2
Mapping manager traps
write fault and tells other
mapping managers to
flush their copies
If thread in CPU2
needs C, page fault
handled by Mapping
Manager which
retrieves up-to-date C
from CPU1
F
Mapping
Manager
B
C
D
E
F
VM Service Usage
Concurrent Checkpointing





Checkpointing is the process of state such as heap, stack, etc –
which can be slow
Instead of a synchronous save, we can simply use PROTN to
mark the pages that need to be saved to disk read-only
A second thread can then run concurrently with the user
threads writing out pages and UNPROTing each page as its
written
If a user thread hits a “read-only” page, a fault occurs
TRAPping to the concurrent thread which quickly writes the
page and allows the faulting thread to continue
Could also just do this with the DIRTY pages using PROT1
CS533 - Concepts of Operating Systems
Concurrent Checkpointing
CS533 - Concepts of Operating Systems
Concurrent Checkpointing With DIRTY
1
Use DIRTY to see which pages have been modified
1
2
2
3
4
Pages 1, 2, and 4 are dirty so PROT1 each of these, then execute
original algorithm
1
2
3
4
VM Service Usage
Generational Garbage collection








Objects are kept in generations
The longer an object lives, the older its generation
Typically garbage is in younger generations, but an old object
might be pointing at a young object so…
Use DIRTY checkpointing to see if pages containing old objects
were changed, objects in these DIRTY pages can be scanned to
see where they point
Or
PROTN all old pages and TRAP to a handler when old page is
written to, save page id in a list for later scanning and UNPROT
page so writer can write
Later, collector can scan the list of pages to see if any objects
within the pages are pointing to younger generations
Why use a small page size here?
CS533 - Concepts of Operating Systems
The Problem (in red), older generation
pointing to object in younger
generation, which means we can’t
collect that object
Generation X
1
Generation Y
Use DIRTY
Pretend each generation is in its own page for now
Generation Y
Generation X
Page 2
Page 1
Generation X
Page 1
Time t1 (nothing
DIRTY)
Generation Y
Page 2
2
USE PROTN, TRAP to create a list of modified pages
Generation X
Page 1
Generation Y
Page 2
Time t2 - PROTN all GenX
pages
Generation Y
Generation X
Page 1
Page 2
Time t2 (p1 is dirty)
Time t2 (p1 is being written to, Add to a list and
allow the write to continue (UNPROT page)
At t3, collector kicks in and scans the list
of dirty pages for reverse dependencies
At t3, collector kicks in and scans the
MANUALLY created list of written pages
VM Service Usage
Others…

Persistent Stores
o
o

Extending addressability
o
o

After translating 64-bit32-bit pages may need to be protected so that a TRAP
handler can properly “load” the page for suitable access, then UNPROT it
TRAP, UNPROT, PROT1 or PROTN and MAP2
Data-compression Paging
o
o
o

Can use VM services to protect pages, trap on writes and persist dirty pages on
commit or toss them on abort
TRAP, UNPROT and PROTN, UNPROT, MAP2
Compressing n pages into a couple of pages may be faster than writing these pages to
disk. The compressed pages can then be access-protected. When user then tries to
access such a page, TRAP, decompress, UNPROT
Could also use PROT1 to test access frequency of page
TRAP, PROT1 or PROTN, TRAP, UNPROT
Heap overflow detection
o
o
o
o
Terminate memory allocation with a “guard” (PROT1) page
Upon access to this page call TRAP-handler which triggers collector
Alternative is conditional branch
PROT1, TRAP
CS533 - Concepts of Operating Systems
Page 1
Persistent Store Example
& Data Compression Example
ddd
Compress
ed paged
ddd
qqq
rrr
Page 2
Page 1 and 2 are PROTN
instead of copy-on-wrtie
A
Page 1
B
Page 3
Page 2
rrr
Page 1
C
Resume
s
D
qqq
2a, b) UNPROT Page
and allow write
TRAP
ddd
Compress
ed paged
ddd
qqq
rrr
co
Page 1
De
TRAP
mp
re s
ddd
De
co
mp
re s
Page 2
s
Dec
qqq
o mp
ress
1a) Save A on Commit
1b) Save C on Commit
Page 3
rrr
Performance in OSs


Devised Appel1 and Appel2 based on algorithms’
patterns of primitive usage
Appel1
o
o

PROT1, TRAP, UNPROT
e.g. Shared Virtual Memory
Appel2
o
o
PROTN, TRAP, UNPROT
e.g. Concurrent garbage collection,
CS533 - Concepts of Operating Systems
Performance in OSs
CS533 - Concepts of Operating Systems
Performance of Primitives



All data normalized based on speed of Add
instruction on CPU
Some OSs didn’t implement Map2
Some OSs did a crummy job of implementing these
primitives
o

OS designers seem to be relying on old notions like
disk latency
o

mprotect does not flush the TLB correctly
Not relevant with CPU-based algorithms like these
One OS performed exceptionally well showing that
these instructions don’t have to perform poorly
CS533 - Concepts of Operating Systems
Ramifications on System Design


Fault handling must be fast because we are no longer
at the mercy of the disk – we can do it all in the CPU
TLB Consistency
o
Making memory more accessible is good for TLB consistency
•
o
One less thing you need to worry about
Making memory less accessible in the multi-processor case
forces TLB “shootdown”
•
•
•
Stop all processors and tell each to flush entry 123 in TLB
Better if done in batches
In fact, paging out could improve if done in batches too
CS533 - Concepts of Operating Systems
Ramifications on System Design

Optimal Page Size
o
Some operations depend on the size of the page
•
o
o
Disk latency can no longer be counted on for crummy design
Computations linearly proportional to page size are now
going to be noticed, so we might benefit by cutting down
the page size
•
o
“HEY OS DESIGNERS LISTEN UP!”
Those algorithms that do a lot of scanning – like the
Generational Garbage collector – would benefit from a smaller
page size
Also be aware that shrinking page sizes will cause more
page faults and more calls to the fault trap handler, so its
overhead must also be very small
CS533 - Concepts of Operating Systems
Ramifications on System Design

Access to Protected Pages
o
Mapping same page two different ways with two different
protections in same address space is FAST
•
•
o
You could achieve the same results by copying memory
around – only 65 copies and you’re there!
•
o
Although it does add some bookkeeping overhead
And cache consistency could be a problem
Or pounding your head on the desk – that works too
You could also use a heavyweight process and super heavy
RPC to context switch heavily, relying on the shared page
between processes support in OSs
•
Techniques employeed in LRPC and URPC can alleviate the
context switch problem
CS533 - Concepts of Operating Systems
Ramifications on System Design

What about pipelined processors?
o
o
o
Out-of-order execution
Dependence on sequential execution
Only a problem in the heap overflow detection case
•
•
Register tweaking can be a problem
All other algorithms work just like a typical page fault handler
– handle fault, pull page in, make page accessible
CS533 - Concepts of Operating Systems
Final Considerations



Making memory more accessible one page at a time,
and less accessible in large batches is good for TLB
consistency
The total performance effect of page size should be
considered (fixed costs vs variable costs)
Locality of reference is exploited in these
algorithms
o

Better locality improves fault handling overhead (as data is
closer to CPU)
Pages should be accessible in different ways in a
single address space
CS533 - Concepts of Operating Systems