P i

Transcript P i

Memory Management
CS 550
Spring 2017
Kenneth Chiu
Background
• Program must be brought (from disk) into
memory and placed within a process for it to be
run.
• Main memory and registers are only storage CPU
can access directly.
• Register access in one CPU clock (or less).
• Main memory can take many cycles.
• Cache sits between main memory and CPU
registers.
• Protection of memory required to ensure correct
operation.
Many Ways to Do Memory
Management!
• We will cover a variety of them.
• Not mutually exclusive, and can be hybridized
in various ways.
• Many of them are simplified examples,
illustrating basic principles.
• Need to understand the possibilities.
The Problem
•
•
•
•
•
There is a finite amount of RAM on any machine.
There is a bus with finite speed, non-zero latency.
There is a CPU with a given clock speed.
There is a disk with a given speed.
How do we manage memory effectively?
– Mainly, that means preventing CPU idle time,
maintaining good response time, etc.
– While maintaining isolation and security.
Early Systems: No Abstraction
• Earlier systems: no
abstraction
– OS == library that sat in
memory starting from physical
address 0.
– Just one running program
starting from physical address
64K.
0KB
Operating System
(code, data, etc.)
64KB
Current Program
(code, data, etc.)
max
5
Multiprogramming and Time-Sharing
• Multiple concurrent processes
– Batch multiprogramming
– Timesharing
• Could switch by swapping.
– Fast or slow? Why?
• Instead, leave in memory.
• Good? Bad?
0KB
Operating System
(code, data, etc.)
64KB
(free)
128KB
Process C
(code, data, etc.)
192KB
Process B
(code, data, etc.)
256KB
(free)
320KB
Process A
(code, data, etc.)
384KB
(free)
448KB
(free)
512KB
6
Address Space
• This is the view we'd
like to provide.
• Goals of VM:
– Transparency
– Protection
– Protection
– Isolation
– Efficiency
0KB
Program Code
the code segment:
where instructions live
1KB
Heap
2KB
the heap segment:
contains malloc’d data
dynamic data structures
(it grows downward)
(free)
15KB
Stack
16KB
(it grows upward)
the stack segment:
contains local variables
arguments to routines,
return values, etc.
ADDRESS TRANSLATION
Base and Bound
• Consider this C code:
– void func() {
int x;
…
x = x + 3;
– 128: movl 0x0(%ebx), %eax
132: addl $0x03, $eax
135: movl %eax, 0x0(%ebx)
; load 0+ebx into eax
; add 3 to eax register
; store eax back to mem
• Also assume that all processes are contiguous in memory, smaller than
physical memory, and the same size.
0KB
128 movl 0x0(%ebx),%eax
130 addl 0x03, %eax
133 movl %eax,0x0(%ebx)
1KB
Program Code
0KB
2KB
3KB
Heap
Operating System
4KB
16KB
32KB
Code
Heap
(free)
(allocated but not in use)
48KB
Stack
(not in use)
64KB
14KB
15KB
3000
Stack
16KB
Relocated Process
(not in use)
OS Issues
• Find free slot.
– How?
• Can modify base and limit in user mode?
• What needs to be done on process switch?
• How does it know what base and bound to
use?
Segmentation
• Big chunk of unused space (internal
fragmentation).
– How to avoid?
• Segmentation: Have multiple base/bounds for
every "chunk" (called a segment).
0KB
1KB
Program Code
0KB
2KB
0KB
3KB
Operating System
Operating System
4KB
5KB
Heap
16KB
16KB
6KB
(not in use)
(not in use)
32KB
Code
Heap
(allocated but not in use)
(free)
48KB
Stack
Relocated Process
Stack
32KB
48KB
(not in use)
64KB
14KB
15KB
Stack
16KB
64KB
(not in use)
Code
Heap
(not in use)
Which Segment?
• How does the hardware know which segment
to use?
• Could be explicit, it's just part of the address.
• Could be implicit: where does the address
come from?
// get top 2 bits of 14-bit VA
Segment = (VirtualAddress & SEG_MASK) >> SEG_SHIFT
// now get offset
Offset = VirtualAddress & OFFSET_MASK
if (Offset >= Bounds[Segment])
RaiseException(PROTECTION_FAULT)
else
PhysAddr = Base[Segment] + Offset
Register = AccessMemory(PhysAddr)
Shared Memory
• With segments, an OS can support shared
memory.
• Need additional protection bits, however.
Free Space Management
• There are similar problems to what malloc has
to deal with.
0KB
8KB
Not Compacted
Operating System
0KB
8KB
Compacted
Operating System
16KB
16KB
(not in use)
24KB
24KB
Allocated
Allocated
32KB
40KB
48KB
32KB
(not in use)
Allocated
(not in use)
40KB
48KB
(not in use)
56KB
56KB
Allocated
64KB
64KB
PAGING
• Segmentation has external fragmentation
issues.
• But not segmenting has internal
fragmentation issues.
• Allocate in fixed size chunks, but a flexible
number of fixed size chunks.
• A page is the unit in the address space of the
process (virtual).
• A page frame is the physical unit.
0
reserved for OS
page frame 0 of physical memory
16
(unused)
page frame 1
page 3 of AS
page frame 2
page 0 of AS
page frame 3
(unused)
page frame 4
page 2 of AS
page frame 5
(unused)
page frame 6
page 1 of AS
page frame 7
32
48
• 4 pages, each 16
bytes
• Mapping kept in
page table.
64
80
96
112
128
Page Table Organization
• Simplest is just an array.
– Length of the array is number of pages.
– Width of the array depends on number of page
frames.
Page Table Entry
PFN
8
7
6
5
4
3
2
1
0
G
PAT
D
A
PCD
PWT
U/S
R/W
P
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9
• Example: PTE from x86
–
–
–
–
–
–
–
Present bit (P)
Read/write bit (R/W)
User/supervisor bit (U/S)
Access bit (A)
Dirty bit (D)
Hardware caching related bits (PWT, PCD, PAT, and G)
The page frame number (PFN)
24
Translation Lookaside Buffers
• Address translation, as we've described it, is
slow.
– Page table is in memory.
– Whenever some kind of access or lookup is slow in
CS, what's the solution?
• A cache for PTEs, known as the TLB.
Associative Memory
• Associative memory is a fancy name for a lookup table.
– Use a key, produce a value.
– In software, we would implement it using a hash table, binary search tree, etc.
– In hardware, the idea is that it’s a parallel search on all keys. Very expensive.
Page #
Frame #
• Address translation of (p, d) is then:
– If p is in associative register, produce the frame #.
– Otherwise get frame # from page table in memory.
Paging Hardware With TLB
Effective Access Time
• How long does a lookup take with a TLB?
• A crucial performance measure is the hit ratio, h, which is
the percentage of time that the desired page is in the TLB.
– Assume TLB lookup time is l.
– Assume memory cycle time is m.
• The effective access time (EAT) is then:
EAT = (m + l) h + (2m + l)(1 – h)
= 2m + l – h*m
• Example: Assume 80% hit ratio, 20 ns to search TLB, 100 ns
to access memory.
– If in TLB, how long does it take access memory?
– If not in TLB, how long does it take?
– EAT = .8*120 + .2*220 = 140 ns
TLB Reach/Coverage
• TLB reach: Amount of virtual address space
covered by TLB.
• Example:
– 4 GB address space
– 4 KB pages
– 32 entry TLB
– 32 × 4 KB = 128 KB
Context Switches
•
•
•
•
•
What is the key in the TLB?
What happens when we do a context switch?
Normally must flush the TLB.
What does this do to performance?
Some TLBs store address-space identifiers (ASIDs)
in each TLB entry which uniquely identifies each
process to provide address-space protection for
that process.
• This allows the TLB to not be flushed.
• Example:
– G: Global (ignore ASID)
– D: Dirty
– V: Valid (is this a used slot)
TLB Miss Handling
• On TLB miss, it needs to be filled from the
page tables.
• Could be handled by the hardware
automatically, or it could be handled by
software?
– What would be the mechanism for it to be
handled by software?
– Pros/cons?
TLB Replacement Policy
• TLB is a kind of cache.
• Caches can get full.
• What’s a good policy?
SMALLER TABLES
• Real-sized, linear table size:
– 32-bit address space
– 4 KB page size (212)
– 4 byte PTE
– 220 × 4 bytes = 4 MB
• Each process has a separate page table!
– 100-200 processes  400—800 MB just for page
tables
• Too much, must make smaller!
Bigger Pages
• Initial scenario, but with 16 KB pages:
–
–
–
–
•
•
•
•
32-bit address space
16 KB page size (214)
4 byte PTE
218 × 4 bytes = 1 MB
X86-64 has 4 KB, 2 MB, 1 GB page sizes.
Still pretty big.
Also have internal fragmentation.
Bigger pages increase TLB reach.
Hybrid: Paging and Segmentation
• Mostly empty.
• Hybrid approach:
– Use segments of
much smaller page
tables.
• Example: 32-bit address space, 4 KB pages, 4
segments, 3 used. Address looks like:
• 3 base-bounds pairs.
– Each base points to the page table in physical memory
for that segment.
– Stored in hardware registers.
• Has some disadvantages.
– Assumes that memory usage can be divided into
"segments".
– Can lead to various fragmentation issues, since page
tables are now different sizes.
Multi-Level Page Tables
• How did we get to paging?
– Went with segmentation, then to paging.
– Then we did segmentation of page tables.
– Now can do paging of page tables!
• Multi-level page tables:
– Chop page table into page-sized chunks.
– If entire page is invalid, don't allocate at all.
– Create an "outer" page table, called the page
directory.
Hierarchical Page Tables
• Break up the logical address space into
multiple page tables.
• A simple technique is a two-level page
table.
• First-level page table is contiguous, but
small.
• Each first-level PTE points to a separate 2ndlevel table.
• Each 2nd-level table must still be contiguous,
but there are many of them, so each one is
(relatively) small.
Two-Level Page-Table Scheme
Address-Translation Scheme
Physical page where
the page table is
stored.
Example
• Consider this
virtual address
space:
– 16 KB
– 64 byte pages
– 6 bit offset
– 8 bit VPN
VPN
PDI
• First chunk page
table into pages.
• Use top 4 bits to
index into page
directory.
• Use next 4 to index
into page table.
PTI
• More than two levels. Consider 30-bit virtual
address space and 512 byte page.
• All of the page directory must be present.
64-Bit Systems
• How big is the outer page table?
• Three-level?
Inverted Page Table
• Normal page table has one entry for each page.
– Optimizing it amounts to figuring out how to not use resources for
large chunks of the page table that are empty.
• If the ratio of logical address space to physical address space is
large, this suggests using one entry per physical frame, rather than
one entry per page.
• Entry consists of the virtual address of the page stored in that real
frame, with information about the process that owns that page.
• Decreases memory needed to store each page table, but increases
time needed to search the table when a page reference occurs.
• Use hash table to limit the search to one — or at most a few —
page-table entries.
• Why do we need the pid in the table?
Hashed Page Tables
• Use a hash table for lookup.
• The virtual page number is hashed and used as an
index into a hash table.
– That entry in the table contains a chain of elements
hashing to the same location.
– I.e., virtual page numbers that have the same hash
code.
• Virtual page numbers are compared in this chain
searching for a match.
– If a match is found, the corresponding physical frame
is extracted.
page number
frame number
logical address
pid2
p2
physical address
d
i
hash anchor table
inverted page
table
pid
page
next
number
hash
function
i
pid2
p2
p1
d
• How do you do shared memory with an
inverted page table?
GOING BEYOND PHYSICAL
MEMORY
Locality
• Locality is a fundamental property of
computer systems.
• Spatial locality:
– Near by memory addresses are often accessed
together.
• Temporal locality:
– Memory that is accessed now is likely to be
accessed again in the near future.
Memory Hierarchy
• There are different kinds of memory, in the
general sense, with different kinds of
price/performance.
–
–
–
–
–
–
Registers
Cache (L1, L2, L3)
RAM
Flash
Hard drive
Tape
• Each level is used as a cache for the below.
Virtual Memory
• Idea is to make the process think there is more memory than there
really is.
• How?
– If it tries to access a (virtual) page that is not mapped to a physical
page, we on-the-fly find a (physical) page frame for it.
– Pages that are not mapped have their contents stored on disk.
• Virtual memory allows the sum total logical memory of all ready or
running processes to be greater than physical memory.
• Parts of the logical address space to be non-resident in physical
memory.
• This also allows logical address space to be much larger than the
physical memory.
What is the advantage of virtual memory?
Does it work well with every process?
• Virtual memory counts on:
– In many cases, not all of the logical memory of a
process is used in any given run.
• For example, if you never use comparison feature of
Word.
• For example, if you statically allocate an array of a
certain size as a maximum.
– Furthermore, even if N pages are used in a given
run, they are not all used at the same time.
Page Fault
• A page fault: Reference to memory that is “valid”,
but not “present”.
• Sequence:
1. Hardware checks TLB.
2. Not there, causes exception. Jumps to TLB miss
handler.
3. Miss handler looks in page tables.
4. If “invalid”, the segfault.
5. If “valid”, but not present, then it finds a frame,
pages the contents in from disk.
1.
Does a context switch, since paging is slow.
If Memory Is Full
• No free frames -> kick out a page
• Which page to kick out?
– Page-replacement policy
• What would be a good page? Bad page?
– Ideally, kick out (“evict”) pages that won’t be used for
a long time.
• When should kick out occur?
– When actually needed, or earlier?
– Should try to use “idle” time to page-out pages ahead
of time.
POLICIES
RAM is Cache
• Which is faster registers or RAM?
– Why not just use registers then?
•
•
•
•
Some memory is fast but expensive (registers).
Some memory is cheap but slow (disk).
This leads to a cache hierarchy.
As with all hierarchies, cache management
then is important.
Optimal Replacement
• How far from ideal?
PAGE REPLACEMENT
• Suppose two processes have ten pages each,
but each process only ever uses five pages.
• This will fit in ten frames.
• If each process ends up using six pages each,
though, then we have over-subscribed
physical memory. In such cases, frames that
are currently in use will need to be freed up.
• This is known as page replacement.
Basic Steps to Page In
• If the page already has contents (has been written to previously):
1.
2.
Find the location of the desired page on disk.
Find a free frame:
• If there is a free frame, use it.
• If there is no free frame, use a page replacement algorithm to select a victim
frame.
3.
4.
Bring the desired page into the (newly) free frame; update the page
and frame tables.
Restart the process.
What if the page is a new page? How does this change?
• If the page is a new page, then no need to page in from disk, but
– a location on disk needs to be allocated for it,
– and the frame needs to be zero-filled.
How many disk ops are
needed?
Always?
• Can use modified bit.
A special bit in frame
table that is
maintained by the
hardware.
Page Replacement
What are the goals of page replacement?
• A page replacement algorithm should:
– Result in the minimum number of page faults.
– Not result in too much overhead. In other words,
it should be not spend too much effort in figuring
out which page to replace.
– Be implementable on available hardware.
• Want lowest page-fault rate.
• As you increase the number of frames, does
the number of page faults go up or down?
Graph of Page Faults Versus The Number of Frames
FIFO Page Replacement
• Evaluate algorithm by running it on a particular string of
memory references (reference string) and computing the
number of page faults on that string.
• One simple page replacement algorithm is simply FIFO.
• 15 page faults.
Is FIFO good?
What can you say about the page that is replaced?
• FIFO replaces the “first” page that was brought in.
If it was brought in a long time ago, does that mean it is not
needed anymore?
• Could be an initialization page that is not needed.
• Could be something that was used in the beginning and is still
heavily used.
Can you actually implement FIFO?
• Can be implemented by putting pages on a linked list
when they are paged in.
How much overhead would there be?
– Overhead is minimal, because there is already a page fault
when a page is paged in.
Another Example
• Consider the sequence of
references below:
–
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
12:
1
2
3
4
1
2
5
1
2
3
4
5
• How many faults with 4
page frames?
• How many with 3 page
frames?
Optimal Algorithm
Is there an algorithm that has the lowest possible number of faults?
• The optimal algorithm (the one that has the lowest possible number of
page faults) is simply: Replace the page that will not be used again for the
longest period of time.
– Sometimes called OPT or MIN algorithm.
• 9 page faults
Is this a practical algorithm?
If it’s not practical, what good is it? Why do we
mention it at all?
• Though not practical, it’s useful as a
comparison. It’s helpful to know how close a
particular algorithm performs compared to
the best possible.
Least Recently Used Algorithm
• Optimal algorithm: impractical; Approximate it, with LRU.
• Key idea: a page that has not been used recently probably won’t be
used for a while.
• Algorithm: Pick the page that has been unused for the longest time
as the page to replace.
• 12 page faults.
How do we know what page was least recently
used?
• LRU is considered the de facto standard pagereplacement algorithm. The main question is
how to implement it.
– Counters/timestamp:
• Associate with each PTE a time-of-use field.
• Use a CPU clock/counter register.
• Whenever a page is referenced, the contents of the
register are copied to the PTE field.
How do we find the LRU page then?
• To find the LRU page, need to search all PTEs.
– Stack approach:
• Maintain a stack of page numbers.
• When a page is referenced, remove it from its current location and push it
onto the top of the stack.
What is at the top of the stack? Bottom?
– The most recently used page is at the top of the stack then.
– The LRU page is at the bottom.
Which approach is more efficient?
– Stack doesn’t require a search, but is more complex.
LRU Approximation Algorithms
Say your boss says tomorrow, “I’ve read that LRU is
optimal. I want you to implement it.” What is your
response?
Do most systems provide the hardware support for
the preceding LRU techniques? What is provided?
• Most systems provide page protection bits.
Can we use this to implement LRU?
• Could use to implement LRU by trapping on every
reference, then updating a timestamp.
• But this would be horribly, horribly slow.
• Most systems do provide a reference bit, that
is set when the page is referenced.
How can we use this to help us approximate
LRU?
• Additional history bits:
– Maintain N-bit number for each page.
– At the end of each interval, move each reference bit
to the high-order bit, shifting all other bits to the right
by one bit. Then clear the reference bit.
– Use the page with the lowest number.
– Examples:
• 0000
– Not used at all in last four intervals.
• 1111
– Used in each of last four intervals.
• 0011
– Not used for two intervals.
• 1100
– Used in last two intervals, not used before that.
Clock Algorithm (Second Chance)
• Maintain a circular list of page frames.
• There is a clock hand that points to the page
to be replaced next.
• When a page frame is needed, the hand
advances till it finds a frame that has not been
referenced.
• As it advances, it clears the bits.
What happens when all reference bits are set?
• If all reference bits are always set, this
degenerates to FIFO.
• Enhanced second-chance: Use two bits,
referenced and modified:
– (0, 0): Not used, not modified.
• Good candidate.
– (0, 1): Not used (recently), but modified.
• Page will need to be written out to disk.
– (1, 0): Used, but clean.
• Probably used again soon.
– (1, 1): Used and dirty.
– We first look for something in (0,0) class, then in
(0,1) class, etc. Each search is a complete scan.
Experiments
• 100 unique pages
• 10,000 total references
• Cache size (physical memory) is varied.
• No locality: Each reference to a random page.
• 80-20: 80% to “hot” pages, 20% to cold pages.
• Looping-sequential: Touch pages 0-49, then
repeat.
• 80-20 with clock
Is VM for All Applications?
Do all applications benefit from VM?
• Some applications, such as databases, have a
very good understanding of their memory
usage.
• Such applications use raw disk partitions.
Thrashing
• [Show thrash.cpp]
• If a process does not have “enough” pages,
the page-fault rate is very high. This leads
to:
– low CPU utilization
– operating system thinks that it needs to
increase the degree of multiprogramming
– another process added to the system
• Thrashing  a process is busy swapping
pages in and out
Thrashing (Cont.)
Demand Paging and Thrashing
• Why does demand paging work?
Locality model
– Process migrates from one locality to another
– Localities may overlap
• Why does thrashing occur?
 size of locality > total memory size
Locality In A
MemoryReference
Pattern
• Does locality
mean that
references are
close to each
other?
• Localities
switch. Stages.
Data Organization
• Access patterns
– int data[128][128];
– Each row is stored in one page
– Program 1
for (j = 0; j <128; j++)
for (i = 0; i < 128; i++)
data[i][j] = 0;
– Program 2
• [Show stride.]
for (i = 0; i < 128; i++)
for (j = 0; j < 128; j++)
data[i][j] = 0;
• Summary:
– Optimal algorithm is to replace the page that will
not be used for the longest period of time starting
from now.
• “It’s tough to make predictions, especially about the
future.”
– Approximate optimal with LRU.
– But LRU has too much overhead, not practical on
real hardware.
– Approximate LRU with various heuristics such as
clock algorithm.
Counting Algorithms
• Keep a counter of the number of references that
have been made to each page.
• LFU Algorithm: replaces page with smallest count
– Is LFU not going to be used in the future?
• MFU Algorithm: based on the argument that the
page with the smallest count was probably just
brought in and has yet to be used. If count is high,
must be old, probably can be paged out.
Additional Paging-Related Algorithms
• Maintain a pool of clean, “free” pages. When a PF occurs:
– Start page-in to a page picked from the pool.
– Find victim, possibly page-out.
– When victim has been paged-out, it can be added to the free
pool.
What does “free” mean in this context?
– Free really just means that it can be reused immediately. It can
still have valid data.
• Pages in free pool can be reclaimed.
– If a page fault occurs, grab it back from the free pool.
• Alternative strategy, leave it mapped, but clean it and put
in free pool. When needed, then unmap it.
• Cleaning:
– Maintain list of dirty pages. When disk is idle, page
them out to clean them.
DEMAND PAGING
Demand Paging
• What percentage of a program’s code is actually executed?
• What percentage of a program’s data is actually accessed?
• Bring a page into memory only when it is needed
–
–
–
–
Less I/O needed
Less memory needed
Faster response, compared to loading whole thing.
More users
• Page is needed  reference to it
– bad reference  abort
– okay but not-in-memory  bring to memory
• Lazy swapper – never swaps a page into memory unless
page will be needed
– Swapper that deals with pages is a pager
Page Fault
• If there is a reference to a page, first reference to
that page will trap to operating system:
page fault
1. Operating system looks at another table to decide:
– Invalid reference  abort
– Just not in memory
2.
3.
4.
5.
6.
Get empty frame
Read page into frame
Reset tables
Set validation bit = v
Restart the instruction that caused the page fault
• Pure demand paging:
– Process starts with no pages in RAM. It faults on
the first instruction.
• Do processes reference memory in ways that
are purely random, when observed externally?
• Programs tend to have locality of reference.
They reference memory in ways that are not
totally random.
• Hardware needed is page table, and some
secondary memory.
Instruction Restart
• Another crucial requirement is the ability to
restart an instruction from exactly the spot where
the page fault occurred.
• Consider an ADD instruction:
– ADD A, B, C
– Say we fault when fetching A? B? C? Any problems?
• How about an instruction like this?
– MOVE ++R1(a), (b)
– Move offset from a, autoincrement register R1, to
location b.
Performance of Demand Paging
• Page Fault Rate 0  p  1.0
– if p = 0 no page faults
– if p = 1, every reference is a fault
• Effective Access Time (EAT)
EAT = (1 – p) x memory access
+ p x page fault time
• Steps that have to happen during page fault:
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
Trap to OS
Save registers and process state
Determine that interrupt was page fault.
Determine that the fault is a “legal” reference to a page that has been
paged-out, or an “illegal” reference to invalid logical address (segfault).
Find a free frame.
Issue read to disk to read page into free frame.
Call scheduler, and context-switch to some other thread or process.
Receive an interrupt from disk that I/O is done.
Save the registers before servicing the interrupt.
Determine interrupt is from disk.
Update page tables, so that the reference is not “invalid”.
Wait for CPU to be allocated again.
Restore registers.
• Typical times: memory access time is 200 ns, page fault time is 8 ms.
Demand Paging Example
• Memory access time = 200 nanoseconds
• Average page-fault service time = 8 milliseconds
• EAT = (1 – p) x 200 + p (8 milliseconds)
= (1 – p x 200 + p x 8,000,000
= 200 + p x 7,999,800
• If one access out of 1,000 causes a page fault, then
EAT = 8.2 microseconds.
This is a slowdown by a factor of 40!!
• If we want the slow down to be less than 10%,
then we need to make sure that less than 1
out 400,000 accesses cause a fault.
Process Creation
• Virtual memory allows other benefits during
process creation:
- Copy-on-Write
- Memory-Mapped Files
Copy-on-Write
• Copy-on-Write (COW) allows both parent and
child processes to initially share the same
pages in memory
If either process modifies a shared page, only
then is the page copied
• COW allows more efficient process creation as
only modified pages are copied
Before Process 1 Modifies Page C
After Process 1 Modifies Page C
vfork()
• Recall that the fork issue is that in many cases, the child will exec
immediately after a fork.
– This is inefficient, because the fork caused the entire address space of the
parent to be copied.
• COW makes fork efficient.
– Nothing is copied. Rather, the pages are shared.
– Before COW, however, needed another way: vfork().
• After a vfork(), the parent process is suspended.
– The child “borrows” the parent’s address space until the child does an exec.
– At that point, the child gets a completely new address space (no copying), and
returns the address space to the parent, which continues to execute.
What happens if the child assigns to a variable before it does the exec?
• The child must not write to any pages, else it will be changing the parent
and violating the “rules” of vfork().
stop
Base and Limit Registers
• Put every process at
a different location
in memory.
• A pair of base and
limit registers
define the logical
address space.
• How do you handle memory references in the
machine code in such a scheme?
Logical vs. Physical Address Space
• The concept of a logical address space that is
bound to a separate physical address space is
central to proper memory management.
– Logical address – generated by the CPU; also
referred to as virtual address.
– Physical address – address seen by the memory
unit.
• Logical and physical addresses can be the
same in primitive schemes; logical (virtual)
and physical addresses differ in most modern
schemes.
Memory-Management Unit (MMU)
• Hardware device that maps virtual to
physical address.
• In a simple MMU scheme, the value in
the relocation register is added to every
address generated by a user process at
the time it is sent to memory.
• The user program deals with logical
addresses; it never sees the real physical
addresses.
• Simple MMU.
Swapping
• A process can be swapped temporarily out of memory
to a backing store, and then brought back into memory
for continued execution.
• Backing store – fast disk large enough to
accommodate copies of all memory images for all
users; must provide direct access to these memory
images.
• Major part of swap time is transfer time; total transfer
time is directly proportional to the amount of memory
swapped.
• Modified versions of swapping are found on many
systems (i.e., UNIX, Linux, and Windows).
• System maintains a ready queue of ready-to-run
processes which have memory images on disk.
• The term swap is not that precise. Often used slightly
differently, depending on the OS.
Schematic View of Swapping
CONTIGUOUS MEMORY
ALLOCATION
• A common early way of allocating memory is to simply
provide each process with a contiguous logical address
space, and map that to a contiguous range of physical
addresses, called a partition.
• Relocation registers used to protect user processes from
each other, and from changing operating system code and
data.
– Relocation register contains value of smallest physical address.
– Limit register contains range of logical addresses. Each logical
address must be less than the limit register.
– MMU maps logical address dynamically.
• OS also needs a partition:
– Usually held in low memory with interrupt vector.
– User processes then held in high memory.
Hardware Support
Memory Allocation
• Can statically divide memory into several fixedsize partitions.
– Each process goes into one partition, and each
partition can only hold one process.
– Disadvantage?
• Can also use variable sized partitions.
– OS must keep a table of occupied parts of memory.
– An unoccupied chunk is called a hole.
– Memory consists of occupied partitions and holes of
various sizes.
– Holes of various size are scattered throughout memory.
– When a process arrives, it is placed into an input queue.
– OS makes periodic decisions about how to allocate a
partition to a process from the input queue.
– A hole that is too large will be split into a partition and a
new hole.
– When a process exits, a neighboring hole will be merged
with the newly created hole.
OS
OS
OS
OS
process 5
process 5
process 5
process 5
process 9
process 9
process 8
process 2
process 10
process 2
process 2
process 2
Dynamic Storage-Allocation Problem
• How to satisfy a request of size n from a list of free
holes?
– First-fit: Allocate the first hole that is big enough.
• Fast.
– Best-fit: Allocate the smallest hole that is big enough;
must search entire list, unless ordered by size.
• Produces the smallest leftover hole.
– Worst-fit: Allocate the largest hole; must also search
entire list.
• Produces the largest leftover hole
• Which do you thinks works best?
– First-fit and best-fit better than worst-fit in terms of speed
and storage utilization.
• Input queue interaction also a factor. If there
is a hole of size N, should we?
1. Use it for the smallest job in the queue that fits?
2. Use it for the largest job in the queue that fits?
3. Use it only for the next job in the queue, if it
doesn’t fit, don’t allocate (FCFS).
• Which will guarantee highest number of jobs
per second?
Fragmentation
• External Fragmentation: Wasted space due to discontiguous holes.
– General rule of thumb, for a fully utilized memory, about 50% may be wasted.
– Should we allocate from top or bottom of hole?
• Internal Fragmentation: Allocated memory may be larger than actual
requested memory, due to allocation granularity.
– For example, if there is a hole of 2 bytes, no point in keeping track of it.
– Usually very small for variable sized partitions.
• External fragmentation can be addressed by compaction
– Shuffle memory contents to place all free memory together in one large block.
– Compaction is possible only if relocation is dynamic and is done at execution
time.
– I/O problem: What if I/O is going on?
• Latch job in memory while it is involved in I/O, or
• Do I/O only into OS buffers
PAGING
Paging
• Many of the issues seen so far are due to contiguity
requirements of the RAM. Let’s relax that.
• Divide physical memory into fixed-sized blocks called
frames, size is a power of 2.
• Divide logical memory into blocks of same size called
pages
• Keep track of all free frames.
• To run a program of size n pages, need to find n free
frames and load program.
• Set up a page table to translate logical to physical
addresses.
Addresses
• With paging, address generated by CPU is divided into:
– Page number (p): Used as an index into a page table which
contains base address of each page in physical memory.
– Page offset (d): Added to the base address to define the
physical memory address that is sent to the memory unit.
– Assume that there are m bits in the logical address, and
that the page size is 2^n. Then:
page number
page offset
p
d
m-n
n
Paging Hardware
Mapping Between Logical and Physical Memory
Example
• How big is page?
• Any external
fragmentation?
• Any internal
fragmentation?
• How big should a
page be?
16-byte logical address
space
32-byte physical memory
Free Frames
• When a new process arrives, it will request N
pages.
• N frames need to be found.
• How would you do this?
Before allocation
After allocation
Implementation of Page Table
• Page tables are typically done via a combination
of OS support and hardware support. Many
variations.
• Register scheme:
– Page table is kept in registers. Registers loaded during
context switch.
– PDP-11: Address space is 16 bits, page size is 8 KB 
13 bits. Page table needs ?? entries.
– Will this work if address space is 32 bits? How big
does the page table need to be?
• Another scheme, page table is kept in main memory:
– Page-table base register (PTBR) points to the page table.
– Page-table length register (PTLR) indicates size of the page
table.
– These registers are reloaded on context switch.
– How many memory accesses does a load instruction take?
– In this scheme every data/instruction access requires two
memory accesses. One for the page table and one for the
data/instruction.
• The two memory access problem can be solved by the
use of a special hardware cache called the translation
look-aside buffer (TLBs), which is a form of associative
memory.
Associative Memory
• Associative memory is a fancy name for a lookup table.
– Use a key, produce a value.
– In software, we would implement it using a hash table, binary search tree, etc.
– In hardware, the idea is that it’s a parallel search on all keys. Very expensive.
Page #
Frame #
• Address translation of (p, d) is then:
– If p is in associative register, produce the frame #.
– Otherwise get frame # from page table in memory.
Paging Hardware With TLB
Effective Access Time
• How long does a lookup take with a TLB?
• A crucial performance measure is the hit ratio, h, which is
the percentage of time that the desired page is in the TLB.
– Assume TLB lookup time is l.
– Assume memory cycle time is m.
• The effective access time (EAT) is then:
EAT = (m + l) h + (2m + l)(1 – h)
= 2m + l – h*m
• Example: Assume 80% hit ratio, 20 ns to search TLB, 100 ns
to access memory.
– If in TLB, how long does it take access memory?
– If not in TLB, how long does it take?
– EAT = .8*120 + .2*220 = 140 ns
Context Switches
•
•
•
•
•
What is the key in the TLB?
What happens when we do a context switch?
Normally must flush the TLB.
What does this do to performance?
Some TLBs store address-space identifiers (ASIDs)
in each TLB entry which uniquely identifies each
process to provide address-space protection for
that process.
• This allows the TLB to not be flushed.
Memory Protection
• Does memory need to be protected? Why?
• Should we associate the protection with the frame
(physical), or the page (logical)?
• Memory protection implemented by associating
protection bit with each page.
• Bits are added to the page table entry.
– These bits indicate things like read/write/execute.
• One important one is the valid-invalid:
– “Valid” indicates that the associated page is in the process’
logical address space, and is thus a legal page.
– “Invalid” indicates that the page is not in the process’
logical address space.
Process only uses
to 10,468
Page ends at
12,287
•
Consequence of internal fragmentation?
Sharing Pages
• Do we ever want to share memory between processes?
Why?
• Multiple processes executing the same program:
– One copy of read-only code shared among processes (i.e., text
editors, compilers, window systems).
• Shared libraries:
– Each process keeps a separate copy of the non-shared code and
data.
– The pages for the private code and data can appear anywhere in
the logical address space.
• IPC:
– Implementing messages, etc.
• How can this be done with paging?
STRUCTURE OF THE PAGE TABLE
SEGMENTATION
User’s View of a Program
• Normally, users
think of their
address space
conceptually.
• This conceptual
view is called
segmentation.
Segmentation
• Segmentation: Logical address space is divided into
multiple, distinct linear sequences.
– Each sequence is a segment.
– Each segment has a name (number) and a length.
• A logical address is a pair: <segment-number, offset>.
• A program is a collection of segments. Typical segments
are:
–
–
–
–
–
–
Machine instructions (text)
Initialized data
Zero-initialized data (BSS)
Heap
Stack
Each shared library
Logical View of Segmentation
1
4
1
2
3
4
2
3
user space
physical memory space
Segmentation Can Be Purely
Conceptual
• At the assembly and object code level,
segmentation is very common.
• However, it can be mapped to an OS and
hardware that doesn’t have it.
Segmentation Hardware
• Logical address consists of a pair:
<segment-number, offset>.
• Segment table maps to physical addresses; each table
entry has:
– base: contains the starting physical address where the
segments reside in memory
– limit: specifies the length of the segment
• Segment-table base register (STBR) points to the
segment table’s location in memory.
• Segment-table length register (STLR) indicates number
of segments used by a program;
– Segment number s is legal if s < STLR
• Protection: With each entry in segment table
associate:
– validation bit = 0  illegal segment
– read/write/execute privileges
• Protection bits associated with segments;
code sharing occurs at segment level
• Since segments vary in length, memory
allocation is a dynamic storage-allocation
problem
Segmentation and Paging
• Note that segmentation and paging can be
used together.
• What functionality does each provide?
• Without paging, each segment must still be
contiguous.
EXAMPLE: PENTIUM
Example: The Intel Pentium
• Supports both segmentation and segmentation
with paging
• CPU generates logical address
– Given to segmentation unit
• Which produces linear addresses
– Linear address given to paging unit
• Which generates physical address in main memory
• All of these together form the MMU
Pentium Segmentation
•
•
•
•
•
•
•
Max 16K segments per process, each up to 4 GB.
Segments divided into two partitions: those in the local descriptor table (LDT)
(process specific), and those in the global descriptor table (GDT) (shared).
Each entry in LDT or GDT is an 8-byte segment descriptor, which includes base and
limit.
Logical address is a pair <segment selector, offset>, where selector is:
s
g
p
13
1
2
s is the segment number, g is GDT or LDT, and p are protection bits.
There are six segment registers, each holding a selector, along with six 8-byte
internal registers to hold the corresponding descriptors.
Summary: segment registers hold segment selectors, which are used to index into
LDT/GDT, which has the final segment description.
Also some way for the CPU to determine which segment register should be used
for a particular memory reference.
• Where does the selector come from?
• Not shown: segment limit is used to check the offset.
Pentium Paging
• The linear address is then sent to paging unit.
• Pages can be either 4 KB or 4 MB. For 4 KB pages, there is a
2-level paging scheme.
– Outer page table called the page directory. (CR3 register points
to it.)
– Page directory entry (PDE) points to page table.
– There is a bit in the PDE that indicates it is a 4 MB page.
• In this case, the PDE points to the page frame.
• Page tables (but not directories) can be swapped to disk.
There is a bit in the PDE that can be used to indicate that
the page table is swapped out.
FRAME ALLOCATION
Using the previous algorithms, how many frames
(physical memory) does each process get?
Each user?
Is this fair? Is this good?
• So far, the allocation of frames has been implicit.
It just all depends on the page replacement and
related algorithms.
– This is often not ideal.
• Frame allocation seeks to control how many
frames each process (or maybe user) gets.
– It seeks to make sure that a process gets the ideal
number of frames.
Minimum Number
Can a process proceed with just one frame
allocated to it?
• Each process needs minimum number of pages
• Example: IBM 370 – 6 pages to handle MOVE
instruction:
– instruction is 6 bytes, might span 2 pages
– 2 pages to handle from
– 2 pages to handle to
• Two major allocation schemes
– fixed allocation
– priority allocation
Fixed Allocation
• Equal allocation – For example, if there are 100 frames
and 5 processes, give each process 20 frames.
• Proportional allocation – Allocate according to the size
of process
si  size of process pi
S   si
m  total number of frames
si
ai  allocation for pi   m
S
m  64
si  10
s2  127
10
a1 
 64  5
137
127
a2 
 64  59
137
Priority Allocation
• Use a proportional allocation scheme
using priorities rather than size.
• If process Pi generates a page fault,
– select for replacement one of its frames
– select for replacement a frame from a
process with lower priority number
Global vs. Local Allocation
• Global replacement – process
selects a replacement frame from
the set of all frames; one process
can take a frame from another
• Local replacement – each process
selects from only its own set of
allocated frames
Non-Uniform Memory Access
• Large computers are often built from multiple
boards, connected by a fast interconnect.
• Each board has memory and CPUs.
• Still a single memory space.
• It is possible to access memory from different
board, but it is slower.
• In this case, you want to group threads and
memory.
– Note that cache effects are another reason to group
threads and memory.
Temporal vs. Spatial Locality
• Temporal: If something is used/accessed, likely
to be used/accessed again in the near future.
• Spatial: If something used/accessed, things
that are “nearby” are also likely to be
used/accessed in the near future.
• What does MM rely on?
Working-Set Model
•   working-set window  a fixed number of page
references
Example: 10,000 references (or time units)
• WSSi (working set of Process Pi) =
total number of pages referenced in the most
recent 
– if  too small will not encompass entire locality
– if  too large will encompass several localities
– if  =   will encompass entire program
• D =  WSSi  total demand frames
• if D > m  Thrashing
• Policy if D > m, then suspend one of the processes
Working-set model
Keeping Track of the Working Set
• How can we implement this?
• Approximate with interval timer + a reference bit
• Example:  = 10,000
– Timer interrupts after every 5000 time units
– Keep in memory 2 bits for each page
– Whenever a timer interrupts copy and sets the values of all
reference bits to 0
– If one of the bits in memory = 1  page in working set
• Why is this not completely accurate?
• Improvement = 10 bits and interrupt every 1000 time
units
Working Sets and Page Fault Rates
Page Fault Frequency Scheme
What are we really concerned about?
• The actual problem is caused by page faults.
• Trying to determine the working set is one
level above that.
• Another tactic is to directly monitor the page
faults.
– If too many, we give the process more frames.
– If too few, we take away a few frames.
• Establish “acceptable” page-fault rate for a
process
– If actual rate too low, process loses frame
– If actual rate too high, process gains frame
• What could go wrong with PFF?
– A process with poor locality would get more
frames than it really deserves.
Backing Store
• When a frame needs to be reclaimed, if the page it is holding has
been changed, it needs to be written back to disk.
How is this space on disk organized?
• The paging space can be a file on a regular file system, or it can be a
dedicated partition.
– Partition is faster, theoretically.
– File is more flexible.
Does it need to be a magnetic hard drive?
• We can generalize the disk by recognizing that it need not be an
actual drive, but just some linear sequence of blocks.
– It could could be flash.
– It could could be a remote network device of some kind.
– It could be any old file, not a special paging file. This case is known as
memory-mapped files.
Memory-Mapped Files
When you use the write() system call, how many times is the data
copied?
• Typically, with the write() or read() system call, the data is copied
to/from a kernel buffer first, then to the drive.
• Memory-mapped file I/O allows file I/O to be treated as routine
memory access by mapping a disk block to a page in memory.
• A file is initially read using demand paging. A page-sized portion of
the file is read from the file system into a physical page. Subsequent
reads/writes to/from the file are treated as ordinary memory
accesses.
• Simplifies file access by treating file I/O through memory rather
than read()/write() system calls.
• Also allows several processes to map the same file allowing the
pages in memory to be shared.
• This approach to file I/O is generally clean and
elegant, and in modern Oses, it is common to
actually map the file internally anyway, even
when using read/write system calls.
• You can also use mmap() to share memory. In
fact, this is the preferred way.
• You can mmap() a local file in shared mode.
(Don’t mmap() a NFS mounted file.)
• You can create anonymous mappings that are
shared via parent-child relationship.
• [Show mmap/]
• Let’s say that you use write(). When does
the data get written to disk?
• Let’s say that you use mmap(). When does
the data get written to disk?
Memory Mapped I/O
• At the assembly language level, how do you
actually communicate with a hardware device,
such as a disk drive, graphics card, etc.?
• A CPU only has limited ways to affect the
world.
– Write to special I/O ports.
– Write to memory that is being monitored by a
hardware device. The latter is known as memory
mapped I/O.
Allocating Kernel Memory
• Treated differently from user memory
• Often allocated from a free-memory pool
– Kernel requests memory for structures of varying
sizes
– Some kernel memory needs to be contiguous.
• Kernel memory allocation is also closely
related to general memory allocation.
• What is a memory allocator?
• It manages a contiguous chunk of linear address
space.
• When a request is made, it needs to find a
unused block of the given size.
• When a block is freed, it needs to return the
block to the free memory pool.
– If the freed block is adjacent to a previously freed
block, the two blocks should ideally be merged.
(Why?)
• Memory allocation is often a huge bottleneck!
– Especially in multithreaded programs.
• What are criteria used to judge?
Buddy System
• Allocates memory from fixed-size segment
consisting of physically-contiguous pages
• Memory allocated using power-of-2 allocator
– Satisfies requests in units sized as power of 2
– Request rounded up to next highest power of 2
– When smaller allocation needed than is available,
current chunk split into two buddies of next-lower
power of 2
• Continue until appropriate sized chunk available
Buddy System Allocator
• Algorithm:
– malloc():
• Maintain a free list for each power of two.
• Round up to the next power of two.
• Find a block on that list.
– If found, return it.
• Otherwise, use a bigger block and split.
– free():
• Find the block’s buddy.
• Is it free? Then merge.
• Put (merged) block on free list.
Slab Allocator
•
•
•
•
Alternate strategy
Slab is one or more physically contiguous pages
Cache consists of one or more slabs
Single cache for each unique kernel data structure
– Each cache filled with objects – instantiations of the data structure
• When cache created, filled with objects marked as free
• When structures stored, objects marked as used
• If slab is full of used objects, next object allocated from empty slab
– If no empty slabs, new slab allocated
• Benefits include no fragmentation, fast memory request
satisfaction
• Any drawbacks?
Other Issues -- Prepaging
• Prepaging
– To reduce the large number of page faults that
occurs at process startup
– Prepage all or some of the pages a process will
need, before they are referenced
• For example, maybe when restarting a process.
– But if prepaged pages are unused, I/O and
memory was wasted
– Assume s pages are prepaged and α fraction of
the pages is used
• Is cost of s*α saved page faults > or < than the cost of
prepaging s * (1- α) unnecessary pages?
• α near zero  prepaging loses
Other Issues – Page Size
• Page size selection must take into
consideration:
– fragmentation
– table size
– I/O overhead
– locality
Other Issues – TLB Reach
• TLB Reach - The amount of memory accessible
from the TLB
• TLB Reach = (TLB Size) X (Page Size)
• Ideally, the working set of each process is stored
in the TLB
– Otherwise there is a high degree of page faults
• Increase the Page Size
– This may lead to an increase in fragmentation as
not all applications require a large page size
• Provide Multiple Page Sizes
– This allows applications that require larger page
sizes the opportunity to use them without an
increase in fragmentation
Instruction Organization
• What about the actual machine code for a
process?
• How should it be organized?
• Functions that are used together (belong to
the working set in the same stage) should be
grouped together.

P i

Transcript P i

Directory