Lecture 12 Virtual Memory

Download Report

Transcript Lecture 12 Virtual Memory

Lecture 12 Virtual Memory
Peng Liu
[email protected]
1
Last time in Lecture 11
• Cache
– The memory system has a significant effect on
program execution time.
– The number of memory-stall cycles depends on both
the miss rate and the miss penalty.
» To reduce the miss rate, the use of associative
placement schemes
» To reduce the miss penalty by allowing a larger
secondary cache to handle misses to the primary
cache.
2
Associative Cache Example
Review
3
Direct-Mapped Cache
Tag
Index
t
V
Tag
k
Review
Block
Offset
Data Block
b
2k
lines
t
=
HIT
Data Word or Byte
4
2-Way Set-Associative Cache Review
Tag
t
Index
k
V Tag Data Block
Block
Offset
b
V Tag Data Block
t
=
=
Data
Word
or Byte
HIT
5
Fully Associative Cache
V Tag
Review
Data Block
t
Tag
=
t
=
Block
Offset
HIT
b
=
Data
Word
or Byte
6
Tag & Index with Set-Associative Caches
Review
• Assume a 2n-byte cache with 2m-byte blocks that is 2a way setassociative
– Which bits of the address are the tag or the index?
– m least significant bits are byte select within the block
• Basic idea
– The cache contains 2n/2m=2n-m blocks
– Each cache way contains 2n-m/2a=2n-m-a blocks
– Cache index: (n-m-a) bits after the byte select
• Same index used with all cache ways …
• Observation
– For fixed size, length of tags increases with the associativity
– Associative caches incur more overhead for tags
7
Replacement Methods
Review
• Which line do you replace on a miss?
• Direct Mapped
– Easy, you have only one choice
– Replace the line at the index you need
• N-way Set Associative
– Need to choose which way to replace
– Random (choose one at random)
– Least Recently Used (LRU) (the one used least recently)
• Often difficult to calculate, so people use approximations. Often they
are really not recently used
8
Replacement Policy
Review
In an associative cache, which block from a set
should be evicted when the set becomes full?
• Random
• Least Recently Used (LRU)
• LRU cache state must be updated on every access
• true implementation only feasible for small sets (2-way)
• pseudo-LRU binary tree often used for 4-8 way
• First In, First Out (FIFO) a.k.a. Round-Robin
• used in highly associative caches
This is a second-order effect. Why?
Replacement only happens on misses
9
Causes for Cache Misses
Review
• Compulsory: first-reference to a block a.k.a. cold
start misses
- misses that would occur even with infinite cache
• Capacity: cache is too small to hold all data needed
by the program
- misses that would occur even under perfect
replacement policy
• Conflict: misses that occur because of collisions
due to block-placement strategy
- misses that would not occur with full associativity
10
Write Policy Choices
Review
• Cache hit:
– write through: write both cache & memory
• generally higher traffic but simplifies cache coherence
– write back: write cache only
(memory is written only when the entry is evicted)
• a dirty bit per block can further reduce the traffic
• Cache miss:
– no write allocate: only write to main memory
– write allocate (aka fetch on write): fetch into cache
• Common combinations:
– write through and no write allocate
– write back with write allocate
11
Cache Design: Datapath + Control
Review
Most design errors come from incorrect specification of state
machine behavior!
Common bugs: Stalls, Block replacement, Write buffer
To
CPU
Control
State Machine
Control
Control
Addr
To
CPU
Din
Dout
Addr
Blocks
Tags
Din
Dout
To
Lower
Level
Memory
To
Lower
Level
Memory
12
Virtual Memory
13
Motivation #1: Large Address Space for
Each Executing Program
• Each program thinks it has a
~232 byte address space of its
own
0xFFFF_FFFF
Kernel (OS) memory
(code, data, heap, stack)
0x8000_0000
– May not use it all though
memory
invisible to
user code
User stack
(created at runtime)
$sp (stack pointer)
• Available main memory may be
much smaller
brk
run-time heap
(managed by malloc)
Read/write segment
(.data,.bss)
0x1000_0000
Read-only segment
(.text)
Loaded from the
executable file
0x0040_0000
0x0000_0000
unused
14
Motivation #2: Memory Management for
Multiple Programs
• At an point in time, a computer may
be running multiple programs
– E.g., Firefox + Foxmail
– Virtual machine (Mac IOS +
Windows )
• Questions:
– How do we share memory
between multiple programs?
– How do we avoid address
conflicts?
– How do we protect programs
• Isolations and selective
sharing
kernel virtual memory
memory
invisible to
user code
stack
$sp
runtime heap (via malloc)
“brk” ptr
uninitialized data (.bss)
initialized data (.data)
program text (.text)
forbidden
15
Virtual Memory in a Nutshell
• Use hard disk (or Flash) as a large storage for data of all programs
– Main memory (DRAM) is a cache for the disk
– Managed jointly by hardware and the operating system (OS)
• Each running program has its own virtual address space
– Address space as shown in previous figure
– Protected from other programs
• Frequently-used portions of virtual address space copied to DRAM
– DRAM = physical address space
– Hardware + OS translate virtual addresses (VA) used by
program to physical addresses (PA) used by the hardware
– Translation enables relocation (DRAM disk) & protection
16
Reminder: Memory Hierarchy
Everything is a Cache for Something Else
Access time
Capacity
Managed by
1 cycle
~500B
Software/compiler
1-3 cycles
~64KB
hardware
5-10 cycles
1-10MB
hardware
~100 cycles
~10GB
Software/OS
106-107 cycles
~100GB
Software/OS
17
DRAM vs. SRAM as a “Cache”
• DRAM vs. disk is more extreme than SRAM vs. DRAM
– Access latencies:
• DRAM ~10X slower than SRAM
• Disk ~100000X slower than DRAM
– Importance of exploiting spatial locality
• First byte is ~100,000X slower than successive bytes on disk
• vs, ~4X improvement for page-mode vs. regular accesses to DRAM
SRAM
DRAM
Disk
18
Impact of These Properties on Design
• Bottom line:
– Design decision made for virtual memory driven by
enormous cost of misses (disk accesses)
• Consider the following parameters for DRAM as a “cache”
for the disk
– Line size?
• Large, since disk better at transferring large blocks and minminzes
miss rate
– Associativity?
• High, to minminze miss rate
– Write through or write back?
• Write back, since can’t afford to perform small writes to disk
19
Terminology for Virtual Memory
• Virtual memory used DRAM as a cache for disk
• New terms
– VM block is called a “page”
• The unit of data moving between disk and DRAM
• It is typically larger than a cache block (e.g., 4KB, or 16KB, or 1KB)
• Virtual and physical address spaces can be divided up to virtual
pages and physical pages (e.g., contiguous chunks of 4KB)
– VM miss is called a “page fault”
• More on this later
20
Locating an Object in a “Cache”
• SRAM Cache (L1, L2, etc)
– Tag stored with cache line
– Maps from cache block to a memory address
– No tag for blocks not in cache
• If not in cache, then it is in main memory
– Hardware retrieves and manages tag information
• Can quickly match against multiple tags
21
Locating an Object in a “Cache” (cont.)
• SRAM Cache (virtual memory)
– Each allocated page of virtual memory has entry in page table
– Mapping from virtual pages to physical pages
• One entry per page in the virtual address space
– Page table entry even if page not in memory
• Specifies disk address
– OS retrieves and manages page table information
Page Table
Cache
Data
Location
Object Name
X
D
0
0
243
J
On Disk
1
17
X
1
N-1
105
22
A System with Physical Memory Only
• Examples:
– Most Cray machines, early PCs, nearly all embedded
systems, etc
• Addresses generated by the CPU point directly to bytes in physical
memory
23
A System with Virtual Memory
• Examples:
– Workstations, servers, modern PCs, etc.
Memory
Page Table
Virtual
Addresses
0
1
Physical
Addresses
0
1
CPU
P-1
N-1
Disk
Address Translation: Hardware converts virtual addresses to
physical addresses via an OS-managed lookup table (page
table)
24
Page Faults (Similar to “Cache Misses”)
• What if an object is on disk rather than in memory?
– Page table entry indicates virtual address not in memory
– OS exception handler invoked to move data from disk into memory
• OS has full control over placement
• Full-associativity to minimize future misses
Before fault
Memory
Page Table
Virtual
Addresses
0
1
After fault
Physical
Addresses
Memory
Page Table
Virtual
Addresses
0
1
0
1
Physical
Addresses
0
1
CPU
CPU
P-1
P-1
N-1
N-1
Disk
Disk
25
Does VM Satisfy Original Motivations?
• Multiple active programs can
share physical address space
0xFFFF_FFFF
Kernel (OS) memory
(code, data, heap, stack)
0x8000_0000
memory
invisible to
user code
User stack
(created at runtime)
• Address conflicts are resolved
$sp (stack pointer)
– All programs think their code
is at 0x400000
• Data from different programs
can be protected
brk
run-time heap
(managed by malloc)
• Programs can share data or
code when desired
Read/write segment
(.data,.bss)
0x1000_0000
Read-only segment
(.text)
Loaded from the
executable file
0x0040_0000
0x0000_0000
unused
26
Answer: Yes, Using Separate Addresses
Spaces Per Program
• Each program has its own virtual address space and own page table
– Addresses 0x400000 from different programs can map to
different locations or same location as desired
– OS control how virtual pages as assigned to physical
memory
27
Bare Machine
Physical
Address
PC
Inst.
Cache
Physical
Address
D
Decode
Physical
Address
E
+
M
Memory Controller
Data
Cache
W
Physical
Address
Physical Address
Main Memory (DRAM)
• In a bare machine, the only kind of address is a
physical address
28
Dynamic Address Translation
Motivation
In the early machines, I/O operations were slow
and each word transferred involved the CPU
prog1
Location-independent programs
Programming and storage management ease
 need for a base register
Protection
prog2
Independent programs should not affect
each other inadvertently
 need for a bound register
Multiprogramming drives requirement for
resident supervisor software to manage
context switches between multiple programs
Physical Memory
Higher throughput if CPU and I/O of 2 or more
programs were overlapped.
How? multiprogramming with DMA I/O
devices, interrupts
OS
29
Translation: High-level View
• Fixed-size pages
• Physical page sometimes called as frame
30
Translation: Process
31
Address Translation & Protection
Virtual Address
Virtual Page No. (VPN)
offset
Kernel/User Mode
Read/Write
Protection
Check
Address
Translation
Exception?
Physical Address
Physical Page No. (PPN)
offset
•Every instruction and data access needs address
translation and protection checks
A good VM design needs to be fast (~ one cycle) and
space efficient
32
Translation Process Explained
• Valid page
– Check access rights (R,W,X) against access type
• Generate physical address if allowed
• Generate a protection fault (exception) if illegal access
• Invalid page
– Page is not currently mapped and a page fault is generated
• Faults are handled by the operating system
– Sometimes due to a program error => program terminated
• E.g. accessing out of the bounds of array
– Sometimes due to “caching” => refill & restart
• Desired data or code available on disk
• Space allocated in DRAM, page copied from disk, page table updated
• Replacement may be needed
33
VM: Replacement and Writes
• To reduce page fault rate, OS uses least-recently used
(LRU) replacement
– Reference bit (aka use bit) in PTE set to 1 on access to page
– Periodically cleared to 0 by OS
– A page with reference bit = 0 has not been used recently
• Disk writes take millions of cycles
–
–
–
–
Block at once, not individual locations
Write through is impractical
Use write-back
Dirty bit in PTE set when page is written
34
Fast Translation Using a TLB
• Address translation would appear to require extra
memory references
– One to access the PTE
– Then the actual memory access
• But access to page tables has good locality
– So use a fast hardware cache of PTEs within the processor
– Called a Translation Look-aside Buffer (TLB)
– Typical:
• 16-512 PTEs, 0.5-1 cycle for hit
• 10-100 cycles for miss, 0.01%-1% miss rate
• Misses could be handled by hardware or software
36
Fast Translation Using a TLB
37
Translation LookasideBuffers (TLB)
Address translation is very expensive!
In a two-level page table, each reference
becomes several memory accesses
Solution: Cache translations in TLB
TLB hit
Single-Cycle Translation
TLB miss
Page-Table Walk to refill
virtual address
VRWD
tag
PPN
VPN
offset
(VPN = virtual page number)
(PPN = physical page number)
hit?
physical address
PPN
offset
38
TLB Entries
• The TLB is a cache for page table entries (PTE)
• The data for a TLB entry ( == a PTE entry)
– Physical page number (frame #)
– Access rights (R/W bits)
– Any other PTE information (dirty bit, LRU info, etc)
• The tags for a TLB entry
– Physical page number
• Portion of it not used for indexing into the TLB
– Valid bit
– LRU bits
• If TLB is associative and LRU replacement is used
39
TLB Misses
• If page is in memory
– Load the PTE from memory and retry
– Could be handled in hardware
• Can get complex for more complicated page table structures
– Or in software
• Raise a special exception, with optimized handler
• This is what MIPS does using a special vectored interrupt
• If page is not in memory (page fault)
– OS handles fetching the page and updating the page table
– Then restart the faulting instruction
40
TLB & Memory Hierarchies
• Once address is translated, it used to access memory
hierarchy
– A hierarchy of caches (L1, L2, etc)
41
TLB and Cache Interaction
• Basic process
– Use TLB to
get PA
– Use PA to
access
caches and
DRAM
• Question: can
you ever
access the
TLB and the
cache in
parallel?
42
Page-Based Virtual-Memory Machine
(Hardware Page-Table Walk)
Page Fault?
Page Fault?
Protection violation?
Virtual
Physical
Address
Address
PC
Inst.
TLB
Inst.
Cache
Protection violation?
Virtual
Physical
Address
Address
D
Decode
E
+
Data
TLB
M
Data
Cache
W
Miss?
Miss?
Page-Table Base
Register
Physical
Address
Hardware Page
Table Walker
Memory Controller
Physical
Address
Physical Address
Main Memory (DRAM)
• Assumes page tables held in untranslated physical memory
43
Virtual Memory Summary
• Use hard disk ( or Flash) as large storage for data of all programs
– Main memory (DRAM) is a cache for the disk
– Managed jointly by hardware and the operating system (OS)
• Each running program has its own virtual address space
– Address space as shown in previous figure
– Protected from other programs
• Frequently-used portions of virtual address space copied to DRAM
– DRAM = physical address space
– Hardware + OS translate virtual addresses (VA) used by
program to physical addresses (PA) used by the hardware
– Translation enables relocation & protection
44
Page Table Issues & Solutions
• Page table access latency
– Use TLB to cache page table entries close to processor
– Limit size of L1 cache to overlap TLB & cache accesses
• TLB coverage
– Larger pages
– Multi-level TLBs
• Page table size
– Multi-level page tables
45
Architectural Support Operating Systems (OS)
• Operating system (OS)
– Manages hardware resources
• Processor, main memory (DRAM), I/O devices
– Provides virtualization
• Each program thinks it has exclusive access to resources
– Provides protection, isolation, and sharing
• Between user programs and between programs and OS
• Examples of operating systems
– Windows (XP, Vista, Win7), MacOS, Linux, BSD Unix, Solaris..
– Symbian, Windows Mobile, Android
46
Processes
• Definition: a process is an instance of a running program
• Process provides each program with two key abstractions
– Logical control flow
• Illusion of exclusive use of the processor
• Private set of register values (including PC)
– Private address space
• Illusion of exclusive use of main memory of “infinte” size
• How are these illusion maintained?
– Process execution is interleaved (multitasking)
• On available processors
– Address space is managed by virtual memory system
47
Execution of Processes
• Each process has its own logical flow control
– A & B are concurrent processes, B & C are sequential
– Control flows for concurrent processes are physically
disjoint in time
– However, can think of concurrent processes are running in
parallel
48
Process Address Space
0xFFFF_FFFF
• Each process has its own
private address space
Kernel (OS) memory
(code, data, heap, stack)
0x8000_0000
memory
invisible to
user code
User stack
(created at runtime)
$sp (stack pointer)
• Use processes cannot access
top region of memory
– Used by OS kernel
– The kernel is shared core of
the OS
brk
run-time heap
(managed by malloc)
Read/write segment
(.data,.bss)
0x1000_0000
Read-only segment
(.text)
Loaded from the
executable file
0x0040_0000
0x0000_0000
unused
49
Supporting Concurrent Processes
through Context Switching
• Processes are managed by the OS kernel
– Important: the kernel is not a separate process, but rather
runs as part of some user process
• Control flow passes from one process to another via a
context switch
• Time between context switches 10-20ms
50
HW Support for the OS
• Mechanisms to protect OS from processes
– Modes + virtual memory
• Mechanisms to switch control flow between OS –
processes
– System calls + exceptions
• Mechanisms to protect processes from each other
– Virtual memory
• Mechanisms to interact with I/O devices
– Primarily memory-mapped I/O
51
Hardware Modes
(Protecting OS from Processes)
• 2 modes are needed, but some architectures have more
• User mode: Used to run users processes
– Accessible instructions: user portion of the ISA
• The instructions we have seen so far
– Accessible state: user registers and memory
• But virtual memory translation is always on
• Cannot access EPC, Cause, … registers
– Exceptions and interrupts are always on
• Kernel mode: used to run (most of) the kernel
– Accessible instructions: the whole ISA
• User + privileged instructions
– Accessible state: everything
– Virtual memory, exceptions, and interrupts may be turned off
52
Altering the Control Flow: System Calls
• So far, we have branches and jumps
– They implement control flows within a program
• Expected switch to the OS: system call instruction (syscall)
– A jump (or function call) to OS code
• E.g., in order to perform a disk access
• Also similar to a program-initiated exception
– Switches processor to kernel mode & disables interrupts
– Jump to a pre-defined place in the kernel
• Returning from a syscall: use the eret instruction (privileged)
– Switch to user mode & enable interrupts
– Jump to program counter indicated by EPC
53
Altering Flow Control: Exceptions & Interrupts
• Exceptions & interrupts implement unexpected switch to
OS
– Exceptions: due to a problem in the program
• E.g., divide by zero, illegal instructions, page fault
– Interrupts: due to an external event
• I/O event, hitting clt+c, periodic timer interrupt
• Exceptions & interrupts operate similar to system calls
–
–
–
–
Set EPC & Cause registers
Switch to kernel mode, turn off interrupts
Jump to predefined part of the OS
Return to user program using eret instruction
54
Dealing with Switches to the OS
• A syscall, exception, or interrupt transfers control to the
OS kernel
• Kernel action (stating in the exception handling code)
– Examine issue & correct if possible
– Return to the same process or to another process
55
A Simple Exception Handler
• Can’t survive nested exceptions
– Since does not re-enable interrupts
• Does no use any user registers
– No need to save them
56
Exception Example #1
• Memory Reference
– User writes to memory location
– That virtual page is currently on disk
– OS must load page in memory
• Updated page table & TLB
– Return to faulting instruction
– Successful on second try
57
Exception Examples #2
• Memory Reference
– User writes to memory location
– Address is not valid or wrong access rights
– OS decides to terminate the process
• Sends SIGSEG signal to user process
• User process exits with “segmentation fault”
– OS switches to another active process
58
Using Virtual Memory & Exceptions: Page
Sharing
• Example: Unix fork() system call
– Creates a 2nd process which is a clone of the current one
– How can we avoid copying all the address space
• Expensive in time and space
• Idea: share pages in physical memory
– Each process gets a separate page table (separate virtual
space)
– But both PTs point to the same physical pages
– Saves time and space on fork()
• Same approach to support any type of sharing
– Two processes running same program
– Two processes that share data
59
Page Sharing
60
Memory Fragmentation
OS
Space
Users 4 & 5
arrive
OS
Space
Users 2 & 5
leave
free
OS
Space
user 1
16K
user 1
16K
user 2
24K
user 2
24K
user 4
16K
8K
user 4
16K
8K
32K
user 3
32K
user 3
32K
24K
user 5
24K
24K
user 3
user 1
16K
24K
24K
As users come and go, the storage is “fragmented”.
Therefore, at some stage programs have to be moved
around to compact the storage.
67
Paged Memory Systems
• Processor-generated address can be split into:
Page Number
Offset
• A Page Table contains the physical address at the start of each
page
0
1
2
3
Address Space
of User-1
1
0
0
1
2
3
Physical
Memory
3
Page Table
of User-1
2
Page tables make it possible to store the pages of a
program non-contiguously.
68
Private Address Space per User
User 1
OS
pages
VA1
Page Table
VA1
Physical Memory
User 2
Page Table
User 3
VA1
Page Table
free
• Each user has a page table
• Page table contains an entry for each user page
69
Where Should Page Tables Reside?
• Space required by the page tables (PT) is
proportional to the address space, number of
users, ...
Too large to keep in registers
• Idea: Keep PTs in the main memory
– needs one reference to retrieve the page base address
and another to access the data word
 doubles the number of memory references!
70
Page Tables in Physical Memory
PT User
1
User 1 Virtual
Address Space
PT User
2
Physical Memory
VA1
VA1
User 2 Virtual
Address Space
71
Linear Page Table
• Page Table Entry (PTE)
contains:
– A bit to indicate if a page exists
– PPN (physical page number) for
a memory-resident page
– DPN (disk page number) for a
page on the disk
– Status bits for protection and
usage
• OS sets the Page Table
Base Register whenever
active user process
changes
PT Base Register
Supervisor Accessible
Control Register inside CPU
Data Pages
Page Table
PPN
PPN
DPN
PPN
Data word
Offset
DPN
PPN
PPN
DPN
DPN
DPN
PPN
PPN
VPN
VPN
Offset
Virtual address from
CPU Execute Stage
77
Size of Linear Page Table
Page 424
• With 32-bit addresses, 4-KB pages & 4-byte
PTEs:
– 220 PTEs, i.e, 4 MB page table per user
– 4 GB of swap needed to back up full virtual address
space
• Larger pages?
– Internal fragmentation (Not all memory in page is used)
– Larger page fault penalty (more time to read from disk)
• What about 64-bit virtual address space???
– Even 1MB pages would require 244 8-byte PTEs (35 TB!)
What is the “saving grace” ?
78
Hierarchical Page Table
Virtual Address
22 21
p1
12 11
p2
0
offset
10-bit
10-bit
L1 index L2 index
offset
p2
p1
Root of Current
Page Table
(Processor Register)
Level 1
Page Table
page in primary memory
page in secondary memory
PTE of a nonexistent page
Physical Memory
31
Level 2
Page Tables
Data Pages
79
Two-Level Page Tables in Physical Memory
Physical
Memory
Virtual
Address
Spaces
Level 1 PT
User 1
VA1
Level 1 PT
User 2
User 1
User2/VA1
User1/VA1
VA1
User 2
Level 2 PT
User 2
80
Address Translation & Protection
Virtual Address
Virtual Page No. (VPN)
offset
Kernel/User Mode
Read/Write
Protection
Check
Address
Translation
Exception?
Physical Page No. (PPN)
offset
Physical Address
• Every instruction and data access needs address
translation and protection checks
A good VM design needs to be fast (~ one cycle) and
space efficient
81
Translation Lookaside Buffers (TLB)
Address translation is very expensive!
In a two-level page table, each reference becomes several
memory accesses
Solution: Cache translations in TLB
 Single-Cycle Translation
 Page-Table Walk to refill
TLB hit
TLB miss
virtual address
V R WD
tag
PPN
VPN
offset
(VPN = virtual page number)
(PPN = physical page number)
hit?
physical address
PPN
offset
82
TLB Designs
• Typically 32-128 entries, usually fully associative
– Each entry maps a large page, hence less spatial locality across pages
 more likely that two entries conflict
– Sometimes larger TLBs (256-512 entries) are 4-8 way set-associative
– Larger systems sometimes have multi-level (L1 and L2) TLBs
• Random or FIFO replacement policy
• No process information in TLB?
• TLB Reach: Size of largest virtual address space that
can be simultaneously mapped by TLB
Example: 64 TLB entries, 4KB pages, one page per entry
64 entries * 4 KB = 256 KB (if contiguous)
TLB Reach = _____________________________________________?
83
Handling a TLB Miss
• Software (MIPS, Alpha)
– TLB miss causes an exception and the operating
system walks the page tables and reloads TLB. A
privileged “untranslated” addressing mode used
for walk.
• Hardware (SPARC v8, x86, PowerPC,
RISC-V)
– A memory management unit (MMU) walks the
page tables and reloads the TLB.
– If a missing (data or PT) page is encountered
during the TLB reloading, MMU gives up and
signals a Page Fault exception for the original
instruction.
84
Hierarchical Page Table Walk: SPARC v8
Virtual Address
Context
Table
Register
Context
Register
Index 1
31
Index 2
Index 3
17
23
Offset
11
0
Context Table
L1 Table
root ptr
L2 Table
PTP
L3 Table
PTP
PTE
31
Physical Address
11
PPN
0
Offset
MMU does this table walk in hardware on a TLB miss
85
Modern Virtual Memory Systems
Illusion of a large, private, uniform store
OS
Protection & Privacy
several users, each with their private address
space and one or more shared address spaces
page table name space
useri
Demand Paging
Provides the ability to run programs larger
than the primary memory
Primary
Memory
Secondary
Storage
Hides differences in machine configurations
The price is address translation on
each memory reference
VA
mapping
TLB
PA
86
Page-Based Virtual-Memory Machine
(Hardware Page-Table Walk)
Page Fault?
Protection violation?
Virtual
Physical
Address
Address
PC
Inst.
TLB
Inst.
Cache
Miss?
D
Decode
Page Fault?
Protection violation?
Virtual
Physical
Address
Address
Data
Data
M
+
TLB
Cache
E
Page-Table Base
Register
Physical
Address
W
Miss?
Hardware Page
Table Walker
Memory Controller
Physical
Address
Physical Address
Main Memory (DRAM)
• Assumes page tables held in untranslated physical memory
87
Address Translation:
putting it all together
Virtual Address
hardware
hardware or software
software
TLB
Lookup
miss
Protection
Check
Page Table
Walk
 memory
the page is
Page Fault
(OS loads page)
Where?
hit
 memory
Update TLB
denied
Protection
Fault
permitted
Physical
Address
(to cache)
SEGFAULT
88
Page Fault Handler
• When the referenced page is not in DRAM:
– The missing page is located (or created)
– It is brought in from disk, and page table is updated
» Another job may be run on the CPU while the first job
waits for the requested page to be read from disk
– If no free pages are left, a page is swapped out
» Pseudo-LRU replacement policy, implemented in
software
• Since it takes a long time to transfer a page
(msecs), page faults are handled completely
in software by the OS
– Untranslated addressing mode is essential to allow kernel to
access page tables
89
Handling VM-related exceptions
PC
Inst
TLB
Inst.
Cache
D
Decode
E
+
Data
M TLB
Data
Cache
W
TLB miss? Page Fault?
TLB miss? Page Fault?
Protection violation?
Protection violation?
• Handling a TLB miss needs a hardware or software
mechanism to refill TLB
• Handling a page fault (e.g., page is on disk) needs a
restartable exception so software handler can resume
after retrieving page
– Precise exceptions are easy to restart
– Can be imprecise but restartable, but this complicates OS
software
• Handling protection violation may abort process
– But often handled the same as a page fault
90
Address Translation in CPU Pipeline
PC
Inst
TLB
Inst.
Cache
TLB miss? Page Fault?
Protection violation?
D
Decode
E
+
Data
M TLB
Data
Cache
W
TLB miss? Page Fault?
Protection violation?
• Need to cope with additional latency of TLB:
–
–
–
–
slow down the clock?
pipeline the TLB and cache access?
virtual address caches
parallel TLB/cache access
91
Virtual-Address Caches
CPU
VA
TLB
PA Physical
Cache
PA Primary
Memory
Alternative: place the cache before the TLB
CPU
VA
Virtual
Cache
VA
TLB
PA
Primary
Memory (StrongARM)
• one-step process in case of a hit (+)
• cache needs to be flushed on a context switch unless address
space identifiers (ASIDs) included in tags (-)
• aliasing problems due to the sharing of pages (-)
• maintaining cache coherence (-) (see later in “Computer
Architecture” or “Multicore Architecture” course)
92
Virtually Addressed Cache
(Virtual Index/Virtual Tag)
Virtual
Address
Virtual
Address
PC
Inst.
Cache
D
Decode
E
+
M
Data
Cache
Miss?
Inst.
TLB
Physical
Address
Instruction data
W
Miss?
Page-Table Base
Register
Hardware Page
Table Walker
Memory Controller
Data
TLB
Physical
Address
Physical Address
Main Memory (DRAM)
Translate on miss
93
Aliasing in Virtual-Address Caches
VA1
Page Table
Tag
Data Pages
PA
VA2
Two virtual pages share
one physical page
Data
VA1
1st Copy of Data at PA
VA2
2nd Copy of Data at PA
Virtual cache can have two copies of
same physical data. Writes to one
copy not visible to reads of other!
General Solution: Prevent aliases coexisting in cache
Software (i.e., OS) solution for direct-mapped cache
VAs of shared pages must agree in cache index bits; this ensures
all VAs accessing same PA will conflict in direct-mapped cache
(early SPARCs)
94
Concurrent Access to TLB & Cache
(Virtual Index/Physical Tag)
VA
VPN
L
TLB
PA
PPN
Tag
b
k
Page Offset
=
Virtual
Index
Direct-map Cache
2L blocks
2b-byte block
Physical Tag Data
hit?
Index L is available without consulting the TLB
cache and TLB accesses can begin simultaneously!
Tag comparison is made after both accesses are completed
Cases: L + b = k, L + b < k, L + b > k
95
Virtual-Index Physical-Tag Caches:
Associative Organization
VA
VPN
TLB
PA
PPN
Tag
a
L = k-b
b
k
2
Virtual
Index
a
Direct-map
2L blocks
Direct-map
2L blocks
Phy.
Tag
Page Offset
=
hit?
=
2a
Data
After the PPN is known, 2a physical tags are compared
How does this scheme scale to larger caches?
96
Concurrent Access to TLB & Large
L1
The problem with L1 > Page size
Virtual Index
VA
VPN
a
Page Offset
b
VA1 PPNa
VA2 PPNa
TLB
PA
PPN
L1 PA cache
Direct-map
Page Offset
Data
Data
b
Tag
=
hit?
Can VA1 and VA2 both map to PA ?
97
A solution via Second Level Cache
CPU
RF
L1
Instruction
Cache
L1 Data
Cache
Unified L2
Cache
Memory
Memory
Memory
Memory
Usually a common L2 cache backs up both Instruction
and Data L1 caches
L2 is “inclusive” of both Instruction and Data caches
• Inclusive means L2 has copy of any line in either L1
98
Anti-Aliasing Using L2 [MIPS R10000,1996]
Virtual Index
VA
VPN
TLB
PA
PPN
a Page Offset
b
into L2 tag
Page Offset
L1 PA cache
Direct-map
VA1 PPNa
Data
VA2 PPNa
Data
b
PPN
Tag
• Suppose VA1 and VA2 both map to PA
and VA1 is already in L1, L2 (VA1 
VA2)
• After VA2 is resolved to PA, a collision
will be detected in L2.
• VA1 will be purged from L1 and L2, and
VA2 will be loaded  no aliasing !
=
hit?
PA a1
Data
Direct-Mapped L2
99
Anti-Aliasing using L2 for a Virtually
Addressed L1
VA
VPN
Page Offset
b
VA1 Data
TLB
PA
PPN
Tag
Virtual
Index & Tag
VA2 Data
Page Offset
b
Physical
Index & Tag
Physically-addressed L2 can also be
used to avoid aliases in virtuallyaddressed L1
L1 VA Cache
“Virtual Tag”
PA VA1
Data
L2 PA Cache
L2 “contains” L1
100
VM features track historical uses:
• Bare machine, only physical addresses
– One program owned entire machine
• Batch-style multiprogramming
– Several programs sharing CPU while waiting for I/O
– Base & bound: translation and protection between programs (supports
swapping entire programs but not demand-paged virtual memory)
– Problem with external fragmentation (holes in memory), needed
occasional memory defragmentation as new jobs arrived
• Time sharing
– More interactive programs, waiting for user. Also, more jobs/second.
– Motivated move to fixed-size page translation and protection, no external
fragmentation (but now internal fragmentation, wasted bytes in page)
– Motivated adoption of virtual memory to allow more jobs to share limited
physical memory resources while holding working set in memory
• Virtual Machine Monitors
– Run multiple operating systems on one machine
– Idea from 1970s IBM mainframes, now common on laptops
» e.g., run Windows on top of Mac OS X
– Hardware support for two levels of translation/protection
» Guest OS virtual -> Guest OS physical -> Host machine physical
104
Virtual Memory Use Today - 1
• Servers/desktops/laptops/smartphones have full
demand-paged virtual memory
–
–
–
–
Portability between machines with different memory sizes
Protection between multiple users or multiple tasks
Share small physical memory among active tasks
Simplifies implementation of some OS features
• Vector supercomputers have translation and protection
but rarely complete demand-paging
• (Older Crays: base&bound, Japanese & Cray X1/X2: pages)
– Don’t waste expensive CPU time thrashing to disk (make jobs fit in
memory)
– Mostly run in batch mode (run set of jobs that fits in memory)
– Difficult to implement restartable vector instructions
105
Virtual Memory Use Today - 2
• Most embedded processors and DSPs provide physical
addressing only
– Can’t afford area/speed/power budget for virtual memory support
– Often there is no secondary storage to swap to!
– Programs custom written for particular memory configuration in
product
– Difficult to implement restartable instructions for exposed architectures
106
Acknowledgements
• UCB material derived from course CS152
107
Input/Output
108
Diversity of Devices
• I/O devices can be characterized by
– Behavior: input, output, storage
– Partner: human or machine
– Data rate: bytes/sec, transfers/sec
109
I/O System Design Goals
• Performance measures
– Latency (response time)
– Throughput (bandwidth)
• Dependability is important
– Resilience in the face of failures
– Particularly for storage devices
• Expandability
• Computer classes
– Desktop: response time and diversity of devices
– Server: throughput, expandability, failure resilience
– Embedded: cost and response time
110
I/O System Design
• Satisfying latency requirement
– For time-critical operations
– If system is unloaded
• Add up latency of components
• Maximizing throughput at steady state (loaded system)
– Find “weakest link” (lowest-bandwidth component)
– Configure to operate at its maximum bandwidth
– Balance remaining components in the system
127
I/O Commands:
A method for addressing a device
• Memory-mapped I/O:
– Portions of the address space are assigned to each I/O
device
– I/O addresses correspond to device registers
– User programs prevented from issuing I/O operations directly
since I/O address space is protected by the address
translation mechanism
140
Polling and Programmed I/O
143
I/O Notification
• Method #2: I/O Interrupt
– When an I/O device needs attention, it interrupts the
processor
– Interrupt must tell OS about the event and which device
• Using “cause” register(s): Kernel “asks” what interrupted
• Using vectored interrupts: A different exception handler for each
– I/O interrupts are asynchronous events, and happen anytime
• Processor waits until current instruction is completed
– Interrupts may have different priorities
– Advantages: Execution is only halted during actual transfer
– Disadvantage: Software overhead of interrupt processing
144
I/O Interrupts
145
Data Transfer
• The third component to I/O communication is the transfer
of data from the I/O device to memory (or vice versa)
• Simple approach: “Programmed” I/O
– Software on the processor moves all data between memory
addresses and I/O addresses
– Simple and flexible, but wastes CPU time
– Also, lots of excess data movement in modern systems
• Mem->Network->CPU->Network->graphics
• So need a solution to allow data transfer to happen
without the processor’s involvement
146
Delegating I/O : DMA
• Direct Memory Access (DMA)
– Transfer blocks of data to or from memory without CPU
intervention
– Communication coordinated by the DMA controller
• DMA controllers are integrated in memory or I/O controller chips
– DMA controller acts as a bus master, in bus-based systems
• DMA steps
– Processor sets up DMA by supplying
• Identify of the device and the operation (read/write)
• The memory address for source/destination
• The number of bytes to transfer
– DMA controller starts the operation by arbitrating for the bus
and then starting the transfer when the data is ready
– Notify the processor when the DMA transfer is complete or
on error
• Usually using an interrupt
147
DMA Problems: Virtual vs. Physical Addresses
• If DMA uses physical addresses
– Memory access across physical page boundaries may not
correspond to contiguous virtual pages (or even the same
application)
• Solution1: <1 page per DMA transfer
• Solution: chain a series of 1-page requests provided by the OS
– Single interrupt at the end of the last DMA request in the
chain
• Solution2: DMA engine uses virtual addresses
– Multi-page DMA requests are now easy
– A TLB is necessary for the DMA engine
• For DMA with physical addresses: pages must be pinned in DRAM
– OS should not page to disks pages involved with pending I/O
148
DMA Problems: Cache Coherence
• A copy of the data involved in a DMA transfer may reside in processor cache
– If memory is updated: must update or invalidate “old” cache copy
– If memory is read: must read latest value, which may be in the cache
• Only a problem with write-back caches
• This is called the “cache coherence” problem
– Same problem in multiprocessor systems
• Solution 1: OS flushes the cache before I/O reads or forces write backs
before I/O writes
– Flush/write-back may involve selective addresses or whole cache
– Can be done in software or with hardware (ISA) support
• Solution 2: Route memory accesses for I/O through the cache
– Search the cache for copies and invalidate or write-back as needed
– This hardware solution may impact performance negatively
• While searching cache for I/O requests, it is not available to processor
– Multi-level, inclusive caches make this easier
• Processor searches L1 cache mostly (until it misses)
• I/O requests search L2 cache mostly *until it finds a copy of interest)
149
I/O Summary
• I/O performance has to take into account many variables
– Response time and throughput
– CPU, memory, bus, I/O device
– Workload e.g. transaction processing
• I/O devices also span a wide spectrum
– Disks, graphics, and networks
• Buses
– Bandwidth
– Synchronization
– Transactions
• OS and I/O
– Communication: Polling and Interrupts
– Handling I/O outside CPU: DMA
150
Acknowledgements
• These slides contain material from courses:
– UCB CS152
– Stanford EE108B
• Read Book
– Pages. 412-454
151