SE 292: High Performance Computing Memory Organization

Download Report

Transcript SE 292: High Performance Computing Memory Organization

SE 292: High Performance Computing
Memory Organization and Process
Management
Sathish Vadhiyar
Memory Image of a Process –
Structure of a process in main memory


A process consists of
different sections (shown
in the figure) in main
memory
A process also includes
current activity


Program counter
Process registers
Stack
libc.so
Memory mapped region
for shared libraries
Data
Text
Heap
Uninitialized
Initialized
Data
Text
Code
a.out
Data
Text
Memory Image


Data section contains global variables
(initialized & unitialized)
The memory image also consists of shared
libraries used by a program (more later)
Memory Image
Stack





Stack contains temporary data
Holds function parameters, return addresses,
local variables
Expands and contracts by function calls and
returns, respectively
Addresses calculated and used by compiler,
relative to the top of stack, or some other base
register associated with the stack
Growth of stack area is thus managed by the
program, as generated by the compiler
Memory Image
Heap



Heap is dynamically allocated
Expands and contracts by malloc & free
respectively
Managed by a memory allocation library
Memory Abstraction



Above scheme requires entire process to be in
memory
What if process space is larger than physical
memory?
Virtual memory



Provides an abstraction to physical memory
Allows execution of processes not completely in
memory
Offers many other features
Virtual Memory Concepts




Instructions must be in physical memory to
be executed
But not all of the program need to be in
physical memory
Each program could take less memory; more
programs can reside simultaneously in physical
memory
Need for memory protection


One program should not be able to access the variables
of the other
Typically done through address translation (more later)
Virtual Memory Concepts – Large size

Virtual memory can
be extremely large
when compared to a
smaller physical
memory
page 0
page 1
page 2
page 2
page 2
Memory
Map
Physical
Memory
page v
Virtual
Memory
Virtual Memory Concepts - Sharing
stack
stack
shared library
shared library
heap
heap
data
code

Memory
Map
data
code
Virtual memory allows programs to share
pages and libraries
Address Translation



Terminology – Virtual and Physical Addresses
Virtual memory addresses translated to physical
memory addresses
Memory Management Unit (MMU) provides the
mapping



MMU – a dedicated hardware in the CPU chips
Uses a lookup table (see below)
To translate a virtual address to the corresponding
physical address, a table of translation information
is needed
Address Translation





Minimize size of table by not managing translations
on byte basis but larger granularity
Page: fixed size unit of memory (contiguous memory
locations) for which a single piece of translation
information is maintained
The resulting table is called page table
The contents of the page table managed by OS
Thus, (address translation logic in MMU + OS +
page table) used for address translation
Address Translation
PROGRAM/PROCESS
Virtual Address Space
0x00000000
0x000000FF
0x00000100
0x000001FF
0x00000200
Text
256 Bytes
0x000000 0x000024:
256 Bytes
Virtual Page 0
0x0000FF
Data
256 Bytes
Stack
Physical Page 0
0x00000124:
Virtual Page 1
…
Heap
…
0xFFFFFFFF
MAIN MEMORY
Physical Address Space
…
0xFFFFFF
Virtual Page
0xFFFFFF
Address Translation
VPN
PPN
Virtual address
Physical address
Offset
Offset
Physical Page Number
Virtual page number
(VPN)
(PPN)
Translation
table
PAGE TABLE
Size of Page Table
32

Big Problem: Page Table size




2
14
2
Example: 32b virtual address, 16KB page size
256K virtual pages => MB page table size per process
Has to be stored in memory
Solution – Multi-level page table (more later)
Address Translation
Page Table
Base Register
(PTBR)
Virtual Address
(n-bit) p p-1
n-1
Virtual Page Number (VPN)
Valid
0
Virtual Page Offset (VPO)
Physical Page Number (PPN)
Page Table
m-1
p p-1
Physical Page Number (PPN)
Physical Page Offset (PPO)
Physical Address
(m-bit)
Fig 10.13 (Bryant)
0
What’s happening…
Main Memory
Page Tables
P1
1
2
P2
4
1
1
2
3
4
-
1
2
3
4
-
P1
4
1
-
…
3
Pn
Disk
1
2
3
4
3
-
2
-
Processes
P2
4
Pn
…
3
2
Virtual page contents
Demand Paging





When a program refers a page, the page table is
looked up to obtain the corresponding physical
frame (page)
If the physical frame is not present, page fault
occurs
The page is then brought from the disk
Thus pages are swapped in from the disk to the
physical memory only when needed – demand
paging
To support this, the page table has valid and
invalid bits


Valid – present in memory
Invalid – present in disk
Page Fault


Situation where virtual address generated
by processor is not available in main
memory
Detected on attempt to translate address


Page Table entry is invalid
Must be `handled’ by operating system


Identify slot in main memory to be used
Get page contents from secondary memory



Part of disk can be used for this purpose
Update page table entry
Data can now be provided to the processor
Demand Paging (Valid-Invalid Bit)
Fig. 9.5 (Silberschatz)
Page Fault Handling Steps
Fig. 9.6 (Silberschatz)
Page Fault Handling – a different
perspective
4
Exception
Page fault exception handler
2
CPU chip
Victim page
PTEA
1
Processor
VA
7
MMU
PTE
3
Cache/
Memory
5
Disk
New page
6
Fig. 10.14 (Bryant)
Page Fault Handling Steps
Check page table to see if reference is valid or
invalid
If invalid, a trap into the operating system
Find free frame, locate the page on the disk
Swap in the page from the disk
1.
2.
3.
4.
1.
5.
6.
Replace an existing page (page replacement)
Modify page table
Restart instruction
Performance Impact due to Page Faults




p – probability of page fault
ma – memory access time
effective access time (EAT) – (1-p) x ma +
p x page fault time
Typically
EAT = (1-p) x (200 nanoseconds) + p(8
milliseconds)
= 200 + 7,999,800 x p nanoseconds

Dominated by page fault time
So, What happens during page fault?
1.
2.
3.
4.





Trap to OS, Save registers and process state, Determine location of
page on disk
Issue read from the disk
1. major time consuming step – 3 milliseconds latency, 5 milliseconds
seek
Receive interrupt from the disk subsystem, Save registers of other
program
Restore registers and process state of this program
To keep EAT increase due to page faults very small (< 10%), only
1 in few hundred thousand memory access should page fault
i.e., most of the memory references should be to pages in
memory
But how do we manage this?
Luckily, locality
And, smart page replacement policies that can make use of
locality
Page Replacement Policies


Question: How does the page fault handler
decide which main memory page to replace
when there is a page fault?
Principle of Locality of Reference
A commonly seen program property
 If memory address A is referenced at time t,
then it and its neigbhouring memory locations
temporal
spatial are likely to be referenced in the near future
 Suggests that a Least Recently Used (LRU)
replacement policy would be advantageous

Locality of Reference

Based on your experience, why do you
expect that programs will display locality of
reference?
Program
Data
Same address
Neighbours
(temporal)
(spatial)
Loop
Sequential code
Function
Loop
Local
Stepping through array
Loop index
Page Replacement Policies


FIFO – performance depends on if the initial
pages are actively used
Optimal page replacement – replace the page that
will not be used for the longest time



Difficult to implement
Some ideas?
LRU



Least Recently Used is most commonly used
Implemented using counters
LRU might be too expensive to implement
Page Replacement Algorithms Alternatives

LRU Approximation – Second-Chance or
Clock algorithm



Counting-Based Page Replacement


Similar to FIFO, but when a page’s reference
bit is 1, the page is given a second chance
Implemented using a circular queue
LFU, MFU
Performance of page replacement depends
on applications

Section 9.4.8
Managing size of page table – TLB, multilevel tables



n-1
Translation
Lookaside Buffer –
a small and fast
memory for storing
page table entries
MMU need not
fetch PTE from
memory every time
TLB is a virtually
addressed cache
p+t p+t-1
TLB tag (TLBT)
p p-1
TLB index (TLBI)
VPO
TLB
2
VPN
PTE
1
Processor
VA
4
Translation
Data
0
3
PA
Cache/
Memory
Speeding up address translation


Translation Lookaside Buffer (TLB): memory in MMU
that contains some page table entries that are likely to
be needed soon
TLB Miss: required entry not available
Multi Level Page Tables


A page table can
occupy significant
amount of memory
Multi level page tables
to reduce page table
size


If a PTE in level 1 PT is
null, the corresponding
level 2 PT does not exist
Only level-1 PT needs to
be in memory. Other PTs
can be swapped in ondemand.
Level 1
page table
Level 2
page tables
PTE 0
PTE 1
PTE 2
null
PTE 3
null
PTE 4
null
PTE 5
null
PTE 6
…
…
Multi Level Page Tables (In general..)
Virtual Address
n-1
VPN 1
VPN 2
…
p-1
VPN k
0
VPO
fig. 10.19
(Bryant)
PPN
m-1
p-1
PPN
0
PPO
Physical Address


Each VPN i is an index into a PT at level i
Each PTE in a level j PT points to the base of
some PT at level (j+1)
Virtual Memory and Fork




During fork, a parent process creates a
child process
Initially the pages of the parent process is
shared with the child process
But the pages are marked as copy-onwrite.
When a parent or child process writes to a
page, a copy of a page is created.
Dynamic Memory Allocation




Allocator – Manages the heap region of the VM
Maintains a list of “free” in memory and allocates
these blocks when a process requests it
Can be explicit (malloc & free) or implicit
(garbage collection)
Allocator’s twin goals


Maximizing throughput and maximizing utilization
Contradictory goals – why?
Components of Memory Allocation



The heap is organized as a free list
When an allocation request is made, the list is
searched for a free block
Search method or placement policy


Coalescing – merging adjacent free blocks


First fit, best fit, worst fit
Can be immediate or deferred
Garbage collection – to automatically free
allocated blocks no longer needed by the program
Fragmentation



Cause of poor heap utilization
An unused memory is not available to
satisfy allocate requests
Two types


Internal – when an allocated block is larger
than the required payload
External – when there is enough memory to
satisfy a free request, but no single free block
is large enough
Process Management
Computer Organization: Software


Hardware resources of computer system
are shared by programs in execution
Operating System: special program that
manages this sharing


Process: a program in execution


Ease-of-use, resource allocator, device
controllers
ps tells you the current status of processes
Shell: a command interpreter through
which you interact with the computer
system

csh, bash,…
Operating System, Processes, Hardware
Processes
System Calls
OS Kernel
Hardware
Operating System
Software that manages the resources of a
computer system




CPU time
Main memory
I/O devices
OS functionalities



Process management
Memory management
Storage management
Process Lifetime

Two modes



User – when executing on behalf of user
application
Kernel mode – when user application requests
some OS service, some privileged instructions
Implemented using mode bits
Silberschatz – figure 1.10
Modes

Can find out the total CPU time used by a
process, as well as CPU time in user mode,
CPU time in system mode
Shell - What does a Shell do?
while (true){
•
Prompt the user to type in a command
write
•
Read in the command read
•
Understand what the command is asking for
•
Get the command executed
fork, exec
•
}
Q: What system calls are involved?





Shell – command interpreter
Shell interacts with the user and invokes system call
Its functionality is to obtain and execute next user command
Most of the commands deal with file operations – copy, list,
execute, delete etc.
It loads the commands in the memory and executes
System Calls




How a process gets the operating system to do something for
it; an interface or API for interaction with the operating
system
Examples
 File manipulation: open, close, read, write,…
 Process management: fork, exec, exit,…
 Memory management: sbrk,…
 device manipulation – ioctl, read, write
 information maintenance – date, getpid
 communications – pipe, shmget, mmap
 protection – chmod, chown
When a process is executing in a system call, it is actually
executing Operating System code
System calls allow transition between modes
Mechanics of System Calls





Process must be allowed to do sensitive operations while it is
executing system call
Requires hardware support
Processor hardware is designed to operate in at least 2
modes of execution
 Ordinary, user mode
 Privileged, system mode
System call entered using a special machine instruction (e.g.
MIPS 1 syscall) that switches processor mode to system
before control transfer
System calls are used all the time
 Accepting user’s input from keyboard, printing to console,
opening files, reading from and writing to files
System Call Implementation





Implemented as a trap to a specific location in the
interrupt vector (interrupting instructions contains
specific requested service, additional information
contained in registers)
Trap executed by syscall instruction
Control passes to a specific service routine
System calls are usually not called directly There is a mapping between a API function and a
system call
System call interface intercepts calls in API,
looks up a table of system call numbers, and
invokes the system calls
Figure 2.9 (Silberschatz)
Traditional UNIX System Structure
System Boot





Bootstrap loader – a program that locates the kernel,
loads it into memory, and starts execution
When CPU is booted, instruction register is loaded
with the bootstrap program from a pre-defined
memory location
Bootstrap in ROM (firmware)
Bootstrap – initializes various things (mouse, device),
starts OS from boot block in disk
Practical:




BIOS – boot firmware located in ROM
Loads bootstrap program from Master Boot record (MBR) in the hard disk
MBR contains GRUB; GRUB loads OS
OS then runs init and waits
Process Management

What is a Process?



Program in execution
But some programs run as multiple processes
And same program can be running multiply at
same time
Process vs Program




Program: static, passive, dead
Process: dynamic, active, living
Process changes state with time
Possible states a process could be in?



Running (Executing on CPU)
Ready (to execute on CPU)
Waiting (for something to happen)
Process State Transition Diagram
preempted or
yields CPU
Running
Ready
scheduled
waiting for an event
to happen
Waiting
awaited event
happens
Process States


Ready – waiting to be assigned to a
processor
Waiting – waiting for an event
Figure 3.2 (Silberschatz)
CPU and I/O Bursts


Processes alternate between two states of
CPU burst and I/O burst.
There are a large number of short CPU
bursts and small number of long I/O
bursts
Process Control Block


Process represented by Process Control Block
(PCB). Contains:
Process state









text, data, stack, heap
Hardware – PC value, CPU registers
Other information maintained by OS:
Identification – process id, parent id, user id
CPU scheduling information – priority
Memory-management information – page tables
etc.
Accounting information – CPU times spent
I/O status information
Process can be viewed as a data structure with
operations like fork, exit etc. and the above data
PCB and Context Switch
Fig. 3.4
(Silberschatz)
Process Management

What should OS do when a process does
something that will involve a long time?


Objective: Maximize utilization of CPU



e.g., file read/write operation, page fault, …
Change status of process to `Waiting’ and make
another process `Running’
Which process?
Objective?


Minimize average program execution time
Fairness
Process Scheduling




Selecting a process from many available ready
processes for execution
A process residing in memory and waiting for
execution is placed in a ready queue
Can be implemented as a linked list
Other devices (e.g. disk) can have their own
queues
Queue diagram
Fig. 3.7
(Silberschatz)
Scheduling Criteria






CPU utilization
Throughput
Turnaround time
Waiting time
Response time
Fairness
Scheduling Policies

Preemptive vs Non-preemptive




Preemptive policy: one where OS `preempts’
the running process from the CPU even though
it is not waiting for something
Idea: give a process some maximum amount of
CPU time before preempting it, for the benefit
of the other processes
CPU time slice: amount of CPU time allotted
In a non-preemptive process scheduling policy,
process would yield CPU either due to waiting
for something or voluntarily
Process Scheduling Policies

Non-preemptive



First Come First Served (FCFS)
Shortest Process Next
Preemptive



Round robin
Preemptive Shortest Process Next (shortest
remaining time first)
Priority based


Process that has not run for more time could get
higher priority
May even have larger time slices for some processes
Example: Multilevel Feedback


Used in some kinds of UNIX
Multilevel: Priority based (preemptive)



OS maintains one readyQ per priority level
Schedules from front of highest priority nonempty queue
Feedback: Priorities are not fixed

Process moved to lower/higher priority queue
for fairness
Context Switch


When OS changes process that is
currently running on CPU
Takes some time, as it involves replacing
hardware state of previously running
process with that of newly scheduled
process



Saving HW state of previously running process
Restoring HW state of scheduled process
Amount of time would help in deciding what
a reasonable CPU timeslice value would be
Time: Process virtual and Elapsed
P1
P1
P2
P3
P1
P3
Real time
Wallclock time
Elapsed time
P1
Process P1 virtual time
Process P1 virtual time
: Running in user mode
: Running in system mode
How is a Running Process Preempted?

OS preemption code must run on CPU


How does OS get control of CPU from running
process to run its preemption code?
Hardware timer interrupt



Hardware generated periodic event
When it occurs, hardware automatically
transfers control to OS code (timer interrupt
handler)
Interrupt is an example of a more general
phenomenon called an exception
Exceptions


Certain exceptional events during program
execution that are handled by processor
HW
Two kinds of exceptions

Traps


: Synchronous, software generated
Page fault, Divide by zero, System call
Interrupts : Asynchronous, hardware generated

Timer, keyboard, disk
What Happens on an Exception
1. Hardware
•
•
Saves processor state
Transfers control to corresponding piece of
OS code, called the exception handler
2. Software (exception handler)
•
•
Takes care of the situation as appropriate
Ends with return from exception instruction
3. Hardware (execution of RFE instruction)
•
•
Restores the saved processor state
Transfers control back to saved PC value
Re-look at Process Lifetime

Which process has the exception handling
time accounted against it?


Process running at time of exception
All interrupt handling time while process is
in running state is accounted against it

Part of `running in system mode’

More…
Trashing




When CPU utilization decreases, OS increases
multiprogramming level by adding more processes
Beyond a certain multiprogramming level,
processes compete for pages leading to page
faults
Page fault causes disk reads by processes leading
to lesser CPU utilization
OS adds more processes, causing more page
faults, lesser CPU utilization – cumulative effect