MODERN OPERATING SYSTEMS Third Edition ANDREW S. …

Download Report

Transcript MODERN OPERATING SYSTEMS Third Edition ANDREW S. …

Chapter 8-1 : Multiple Processor Systems


Multiple Processor Systems
Multiprocessor Hardware
UMA Multiprocessors
 NUMA Multiprocessors
 Multicore Chips


Multiprocessor Operating Systems
Types of Operating Systems
 Multiprocessor Syncronization
 Multiprocessor Scheduling

1
More Power Problems

Run the clock faster




Electrical signals travel 20cm/nsec in copper (Einstein’s
special theory of relativity)
Signals can not travel for more than 2 cm for a 10 GHz
clock, or 2 mm for a 100 GHz computer
Making computers this small may be possible, but we
have a heat dissipation problem
We have limitations for the time being, so the solution
to more power is through using multiple and paralel
CPU’s
2
Multiple Processor Systems
Figure 8-1. (a) A shared-memory multiprocessor. (b) A messagepassing multicomputer. (c) A wide area distributed system.
3
Multiple Processor Systems



Shared-memory multiprocessors: Every CPU has equal
access to the entire physical memory
Message-passing multicomputers: Each CPU has it’s
own memory. The CPU’s communicate with each other
using messages over the interconnection structure
Wide area distributed system: Computer systems
connected over a network. Communication is again by
messages but there is a delay due to the network
4
Multiprocessor Hardware



UMA Multiprocessors
NUMA Multiprocessors
Multicore Chips
5
UMA (Uniform Memory Access)
Multiprocessors with
Bus-Based Architectures (1)
Figure 8-2. Three bus-based multiprocessors. (a) Without caching. (b) With
caching. (c) With caching and private memories.

UMA : Uniform access to the entire memory, same access times for all CPU’s
6
UMA Multiprocessors with
Bus-Based Architectures (2)




Each CPU has to wait for the bus to be idle to read or write to
the memory
For 2 or 3 computers, bus contention is manageable (Figure 8.2
(a))
For larger number of CPU’s, a cache is added to the CPU. Since
reads can be satisfied by cache contents, there will be less traffic
on the bus (Figure 8.2 (b))
 Writing has to be managed!
Some systems have private and shared memories (Figure 8.3 (c))
 Mostly private memory is used. Shared memory is for shared
variables between CPUs
 Needs carefull programming!
7
UMA Multiprocessors
Using Crossbar Switches (1)
Figure 8-3. (a) An 8 × 8 crossbar switch. (b) An
open crosspoint. (c) A closed crosspoint.
8
UMA Multiprocessors
Using Crossbar Switches (2)




Use of a single bus limits (even with caches) the
number of CPUs to about 16 or 32 CPUs
A crossbar switch connecting n CPUs to k
memories may solve this problem
A crosspoint is a small electronic switch
Contention for memory is still possible if k < n.
Partitioning the memory into n units may reduce
the contention
9
UMA Multiprocessors Using Multistage
Switching Networks (1)
Figure 8-4. (a) A 2 × 2 switch with two input lines, A and B, and two
output lines, X and Y. (b) A message format.




Module
Address
Opcode
Value
: memory unit
: an address within a module
: Read or Write
: value to be written
10
UMA Multiprocessors Using Multistage
Switching Networks (2)
Figure 8-5. An omega switching network.
11
NUMA (Nonuniform)
Multiprocessors (1)
Characteristics of NUMA machines:
1. There is a single address space visible to all CPUs.
2. Access to remote memory is via LOAD and STORE
instructions.
3. Access to remote memory is slower than access to local
memory.
12
NUMA Multiprocessors (2)
Figure 8-6. (a) A 256-node directory-based multiprocessor.
(b) Division of a 32-bit memory address into fields.
(c) The directory at node 36.
13
NUMA Multiprocessors (3)






Let us assume that each node has one CPU, 16 MB of ram and a cache
The total memory is 232 bytes, divided up into 226 cache lines (blocks) of 64
bytes each
The total memory is allocated among nodes, with 0-16 MB in node 0, 16-32
MB in node 1, and so on
Each node has a directory containing an entry for each of the 218 (262,144)
64-byte cache lines
Each directory entry is 9 bits (cache presence bit + 8 bits for a node number),
so the total directory size is 218 * 9 = 2,359,296 bits = 294,912 bytes
We will assume that a cache line (memory block) is held in the cache of one
node only (single copy)
14
NUMA Multiprocessors (4)





The directory of each node is kept in an extremely fast specialpurpose hardware, since directory must be queried on every
instruction that references memory (so expensive)
Let us assume that CPU 20 references the address 0x 24000108.
This address corresponds node 36, block 4, offset 8 in decimal
Node 20 sends a request message to node 36 to find whether
block 4 is cached ot not (NOT from Figure 8-6 (c))
Node 36 fetches block 4 from it’s local ram, sends it back to the
to node 20, and updates the directory entry to indicate that the
line is now cached at node 20
Now let us assume that node 20 makes a reference to line 2 of
node 36. This line is cached in node 82 (Figure 8-6 (c)). Node 82
passes the line to node 20
Multicore Chips (1)




Moore's Law : The number of transistors that can be
placed on a chip increases exponentially, doubling
approximately every two years.
Multicore (dual-core, quad-core) means more than one
complete CPU on the same chip
The CPU’s share the same main memory but they may
have private (AMD) or shared (Intel) cache memory
Snooping : special hardware circuitry makes sure that if a
word is present in two or more caches and one of the
CPUs modifies the word, it is automatically and
atomically removed from all caches in order to maintain
consistency
Multiprocessor Operating
Systems



Each CPU has its own operating system
Master-Slave multiprocessors
Symmetric Multiprocessors
17
Each CPU Has Its Own
Operating System (1)
Figure 8-7. Partitioning multiprocessor memory among
four CPUs, but sharing a single copy of the operating
system code (The boxes marked Data are the operating
system’s private data for each CPU).
18
Each CPU Has Its Own
Operating System (2)







CPUs share the OS code
System calls handled by individual CPUs
No sharing of processes since each CPU has OS tables of its own
Each CPU schedules its own processes and may be idle if there are no
processes
Since memory allocation is fixed, pages are not shared among CPUs
Since each OS maintains its own cache of disk blocks, there may be
inconsistency if blocks are modified by different Oss
This model was used in the early days of multiprocessors and rarely used
these days
19
Master-Slave Multiprocessors
Figure 8-8. A master-slave multiprocessor model.



One master OS in one CPU distributes work among other slave
CPUs
Problems of previous arcitecture (no page sharing, CPUs idle,
buffer cache inconsistency) are solved
Master CPU may be overloaded since it has to cater for all others
20
Symmetric Multiprocessors (1)

Figure 8-9. The SMP multiprocessor model.
Each CPU runs a single but shared copy of OS
independently
21
Symmetric Multiprocessors (2)


When a system call is made, the CPU on which
the call is made traps the kernel and processes
the call
This model eliminates the asymmetry of masterslave configuration
One copy of OS executing on different CPUs
 One set of OS tables
 No master CPU bootleneck

22
Symmetric Multiprocessors (3)




Problem: What will happen if two CPUs try to
claim the same free page?
This is one example out of many
Locks (mutex) are provided to solve these
problems
The OS is splitted into several critical regions
(each controlled by different locks) that can be
executed concurrently by different CPUs
without interfering with each other
23
Multiprocessor Synchronization (1)
Figure 8-10. The TSL (Test and Set Lock) instruction can
fail if the bus cannot be locked. These four steps show a
sequence of events where the failure is demonstrated.
24
Multiprocessor Synchronization (2)

The synchronization problem of previous slide
can be prevented
By locking the bus so that other CPUs can not
access it
 Execute the TSL instruction
 Unlock the bus


This can be done preferably by hardware locking
or by software using spin locks (process
executes a tight loop testing its status)
25
Multiprocessor Scheduling



Uniprocessor scheduling : scheduler chooses the
thread to run next
Multiprocessor scheduling : scheduler has to
choose a thread and a CPU
Another complication factor is the thread
relations
Unrelated threads as in multi-user timesharing
environments
 Related threads as in the threads of some application
working together (such as compilers make
command)

26
Thread Scheduling Algorithms



Time-Sharing : unrelated threads
Space Sharing : related threads
Gang Scheduling : related threads
27
Timesharing
Figure 8-12. Using a single data structure for scheduling a
multiprocessor.


16 CPUs all busy, CPU 4 becomes idle, locks scheduling queues
and selects thread A. Next CPU 12 goes idle and chooses thread B
CPUs are time-shared and we have load balancing (no overload,
but work is distributed)
28
Space Sharing (1)

Figure 8-13. A set of 32 CPUs split into four
partitions, with two CPUs available.
Partitioning is based on related (grouped)
threads
29
Space Sharing (2)




Scheduling multiple related threads at the same time
across multiple CPUs is called space sharing
Some applications benefit from this approach such as
compilers make command
Each thread in a group is given its dedicated CPU. This
thread holds on to the CPU until it terminates. If a
thread blocks on I/O, it continues to hold the CPU
If there are not enough CPUs to start all threads of a
group at the same time, the whole group waits until
there are
30
Space Sharing (3)



Space sharing does not use multiprogramming
so we do not have context switching overhead
Time is wasted when a thread blocks for I/O or
for some other event and CPU is idle
Consequently, people have looked for
algorithms that attempt to schedule in both time
and space together, especially for threads that
create multiple threads, which usually need to
communicate with one another
31
Gang Scheduling (1)
Figure 8-14. Communication between two threads belonging to
thread A that are running out of phase.



Consider process A with threads A0 and A1, process B with threads B0 and
B1. A0 and B0 execute on CPU 0, A1 and B1 on CPU 1
A0 sends a message to A1, A1 sends a reply back repeately
A0 sends the message but gets the reply after a delay of 200 msec because of
B’s threads
32
Gang Scheduling (2)
The three parts of gang scheduling:
1. Groups of related threads are scheduled as a unit, a
gang.
2. All members of a gang run simultaneously, on different
timeshared CPUs.
3. All gang members start and end their time slices
together.
33
Gang Scheduling (3)




Time is divided into time slices.
At the start of a time slice, all CPUs are
rescheduled with a new thread in each.
At the start of the next time slice another
scheduling is done. No scheduling is done in
between
If a thread blocks, its CPU stays idle until the
end of the quantum
34
Gang Scheduling (4)

Figure 8-15. Gang scheduling (6 CPUs, 5 processes and 24 threads
in total).
In gang scheduling all threads of a process run together, so that if
one of them sends a request to another one, it will get the message
and reply immediately
35