Presentazione di PowerPoint

Download Report

Transcript Presentazione di PowerPoint

Multiple Processor Systems
Multiprocessor Systems
Multiprocessor
Multicomputer
Distributed System
• Continuous need for faster and powerful computers
– shared memory model ( access nsec)
– message passing multiprocessor (access microsec)
– wide area distributed system (access msec)
Multiprocessor
Definition:
A computer system in which two or
more CPUs share full access to a
common RAM
Multiprocessor Hardware (1)
Bus-based multiprocessors
memory coherence
Multiprocessor Hardware (2)
• UMA (Uniform Memory Access) Multiprocessor using a
crossbar switch (n*n crosspoints)
Multiprocessor Hardware (3)
• UMA multiprocessors using multistage switching
networks can be built from 2x2 switches
(a) 2x2 switch
(b) Message format
Multiprocessor Hardware (4)
• Omega Switching Network (n/2 * ln2 n switches)
Multiprocessor Hardware (5)
NUMA Multiprocessor Characteristics
1. Single address space visible to all CPUs
2. Access to remote memory via commands
-
LOAD
STORE
3. Access to remote memory slower than to local
Performance worse than in UMA machines
Multiprocessor OS Types (1)
Bus
Each CPU has its own operating system
System calls caught and handle on its own CPU
No sharing of process
No sharing of pages
Multiple independent buffer caches
Multiprocessor OS Types (2)
Bus
Master-Slave multiprocessors
Master is a bottleneck
It fails for large multiprocessors
Multiprocessor OS Types (3)
Bus
• SMP - Symmetric Multi-Processor
Only one CPU at a time can run the operating system
 operating system split in critical regions
Multiprocessor Synchronization (1)
TSL (test and set lock) instruction can fail if bus already locked
TSL must first lock the bus
Special bus needed
Multiprocessor Synchronization (2)
Multiple locks used to avoid cache thrashing
Multiprocessor Synchronization (3)
Spinning versus Switching
• In some cases CPU must wait
– waits to acquire ready list
• In other cases a choice exists
– spinning wastes CPU cycles
– switching uses up CPU cycles also
– possible to make separate decision each time
locked mutex encountered
Multiprocessor Scheduling
• Scheduling on a single processor is one
dimensional (process)
• Scheduling on a multiprocessor is two
dimensional (process & CPU)
• Unrelated processes
• Related in groups processes
Multiprocessor Scheduling (1)
independent processes
• Pure Timesharing: use of single data structure for scheduling
• Affinity scheduling
2 level algorithm: process assigned at a CPU when created
each CPU has its own priority list, but if idle it takes a process from another CPU
Multiprocessor Scheduling (2)
related processes
• Space sharing: multiple processes or threads at the same time across
multiple CPUs (non multi-programmed)
• Partitions size fixed or dynamically modified
Multiprocessor Scheduling (3)
time and space sharing together:
• Problems with communication between two threads (A0 A1, i.e.)
– both belong to process A
– both running out of phase
Multiprocessor Scheduling (4)
•
Solution: Gang Scheduling
1. Groups of related threads scheduled as a unit (a gang)
2. All members of a gang run simultaneously
on different timeshared CPUs
3. All gang members start and end time slices together
All CPUs scheduled synchronously.
Time divided into quanta
Multiprocessor Scheduling (5)
Gang Scheduling
6 CPU, 5 processes (A-E)
In principle, all threads of the same process run together
Multicomputers
• Definition:
Tightly-coupled CPUs that do not share
memory
• Also known as
– cluster computers
– clusters of workstations (CoWs, Farms)
Interconnection network is crucial
Multicomputer Hardware (1)
• Interconnection topologies
(a) single switch
(b) ring
(c) grid (mesh)
(d) double torus
(e) cube
(f) hypercube
Multicomputer Hardware (2)
• Switching scheme
– store-and-forward packet switching
Packet switching vs Circuit switching
Multicomputer Hardware (3)
Network interface boards in a multicomputer
DMA (and CPU) increase efficiency
Mapping of the interface board directly into user space speeds up, but...
Low-Level Communication Software (1)
Problems:
• If several processes, running on node, need network access to
send packets …?
One can map the interface board to all process that need it
but a synchronization mechanism is needed (difficult for
multiprocessing)
• If kernel needs access to network …?
We can use two network boards
– One mapped into user space, one into kernel space
• DMA uses physical addr, user process virtual addr.
it is needed to handle the problem without system calls
Low-Level Communication Software (2)
Node to Network Interface Communication, with on-board CPU
• Use send & receive rings, with bitmap
• coordinates main CPU with on-board CPU
User Level Communication Software
• Minimum services:
(a) Blocking send call
send and receive commands
– These are blocking
(synchronous) calls
– CPU idle during transmission
• Non blocking calls
(asynchronous)
– with copy
– with interrupt
– copy on write
(b) Non blocking send call with
kernel copy
Remote Procedure Call (1)
Message passing is based on I/O
RPC allows programs to call procedures located on different CPU.
RPC looks like a local call
• Steps in making a remote procedure call
Remote Procedure Call (2)
Implementation Issues
• Cannot pass pointers
– call by reference becomes copy-restore (but might fail)
• Weakly typed languages
– client stub cannot determine size
• Not always possible to determine parameter types
• Cannot use global variables
– may get moved to remote machine
Distributed Shared Memory (1)
• Note layers where it can be implemented
– hardware
– operating system
– user-level software
Distributed Shared Memory (2)
Replication
(a) Pages distributed on
4 machines
(b) CPU 0 references
page 10
(c) CPU 1 reads (only)
page 10 (replicated)
Distributed Shared Memory (3)
DSM uses multiple of page size: page size is important
False Sharing
Must also achieve sequential consistency (write on
replicated page)
Multicomputer Scheduling
Load Balancing (1)
Each process can only run on the CPU where it is located on.
Choice parameters : CPU & mem. usage, comm. needs etc
Process
• Graph-theoretic deterministic algorithm
Load Balancing (2)
overloaded sender
• Sender-initiated distributed heuristic algorithm
Processes run on the cpus that have created them, unless
overloaded
Load Balancing (3)
Under-loaded receiver
• Receiver-initiated distributed heuristic algorithm