Transcript cellular

Cellular Disco: resource
management using virtual clusters
on shared memory multiprocessors
Published in ACM 1999 by K.Govil, D. Teodosiu,Y.
Huang, M. Rosenblum.
Presenter: Soumya Eachempati
Motivation
• Large scale shared-Memory Multiprocessors
– Large number of CPUs (32-128)
– NUMA Architectures
• Off-the-shelf OS not scalable
– Cannot handle large number of resources
– Memory management not optimized for NUMA
– No fault containment
Existing Solutions
• Hardware partitioning
–
–
–
–
Provides fault containment
Rigid resource allocation
Low resource utilization
Cannot dynamically adapt to workload
• New Operating System
– Provides flexibility and efficient resource management.
– Considerable effort and time
Goal: To exploit hardware resources to the fullest
with minimal effort while improving flexibility
and fault-tolerance.
Solution: DISCO(VMM)
– Virtual Machine monitor
– Addresses NUMA awareness issues and
scalability
Issues not dealt by DISCO:
– Hardware fault tolerance/containment
– Resource management policies
Cellular DISCO
• Approach: Convert Multiprocessor machine
into a Virtual Cluster
• Advantages:
–
–
–
–
–
–
Inherits the benefits of DISCO
Can support legacy OS transparently
Combines the goodness of H/W Partitioning and new OS.
Provides fault containment
Fine grained resource sharing
Less effort than developing an OS
Cellular DISCO
• Internally structured into semi-independent
cells.
• Much less development effort compared to
HIVE
• No performance loss - with fault containment.
WARRANTED DESIGN DECISION: Code of
Cellular DISCO is correct.
Cellular Disco Architecture
Resource Management
• Over-commits resources
• Gives flexibility to adjust fraction of resources assigned to
VM.
• Restrictions on resource allocation due to fault
containment.
• Both CPU and memory load balancing under constraints.
– Scalability
– Fault containment
– Avoid contention
• First touch allocation, dynamic migration, replication of hot
memory pages
Hardware Virtualization
• VM’s interface mimics the underlying H/W.
• Virtual Machine Resources (User-defined)
– VCPUs, memory, I/O devices(physical)
• Physical vs. machine resources(allocated dynamically - priority
of VM)
– VCPUs - CPUs
– Physical - machine pages
• VMM intercepts privileged instructions
– 3 modes - user & supervisor(guest OS), kernel(VMM).
– Supervisor mode all memory accesses are mapped.
• Allocates machine memory to back the physical memory.
• Pmap and memmap data structure.
• Second level software TLB(L2TLB).
Hardware fault containment
Hardware fault containment
• VMM - software fault containment.
• Cell
• Inter-cell communication
– Inter-processor RPC
– Messages - no need for locking since serialized.
– Shared memory for some data structures(pmap,
memmap).
– Low latency, exactly once semantics
• Trusted system software layer - enables us to use
shared memory.
Implementation 1: MIPS
R10000
•
•
•
•
32-processor SGI Origin 2000
Piggybacked on IRIX 6.4(Host OS)
Guest OS - IRIX 6.2
Spawns Cellular DISCO(CD) as a multithreaded kernel process.
– Additional overhead < 2%(time spent in host IRIX)
– No fault isolation: IRIX kernel is monolithic
• Solution: Some host OS support needed-one
copy of host OS per cell.
I/O Request execution
• Cellular Disco piggybacked on IRIX kernel
32 - MIPS R10000
Characteristics of
workloads
•
•
•
•
Database - decision support workload
Pmake - IO intensive workload
Raytrace - CPU intensive
Web - kernel intensive web-server
workload.
Virtualization Overheads
Fault-containment
Overheads
Left bar single cell
config
Right bar - 8
cell system.
CPU Management
•
Load Balancing mechanisms:
–
–
–
–
•
•
•
•
Three types of VCPU migrations - Intra-node, Inter-node, Inter-cell.
Intra node - loss of CPU cache affinity
Inter node - cost of copying L2TLB, higher long term cost.
Inter cell - loss of both cache and node affinity, increases fault
vulnerability.
Alleviates penalty by replicating pages.
Load balancing policies - idle (local load stealer) and
periodic (global redistribution) balancers.
Each CPU has local run queue of VCPUs.
Gang-scheduling
– Run all VCPUs of a VM simultaneously.
Load Balancing
• Low contention
distributed data
structure - load tree.
• Contention on higher
level nodes
• List of cells vulnerable
to - VCPU.
• Heavy loaded - idle
balancer not enough
• Local periodic balancer
for 8 CPU region.
CPU Scheduling and
Results
•
•
Scheduling - highest-priority gang runnable
VCPU that has been waiting. Sends out
RPC.
3 configs: 32- processors.
a) One VM - 8 VCPUs--8 process raytrace.
b) 4 VMs
c) 8 VMs (total of 64 VCPUs).
•
•
Pmap migrated only when all VCPUs are
migrated out of a cell.
Data pages also migrated for independence
Memory Management
• Each cell has its own freelist of pages
indexed by the home node.
• Page allocation request
– Satisfied from local node
– Else satisfied from same cell
– Else borrowed from another cell
• Memory balancing
– Low memory threshold for borrowing and lending
– Each VM has priority list of lender cells
Memory Paging
• Page Replacement
– Second-chance FIFO
• Avoids double paging overheads.
• Tracking used pages
– Use annotated OS routines
• Page Sharing
– Explicit marking of shared pages
• Redundant Paging
– Avoids by trapping every access to virtual paging disk
Implementation 2: FLASH
Simulation
• FLASH has hardware fault recovery support
• Simulation of FLASH architecture on SimOS
• Use Fault injector
– Power failure
– Link failure
– Firmware failure (?)
• Results: 100% fault containment
Fault Recovery
• Hardware support needed
– Determine what resources are operational
– Reconfigure the machine to use good resources
• Cellular Disco recovery
– Step 1: All cells agree on a liveset of nodes
– Step 2: Abort RPCs/messages to dead cells
– Step 3: Kill VMs dependent on failed cells
Fault-recovery Times
• Recovery times higher for larger memory
– Requires memory scanning for fault detections
Summary
• Virtual Machine Monitor
– Flexible Resource Management
– Legacy OS support
• Cellular Disco
– Cells provide fault-containment
– Create Virtual Cluster
– Need hardware support