The Multikernel: A New OS Architecture for Scalable Multicore

Download Report

Transcript The Multikernel: A New OS Architecture for Scalable Multicore

The Multikernel: A New OS Architecture
for Scalable Multicore Systems
Andrew Baumann, Paul Barham, Pierre-Evariste Dagand, Tim Harris, Rebecca
Isaacs, Simon Peter, Timothy Roscoe, Adrian Schupbach, and Akhilesh Singhania
The multikernel: a new OS architecture for scalable multicore systems. In
Proceedings of the ACM SIGOPS 22nd symposium on Operating systems
principles (SOSP '09). ACM, New York, NY, USA, 29-44.
Introduction
•
•
•
•
Increasing number of multicore systems
Similar to HPC systems
Generic more OS intensive workload
Shared memory w/lock protected structures causes scalability
problems
• Solutions:
1. Message passing between cores
2. Hardware independent system structure
3. Replicated not shared data state
Motivations
• Diverse hardware
• Specific hardware exploits and optimizations cannot be used for
other hardware
• Heterogeneous cores
• Interconnects mirror messaging passing
• Optimization Issues
• Sun Niagara read-writer lock
• Windows 7 Dispatcher lock
Messaging vs. Shared Memory
• Messaging Issues:
• No access to shared data
• Event styled programming
• Claims:
• Shared data convenience is superficial
• Fine tuning requires developers think in terms of cachecoherency messages and protocols.
• Monolithic kernels are essentially event-driven
• Message passing used in user interfaces, some network servers,
and large scale computations
Multikernel
1. All inter-core communication is explicit
•
•
•
•
Only shared memory are message passing channels
Exposes what is accessed by who and when
Pipelining and batching optimizations
Calls can be made and work can continue (like FlexSC)
2. OS structure is hardware independent
• Only CPUs, devices, and message passing systems are
architecture specific
3. Replicated not shared data state
• No shared memory and Increases scalability
Goals
•
•
•
•
Comparable performance to existing systems
Evidence of scalability
Retargeted to different hardware
Can use pipelined and batched messages to improve
performance
• Adapt OS functionality to load or hardware
Implementation
• Test platforms
• 2x4 - Intel Xeon
• 2x2, 4x4, 8x4 - AMD Opteron
• System Structure
• Privileged mode CPU driver (local to core)
• User mode monitor (inter-core communication)
• Microkernel design
Details
• CPU Driver
• Provides protection, time slices, access to hardware
• Event driven, single threaded, non-preemptable
• Local messaging between processes
• Monitors
•
•
•
•
Coordinate system state
User space, schedulable
State replication/Data consistency by agreement protocol
Process wakeup, IPC setup, Core idle
• Processes
• Dispatch objects
• Scheduled by local CPU driver
Details
• Inter-core communication
•
•
•
•
User level RPC (shared memory channel)
Receiving through polling
Optimized for cache coherency protocol
Fewer intra-core context switches
Details
• Memory management
• Need consistent memory allocation
• Capability based
• Shared address space
•
•
•
•
Either share a hardware page table or replicate with messages
Also need shared capabilities between cores
Monitors provide sharing between cores
Dispatchers can start/stop and migrate threads
Details
• System Knowledge base
• Used to store hardware information
• Other thoughts
• CPU/Monitor separation, Network stack, Shared state
TLB Shootdown
• Requires global synchronization
• Messages instead of IPI (IPI low latency, but invasive)
Computation
• Used OpenMP and Splash-2 on a 4x4 AMD system
• Showed similar performance to Linux on FFT, Barnes Hut and
radiosity computations
Other Measurements
• Network throughput equal ≈ 951 Mbit/s
• Web server
• Barrelfish – 640 Mbit/s
• Linux – 316 Mbit/s
• Avoids context switches by running in user-space
• IP Loopback
• Barrelfish – 2154 Mbit/s
• Linux – 1823 Mbit/s
• Also avoids kernel crossings and shared memory
Related Work
• Performance optimization by reducing sharing
• Tornado, K42, Corey
• Microkernel vs Multikernel
• Similar message passing design
• Core management is different
• Other distributed systems
• Higher latency links
Future Work
•
•
•
•
Improving the message queues
Efficient message passing based on cache coherence
Port to ARM processors
File systems – Current is backed by NFS
Critique
• Good idea with message passing between multiple
individually managed cores and state replication
•
•
•
•
Separation of CPU driver and monitor
Similar to microkernel design
Message passing in Linux?
HPC/server specific benchmarks
• Multicore OS benchmarks: we can do better
• Possible approach to improve virtualization?