Disco: Running Commodity Operating Systems on Scalable

Download Report

Transcript Disco: Running Commodity Operating Systems on Scalable

Kit Cischke
09/09/08
CS 5090
DISCO: RUNNING COMMODITY
OPERATING SYSTEMS ON
SCALABLE MULTIPROCESSORS
Overview
 Background
 What are we doing here?
 A Return to Virtual Machine Monitors
 What does Disco do?
 Disco: A Return to VMMs
 How does Disco do it?
 Experimental Results
 How well does Disco dance?
The Basic Problem
 With the explosion of multiprocessor
machines , especially of the NUMA variety,
the problem of effectively using the machines
becomes more immediate.
 NUMA = Non-Uniform Memory Access – shows up
a lot in clusters.
 The authors point out that the problem applies to
any major hardware innovation, not just
multiprocessors.
Potential Solution
 Solution: Rewrite the operating system to
address fault-tolerance and scalability.
 Flaws:
 Rewriting will introduce bugs.
 Bugs can disrupt the system or the applications.
 Instabilities are usually less-tolerated on these
kinds of systems because of their application
space.
 You may not have access to the OS.
Not So Good
 Okay. So that wasn’t so good. What else do
we have?
 How about Virtual Machine Monitors?
 A new twist on an old idea, which may work
better now that we have faster processors.
Enter Disco
•Disco is a system VM
that presents a
similar fundamental
machine to all of the
various OS’s that
might be running on
the machine.
•These can be
commodity OS’s,
uniprocessor,
multiprocessor or
specialty systems.
Disco VMM
 Fundamentally, the hardware is a cluster, but
Disco introduces some global policies to
manage all of the resources, which makes for
better usage of the hardware.
 We’ll use commodity operating systems and
write the VMM. Rather than millions of lines
of code, we’ll write a few thousand.
 What if the resource needs exceed that of the
commodity OS?
Scalability
 Very simple changes to the commodity OS
(maybe on the driver level or kernel
extension) can allow virtual machines to
share resources.
 E.g., a parallel database could have a cache in
shared memory and multiple virtual processors
running on virtual machines.
 Support for specialized OS’s that need the
power of multiple processors but not all of
the features offered by a commodity OS.
Further Benefits
 Multiple copies of an OS naturally addresses
scalability and fault containment.
 Need greater scaling? Add a VM.
 Only the monitor and the system protocols (NFS, etc.)
need to scale.
 OS or application crashes? No problem. The rest of
the system is isolated.
 NUMA memory management issues are
addressed.
 Multiple versions of different OS’s provide legacy
support and convenient upgrade paths.
Not All Sunshine & Roses
 VMM Overhead
 Additional exception processing, instruction execution
and memory to virtualize hardware.
 Privileged instructions aren’t directly executed on the
hardware, so we need to fake it. I/O requests need to
be intercepted and remapped.
 Memory overhead is rough too.
 Consider having 6 copies of Vista in memory
simultaneously.
 Resource Management
 VMM can’t make intelligent decisions about code
streams without info from OS.
One Last Disadvantage
 Communication
 Sometimes resources simply can’t be shared the
way we want.
 Most of these can be mitigated though.
 For example, most operating systems have good
NFS support. So use it.
 But… We can make it even better! (Details
forthcoming.)
Introducing Disco
 VMM designed for the FLASH multiprocessor
machine
 FLASH is an academic machine designed at
Stanford University
 Is a collection of nodes containing a processor,
memory, and I/O. Use directory cache coherence
which makes it look like a CC-NUMA machine.
 Has also been ported to a number of other
machines.
Disco’s Interface
 The virtual CPU of Disco is an abstraction of a
MIPS R10000.
 Not only emulates but extends (e.g., reduces
some kernel operations to simple load/store
instructions.
 A presented abstraction of physical memory
starting at address 0 (zero).
 I/O Devices
 Disks, network interfaces, interrupts, clocks, etc.
 Special interfaces for network and disks.
Disco’s Implementation
 Implemented as a multi-threaded shared-
memory program.
 Careful attention paid to memory placement,
cache-aware data structures and processor
communication patterns.
 Disco is only 13,000 lines of code.
 Windows Server 2003 - ~50,000,000
 Red Hat 7.1 - ~ 30,000,000
 Mac OS X 10.4 - ~86,000,000
Disco’s Implementation
 The execution of a virtual processor is
mapped one-for-one to a real processor.
 At each context switch, the state of a processor is
made to be that of a VP.
 On MIPS, Disco runs in kernel mode and puts
the processor in appropriate modes for
what’s being run
 Supervisor mode for OS, user mode for apps
 Simple scheduler allows VP’s to be time-
shared across the physical processors.
Disco’s Implementation
 Virtual Physical Memory
 This discussion goes on for 1.5 pages. To sum up:
 The OS makes requests to physical addresses, and
Disco translates them to machine addresses.
 Disco uses the hardware TLB for this.
 Switching a different VP onto a new processor
requires a TLB flush, so Disco maintains a 2nd-level
TLB to offset the performance hit.
 There’s a technical issue with TLBs, Kernel space
and the MIPS processor that threw them for a
loop.
NUMA Memory Management
•In an effort to
mitigate the nonuniform effects of a
NUMA machine,
Disco does a bunch of
stuff:
•
•
Allocating as much
memory to have
“affinity” to a
processor as possible.
Migrates or replicates
pages across virtual
machines to reduce
long memory
accesses.
Virtual I/O Devices
 Obviously Disco needs to intercept I/O
requests and direct them to the actual device.
 Primarily handled by installing drivers for
Disco I/O in the guest OS.
 DMA provides an interesting challenge, in
that the DMA addresses need the same
translation as regular accesses.
 However, we can do some especially cool
things with DMA requests to disk.
Copy-on-Write Disks
 All disk DMA requests are caught and analyzed.
If the data is already in memory, we don’t have
to go to disk for it.
 If the request is for a full page, we just update a
pointer in the requesting virtual machine.
 So what?
 Multiple VM’s can share data without being aware of
it. Only modifying the data causes a copy to be made.
 Awesome for scaling up apps by using multiple copies
of an OS. Only really need one copy of the OS kernel,
libraries, etc.
My Favorite – Networking
 The Copy-on-write disk stuff is great for non-
persistent disks. But what about persistent
ones? Let’s just use NFS.
 But here’s a dumb thing: A VM has a copy of
information it wants to send to another VM
on the same physical machine. In a naïve
approach, we’d let that data be duplicated,
taking up extra memory pointlessly.
 So, let’s use copy-on-write for our network
interface too!
Virtual Network Interface
 Disco provides a virtual subnet for VM’s to talk to
each other.
 This virtual device is Ethernet-like, but with no
maximum transfer size.
 Transfers are accomplished by updating pointers
rather than actually copying data (until
absolutely necessary).
 The OS sends out the requests as NFS requests.
 “Ah,” but you say. “What about the data locality
as a VM starts accessing those files and
memory?”
 Page replication and migration!
About those Commodity OS’s
 So what do we really need to do to get these
commodity operating systems running on Disco?
 Surprisingly a lot and a little.
 Minor changes were needed to IRIX’s HAL, amounting
to 2 header files and 15 lines of assembly code. This
did lead to a full kernel recompile though.
 Disco needs device drivers. Let’s just steal them from
IRIX!
 Don’t trap on every privileged register access. Convert
them into normal loads/stores to special address
space, linked to the privileged registers.
More Patching
 “Hinting” added to HAL to help the VMM not
do dumb things (or at least do fewer dumb
things).
 When the OS goes idle, the MIPS (usually)
defaults to a low power mode. Disco just
stops scheduling the VM until something
interesting happens.
 Other minor things were done, but that
required patching the kernel.
SPLASHOS
 Some high-performance apps might need
most or all of the machine. The authors
wrote a “thin” operating system to run
SPLASH-2 applications.
 Mostly proof-of-concept.
Experimental Results
 Bad Idea: Target your software for a machine
that doesn’t physically exist.
 Like, I don’t know, FLASH?
 Disco was validated using two alternatives:
 SimOS
 SGI Origin2000 Board that will form the basis of
FLASH
Experimental Design
 Use 4 representative workloads for parallel
applications:
 Software Development (Pmake of a large app)
 Hardware Development (Verilog simulator)
 Scientific Computing (Raytracing and a sorting
algorithm)
 Commercial Database (Sybase)
 Not only are they representative, but they each
have characteristics that are interesting to study
 For example, Pmake is multiprogrammed, lots of
short-lived processes, OS & I/O intensive.
Simplest Results Graph
•Overhead of Disco is
pretty modest compared
to the uniprocessor
results.
•Raytrace is the lowest,
at only 3%. Pmake is the
highest, at 16%.
•The main hits come
from additional traps and
TLB misses (from all the
flushing Disco does).
•Interestingly, less time is
spent in the kernel in
Raytrace, Engineering
and Database.
•Running a 64-bit system
mitigates the impact of
TLB misses.
Memory Utilization
Key thing here is
how 8 VM’s
doesn’t require 8x
the memory of 1
VM.
Interestingly, we
have 8 copies of
IRIX running in
less than 256 MB
of physical RAM!
Scalability
• Page migration and
replication were
disabled for these runs.
• All use 8 processors
and 256 MB of memory.
• IRIX has a terrible
bottleneck in
synchronizing the
system’s memory
management code
• It also has a “lazy”
evaluation policy in the
virtual memory system
that drags “normal”
RADIX down.
•Overall though, check
out those performance
gains!
Page Migration Benefits
•The 100% UMA results
give a lower bound on
performance gains from
page migration and
replication.
•But in short, the
policies work great.
Real Hardware
 Experiences on the real SGI hardware pretty
much confirms the simulations, at least at the
uniprocessor level.
 Overheads tend to be in the range of 3-8% on
Pmake and the Engineering simulation.
Summing Up
 Disco works pretty well.
 Memory usage scales well, processor
utilization scales well.
 Performance overheads are relatively small
for most loads.
 Lots of engineering challenges, but most
seem to have been overcome.
Final Thoughts
 Everything in this paper seems, in retrospect,
to be totally obvious. However, the
combination of all of these factors seems like
it would have taken just a ton of work.
 Plus, I don’t think I could have done it half as
well, to be honest.
 Targeting a non-existent machine seems a
little silly.
 Overall, interesting paper.