Disco: Running Commodity Operating Systems on Scalable
Download
Report
Transcript Disco: Running Commodity Operating Systems on Scalable
Kit Cischke
09/09/08
CS 5090
DISCO: RUNNING COMMODITY
OPERATING SYSTEMS ON
SCALABLE MULTIPROCESSORS
Overview
Background
What are we doing here?
A Return to Virtual Machine Monitors
What does Disco do?
Disco: A Return to VMMs
How does Disco do it?
Experimental Results
How well does Disco dance?
The Basic Problem
With the explosion of multiprocessor
machines , especially of the NUMA variety,
the problem of effectively using the machines
becomes more immediate.
NUMA = Non-Uniform Memory Access – shows up
a lot in clusters.
The authors point out that the problem applies to
any major hardware innovation, not just
multiprocessors.
Potential Solution
Solution: Rewrite the operating system to
address fault-tolerance and scalability.
Flaws:
Rewriting will introduce bugs.
Bugs can disrupt the system or the applications.
Instabilities are usually less-tolerated on these
kinds of systems because of their application
space.
You may not have access to the OS.
Not So Good
Okay. So that wasn’t so good. What else do
we have?
How about Virtual Machine Monitors?
A new twist on an old idea, which may work
better now that we have faster processors.
Enter Disco
•Disco is a system VM
that presents a
similar fundamental
machine to all of the
various OS’s that
might be running on
the machine.
•These can be
commodity OS’s,
uniprocessor,
multiprocessor or
specialty systems.
Disco VMM
Fundamentally, the hardware is a cluster, but
Disco introduces some global policies to
manage all of the resources, which makes for
better usage of the hardware.
We’ll use commodity operating systems and
write the VMM. Rather than millions of lines
of code, we’ll write a few thousand.
What if the resource needs exceed that of the
commodity OS?
Scalability
Very simple changes to the commodity OS
(maybe on the driver level or kernel
extension) can allow virtual machines to
share resources.
E.g., a parallel database could have a cache in
shared memory and multiple virtual processors
running on virtual machines.
Support for specialized OS’s that need the
power of multiple processors but not all of
the features offered by a commodity OS.
Further Benefits
Multiple copies of an OS naturally addresses
scalability and fault containment.
Need greater scaling? Add a VM.
Only the monitor and the system protocols (NFS, etc.)
need to scale.
OS or application crashes? No problem. The rest of
the system is isolated.
NUMA memory management issues are
addressed.
Multiple versions of different OS’s provide legacy
support and convenient upgrade paths.
Not All Sunshine & Roses
VMM Overhead
Additional exception processing, instruction execution
and memory to virtualize hardware.
Privileged instructions aren’t directly executed on the
hardware, so we need to fake it. I/O requests need to
be intercepted and remapped.
Memory overhead is rough too.
Consider having 6 copies of Vista in memory
simultaneously.
Resource Management
VMM can’t make intelligent decisions about code
streams without info from OS.
One Last Disadvantage
Communication
Sometimes resources simply can’t be shared the
way we want.
Most of these can be mitigated though.
For example, most operating systems have good
NFS support. So use it.
But… We can make it even better! (Details
forthcoming.)
Introducing Disco
VMM designed for the FLASH multiprocessor
machine
FLASH is an academic machine designed at
Stanford University
Is a collection of nodes containing a processor,
memory, and I/O. Use directory cache coherence
which makes it look like a CC-NUMA machine.
Has also been ported to a number of other
machines.
Disco’s Interface
The virtual CPU of Disco is an abstraction of a
MIPS R10000.
Not only emulates but extends (e.g., reduces
some kernel operations to simple load/store
instructions.
A presented abstraction of physical memory
starting at address 0 (zero).
I/O Devices
Disks, network interfaces, interrupts, clocks, etc.
Special interfaces for network and disks.
Disco’s Implementation
Implemented as a multi-threaded shared-
memory program.
Careful attention paid to memory placement,
cache-aware data structures and processor
communication patterns.
Disco is only 13,000 lines of code.
Windows Server 2003 - ~50,000,000
Red Hat 7.1 - ~ 30,000,000
Mac OS X 10.4 - ~86,000,000
Disco’s Implementation
The execution of a virtual processor is
mapped one-for-one to a real processor.
At each context switch, the state of a processor is
made to be that of a VP.
On MIPS, Disco runs in kernel mode and puts
the processor in appropriate modes for
what’s being run
Supervisor mode for OS, user mode for apps
Simple scheduler allows VP’s to be time-
shared across the physical processors.
Disco’s Implementation
Virtual Physical Memory
This discussion goes on for 1.5 pages. To sum up:
The OS makes requests to physical addresses, and
Disco translates them to machine addresses.
Disco uses the hardware TLB for this.
Switching a different VP onto a new processor
requires a TLB flush, so Disco maintains a 2nd-level
TLB to offset the performance hit.
There’s a technical issue with TLBs, Kernel space
and the MIPS processor that threw them for a
loop.
NUMA Memory Management
•In an effort to
mitigate the nonuniform effects of a
NUMA machine,
Disco does a bunch of
stuff:
•
•
Allocating as much
memory to have
“affinity” to a
processor as possible.
Migrates or replicates
pages across virtual
machines to reduce
long memory
accesses.
Virtual I/O Devices
Obviously Disco needs to intercept I/O
requests and direct them to the actual device.
Primarily handled by installing drivers for
Disco I/O in the guest OS.
DMA provides an interesting challenge, in
that the DMA addresses need the same
translation as regular accesses.
However, we can do some especially cool
things with DMA requests to disk.
Copy-on-Write Disks
All disk DMA requests are caught and analyzed.
If the data is already in memory, we don’t have
to go to disk for it.
If the request is for a full page, we just update a
pointer in the requesting virtual machine.
So what?
Multiple VM’s can share data without being aware of
it. Only modifying the data causes a copy to be made.
Awesome for scaling up apps by using multiple copies
of an OS. Only really need one copy of the OS kernel,
libraries, etc.
My Favorite – Networking
The Copy-on-write disk stuff is great for non-
persistent disks. But what about persistent
ones? Let’s just use NFS.
But here’s a dumb thing: A VM has a copy of
information it wants to send to another VM
on the same physical machine. In a naïve
approach, we’d let that data be duplicated,
taking up extra memory pointlessly.
So, let’s use copy-on-write for our network
interface too!
Virtual Network Interface
Disco provides a virtual subnet for VM’s to talk to
each other.
This virtual device is Ethernet-like, but with no
maximum transfer size.
Transfers are accomplished by updating pointers
rather than actually copying data (until
absolutely necessary).
The OS sends out the requests as NFS requests.
“Ah,” but you say. “What about the data locality
as a VM starts accessing those files and
memory?”
Page replication and migration!
About those Commodity OS’s
So what do we really need to do to get these
commodity operating systems running on Disco?
Surprisingly a lot and a little.
Minor changes were needed to IRIX’s HAL, amounting
to 2 header files and 15 lines of assembly code. This
did lead to a full kernel recompile though.
Disco needs device drivers. Let’s just steal them from
IRIX!
Don’t trap on every privileged register access. Convert
them into normal loads/stores to special address
space, linked to the privileged registers.
More Patching
“Hinting” added to HAL to help the VMM not
do dumb things (or at least do fewer dumb
things).
When the OS goes idle, the MIPS (usually)
defaults to a low power mode. Disco just
stops scheduling the VM until something
interesting happens.
Other minor things were done, but that
required patching the kernel.
SPLASHOS
Some high-performance apps might need
most or all of the machine. The authors
wrote a “thin” operating system to run
SPLASH-2 applications.
Mostly proof-of-concept.
Experimental Results
Bad Idea: Target your software for a machine
that doesn’t physically exist.
Like, I don’t know, FLASH?
Disco was validated using two alternatives:
SimOS
SGI Origin2000 Board that will form the basis of
FLASH
Experimental Design
Use 4 representative workloads for parallel
applications:
Software Development (Pmake of a large app)
Hardware Development (Verilog simulator)
Scientific Computing (Raytracing and a sorting
algorithm)
Commercial Database (Sybase)
Not only are they representative, but they each
have characteristics that are interesting to study
For example, Pmake is multiprogrammed, lots of
short-lived processes, OS & I/O intensive.
Simplest Results Graph
•Overhead of Disco is
pretty modest compared
to the uniprocessor
results.
•Raytrace is the lowest,
at only 3%. Pmake is the
highest, at 16%.
•The main hits come
from additional traps and
TLB misses (from all the
flushing Disco does).
•Interestingly, less time is
spent in the kernel in
Raytrace, Engineering
and Database.
•Running a 64-bit system
mitigates the impact of
TLB misses.
Memory Utilization
Key thing here is
how 8 VM’s
doesn’t require 8x
the memory of 1
VM.
Interestingly, we
have 8 copies of
IRIX running in
less than 256 MB
of physical RAM!
Scalability
• Page migration and
replication were
disabled for these runs.
• All use 8 processors
and 256 MB of memory.
• IRIX has a terrible
bottleneck in
synchronizing the
system’s memory
management code
• It also has a “lazy”
evaluation policy in the
virtual memory system
that drags “normal”
RADIX down.
•Overall though, check
out those performance
gains!
Page Migration Benefits
•The 100% UMA results
give a lower bound on
performance gains from
page migration and
replication.
•But in short, the
policies work great.
Real Hardware
Experiences on the real SGI hardware pretty
much confirms the simulations, at least at the
uniprocessor level.
Overheads tend to be in the range of 3-8% on
Pmake and the Engineering simulation.
Summing Up
Disco works pretty well.
Memory usage scales well, processor
utilization scales well.
Performance overheads are relatively small
for most loads.
Lots of engineering challenges, but most
seem to have been overcome.
Final Thoughts
Everything in this paper seems, in retrospect,
to be totally obvious. However, the
combination of all of these factors seems like
it would have taken just a ton of work.
Plus, I don’t think I could have done it half as
well, to be honest.
Targeting a non-existent machine seems a
little silly.
Overall, interesting paper.