Transcript Hypervisors
IBM Research
Hypervisors
Orran Krieger
© 2005 IBM Corporation
http://www.research.ibm.com/hypervisor
Hypervisors will be pervasive for commercial systems
Server consolidation.
Incremental upgradeability of HW and SW.
Testing for new deployments.
Functions like RAS implemented just once.
Security.
Quality of service.
Virtualization commodity and ubiquitous…. PHYP, zSeries,
VMWare, Microsoft Virtual PC & Virtual Server, UML, coLinux, Xen,
Denali, L4, Jaluna, rHype, Virtual IRON…
© 2005 IBM Corporation
http://www.research.ibm.com/hypervisor
My position in the mid 90s
Demonstrates limitations in current OSes:
– Server consolidation because the OSes can’t isolate the
workloads.
– Scalable apps on non-scalable OSes (Disco) & Fault
containment (Cellular Disco)
– Rt-Linux … to isolate real time.
Keeps the OS from direct access to the HW.
Large grained partitioning of resources result in
inefficiencies.
Makes fine grained sharing difficult.
Requires configuration and management of multiple OSes.
Hypervisors are for weenies
© 2005 IBM Corporation
http://www.research.ibm.com/hypervisor
Started dealing with Console vendor
Game comes with its own customized OS.
Absolutely deterministic performance.
– OS can *never* be upgraded.
But…. Need persistent storage, network access, general
applications…
Solution: Hypervisor with General purpose Linux and ability to start
games in their own domain/partition.
We implemented IBM’s “Research Hypervisor” (rHype).
– Research platform for IBM’s PERCS project.
– Supports IBM’s PAPR interface used by PHYP on GP-UL: Runs
K42 and existing Linux distroes.
– Supports modified Linux on non-hypervisor mode PPC.
Released “Research Hypervisor” in February under GPL and now
working on Xen PPC.
© 2005 IBM Corporation
http://www.research.ibm.com/hypervisor
Value for HEC/FASTOS (1)
Enable at-scale evaluation and testing before production
deployment.
Multiple OSes on a platform:
– App doesn’t have to be written to least common denominator:
improved productivity, and improved performance.
– Enable ISV community.
Can share machine with commercial environment.
Enables high level of security, e.g., virtual cluster provided to public.
Isolation from OS perturbations.
Services can be implemented once for all OSes, e.g.,
checkpoint/restart, migration, power management, network
virtualization…
Resource management, e.g., interactive supercomputing.
Scalability and fault containment on SMMP.
© 2005 IBM Corporation
http://www.research.ibm.com/hypervisor
Value for HEC/FASTOS (1)
Enabling HW innovation.
Perform once non-performance critical operations like
HW initialization/configuraiton and virtualize nonperformance critical HW.
Security services, e.g., introspection, can be provided
outside the OS, also verify OS invariants continuously.
Out-of-body debugging.
Enable wacky/library/customizable/light-weight… OSes:
– Can be deployed on real at-scale HW.
– Don’t have to solve all the problems (e.g., ACPI).
– Can be moved to new platforms.
© 2005 IBM Corporation
http://www.research.ibm.com/hypervisor
Example 1: real mode application (PERCS HPCS)
GUPS Initial Results
30000
Time (ms)
25000
20000
15000
Linux
PROSE
10000
5000
0
1
2
4
8
16
32
64
128
Size (MB)
© 2005 IBM Corporation
http://www.research.ibm.com/hypervisor
Current Prototype (Parent/Child)
Ethernet
mpifs
application
Disk
Partition
Private
namespace
netfs
open
read
write
close
File
System
devcons
tcp/ip
u9fs
libfs
Shared Memory
in channel
out channel
Network
© 2005 IBM Corporation
http://www.research.ibm.com/hypervisor
Areas of Opportunity
App
Host A
Libraries
9P Enabled Application/Service
9P Library Interface/Interposer
File System
Local FS
NFS
LVM
TCP
…
Sockets
UDP
SCTP
…
SCSI DD
IP
…
Storage
Adapter DD
NIC DD
…
OS
… N
Direct 9P Sockets/Pipes
9P Exported Networking Stack
9P Exported Device Drivers
Operating System 1
Hypervisor
PCI Root A
…
PCI Bridge or Switch
IB Transport
IB Network
IB Link
IB Phy
HCA
Various Storage
Links Options
(pSCSI, SAS,
SATA, iSCSI)
Storage
HBAs
RDMA/DDP
MPA
TCP/IP
Ethernet Link
Ethernet Phy
9P exporting selfvirtualizing hardware
RNIC
© 2005 IBM Corporation
http://www.research.ibm.com/hypervisor
Example 2: Linux on compute nodes
Limited system call
support
Must marshal
arguments and send to
linux
I/O node
Compute node
Layer to convert to
Linux system calls…
Map FASTOS memory into Linux app
System calls reflected into Linux
All system calls can be supported
Continue to avoid TLB, get
deterministic performance…
Compute node
© 2005 IBM Corporation
http://www.research.ibm.com/hypervisor
Implementation issues
With right HW, system calls go directly to OS, otherwise
trap reflection.
Options for page tables:
– Paravirtualized interface to hypervisor owned page table.
– Page tables cached by hypervisor.
– Writable page tables.
Hypervisor needs to own some timer interrupt.
Interrupts likely go through hypervisor and forwarded to
right OS.
© 2005 IBM Corporation
http://www.research.ibm.com/hypervisor
IO hosting
Logical Partition
Logical Partition
Logical Partition
Logical Partition
Kernel <-> Hypervisor Interface
Hypervisor
Hardware <-> Hypervisor Interface
Hardware Platform
© 2005 IBM Corporation
http://www.research.ibm.com/hypervisor
Device Partitioning with IOMMU
Logical Partition
Logical Partition
Logical Partition
Logical Partition
Kernel <-> Hypervisor Interface
Hypervisor
Hardware <-> Hypervisor Interface
Hardware Platform
© 2005 IBM Corporation
http://www.research.ibm.com/hypervisor
Self virtualizing devices
Logical Partition
Logical Partition
Logical Partition
Logical Partition
Kernel <-> Hypervisor Interface
Hypervisor
Hardware <-> Hypervisor Interface
Hardware Platform
© 2005 IBM Corporation
http://www.research.ibm.com/hypervisor
What we need to do
Xen is the obvious choice.
We need to help drive definition of hypervisor before it
becomes too mature.
– Investigate costs of their design decisions, and fix
Need to drive definition of I/O virtualization and selfvirtualizing devices.
Determine set features we can use in common, e.g.:
– One implementations of checkpoint/restart/migration,…
– Gang scheduling of partitions.
Hypervisor as a base is close to ready, making it first
class platform for HEC will take investments..
© 2005 IBM Corporation