Transcript Hypervisors

IBM Research
Hypervisors
Orran Krieger
© 2005 IBM Corporation
http://www.research.ibm.com/hypervisor
Hypervisors will be pervasive for commercial systems
 Server consolidation.
 Incremental upgradeability of HW and SW.
 Testing for new deployments.
 Functions like RAS implemented just once.
 Security.
 Quality of service.
Virtualization commodity and ubiquitous…. PHYP, zSeries,
VMWare, Microsoft Virtual PC & Virtual Server, UML, coLinux, Xen,
Denali, L4, Jaluna, rHype, Virtual IRON…
© 2005 IBM Corporation
http://www.research.ibm.com/hypervisor
My position in the mid 90s
 Demonstrates limitations in current OSes:
– Server consolidation because the OSes can’t isolate the
workloads.
– Scalable apps on non-scalable OSes (Disco) & Fault
containment (Cellular Disco)
– Rt-Linux … to isolate real time.
 Keeps the OS from direct access to the HW.
 Large grained partitioning of resources result in
inefficiencies.
 Makes fine grained sharing difficult.
 Requires configuration and management of multiple OSes.
Hypervisors are for weenies
© 2005 IBM Corporation
http://www.research.ibm.com/hypervisor
Started dealing with Console vendor
 Game comes with its own customized OS.
 Absolutely deterministic performance.
– OS can *never* be upgraded.
 But…. Need persistent storage, network access, general
applications…
 Solution: Hypervisor with General purpose Linux and ability to start
games in their own domain/partition.
 We implemented IBM’s “Research Hypervisor” (rHype).
– Research platform for IBM’s PERCS project.
– Supports IBM’s PAPR interface used by PHYP on GP-UL: Runs
K42 and existing Linux distroes.
– Supports modified Linux on non-hypervisor mode PPC.
 Released “Research Hypervisor” in February under GPL and now
working on Xen PPC.
© 2005 IBM Corporation
http://www.research.ibm.com/hypervisor
Value for HEC/FASTOS (1)
 Enable at-scale evaluation and testing before production
deployment.
 Multiple OSes on a platform:
– App doesn’t have to be written to least common denominator:
improved productivity, and improved performance.
– Enable ISV community.
 Can share machine with commercial environment.
 Enables high level of security, e.g., virtual cluster provided to public.
 Isolation from OS perturbations.
 Services can be implemented once for all OSes, e.g.,
checkpoint/restart, migration, power management, network
virtualization…
 Resource management, e.g., interactive supercomputing.
 Scalability and fault containment on SMMP.
© 2005 IBM Corporation
http://www.research.ibm.com/hypervisor
Value for HEC/FASTOS (1)
 Enabling HW innovation.
 Perform once non-performance critical operations like
HW initialization/configuraiton and virtualize nonperformance critical HW.
 Security services, e.g., introspection, can be provided
outside the OS, also verify OS invariants continuously.
 Out-of-body debugging.
 Enable wacky/library/customizable/light-weight… OSes:
– Can be deployed on real at-scale HW.
– Don’t have to solve all the problems (e.g., ACPI).
– Can be moved to new platforms.
© 2005 IBM Corporation
http://www.research.ibm.com/hypervisor
Example 1: real mode application (PERCS HPCS)
GUPS Initial Results
30000
Time (ms)
25000
20000
15000
Linux
PROSE
10000
5000
0
1
2
4
8
16
32
64
128
Size (MB)
© 2005 IBM Corporation
http://www.research.ibm.com/hypervisor
Current Prototype (Parent/Child)
Ethernet
mpifs
application
Disk
Partition
Private
namespace
netfs
open
read
write
close
File
System
devcons
tcp/ip
u9fs
libfs
Shared Memory
in channel
out channel
Network
© 2005 IBM Corporation
http://www.research.ibm.com/hypervisor
Areas of Opportunity
App
Host A
Libraries
9P Enabled Application/Service
9P Library Interface/Interposer
File System
Local FS
NFS
LVM
TCP
…
Sockets
UDP
SCTP
…
SCSI DD
IP
…
Storage
Adapter DD
NIC DD
…
OS
… N
Direct 9P Sockets/Pipes
9P Exported Networking Stack
9P Exported Device Drivers
Operating System 1
Hypervisor
PCI Root A
…
PCI Bridge or Switch
IB Transport
IB Network
IB Link
IB Phy
HCA
Various Storage
Links Options
(pSCSI, SAS,
SATA, iSCSI)
Storage
HBAs
RDMA/DDP
MPA
TCP/IP
Ethernet Link
Ethernet Phy
9P exporting selfvirtualizing hardware
RNIC
© 2005 IBM Corporation
http://www.research.ibm.com/hypervisor
Example 2: Linux on compute nodes
 Limited system call
support
 Must marshal
arguments and send to
linux
I/O node
Compute node
 Layer to convert to
Linux system calls…
 Map FASTOS memory into Linux app
 System calls reflected into Linux
 All system calls can be supported
 Continue to avoid TLB, get
deterministic performance…
Compute node
© 2005 IBM Corporation
http://www.research.ibm.com/hypervisor
Implementation issues
 With right HW, system calls go directly to OS, otherwise
trap reflection.
 Options for page tables:
– Paravirtualized interface to hypervisor owned page table.
– Page tables cached by hypervisor.
– Writable page tables.
 Hypervisor needs to own some timer interrupt.
 Interrupts likely go through hypervisor and forwarded to
right OS.
© 2005 IBM Corporation
http://www.research.ibm.com/hypervisor
IO hosting
Logical Partition
Logical Partition
Logical Partition
Logical Partition
Kernel <-> Hypervisor Interface
Hypervisor
Hardware <-> Hypervisor Interface
Hardware Platform
© 2005 IBM Corporation
http://www.research.ibm.com/hypervisor
Device Partitioning with IOMMU
Logical Partition
Logical Partition
Logical Partition
Logical Partition
Kernel <-> Hypervisor Interface
Hypervisor
Hardware <-> Hypervisor Interface
Hardware Platform
© 2005 IBM Corporation
http://www.research.ibm.com/hypervisor
Self virtualizing devices
Logical Partition
Logical Partition
Logical Partition
Logical Partition
Kernel <-> Hypervisor Interface
Hypervisor
Hardware <-> Hypervisor Interface
Hardware Platform
© 2005 IBM Corporation
http://www.research.ibm.com/hypervisor
What we need to do
 Xen is the obvious choice.
 We need to help drive definition of hypervisor before it
becomes too mature.
– Investigate costs of their design decisions, and fix
 Need to drive definition of I/O virtualization and selfvirtualizing devices.
 Determine set features we can use in common, e.g.:
– One implementations of checkpoint/restart/migration,…
– Gang scheduling of partitions.
 Hypervisor as a base is close to ready, making it first
class platform for HEC will take investments..
© 2005 IBM Corporation