Virtualization
Download
Report
Transcript Virtualization
Virtualization
Operating Systems, 2016, Meni Adler, Danny Hendler & Amnon Meisels
1
What is virtualization?
Creating a virtual version of something
o Hardware, operating system, application, network, memory, storage
“The construction of an isomorphism between
a guest system and a host” [Popek, Goldberg, ’74]
2
Example: virtual disk
Partition a single hard disk to multiple virtual disks
o Virtual disk has virtual tracks & sectors
Implement virtual disk by file
Map between virtual disk and real disk contents
Virtual disk write/read mapped to file write/read in host
system
3
What is virtualization? (continued)
A way to run multiple operating systems and applications
on the same hardware (virtual machines)
Only virtual machine manager (a.k.a. hypervisor) has full
system control
Virtual machines completely isolated from each other
(or so we hope)
4
Basic concepts
Virtual Machine (VM)
Host
Guest
Hypervisor (type ||) /
Virtual Machine Monitor
5
Basic concepts
Virtual Machine (VM)
Host
Guest
Hypervisor (type ||) /
Virtual Machine Monitor
6
Basic concepts
Virtual Machine (VM)
Host
Guest
Hypervisor (type ||) /
Virtual Machine Monitor
7
Basic concepts
Virtual Machine (VM)
Host
Guest
Hypervisor (type ||) /
Virtual Machine Monitor
8
Basic concepts
Virtual Machine (VM)
Host
Guest
Hypervisor (type ||) /
Virtual Machine Monitor
9
Types of virtualization
Full virtualization – guest OS runs unmodified
Para-virtualization – guest OS must be aware of
virtualization, source-code modifications required
Hardware virtualization support may be used for both
Our focus is on full virtualization
10
Virtualization advantages
Cost-effectiveness – less hardware
o Multiple virtual machines / operating systems /
services on single physical machine (server consolidation)
o Various forms of computation as a service
Isolation
o Good for security
o Great for reliability and recovery: If VM crashes it can be
rebooted, does not affect other services (fault containment)
o VM migration
Development tool
o Work on multiple OS in parallel
o Develop and debug OS in user mode
o Origins of VMware as a tool for developers
11
Virtualization vs. Multi-Processing
Process1 Process2
∙∙∙
OS
HW (disk, NIC,…)
Multiprocessing
VM
Pr1
Pr2
OS1
Virtualization
Pr1
Pr2
OS2
User space/
kernel separation
HW interface
∙∙∙
∙∙∙
VMM/Hypervisor
Virtual HW interface
Real HW interface
HW (disk, NIC,…)
12
Type 1 and type 2 hypervisors
VMware ESX, Microsoft Hyper-V, Xen
VMware Workstation, Microsoft Virtual
PC, Sun VirtualBox, QEMU, KVM
Figure 7-1. Location of type 1 and type 2 hypervisors.
Operating Systems, 2016, Meni Adler, Danny Hendler & Amnon Meisels
13
Type 1 and type 2 hypervisors (continued)
Figure 7-2. Examples of the various combinations of
virtualization type and hypervisor. Type 1 hypervisors
always run on the bare metal whereas type 2
hypervisors use the services of an existing host
operating system.
Operating Systems, 2016, Meni Adler, Danny Hendler & Amnon Meisels
14
What's required of a (classic) hypervisor
Hypervisor should provide the following:
Safety: have full control of virtualized resources
Fidelity: program behavior on VM should be identical to its
behavior on bare hardware
Efficiency: As much as possible, run directly on hardware without
hypervisor intervention
Full interpretation isn't efficient
Operating Systems, 2016, Meni Adler, Danny Hendler & Amnon Meisels
15
Classic virtualization: trap and emulate
VM1
VM2
VMM
Return to
process (3)
HW
emulation
HW
Trap (1)
Interrupt
handler (2)
Emulation is the process of implementing the functionality/interface
of one system on a system having different functionality/interface
16
Trap and emulate: difficulties
Sensitive instructions: behave differently in kernel/supervisor
and user mode
I/O instructions, enable/disable interrupts, …
Privileged instructions: cause a trap if executed in user mode
Theorem [Popek and Goldberg, 1974]
A machine can be virtualized [using trap and emulate]
if every sensitive instruction is privileged.
Not supported by x86 processors prior to 2005
In 2005, Intel/AMD introduced virtualization HW support.
Operating Systems, 2016, Meni Adler, Danny Hendler & Amnon Meisels
17
What is sensitive?
CPU – registers
MMU
o Page table
o Segments
Interrupts
Timers
IO devices
18
X86 virtualization problem I
The x86 architecture (w/o virtualization extensions)
can't be virtualized by trap and emulate.
Some sensitive instructions are not privileged.
Example: the popf instruction
o
o
o
o
Pops 16 bits from stack to flags register
One of the flags masks (i.e. disables) interrupts
The instruction is not privileged
What happens if the OS of a VM runs popf?
19
X86 virtualization problem II
Some instructions: push, pop, mov can have code
segment selectors (cs, ds, ss) as arguments even in user
mode, so they can be read
The selectors have two bits that are their current privilege
level
o In x86 (beginning with 386), four privilege levels (ring 0 to ring 3)
o Each resource is assigned a level.
o The two lower bits of the cs register are the Current Privilege
Level (CPL) of the code.
o Guest OS thinks that it is in ring 0.
o Guest OS is actually in ring 1
Result - guest OS confusion.
20
Implementation options
Emulation –
o Full emulation – hypervisor executes code of VM step by step,
testing each instruction – prohibitive overhead.
o Trap and emulate if sensitive instructions privileged
instructions
Change sensitive instructions
o Interpretation – equivalent to emulation (BOCHS, JSLinux).
o Binary translation – change (VMware, QEMU).
Para-virtualization – re-compile guest OS (XEN, Denali).
Hardware assistance – Intel VT-x and AMD-V (used by
KVM, XEN, Vmware).
21
Outline
Concepts, classical CPU virtualization
o Basic interpretation
Memory virtualization
22
Binary translation
Binary translation is the process of translating one instruction
set to another one.
Approach I: translate statically all code base.
o In our case the result is para-virtualization.
o Problems
Dynamically linked libraries are not known at compile time.
Self-modifying code, e.g. program generating code and running it,
is not covered.
23
Dynamic binary translation
Approach II: translate code on the fly (Just In Time).
Simplest approach
o
o
o
o
Keep table mapping old instructions to new instructions.
Fetch old instruction.
Use table to translate.
Execute new instruction(s)
Problem: performance
o Overhead for every instruction similarly to interpretation.
24
Dynamic BT with caching
Cache translated code region:
o After translation run from cache.
o Translation occurs only once.
Static translation cannot handle dynamic control transfer,
when:
o Jump depending on memory address.
o Indirect function call (by function pointer).
Translation of dynamic control transfer must be done at
execution time.
25
Virtualization prior to HW support
Figure 7-4. The binary translation rewrites the guest
operating system running in ring 1, while the hypervisor
runs in ring 0
26
VMWare binary translation: example
C code
64-bit binary
Binary (hex)
representation
27
VMWare binary translation: example
Translator reads guest memory at the address indicated by
guest PC
Decodes instructions, creates Intermediate Representation
- IR objects
Accumulates IR objects to translation units (TUs)
o Basic blocks (BB), stops upon control flow
First TU
Compiled code fragment (CCF)
28
VMWare binary translation: example
Translator reads guest memory at the address indicated by
guest PC
Decodes instructions, creates Intermediate Representation
- IR objects
Accumulates IR objects to translation units (TUs)
o Basic blocks (BB), stops upon control flow
First TU
Identical
code
Compiled code fragment (CCF)
29
VMWare binary translation: example
Translator reads guest memory at the address indicated by
guest PC
Parses instructions, creates Intermediate Representation
- IR objects
Accumulates IR objects to translation units (TUs)
o Basic blocks (BB), stops upon control flow
Translation of
jump BB
First TU
Compiled code fragment (CCF)
30
VMWare binary translation: example
Translator reads guest memory at the address indicated by
guest PC
Parses instructions, creates Intermediate Representation
- IR objects
Accumulates IR objects to translation units (TUs)
o Basic blocks (BB), stops upon control flow
Translation of
fall through BB
First TU
Compiled code fragment (CCF)
31
VMWare binary translation: example
C code
64-bit binary
Which basic block will be translated next?
32
VMWare binary translation: example
C code
64-bit binary
Which basic block will be translated next?
33
VMWare binary translation example: output
34
VMWare binary translation operation
Translation cache (TC) stores translations done so far
A hash table tracks the input to output correspondence
Chaining optimization allows one CCF to jump directly to
another without calling out of the translation cache
As TC gradually captures guest's working set, proportion of
translation decreases
User code does not have to be translated
35
Dealing with privileged instructions: example
The cli (clear interrupts) instruction is privileged
Translated to: “vcpu.flags.IP=0”
Much faster than source binary!
36
Outline
Concepts, classical CPU virtualization
o Basic interpretation
Memory virtualization
37
Memory allocation
Each VM usually receives a contiguous set of physical
addresses.
o 512 Mbyte – 4 Gbyte are typical values.
As far as VM is concerned, this is the physical memory of
the machine.
The guest OS allocates pages or segments to guest
processes.
38
Memory management
Assumptions of OS in VM:
o Physical memory is a contiguous block of addresses from 0 to
some n.
o OS can map any virtual page to any page frame.
Hypervisor must:
o Partition memory among VMs.
o Ensure virtual page mapping only to assigned page frames.
TLB – page fault in HW-managed TLB (e.g. x86) causes HW
to select a page from page table.
VM OS must not manage real page table.
39
Option 1: brute force
Define these
pages as not R/W
Guest OS
Page
dir.
CR3
Hypervisor
Page
table
VMM
SW
TLB
VM
memory
layout
Interrupt & VMM
corrects address.
CPU
HW
40
Brute force – description
Guest page tables are read and write protected in host
system.
If guest OS reads page table (e.g. for page eviction)
writes page table (e.g. after page fault), or changes CR3,
the system traps.
The hypervisor then uses a VM memory layout to:
Return answers to VM
Update the layout
Hypervisor switches VM memory layout when new VM is
scheduled.
41
Option 2: shadow page tables
Guest OS
Page
dir.
Hypervisor
Page
table
VMM
SW
G-CR3
CR3
TLB
Shadow
page
table
Interrupt & VMM
corrects page
table.
CPU
HW
42
Shadow page tables – description
Hypervisor maintains “shadow page tables”.
Guest page tables map: Guest VA Guest PA
Shadow tables: Guest VA Host PA.
Hypervisor does not trap guest updates to its page table.
o Result – inconsistent guest page table and shadow page table.
When guest process accesses virtual address
o The physical address is not in the guest page table, but in the
shadow page table.
o HW translates correctly, because it is aware only of shadow
tables.
43
Shadow page tables – description (continued)
If address in TLB – TLB hit and no problem.
When guest process causes a page fault
o Hypervisor begins execution.
o Hypervisor updates guest page table with new page.
o Hypervisor updates shadow page table.
Performance is as good as native execution as long as
there are no page faults.
Shadow page tables should be cached so that once a VM
is re-scheduled the page table does not have to be rebuilt
from scratch.
44
Option 3: nested page tables
Guest OS
Page
dir.
CR3
Hypervisor
Page
table
VMM
SW
TLB
Host
page
table
CPU
EPTP
HW
45
Nested page tables - description
The name implies having page tables within page tables.
The essence of the idea is a hardware assist.
o Hardware has an extra pointer and the ability to walk an extra set
of page tables.
o Idea is called Extended Page Tables (EPT) by Intel
Guest page tables hold Guest VA Guest PA mapping,
access by standard CR3
Extended page tables hold Host VA Host PA mapping,
access by EPTP (EPT pointer).
Host VA=Guest PA
46
Nested page tables – description (cont'd)
TLB as usual holds Guest VA Host PA
On memory access
o If found in TLB – no problem.
o If not in TLB, but no page fault, hardware walks both tables and
updates TLB.
o If page fault, then hardware hypervisor gets host physical page
and provides host virtual page (guest physical) to VM.
47
Sources
“Modern operating systems”, 4‘th edition, A. Tanenbaum and
H. Bos
“Virtual machines”, J. E. Smith and R. Nair
A presentation by Niv Gilboa from CSE@BGU
“Formal requirements for virtualizable third generation
architectures”, G. J. Popek and R. P. Goldberg, CACM, 1974
“A comparison of software and hardware techniques for x86
virtualization”, K. Adams and O. Ageson, ASPLOS 2006
48