Virtualization

Download Report

Transcript Virtualization

Virtualization
Operating Systems, 2016, Meni Adler, Danny Hendler & Amnon Meisels
1
What is virtualization?
Creating a virtual version of something
o Hardware, operating system, application, network, memory, storage
“The construction of an isomorphism between
a guest system and a host” [Popek, Goldberg, ’74]
2
Example: virtual disk
 Partition a single hard disk to multiple virtual disks
o Virtual disk has virtual tracks & sectors
 Implement virtual disk by file
 Map between virtual disk and real disk contents
 Virtual disk write/read mapped to file write/read in host
system
3
What is virtualization? (continued)
 A way to run multiple operating systems and applications
on the same hardware (virtual machines)
 Only virtual machine manager (a.k.a. hypervisor) has full
system control
 Virtual machines completely isolated from each other
(or so we hope)
4
Basic concepts
 Virtual Machine (VM)
 Host
 Guest
 Hypervisor (type ||) /
Virtual Machine Monitor
5
Basic concepts
 Virtual Machine (VM)
 Host
 Guest
 Hypervisor (type ||) /
Virtual Machine Monitor
6
Basic concepts
 Virtual Machine (VM)
 Host
 Guest
 Hypervisor (type ||) /
Virtual Machine Monitor
7
Basic concepts
 Virtual Machine (VM)
 Host
 Guest
 Hypervisor (type ||) /
Virtual Machine Monitor
8
Basic concepts
 Virtual Machine (VM)
 Host
 Guest
 Hypervisor (type ||) /
Virtual Machine Monitor
9
Types of virtualization
 Full virtualization – guest OS runs unmodified
 Para-virtualization – guest OS must be aware of
virtualization, source-code modifications required
Hardware virtualization support may be used for both
Our focus is on full virtualization
10
Virtualization advantages
 Cost-effectiveness – less hardware
o Multiple virtual machines / operating systems /
services on single physical machine (server consolidation)
o Various forms of computation as a service
 Isolation
o Good for security
o Great for reliability and recovery: If VM crashes it can be
rebooted, does not affect other services (fault containment)
o VM migration
 Development tool
o Work on multiple OS in parallel
o Develop and debug OS in user mode
o Origins of VMware as a tool for developers
11
Virtualization vs. Multi-Processing
Process1 Process2
∙∙∙
OS
HW (disk, NIC,…)
Multiprocessing
VM
Pr1
Pr2
OS1
Virtualization
Pr1
Pr2
OS2
User space/
kernel separation
HW interface
∙∙∙
∙∙∙
VMM/Hypervisor
Virtual HW interface
Real HW interface
HW (disk, NIC,…)
12
Type 1 and type 2 hypervisors
VMware ESX, Microsoft Hyper-V, Xen
VMware Workstation, Microsoft Virtual
PC, Sun VirtualBox, QEMU, KVM
Figure 7-1. Location of type 1 and type 2 hypervisors.
Operating Systems, 2016, Meni Adler, Danny Hendler & Amnon Meisels
13
Type 1 and type 2 hypervisors (continued)
Figure 7-2. Examples of the various combinations of
virtualization type and hypervisor. Type 1 hypervisors
always run on the bare metal whereas type 2
hypervisors use the services of an existing host
operating system.
Operating Systems, 2016, Meni Adler, Danny Hendler & Amnon Meisels
14
What's required of a (classic) hypervisor
Hypervisor should provide the following:
 Safety: have full control of virtualized resources
 Fidelity: program behavior on VM should be identical to its
behavior on bare hardware
 Efficiency: As much as possible, run directly on hardware without
hypervisor intervention
 Full interpretation isn't efficient
Operating Systems, 2016, Meni Adler, Danny Hendler & Amnon Meisels
15
Classic virtualization: trap and emulate
VM1
VM2
VMM
Return to
process (3)
HW
emulation
HW
Trap (1)
Interrupt
handler (2)
Emulation is the process of implementing the functionality/interface
of one system on a system having different functionality/interface
16
Trap and emulate: difficulties
 Sensitive instructions: behave differently in kernel/supervisor
and user mode
 I/O instructions, enable/disable interrupts, …
 Privileged instructions: cause a trap if executed in user mode
Theorem [Popek and Goldberg, 1974]
A machine can be virtualized [using trap and emulate]
if every sensitive instruction is privileged.
Not supported by x86 processors prior to 2005
In 2005, Intel/AMD introduced virtualization HW support.
Operating Systems, 2016, Meni Adler, Danny Hendler & Amnon Meisels
17
What is sensitive?

CPU – registers

MMU
o Page table
o Segments

Interrupts

Timers

IO devices
18
X86 virtualization problem I
 The x86 architecture (w/o virtualization extensions)
can't be virtualized by trap and emulate.
 Some sensitive instructions are not privileged.
 Example: the popf instruction
o
o
o
o
Pops 16 bits from stack to flags register
One of the flags masks (i.e. disables) interrupts
The instruction is not privileged
What happens if the OS of a VM runs popf?
19
X86 virtualization problem II
 Some instructions: push, pop, mov can have code
segment selectors (cs, ds, ss) as arguments even in user
mode, so they can be read
 The selectors have two bits that are their current privilege
level
o In x86 (beginning with 386), four privilege levels (ring 0 to ring 3)
o Each resource is assigned a level.
o The two lower bits of the cs register are the Current Privilege
Level (CPL) of the code.
o Guest OS thinks that it is in ring 0.
o Guest OS is actually in ring 1
 Result - guest OS confusion.
20
Implementation options
 Emulation –
o Full emulation – hypervisor executes code of VM step by step,
testing each instruction – prohibitive overhead.
o Trap and emulate if sensitive instructions  privileged
instructions
 Change sensitive instructions
o Interpretation – equivalent to emulation (BOCHS, JSLinux).
o Binary translation – change (VMware, QEMU).
 Para-virtualization – re-compile guest OS (XEN, Denali).
 Hardware assistance – Intel VT-x and AMD-V (used by
KVM, XEN, Vmware).
21
Outline
 Concepts, classical CPU virtualization
o Basic interpretation
 Memory virtualization
22
Binary translation
 Binary translation is the process of translating one instruction
set to another one.
 Approach I: translate statically all code base.
o In our case the result is para-virtualization.
o Problems
 Dynamically linked libraries are not known at compile time.
 Self-modifying code, e.g. program generating code and running it,
is not covered.
23
Dynamic binary translation
 Approach II: translate code on the fly (Just In Time).
 Simplest approach
o
o
o
o
Keep table mapping old instructions to new instructions.
Fetch old instruction.
Use table to translate.
Execute new instruction(s)
 Problem: performance
o Overhead for every instruction similarly to interpretation.
24
Dynamic BT with caching
 Cache translated code region:
o After translation run from cache.
o Translation occurs only once.
 Static translation cannot handle dynamic control transfer,
when:
o Jump depending on memory address.
o Indirect function call (by function pointer).
 Translation of dynamic control transfer must be done at
execution time.
25
Virtualization prior to HW support
Figure 7-4. The binary translation rewrites the guest
operating system running in ring 1, while the hypervisor
runs in ring 0
26
VMWare binary translation: example
C code
64-bit binary
Binary (hex)
representation
27
VMWare binary translation: example
 Translator reads guest memory at the address indicated by
guest PC
 Decodes instructions, creates Intermediate Representation
- IR objects
 Accumulates IR objects to translation units (TUs)
o Basic blocks (BB), stops upon control flow
First TU
Compiled code fragment (CCF)
28
VMWare binary translation: example
 Translator reads guest memory at the address indicated by
guest PC
 Decodes instructions, creates Intermediate Representation
- IR objects
 Accumulates IR objects to translation units (TUs)
o Basic blocks (BB), stops upon control flow
First TU
Identical
code
Compiled code fragment (CCF)
29
VMWare binary translation: example
 Translator reads guest memory at the address indicated by
guest PC
 Parses instructions, creates Intermediate Representation
- IR objects
 Accumulates IR objects to translation units (TUs)
o Basic blocks (BB), stops upon control flow
Translation of
jump BB
First TU
Compiled code fragment (CCF)
30
VMWare binary translation: example
 Translator reads guest memory at the address indicated by
guest PC
 Parses instructions, creates Intermediate Representation
- IR objects
 Accumulates IR objects to translation units (TUs)
o Basic blocks (BB), stops upon control flow
Translation of
fall through BB
First TU
Compiled code fragment (CCF)
31
VMWare binary translation: example
C code
64-bit binary
Which basic block will be translated next?
32
VMWare binary translation: example
C code
64-bit binary
Which basic block will be translated next?
33
VMWare binary translation example: output
34
VMWare binary translation operation
 Translation cache (TC) stores translations done so far
 A hash table tracks the input to output correspondence
 Chaining optimization allows one CCF to jump directly to
another without calling out of the translation cache
 As TC gradually captures guest's working set, proportion of
translation decreases
User code does not have to be translated
35
Dealing with privileged instructions: example
 The cli (clear interrupts) instruction is privileged
 Translated to: “vcpu.flags.IP=0”
 Much faster than source binary!
36
Outline
 Concepts, classical CPU virtualization
o Basic interpretation
 Memory virtualization
37
Memory allocation
 Each VM usually receives a contiguous set of physical
addresses.
o 512 Mbyte – 4 Gbyte are typical values.
 As far as VM is concerned, this is the physical memory of
the machine.
 The guest OS allocates pages or segments to guest
processes.
38
Memory management
 Assumptions of OS in VM:
o Physical memory is a contiguous block of addresses from 0 to
some n.
o OS can map any virtual page to any page frame.
 Hypervisor must:
o Partition memory among VMs.
o Ensure virtual page mapping only to assigned page frames.
 TLB – page fault in HW-managed TLB (e.g. x86) causes HW
to select a page from page table.
 VM OS must not manage real page table.
39
Option 1: brute force
Define these
pages as not R/W
Guest OS
Page
dir.
CR3
Hypervisor
Page
table
VMM
SW
TLB
VM
memory
layout
Interrupt & VMM
corrects address.
CPU
HW
40
Brute force – description
 Guest page tables are read and write protected in host
system.
 If guest OS reads page table (e.g. for page eviction)
writes page table (e.g. after page fault), or changes CR3,
the system traps.
 The hypervisor then uses a VM memory layout to:
 Return answers to VM
 Update the layout
 Hypervisor switches VM memory layout when new VM is
scheduled.
41
Option 2: shadow page tables
Guest OS
Page
dir.
Hypervisor
Page
table
VMM
SW
G-CR3
CR3
TLB
Shadow
page
table
Interrupt & VMM
corrects page
table.
CPU
HW
42
Shadow page tables – description
 Hypervisor maintains “shadow page tables”.
 Guest page tables map: Guest VA Guest PA
 Shadow tables: Guest VA Host PA.
 Hypervisor does not trap guest updates to its page table.
o Result – inconsistent guest page table and shadow page table.
 When guest process accesses virtual address
o The physical address is not in the guest page table, but in the
shadow page table.
o HW translates correctly, because it is aware only of shadow
tables.
43
Shadow page tables – description (continued)
 If address in TLB – TLB hit and no problem.
 When guest process causes a page fault
o Hypervisor begins execution.
o Hypervisor updates guest page table with new page.
o Hypervisor updates shadow page table.
 Performance is as good as native execution as long as
there are no page faults.
 Shadow page tables should be cached so that once a VM
is re-scheduled the page table does not have to be rebuilt
from scratch.
44
Option 3: nested page tables
Guest OS
Page
dir.
CR3
Hypervisor
Page
table
VMM
SW
TLB
Host
page
table
CPU
EPTP
HW
45
Nested page tables - description
 The name implies having page tables within page tables.
 The essence of the idea is a hardware assist.
o Hardware has an extra pointer and the ability to walk an extra set
of page tables.
o Idea is called Extended Page Tables (EPT) by Intel
 Guest page tables hold Guest VA Guest PA mapping,
access by standard CR3
 Extended page tables hold Host VA  Host PA mapping,
access by EPTP (EPT pointer).
 Host VA=Guest PA
46
Nested page tables – description (cont'd)
 TLB as usual holds Guest VA Host PA
 On memory access
o If found in TLB – no problem.
o If not in TLB, but no page fault, hardware walks both tables and
updates TLB.
o If page fault, then hardware hypervisor gets host physical page
and provides host virtual page (guest physical) to VM.
47
Sources
 “Modern operating systems”, 4‘th edition, A. Tanenbaum and
H. Bos
 “Virtual machines”, J. E. Smith and R. Nair
 A presentation by Niv Gilboa from CSE@BGU
 “Formal requirements for virtualizable third generation
architectures”, G. J. Popek and R. P. Goldberg, CACM, 1974
 “A comparison of software and hardware techniques for x86
virtualization”, K. Adams and O. Ageson, ASPLOS 2006
48