balloon driver
Download
Report
Transcript balloon driver
spcl.inf.ethz.ch
@spcl_eth
SALVATORE DI GIROLAMO (TA)
Networks and Operating Systems:
Exercise Session 7
spcl.inf.ethz.ch
@spcl_eth
Virtual Machine Monitors
Literature: Barham et al.: Xen and the art of virtualization
and Anderson, Dahlin: Operating Systems: Principles and
Practice, Chapter 14
2
spcl.inf.ethz.ch
@spcl_eth
Creates
illusion of
hardware
App
App
App
App
VMMs and hypervisors
Guest
operating
system
Guest
operating
system
VMM
VMM
Hypervisor
Real hardware
Some folks
distinguish the
Virtual Machine
Monitor from the
Hypervisor
(we won’t)
spcl.inf.ethz.ch
@spcl_eth
Running multiple OSes on one machine
Hypervisor
Real hardware
App
App
App
App
App
App
Application
compatibility
I use Debian for
almost everything,
but I edit slides in
PowerPoint
Some people
compile Barrelfish in
a Debian VM over
Windows 7 with
Hyper-V
Backward
compatibility
Nothing beats a
Windows 98 virtual
machine for playing
old computer games
spcl.inf.ethz.ch
@spcl_eth
Hypervisor
Real hardware
Application
Application
Application
Server consolidation
Many applications
assume they have
the machine to
themselves
Each machine is
mostly idle
Consolidate
servers onto a
single physical
machine
spcl.inf.ethz.ch
@spcl_eth
Hypervisor
Real hardware
Application
Application
Application
Resource isolation
Surprisingly,
modern OSes do
not have an
abstraction for a
single application
Performance
isolation can be
critical in some
enterprises
Use virtual
machines as
resource
containers
spcl.inf.ethz.ch
@spcl_eth
Application
Selling computing
capacity on
demand
Real hardware
Application
Hypervisor
Application
Real hardware
Application
Hypervisor
Application
Real hardware
Application
Hypervisor
Application
Application
E.g. Amazon EC2,
GoGrid, etc.
Application
Hypervisor
Real hardware
Application
Application
Application
Cloud computing
Hypervisor
Real hardware
Hypervisor
Real hardware
Hypervisors
decouple
allocation of
resources (VMs)
from provisioning
of infrastructure
(physical
machines)
spcl.inf.ethz.ch
@spcl_eth
Visual
Studio
Editor
Compiler
Operating System development
Hypervisor
Real hardware
Building and
testing a new OS
without needing
to reboot real
hardware
VMM often gives
you more
information about
faults than real
hardware anyway
spcl.inf.ethz.ch
@spcl_eth
Application
Application
Tracer
Other cool applications…
Hypervisor
Real hardware
Tracing
Debugging
Execution replay
Lock-step
execution
Live migration
Rollback
Speculation
Etc….
spcl.inf.ethz.ch
@spcl_eth
Exam Question:
What is the difference between a Type 1 and Type 2
hypervisor?
Short answer. Type 1 runs on hardware, Type 2 runs atop an existing OS.
10
spcl.inf.ethz.ch
@spcl_eth
App
App
App
App
Mgmt.
App
Hypervisor-based VMMs
Console
Application
Application
App
Hosted VMMs
Guest
operating
system
Console
(Mgmt)
operating
system
Guest
operating
system
Guest
operating
system
VMM
VMM
VMM
VMM
Host operating system
Hypervisor
Real hardware
Real hardware
Examples:
• VMware workstation
• Linux KVM
• Microsoft Hyper-V
• VirtualBox
Examples:
• VMware ESX
• IBM VM/CMS
• Xen
spcl.inf.ethz.ch
@spcl_eth
Exam Question:
Explain what it means for an instruction set to be
strictly virtualizable.
Answer. An instruction set is strictly virtualizable if it does not contain
instructions which fail silently if executed in the wrong privilege mode,
i.e., do not cause interrupts.
12
spcl.inf.ethz.ch
@spcl_eth
Virtualizing the CPU
A CPU architecture is strictly virtualizable if it can be perfectly
emulated over itself, with all non-privileged instructions
executed natively
Privileged instructions trap
Kernel-mode (i.e., the VMM) emulates instruction
Guest’s kernel mode is actually user mode
Or another, extra privilege level (such as ring 1)
Examples: IBM S/390, Alpha, PowerPC
spcl.inf.ethz.ch
@spcl_eth
Virtualizing the CPU
A strictly virtualizable processor can execute a complete native
Guest OS
Guest applications run in user mode as before
Guest kernel works exactly as before
Problem: x86 architecture is not virtualizable
About 20 instructions are sensitive but not privileged
Mostly segment loads and processor flag manipulation
spcl.inf.ethz.ch
@spcl_eth
Non-virtualizable x86: example
PUSHF/POPF instructions
Push/pop condition code register
Includes interrupt enable flag (IF)
Unprivileged instructions: fine in user space!
IF is ignored by POPF in user mode, not in kernel mode
VMM can’t determine if Guest OS wants interrupts disabled!
Can’t cause a trap on a (privileged) POPF
Prevents correct functioning of the Guest OS
spcl.inf.ethz.ch
@spcl_eth
Exam Question:
Describe how a hypervisor can dynamically adjust the
amount of memory used by the guest.
Answer. With Balloning. The host installs a kernel module in the guest, which
allocated memory from the guest OS and communicates the page numbers to the
host OS. The host OS can use these pages now. To give back memory to the
guest we deflate the baloon by telling the driver in the guest to free some of the
allocated memory again.
16
spcl.inf.ethz.ch
@spcl_eth
Ballooning
Technique to reclaim memory from a Guest
Install a “balloon driver” in Guest kernel
Can allocate and free kernel physical memory
Just like any other part of the kernel
Uses HyperCalls to return frames to the Hypervisor, and have them
returned
Guest OS is unware, simply allocates physical memory
spcl.inf.ethz.ch
@spcl_eth
Ballooning: taking RAM away from a VM
Guest physical address space
Balloon
Balloon
driver
spcl.inf.ethz.ch
@spcl_eth
Ballooning: taking RAM away from a VM
1.
VMM asks balloon driver
for memory
Balloon driver asks
Guest OS kernel for more
frames
Guest physical address space
2.
3.
4.
Balloon
Balloon
driver
“inflates the balloon”
Balloon driver sends
physical frame numbers
to VMM
VMM translates into
machine address and
claims the frames
spcl.inf.ethz.ch
@spcl_eth
Ballooning: taking RAM away from a VM
1.
VMM asks balloon driver
for memory
Balloon driver asks
Guest OS kernel for more
frames
Guest physical address space
2.
3.
Balloon
4.
Balloon
driver
“inflates the balloon”
Balloon driver sends
physical frame numbers
to VMM
VMM translates into
machine address and
claims the frames
spcl.inf.ethz.ch
@spcl_eth
Ballooning: taking RAM away from a VM
1.
VMM asks balloon driver
for memory
Balloon driver asks
Guest OS kernel for more
frames
Guest physical address space
2.
3.
Physical
memory
claimed by
balloon driver
Balloon
4.
Balloon
driver
“inflates the balloon”
Balloon driver sends
physical frame numbers
to VMM
VMM translates into
machine address and
claims the frames
spcl.inf.ethz.ch
@spcl_eth
Ballooning: taking RAM away from a VM
1.
VMM asks balloon driver
for memory
Balloon driver asks
Guest OS kernel for more
frames
Guest physical address space
2.
3.
Physical
memory
claimed by
balloon driver
Balloon
4.
Balloon
driver
“inflates the balloon”
Balloon driver sends
physical frame numbers
to VMM
VMM translates into
machine addresses and
claims the frames
spcl.inf.ethz.ch
@spcl_eth
Returning RAM to a VM
1.
VMM converts machine
address into a physical
address previously
allocated by the balloon
driver
VMM hands PFN to
balloon driver
Balloon driver frees
physical frame back to
Guest OS kernel
Guest physical address space
2.
3.
Balloon
Balloon
driver
“deflates the balloon”
spcl.inf.ethz.ch
@spcl_eth
Exam Question:
What problems do we face if we want to use DMA with
a hardware device (i.e., network card) from within a
virtual machine? Describe two techniques to solve
these problems.
Short answer. Guest OS does not know physical addresses of the host, which is
needed for DMA hardware. This can be fixed either using paravirtualization or an
IOMMU.
24
spcl.inf.ethz.ch
@spcl_eth
Virtualizing Devices
Familiar by now: trap-and-emulate
I/O space traps
Protect memory and trap
“Device model”: software model of device in VMM
Interrupts → upcalls to Guest OS
Emulate interrupt controller (APIC) in Guest
Emulate DMA with copy into Guest PAS
Significant performance overhead!
spcl.inf.ethz.ch
@spcl_eth
Paravirtualized devices
“Fake” device drivers which communicate efficiently with VMM
via hypercalls
Used for block devices like disk controllers
Network interfaces
“VMware tools” is mostly about these
Dramatically better performance!
spcl.inf.ethz.ch
@spcl_eth
Networking
Virtual network device in the Guest VM
Hypervisor implements a “soft switch”
Entire virtual IP/Ethernet network on a machine
Many different addressing options
Separate IP addresses
Separate MAC addresses
NAT
Etc.
spcl.inf.ethz.ch
@spcl_eth
Exam Question:
Describe how shadow page tables work.
28
spcl.inf.ethz.ch
@spcl_eth
Shadow page tables
Guest OS sets up its own page tables
Not used by the hardware!
VMM maintains shadow page tables
Map directly from Guest VAs to Machine Addresses
Hardware switched whenever Guest reloads PTBR
VMM must keep V→M table consistent with Guest V→P table and
it’s own P→M table
VMM write-protects all guest page tables
Write trap: apply write to shadow table as well
Significant overhead!
spcl.inf.ethz.ch
@spcl_eth
Shadow page tables
Guest
Virtual AS
Guest
Physical AS
Machine
Memory
17
Guest 1:
5
Shadow page
table mappings
6
Guest 2:
5
spcl.inf.ethz.ch
@spcl_eth
Shadow page tables
guest reads
Virtual → Guest-Physical
guest writes
accessed and
dirty bits
• Guest changes
optional, but help
with batching,
knowing when to
unshadow
• Latest algorithms
work remarkably
well
Guest OS
updates
Virtual → Machine
VMM
MMU
Hardware
spcl.inf.ethz.ch
@spcl_eth
Hardware support
“Nested page tables”
Relatively new in AMD (NPT) and Intel (EPT) hardware
Two-level translation of addresses in the MMU
Hardware knows about:
V→P tables (in the Guest)
P→M tables (in the Hypervisor)
Tagged TLBs to avoid expensive flush on a VM entry/exit
Very nice and easy to code to
One reason kvm is so small
Significant performance overhead…
spcl.inf.ethz.ch
@spcl_eth
Reliable Storage
OSPP Chapter 14
spcl.inf.ethz.ch
@spcl_eth
Reliability and Availabilty
A storage system is:
Reliable if it continues to store data and can read and write it.
Reliability: probability it will be reliable for some period of
time
Available if it responds to requests
Availability: probability it is available at any given time
spcl.inf.ethz.ch
@spcl_eth
What goes wrong?
1.
Operating interruption: Crash, power failure
2.
Approach: use transactions to ensure data is consistent
Covered in the databases course
See book for additional material
Loss of data: Media failure
Approach:
use redundancy
to tolerate loss of media
File
system
transactions
E.g. RAID storage
Not widely supported
Only one atomic operation in POSIX:
Rename
Careful design of file system data structures
Recovery using fsck
Superseded by transactions
Internal to the file system
Exposed to applications
spcl.inf.ethz.ch
@spcl_eth
Media failures 1: Sector and page failures
Disk keeps working, but a sector doesn’t
Sector writes don’t work, reads are corrupted
Page failure: the same for Flash memory
Approaches:
1. Error correcting codes:
Encode data with redundancy to recover from errors
Internally in the drive
2.
Remapping: identify bad sectors and avoid them
Internally in the disk drive
Externally in the OS / file system
spcl.inf.ethz.ch
@spcl_eth
Media failures 2: Device failure
Entire disk (or SSD) just stops working
Note: always detected by the OS
Explicit failure less redundancy required
Expressed as:
Mean Time to Failure (MTTF)
(expected time before disk fails)
Annual Failure Rate = 1/MTTF
(fraction of disks failing in a year)
spcl.inf.ethz.ch
@spcl_eth
RAID 1: simple mirroring
Disk 0
Disk 1
Data block 0
Data block 1
Data block 2
Data block 3
Data block 4
Data block 5
Data block 6
Data block 7
Data block 8
Data block 9
Data block 10
Data block 11
Data block 0
Data block 1
Data block 2
Data block 3
Data block 4
Data block 5
Data block 6
Data block 7
Data block 8
Data block 9
Data block 10
Data block 11
…
…
Writes go to
both disks
Reads from
either disk
(may be faster)
Sector or whole
disk failure
data can still be
recovered
spcl.inf.ethz.ch
@spcl_eth
Parity disks and striping
Disk 0
Disk 1
Disk 2
Disk 3
Disk 4
Block 0
Block 4
Block 8
Block 12
Block 16
Block 20
Block 24
Block 28
Block 32
Block 36
Block 40
Block 44
Block 1
Block 5
Block 9
Block 13
Block 17
Block 21
Block 25
Block 29
Block 33
Block 37
Block 41
Block 45
Block 2
Block 6
Block 10
Block 14
Block 18
Block 22
Block 26
Block 30
Block 34
Block 38
Block 42
Block 46
Block 3
Block 7
Block 11
Block 15
Block 19
Block 23
Block 27
Block 31
Block 35
Block 39
Block 43
Block 47
Parity(0-3)
Parity(4-7)
Parity(8-11)
Parity(12-15)
Parity(16-19)
Parity(20-23)
Parity(24-27)
Parity(28-31)
Parity(32-35)
Parity(36-39)
Parity(40-43)
Parity(44-47)
…
…
…
…
…
spcl.inf.ethz.ch
@spcl_eth
Parity disks
High
overhead for
small writes
spcl.inf.ethz.ch
@spcl_eth
Stripe 2
Stripe 1
Stripe 0
RAID5: Rotating parity
Disk 0
Disk 1
Strip(0,0)
Strip(1,0)
Parity(0,0)
Parity(1,0)
Parity(2,0)
Parity(3,0)
Block 0
Block 1
Block 2
Block 3
Strip(0,1)
Strip(1,1)
Block 16
Block 17
Block 18
Block 19
A strip of sequential
block2on each disk
Disk
Disk 3
balance
Strip(2,0)
Strip(3,0)
parallelism
with
Block 4
Block 8
sequential
access
Block
5
Block 9
Block
6
Block 10
efficiency
Disk 4
Strip(4,0)
Block 7
Block 11
Block 12
Block 13
Block 14
Block 15
Strip(2,1)
Strip(3,1)
Strip(4,1)
Parity(0,1)
Parity(1,1)
Parity(2,1)
Parity(3,1)
Block 20
Block 24
Block 21
Block 25
Parity
Block 22strip rotates
Block 26
around
Block 23 disks with
Block 27
Block 28
Block 29
Block 30
Block 31
Strip(0,2)
Strip(1,2)
successive
stripes
Strip(2,2)
Strip(3,2)
Strip(4,2)
Block 32
Block 33
Block 34
Block 35
Block 36
Block 37
Block 38
Block 39
Can service
…
widely-spaced
requests in
parallel
…
Parity(0,2)
Parity(1,2)
Parity(2,2)
Parity(3,2)
…
Block 40
Block 41
Block 42
Block 43
…
Block 44
Block 45
Block 46
Block 47
…
spcl.inf.ethz.ch
@spcl_eth
Mean time to repair (MTTR)
RAID-5 can lose data in three ways:
1. Two full disk failures (second while the first is recovering)
2. Full disk failure and sector failure on another disk
3. Overlapping sector failures on two disks
MTTR: Mean time to repair
Expected time from disk failure to when new disk is fully rewritten, often
hours
MTTDL: Mean time to data loss
Expected time until 1, 2 or 3 happens
spcl.inf.ethz.ch
@spcl_eth
Exam Question:
In a raid 5 filesystem, explain why reads become
slower when a disk fails.
Answer. The file is spread over several disks. When one fails, the others must be
used to reconstruct its contents. This slows down reads (slightly for hardward
RAID; significantly for software RAID).
43