balloon driver

Transcript balloon driver

spcl.inf.ethz.ch
@spcl_eth
SALVATORE DI GIROLAMO (TA)
Networks and Operating Systems:
Exercise Session 7
spcl.inf.ethz.ch
@spcl_eth
Virtual Machine Monitors
Literature: Barham et al.: Xen and the art of virtualization
and Anderson, Dahlin: Operating Systems: Principles and
Practice, Chapter 14
2
spcl.inf.ethz.ch
@spcl_eth
Creates
illusion of
hardware
App
App
App
App
VMMs and hypervisors
Guest
operating
system
Guest
operating
system
VMM
VMM
Hypervisor
Real hardware
Some folks
distinguish the
Virtual Machine
Monitor from the
Hypervisor
(we won’t)
spcl.inf.ethz.ch
@spcl_eth
Running multiple OSes on one machine
Hypervisor
Real hardware
App
App
App
App
App
App
 Application
compatibility
 I use Debian for
almost everything,
but I edit slides in
PowerPoint
 Some people
compile Barrelfish in
a Debian VM over
Windows 7 with
Hyper-V
 Backward
compatibility
 Nothing beats a
Windows 98 virtual
machine for playing
old computer games
spcl.inf.ethz.ch
@spcl_eth
Hypervisor
Real hardware
Application
Application
Application
Server consolidation
 Many applications
assume they have
the machine to
themselves
 Each machine is
mostly idle
 Consolidate
servers onto a
single physical
machine
spcl.inf.ethz.ch
@spcl_eth
Hypervisor
Real hardware
Application
Application
Application
Resource isolation
 Surprisingly,
modern OSes do
not have an
abstraction for a
single application
 Performance
isolation can be
critical in some
enterprises
 Use virtual
machines as
resource
containers
spcl.inf.ethz.ch
@spcl_eth
Application
 Selling computing
capacity on
demand
Real hardware
Application
Hypervisor
Application
Real hardware
Application
Hypervisor
Application
Real hardware
Application
Hypervisor
Application
Application
 E.g. Amazon EC2,
GoGrid, etc.
Application
Hypervisor
Real hardware
Application
Application
Application
Cloud computing
Hypervisor
Real hardware
Hypervisor
Real hardware
 Hypervisors
decouple
allocation of
resources (VMs)
from provisioning
of infrastructure
(physical
machines)
spcl.inf.ethz.ch
@spcl_eth
Visual
Studio
Editor
Compiler
Operating System development
Hypervisor
Real hardware
 Building and
testing a new OS
without needing
to reboot real
hardware
 VMM often gives
you more
information about
faults than real
hardware anyway
spcl.inf.ethz.ch
@spcl_eth
Application
Application
Tracer
Other cool applications…








Hypervisor
Real hardware
Tracing
Debugging
Execution replay
Lock-step
execution
Live migration
Rollback
Speculation
Etc….
spcl.inf.ethz.ch
@spcl_eth
Exam Question:
What is the difference between a Type 1 and Type 2
hypervisor?
Short answer. Type 1 runs on hardware, Type 2 runs atop an existing OS.
10
spcl.inf.ethz.ch
@spcl_eth
App
App
App
App
Mgmt.
App
Hypervisor-based VMMs
Console
Application
Application
App
Hosted VMMs
Guest
operating
system
Console
(Mgmt)
operating
system
Guest
operating
system
Guest
operating
system
VMM
VMM
VMM
VMM
Host operating system
Hypervisor
Real hardware
Real hardware
Examples:
• VMware workstation
• Linux KVM
• Microsoft Hyper-V
• VirtualBox
Examples:
• VMware ESX
• IBM VM/CMS
• Xen
spcl.inf.ethz.ch
@spcl_eth
Exam Question:
Explain what it means for an instruction set to be
strictly virtualizable.
Answer. An instruction set is strictly virtualizable if it does not contain
instructions which fail silently if executed in the wrong privilege mode,
i.e., do not cause interrupts.
12
spcl.inf.ethz.ch
@spcl_eth
Virtualizing the CPU
 A CPU architecture is strictly virtualizable if it can be perfectly
emulated over itself, with all non-privileged instructions
executed natively
 Privileged instructions  trap
 Kernel-mode (i.e., the VMM) emulates instruction
 Guest’s kernel mode is actually user mode
Or another, extra privilege level (such as ring 1)
 Examples: IBM S/390, Alpha, PowerPC
spcl.inf.ethz.ch
@spcl_eth
Virtualizing the CPU
 A strictly virtualizable processor can execute a complete native
Guest OS
 Guest applications run in user mode as before
 Guest kernel works exactly as before
 Problem: x86 architecture is not virtualizable 
 About 20 instructions are sensitive but not privileged
 Mostly segment loads and processor flag manipulation
spcl.inf.ethz.ch
@spcl_eth
Non-virtualizable x86: example
 PUSHF/POPF instructions
 Push/pop condition code register
 Includes interrupt enable flag (IF)
 Unprivileged instructions: fine in user space!
 IF is ignored by POPF in user mode, not in kernel mode
 VMM can’t determine if Guest OS wants interrupts disabled!
 Can’t cause a trap on a (privileged) POPF
 Prevents correct functioning of the Guest OS
spcl.inf.ethz.ch
@spcl_eth
Exam Question:
Describe how a hypervisor can dynamically adjust the
amount of memory used by the guest.
Answer. With Balloning. The host installs a kernel module in the guest, which
allocated memory from the guest OS and communicates the page numbers to the
host OS. The host OS can use these pages now. To give back memory to the
guest we deflate the baloon by telling the driver in the guest to free some of the
allocated memory again.
16
spcl.inf.ethz.ch
@spcl_eth
Ballooning
 Technique to reclaim memory from a Guest
 Install a “balloon driver” in Guest kernel
 Can allocate and free kernel physical memory
Just like any other part of the kernel
 Uses HyperCalls to return frames to the Hypervisor, and have them
returned
Guest OS is unware, simply allocates physical memory
spcl.inf.ethz.ch
@spcl_eth
Ballooning: taking RAM away from a VM
Guest physical address space
Balloon
Balloon
driver
spcl.inf.ethz.ch
@spcl_eth
Ballooning: taking RAM away from a VM
1.
VMM asks balloon driver
for memory
Balloon driver asks
Guest OS kernel for more
frames
Guest physical address space
2.

3.
4.
Balloon
Balloon
driver
“inflates the balloon”
Balloon driver sends
physical frame numbers
to VMM
VMM translates into
machine address and
claims the frames
spcl.inf.ethz.ch
@spcl_eth
Ballooning: taking RAM away from a VM
1.
VMM asks balloon driver
for memory
Balloon driver asks
Guest OS kernel for more
frames
Guest physical address space
2.

3.
Balloon
4.
Balloon
driver
“inflates the balloon”
Balloon driver sends
physical frame numbers
to VMM
VMM translates into
machine address and
claims the frames
spcl.inf.ethz.ch
@spcl_eth
Ballooning: taking RAM away from a VM
1.
VMM asks balloon driver
for memory
Balloon driver asks
Guest OS kernel for more
frames
Guest physical address space
2.

3.
Physical
memory
claimed by
balloon driver
Balloon
4.
Balloon
driver
“inflates the balloon”
Balloon driver sends
physical frame numbers
to VMM
VMM translates into
machine address and
claims the frames
spcl.inf.ethz.ch
@spcl_eth
Ballooning: taking RAM away from a VM
1.
VMM asks balloon driver
for memory
Balloon driver asks
Guest OS kernel for more
frames
Guest physical address space
2.

3.
Physical
memory
claimed by
balloon driver
Balloon
4.
Balloon
driver
“inflates the balloon”
Balloon driver sends
physical frame numbers
to VMM
VMM translates into
machine addresses and
claims the frames
spcl.inf.ethz.ch
@spcl_eth
Returning RAM to a VM
1.
VMM converts machine
address into a physical
address previously
allocated by the balloon
driver
VMM hands PFN to
balloon driver
Balloon driver frees
physical frame back to
Guest OS kernel
Guest physical address space
2.
3.
Balloon
Balloon
driver

“deflates the balloon”
spcl.inf.ethz.ch
@spcl_eth
Exam Question:
What problems do we face if we want to use DMA with
a hardware device (i.e., network card) from within a
virtual machine? Describe two techniques to solve
these problems.
Short answer. Guest OS does not know physical addresses of the host, which is
needed for DMA hardware. This can be fixed either using paravirtualization or an
IOMMU.
24
spcl.inf.ethz.ch
@spcl_eth
Virtualizing Devices
 Familiar by now: trap-and-emulate
 I/O space traps
 Protect memory and trap
 “Device model”: software model of device in VMM
 Interrupts → upcalls to Guest OS
 Emulate interrupt controller (APIC) in Guest
 Emulate DMA with copy into Guest PAS
 Significant performance overhead!
spcl.inf.ethz.ch
@spcl_eth
Paravirtualized devices
 “Fake” device drivers which communicate efficiently with VMM
via hypercalls
 Used for block devices like disk controllers
 Network interfaces
 “VMware tools” is mostly about these
 Dramatically better performance!
spcl.inf.ethz.ch
@spcl_eth
Networking
 Virtual network device in the Guest VM
 Hypervisor implements a “soft switch”
 Entire virtual IP/Ethernet network on a machine
 Many different addressing options
 Separate IP addresses
 Separate MAC addresses
 NAT
 Etc.
spcl.inf.ethz.ch
@spcl_eth
Exam Question:
Describe how shadow page tables work.
28
spcl.inf.ethz.ch
@spcl_eth
Shadow page tables
 Guest OS sets up its own page tables
 Not used by the hardware!
 VMM maintains shadow page tables
 Map directly from Guest VAs to Machine Addresses
 Hardware switched whenever Guest reloads PTBR
 VMM must keep V→M table consistent with Guest V→P table and
it’s own P→M table
 VMM write-protects all guest page tables
 Write  trap: apply write to shadow table as well
 Significant overhead!
spcl.inf.ethz.ch
@spcl_eth
Shadow page tables
Guest
Virtual AS
Guest
Physical AS
Machine
Memory
17
Guest 1:
5
Shadow page
table mappings
6
Guest 2:
5
spcl.inf.ethz.ch
@spcl_eth
Shadow page tables
guest reads
Virtual → Guest-Physical
guest writes
accessed and
dirty bits
• Guest changes
optional, but help
with batching,
knowing when to
unshadow
• Latest algorithms
work remarkably
well
Guest OS
updates
Virtual → Machine
VMM
MMU
Hardware
spcl.inf.ethz.ch
@spcl_eth
Hardware support
 “Nested page tables”
 Relatively new in AMD (NPT) and Intel (EPT) hardware
 Two-level translation of addresses in the MMU
 Hardware knows about:
V→P tables (in the Guest)
P→M tables (in the Hypervisor)
 Tagged TLBs to avoid expensive flush on a VM entry/exit
 Very nice and easy to code to
 One reason kvm is so small
 Significant performance overhead…
spcl.inf.ethz.ch
@spcl_eth
Reliable Storage
OSPP Chapter 14
spcl.inf.ethz.ch
@spcl_eth
Reliability and Availabilty
A storage system is:
 Reliable if it continues to store data and can read and write it.
 Reliability: probability it will be reliable for some period of
time
 Available if it responds to requests
 Availability: probability it is available at any given time
spcl.inf.ethz.ch
@spcl_eth
What goes wrong?
1.
Operating interruption: Crash, power failure



2.
Approach: use transactions to ensure data is consistent
Covered in the databases course
See book for additional material
Loss of data: Media failure


Approach:
use redundancy
to tolerate loss of media
File
system
transactions
E.g. RAID storage

Not widely supported

Only one atomic operation in POSIX:

Rename

Careful design of file system data structures

Recovery using fsck

Superseded by transactions


Internal to the file system
Exposed to applications
spcl.inf.ethz.ch
@spcl_eth
Media failures 1: Sector and page failures
Disk keeps working, but a sector doesn’t
 Sector writes don’t work, reads are corrupted
 Page failure: the same for Flash memory
Approaches:
1. Error correcting codes:
 Encode data with redundancy to recover from errors
 Internally in the drive
2.
Remapping: identify bad sectors and avoid them
 Internally in the disk drive
 Externally in the OS / file system
spcl.inf.ethz.ch
@spcl_eth
Media failures 2: Device failure
 Entire disk (or SSD) just stops working
 Note: always detected by the OS
 Explicit failure  less redundancy required
 Expressed as:
 Mean Time to Failure (MTTF)
(expected time before disk fails)
 Annual Failure Rate = 1/MTTF
(fraction of disks failing in a year)
spcl.inf.ethz.ch
@spcl_eth
RAID 1: simple mirroring
Disk 0
Disk 1
Data block 0
Data block 1
Data block 2
Data block 3
Data block 4
Data block 5
Data block 6
Data block 7
Data block 8
Data block 9
Data block 10
Data block 11
Data block 0
Data block 1
Data block 2
Data block 3
Data block 4
Data block 5
Data block 6
Data block 7
Data block 8
Data block 9
Data block 10
Data block 11
…
…
Writes go to
both disks
Reads from
either disk
(may be faster)
Sector or whole
disk failure 
data can still be
recovered
spcl.inf.ethz.ch
@spcl_eth
Parity disks and striping
Disk 0
Disk 1
Disk 2
Disk 3
Disk 4
Block 0
Block 4
Block 8
Block 12
Block 16
Block 20
Block 24
Block 28
Block 32
Block 36
Block 40
Block 44
Block 1
Block 5
Block 9
Block 13
Block 17
Block 21
Block 25
Block 29
Block 33
Block 37
Block 41
Block 45
Block 2
Block 6
Block 10
Block 14
Block 18
Block 22
Block 26
Block 30
Block 34
Block 38
Block 42
Block 46
Block 3
Block 7
Block 11
Block 15
Block 19
Block 23
Block 27
Block 31
Block 35
Block 39
Block 43
Block 47
Parity(0-3)
Parity(4-7)
Parity(8-11)
Parity(12-15)
Parity(16-19)
Parity(20-23)
Parity(24-27)
Parity(28-31)
Parity(32-35)
Parity(36-39)
Parity(40-43)
Parity(44-47)
…
…
…
…
…
spcl.inf.ethz.ch
@spcl_eth
Parity disks
High
overhead for
small writes
spcl.inf.ethz.ch
@spcl_eth
Stripe 2
Stripe 1
Stripe 0
RAID5: Rotating parity
Disk 0
Disk 1
Strip(0,0)
Strip(1,0)
Parity(0,0)
Parity(1,0)
Parity(2,0)
Parity(3,0)
Block 0
Block 1
Block 2
Block 3
Strip(0,1)
Strip(1,1)
Block 16
Block 17
Block 18
Block 19
A strip of sequential
block2on each disk
Disk
Disk 3
 balance
Strip(2,0)
Strip(3,0)
parallelism
with
Block 4
Block 8
sequential
access
Block
5
Block 9
Block
6
Block 10
efficiency
Disk 4
Strip(4,0)
Block 7
Block 11
Block 12
Block 13
Block 14
Block 15
Strip(2,1)
Strip(3,1)
Strip(4,1)
Parity(0,1)
Parity(1,1)
Parity(2,1)
Parity(3,1)
Block 20
Block 24
Block 21
Block 25
Parity
Block 22strip rotates
Block 26
around
Block 23 disks with
Block 27
Block 28
Block 29
Block 30
Block 31
Strip(0,2)
Strip(1,2)
successive
stripes
Strip(2,2)
Strip(3,2)
Strip(4,2)
Block 32
Block 33
Block 34
Block 35
Block 36
Block 37
Block 38
Block 39
Can service
…
widely-spaced
requests in
parallel
…
Parity(0,2)
Parity(1,2)
Parity(2,2)
Parity(3,2)
…
Block 40
Block 41
Block 42
Block 43
…
Block 44
Block 45
Block 46
Block 47
…
spcl.inf.ethz.ch
@spcl_eth
Mean time to repair (MTTR)
RAID-5 can lose data in three ways:
1. Two full disk failures (second while the first is recovering)
2. Full disk failure and sector failure on another disk
3. Overlapping sector failures on two disks
 MTTR: Mean time to repair
 Expected time from disk failure to when new disk is fully rewritten, often
hours
 MTTDL: Mean time to data loss
 Expected time until 1, 2 or 3 happens
spcl.inf.ethz.ch
@spcl_eth
Exam Question:
In a raid 5 filesystem, explain why reads become
slower when a disk fails.
Answer. The file is spread over several disks. When one fails, the others must be
used to reconstruct its contents. This slows down reads (slightly for hardward
RAID; significantly for software RAID).
43

balloon driver

Transcript balloon driver

Directory