Real Time Issues in Live migration of Virtual Machines

Download Report

Transcript Real Time Issues in Live migration of Virtual Machines

Presented by : Ran Koretzki
Basic Introduction
 What are VM’s ?
 What is migration ?
 What is Live migration ?
What are VM’s ?
 VM’s (Virtual Machines) - "completely isolated guest
operating system installation within a normal host
operating system". Modern virtual machines are
implemented with either software emulation or hardware
virtualization or (in most cases) both together.
 This allows to run multiple independent VM’s on a single
physical Machine.
 VM’s operating systems are not hardware depended.
What are VM’s ?
Traditional Architecture
Virtual Architecture
What are VM’s ? Benefits
 Hardware independence.
 Encapsulation - VMs can be described in a file
 Possible to ‘snapshot’.
 Easy to move and to backup.
 Easy to clone and scale wide a server application.
 Many VM venders : VMware, Microsoft , Citrix…
 Enables running multiple operating systems
 Consolidation & use of unused computation power.
 Resource management.
 High availability & disaster recovery.
 Easy Management.
 Migration – next on the agenda.
Migration
 Definition - The ability to move VM’s from one PH to another.
 In the past, to move a VM between 2 PH, it was necessary to
shutdown the VM, allocate the needed recourses to the new
PH, move the VM files, and start the VM in the new host.
 The recourses that need to be transferred are : memory, the
internal state of the devices and of the virtual CPU. The must
time consuming to transfer is memory.
 The problem : downtime.
The solution was at first Automation, but the real
improvement came with Live Migration.
Live Migration
 Wiki Definition - allows a server administrator to move a
running virtual machine or application between different physical
machines without disconnecting the client or application. For a
successful live migration, the memory, storage, and network
connectivity of the virtual machine needs to be migrated to the
destination.
 It mean it allows the server admin to move VM’s between PH
transparently to the clients.
 It is done usually for Load balance between PH and for migration in
case of a hardware failure.
• Live migration of
virtual machines
• Zero downtime
By:
Fabio Checconi,
Tommaso Cucinotta,
Manuel Stein.
Presented by : Ran Koretzki
1. Show a heuristic to reduce the downtime of
a VM during live migration by scheduling
which memory pages to transmit first.
𝑉𝑀 ≡ {𝑝1 , … , 𝑝𝑁 }, where 𝑝𝑖 is a memory page of size P.
Available bandwidth for the transfer - b (bps).
Time needed to transfer a single page - 𝑇 =
H is the overhead
𝑃+𝐻
𝑏
Each page 𝑝𝑖 will be accessed for “write” with a
probability of 𝜋𝑖 during each time frame T.
where
1. At time 𝑡1 (start of Migration)- 𝐷1 is a set of pages to be
transmitted. It is set to be the entire page set used by the VM.
𝐷1 = 𝑛1 .
2. For 𝑘 = 1, … , 𝐾 do: all the pages in 𝐷𝑘 are transferred
according to the order specified by 𝜑𝑘 = 1 … 𝑛𝑘 → {1. . 𝑁};
The transfer ends at 𝑡𝑘+1 = 𝑡𝑘 + 𝑛𝑘 𝑇;
𝑛𝑘+1 pages in 𝐷𝑘+1 are found dirty again;
3. Stop the VM and transfer the last 𝑛𝑘+1 pages, up to migration
𝑃+𝐻
finishing time 𝑡𝑓 = 𝑡𝐾+1 + 𝑛𝐾+1
using a bandwidth of 𝑏𝑑 bps,
𝑏𝑑
with 𝑏𝑑 ≥ 𝑏 .
Down Time (meaning VM not available) is - (𝑡𝑑 = 𝑡𝑓 − 𝑡𝐾 )
𝑃+𝐻
𝑡𝑑 =
𝑛𝐾+1
𝑏𝑑
Overall Migration time: (𝑡𝑡𝑜𝑡 = 𝑡𝑓 − 𝑡1 )
𝑡𝑡𝑜𝑡
𝑃+𝐻
=
𝑏
𝐾
𝑛𝐾 + 𝑡𝑑
𝑘=1
1. The probability of a page 𝑝𝑖 that is not dirty at time 𝑡1 (start of migration) to become
dirty and thus need to be transmitted in the final migration round is:
Pr 𝑝𝑖 ∈ 𝐷2 𝑝𝑖 ∉ 𝐷1 = 1 − (1 − 𝜋𝑖 )𝑛1
2. The probability of a page 𝑝𝑖 that is dirty at time 𝑡1 (start of migration) to become
dirty and thus need to be transmitted in the final migration round is:
−1
Pr 𝑝𝑖 ∈ 𝐷2 𝑝𝑖 ∈ 𝐷1 = 1 − 1 − 𝜋𝑖 𝑛1+1−𝜑1 (𝑖)
(Where 𝜑1 −1 ∙ : 1 … 𝑁 → {1 … 𝑛1 } denotes the inverse of the 𝜑1 ∙ function.)
𝐸 𝑡𝑡𝑜𝑡
𝑃+𝐻
𝑃+𝐻
=
𝑛1 +
∗
𝑏
𝑏𝑑
𝑛1 −
𝑛1
𝑗=1(1
− 𝜋𝜑1 𝑗 )𝑛1 +1−𝑗 +(𝑁
− 𝑛1 − 𝑖∉𝐷1(1 − 𝜋𝑖 )𝑛1
The order of transmission of the pages that minimizes the
expected number of dirty pages found at the end of the 𝑘 𝑡ℎ live
migration step must satisfy the following condition:
∀𝑗𝜋𝜑𝑘 (𝑗) (1 − 𝜋𝜑𝑘(𝑗) )𝑛𝑘−𝑗 ≤ 𝜋𝜑𝑘 (𝑗+1) (1 − 𝜋𝜑𝑘 (𝑗+1) )𝑛𝑘−𝑗
Conclusion: If the probabilities 𝜋𝑖 are lower than
1
,
𝑛𝑘 +1
the
optimum ordering is obtained for increasing values of the
probabilities 𝜋𝑖 . On the other hand, if the probabilities are
1
greater than , the optimum ordering is obtained for decreasing
2
values of the 𝜋𝑖 .
All pages are equal, but some are more equal
Problem: wasteful to transmit at each step
Solution: Wait until the end, when VM is down
Algorithm: among the 𝑛𝑘 pages that are found as dirty at start of step k, for a
set of pages 𝐹𝑘 ⊂ 𝐷𝑘 delay the transmission to when the VM is stopped.
Which pages?
𝐹𝑘 ≜ {𝑝𝑖 ∈ 𝐷𝑘 |𝜋𝑖 ≥ 𝜋} (where 𝜋 is a threshold value)
𝑃+𝐻
𝑏𝑑
𝑃+𝐻
𝑃+𝐻
E 𝑡𝑡𝑜𝑡 ≤ 𝐸 𝑡𝑡𝑜𝑡 − 𝐹1 1 − 𝜋
+ |𝐹1 |(1 − 𝜋)
𝑏
𝑏𝑑
Conclusion : It is possible to achieve a negligible increase in down-time with a
substantial decrease of overall migration time.
E 𝑡𝑑 ≤ 𝐸 𝑡𝑑 + |𝐹1 |(1 − 𝜋)
Problem: Need to know, precisely, 𝜋𝑖 for each page 𝑝𝑖 .
Solution: Gathering information during run-time. (Statistics)
Problem: Non-negligible overheads.
Solution: LRU - Frequency-based approach
Computational Resources:
Scheduling Guarantee by the Kernel – (Q, P).
Meaning period P and share Q
Network Resources:
b needs to be constant and Possible to reserve.
An unstable network is not part of this model, but a
migration will still succeed (but it may not be Live
Migration)
The authors have modified the KVM hypervisor.
Page Tracing mechanism
Page accessed are traced within the hypervisor, using a bitmap
The implementation will exploit this information to modify transfer order
Simulations –
Virtualized VideoLAN Client (VLC) as a streaming server
6500 mapped pages (16KB/page)
Transfer rate of 100MBit/sec.
8 sec. (!)
Guaranteed bandwidth of 50 Mbit/s
Standard vs. LRU
570 -> 300 (47%) (K=1)
360 -> 290 (19.4%) (K=3)
4800 -> 4500 (6.25%) (K=1)
5500 -> 5000 (9.1%) (K=3)
LRU with delayed transmission
LRU vs. Improved LRU (𝜋 = 0.30)
300 -> 220 (27%) (K=3)
5000 ->4400 (12%) (K=3)
• It’s possible to minimize downtime and improve QoS with
simple page ordering algorithms
• With a certain bandwidth, LRU has been proved as an
effective aid for page ordering, achieving good results.
• Further Work needs to be done…