Virtual Machines

Download Report

Transcript Virtual Machines

開源中介軟體:
虛擬機與編譯器
Wei Chung Hsu
1/16/2016
Outline
• Virtual Machines and Translators
• QEMU and LLVM
• Leveraging QEMU and LLVM
• HQEMU/PQEMU
• HQEMU+SIMD
• HSAemu Runtime
• HSA Runtime
• VM-based Software Fault Tolerant Systems
What is Virtual Machine
• Wikipedia:
A Virtual Machine is an emulation of a computer system.
It provides functionality of a physical computer.
Example:
High Level Lang VM
JVM/DVM – Java/Dalvik VM
ARIES/ Rosetta/ IA-32EL
Process VM
BlueStacks or Genymotion
System VM (x-ISA)
– Android emulators
VMWare ESXi, VirtualBox,Xen, KVM based, …
System VM (same ISA)
Virtual Machines
 Process VM: language level (Java, .Net) OS-level (Solaris
Zone), Cross-ISA (Apple Rosetta, IA32EL, FX!32)
System VM: VMware ESX, VirtualBox, Xen, KVM, MS
Hyper-V
Virtualization is an Isomorphism
e(Si)
Sj
Si
Guest
V(Si)
V(Sj)
e’(Si’)
Sj’
Si’
Host
Virtualization is an Isomorphism
State
mapping
e(Si)
Sj
Si
Guest
V(Si)
V(Sj)
e’(Si’)
Sj’
Si’
Host
Emulation
Using Process VM as Example
State
mapping
ADD X10,X10,X11
e(Si)
Si
Sj
X10+=X11
X10
New X10
ARM
V(Si)
X10  RAX
X11  RBX
Or
V(Sj)
ADD RAX, RBX
e’(Si’)
X10  ST[1]
X11  ST[2]
Sj’
Si’
Intel X64
Emulation
Virtualization
• Mapping something virtual (guest) to something real
(host).
 Model the guest, define desired functions
 Emulate the modeled guest on the host
 Functional correctness is the goal, performance is
usually lower than a real machine
 Mapping is
like “table lookup” or “translation”
How to emulate with high performance is
challenging, and calls for good “translation and
optimization”
What is QEMU?
 Quick EMUlator, a free and open-source (GNU GPL License)
 It is FAST, due to the use of DBT.
 Although there are other DBT based tools available, e.g. DynamoRIO, Strata, and
so on, QEMU is widely adopted.
 QEMU is reliable, robust, and retargetable (in fact, it is not the fastest one).
 Created by super-programmer Fabrice Bellard
 QEMU is a hosted VM Monitor. It emulates CPUs through dynamic
binary translation, and provides a set of device models.
 It can be used purely for CPU emulation, as a process VM
 It can also work with KVM to control VMs.
Common Usage of QEMU?
 Operating Modes:
• System-mode emulation – emulation of a full system
• User-mode emulation – launch processes compiled for another CPU (under same
OS)
• Ex. execute an Arm/Linux program on an x86/Linux
 Popular uses:
•
•
•
•
•
For cross-compilation development environments
For system VM, work with KVM, to serve device emulations (KVM hosting)
For system emulation, serve as a CPU emulator, for Virtualbox.
Android Emulators (BlueStacks, Genymotion, …)
Malware analyzers and detectors
Fast Emulation
• Ways of implementing emulation for a machine
• Interpretation: instruction at-a-time
• Binary Translation: block-at-a-time optimized for repeated instruction
executions. (block could be a trace, or even a procedure)
• Interpretation (i.e. Decode-Dispatch)
Decode-Dispatch: Low Efficiency
Decode-dispatch loop
Mostly serial code
Several jumps/branches (some hard-to-predict)
High emulation cost
e.g. executing an add instruction would cost
Approximately 20 target instructions
Several loads/stores
Several shift and mask steps
Optimizing the dispatch loop
Example: DEC/Compaq FX!32
Software pipelined decode-dispatch loop
From executable to executable
Source ISA: assume MIPS
Add r1, r2, r3
Ld r4, 4(sp)
Add r5, r4, r6
…
…
Code
Morphing
addl %eax,%ebx,%ecx
…..
movl %edx,4(%esp)
…..
movl %ecx, 100(%esp);
movl %edx, 104(%esp);
addl %eax, %ecx,%edx;
movl %eax, 108(%esp);
….
….
Target ISA: assume x86
The target executable could be many times
bigger than the source, but can be directly
executed rather than interpreted.
Binary Translation
• Generate custom code for every source instruction. For
example, a load instruction in source code could be translated
into a respective load instruction in native code.
• Get rid of repeated parsing, decoding, and jumping overhead.
• Register mapping is needed to reduce load/stores
significantly.
• Compiled emulation is an early form of binary translation.
Binary Translation Issues
If you really care,
come to my VM class !!
or trace QEMU code
• How to discover code and how to deal with indirect
branches?
• How to deal with self-modifying, self-reference code?
• How to avoid translation time overhead?
• How to control/monitor the translated code?
• How to deal with precise exceptions and external interrupts?
• How to interact with the OS?
• How to handle data misalignment issues, endian issues, data
format issues? ….
Key to Fast Emulation: DBT
What is LLVM?
• LLVM (Low Level Virtual Machine) is a compiler infrastructure,
and a compiling system. A library used to build compiler or
language related software.
• History
Started in 2000 at the University of Illinois by Vikram Adve & Chris
Lattner
First showcase: LLVM: A compilation framework for lifelong
program analysis & transformation (CGO 04)
Target at compile time, link-time, run-time and idle time optimizations.
Chris has been promoting LLVM after he joined Apple in 2005
If you wonder why would a compiler
infrastructure be related to
Low Level Virtual Machine ?
LLVA: The project leads to LLVM
If you’re designing a new processor family …
 Would you like to be able to refine your ISA every year?
 Would you like to add a new optimization without
changing 7 compilers, 4 JITs and 6 debuggers to use it?
 Would you like the compiler to assist your branch
predictor, value predictor, trace cache, or speculation?
 Would you like the program to tell you all loads/stores
are independent in the next 220 static instructions?
In general, none of these is practical
with today’s architectures
VISC: Virtual Instruction Set Computers
Application Software
Operating System
[ IBM AS 400, DAISY,
Transmeta, Strata ]
Kernel
Device drivers
Hardware Processor
Virtual ISA: V-ISA
• s/w representation
Implementation ISA (I-ISA)
• s/w representation
• h/w control
Processor-specific Translator (Software)
2 fundamental
benefits of VISC:
1. V-ISA can be much richer than an I-ISA can be.
2. Translator and processor can be co-designed,
and so truly cooperative.
What is LLVM?
• Collection of industrial strength compiler technology
 Optimizer and Code Generator
 llvm-gcc and Clang frontends
 MSIL and .NET Virtual Machines
• Open Source Project
 Contributors from industry, research groups, individuals
• A set of formats, libraries and tools
 A simple typed IR (bitcode)
 Program analysis and optimization libraries (34 analysis passes, 63 opt pass)
 Machine code generation libraries
Tools that compose the libraries
Open Source C Compilers
• Clang / LLVM
• GCC C
• LCC
• Open64
• Portable C Compiler (PCC)
• Tiny C Compiler
• MS Visual Studio Community 2015
• MS Phoenix Compiler
Why New Compilers (what is wrong with GCC?)
• Existing Open Source C Compilers have stagnated
Based on decade old code generation technology
 No cross-module or link-time optimizations
 No JIT code generation
 Aging code base – hard to change, hard to learn
 Hard to reuse or leverage in other applications
 Getting slower with each new release
• LLVM focus on
New techniques, compile time, and code quality
Modula approach to support many languages and applications
 Components can be used in both static and dynamic translation
systems
LLVM Vision and Approach
• Build a set of modular compiler components
• Reduces the time and effort to build a particular compiler
• Components may be shared by different compilers
• Enable selection of the right component for each job
LLVM Tools
• llvm-as: assemble a human-readable .ll file into bitcode
• llvm-dis: disassemble a bitcode file into a human-readable .ll file
• opt: run a series of LLVM-to-LLVM optimizations on a bitcode file
• llc: generate native machine code for a bitcode file
• lli: directly run a program compiled to bitcode using a JIT compiler or
interpreter
• llvm-link: link several bitcode files into one
• clang: C, C++, Object C front-end for LLVM
• llvm-gcc: GCC-based C front-end for LLVM
• llvm-g++: GCC-based C++ front-end for LLVM
Leveraging LLVM
• LLVM is a great target for new languages
 Well defined, simple to program for
 Easy to retarget existing compiler to use LLVM backend
• LLVM supports Just-In-Time optimizations and compilation
 Optimize code at runtime based on dynamic information
 Easy to retarget existing bytecode interpreter to LLVM JIT
 Great for performance
LLVM IR
From LLVM IR to Target Assembly
● Build flow inside LLVM
LLVM IR
(.ll/.bc)
OPT
LLVM IR
(Optimized)
Target Independent Optimization
(Optional)
LLC
Target
Assembly
Target dependent
Optimization(Optional) &
Code Generation
LLVM IR
● A target independent virtual instruction set
● Three different representations
○
In-memory Format
■
○
On-disk binary format: bitcode
■
■
○
Directly operated by the LLVM API
LLVM IRs are binary encoded
Need to be read into memory before being operated by LLVM
API
On-disk text format: llvm assembly
■
■
Human readable format
Need to be read into memory before being operated by LLVM
API
● SSA-based IR
Single Static Assignment Format
● Each variable is defined at most once
o
There is only one definition for each use
● Every variable is defined before use
● Infinite variable set
o
Unlike the real processor
● Make several optimizations simpler
X=1
X=2
Y=X
X1 = 1
X2 = 2
Y = X2
SSA-Form
X1 can be removed since it has no use
Single Static Assignment Format
● How about this the same variable from different
predecessors?
X=
1
X=
2
X1 =
1
Y=X
X2 =
2
Y = ? SSA-Form
● Phi operation to solve the data flow from multiple
predecessors
X=
1
X=
2
Y=X
X1 =
1
X2 =
2
Y
phi(X1,X2 ) SSA-Form
Structure of an LLVM IR File
Module
Global
Variable
Parameter
Global
Variable
Parameter
Label
…
…
Function
Basic
Block
Instruction
Operand
Function
Basic
Block
Basic
Block
Instruction
Operand
Function
Instruction
…
…
…
…
LLVM IR: Variable Identifier
● Global variable begins with syntax ‘@’
○
Variable
@.str = private constant [13 x i8] c"hello world\0A\00"
○
Functions
■
declaration
■
definition
declare i32 @puts(i8* nocapture) nounwind
define i32 @main() { … }
Example: Hello, World
#include <stdio.h>
int main() {
printf("Hello, World!\n");
return 0;
}
Example: Hello, World(cont.)
declare i32 @printf(i8*, ...)
@.str = private unnamed_addr constant [15 x i8] c"Hello,
World!\0A\00", align 1
define i32 @main() nounwind uwtable {
%1 = alloca i32, align 4
store i32 0, i32* %1
%2 = call i32 (i8*, ...)* @printf(i8* getelementptr
inbounds ([15 x i8]* @.str, i32 0, i32 0))
ret i32 0
}
LLVM IR: Variable Identifier(cont.)
● Local variable begins with syntax ‘%’
○
function argument
define i32 @func(i32 %TorF, i32 %lhs, i32 %rhs){
○
Variables defined by instructions
%4 = load i32* %1, align 4
○
label of basic blocks
br i1 %b, label %BB1, label %BB2
BB1:
…
ret i32 %B
BB2:
Example: Conditional Block
int func(int TorF, int lhs, int rhs) {
int x;
if (TorF) {
x = lhs;
} else {
x = rhs;
}
return x;
}
Example: Conditional Block(cont.)
define i32 @func(i32 %TorF, i32 %lhs, i32 %rhs) #0 {
%1 = icmp ne i32 %TorF, 0
br i1 %1, label %L1, label %L2
L1:
; preds = %0
br label %L3
Define
Use
L2:
; preds = %0
br label %L3
L3:
; preds = %L2, %L1
%x.0 = phi i32 [ %lhs, %L1 ], [ %rhs, %L2 ]
ret i32 %x.0
}
Layout of LLVM Module
; ModuleID = 'main.c'
@varA = external global i32
@varB = external global i32
declare i32 @func(i32, i32) #1
Module
Function
Basic
Block
Global Variable
Global Function
define i32 @main(i32 %argc, i8** %argv) #0 {
%1 = alloca i32, align 4
... Local(Auto) Variable
br i1 %5, label %L1, label %L2
L1:
...
Label
br label %L2
L2:
...
}
Several Examples of LLVM IR
<result> = add <type> <op1>, <op2>
→ %dst = add i32 %src1, %src2
<result> = load <type>* <pointer>
→ %dst = load i32* %ptr
store <type> <op1>, <type>* <pointer>
→ store i32 %src, i32* %dst
br label <dest>
→ br label %dest_label
List of LLVM Types
● Primitive Type
○
○
○
○
Integer: i(#n), e.g. i1, i32, i64, i128
Float: half(16)/float(32)/double(64)/fp128
Pointer: (type)* → i1*, i32*, float*, i32**
Vector: <(#n) x (type)> → <4 x i32>
● Composite Type
○
○
Array: [(#n) x (type)], → [10 x flaot]
Struct: type{#<Type List>} → type{i32, i32}
● Label Type
○
○
Like labels in C language
Used by branch instructions
List of Different Kind of LLVM
Instruction
● Terminator Instruction
o
br/indirectbr/switch/ret/…
● Binary Instruction
o
add/fadd/sub/mul/udiv/urem/…
● Bitwise Binary Instruction
o
shl/lshr/and/or/xor/…
● Memory Access Instruction
o
alloca/load/store/getelementptr/…
● Conversion Instruction
o
zext/sext/fptosi/uitofp/…
Terminator Instruction
● Every basic block ends by a terminator instruction
● Direct Branch
○
br
○
indirectbr/switch/ret
○
invoke/resume
○
unreachable
● Indirect Branch
● Exception Handling
● End of Control Flow
Direct Branch
● Unconditional Branch
o
br label <dest>
br label %BB1
…
%BB1:
● Conditional Branch
o br i1 <cond>, label <iftrue>, label <iffalse>
%cond = icmp eq i32 %a, %b
br i1 %cond, label %BB1, label %BB2
%BB1:
…
%BB2:
…
Return
● Without Return Value (void function)
o
ret void
● With Return Value
o
ret <type> <value>
define i32 @func() {
…
ret i32 %var
}
Switch
switch i32 %a,
i32 0, label
i32 1, label
…
i32 9, label
]
sw.default:
…
sw.bb0:
…
sw.bb1:
…
sw.bb9:
label %sw.default [
%sw.bb0
%sw.bb1
%sw.bb9
LLBT
(LLVM based Binary Translator)
An ARM-to-Andes SBT
An ARM-to-Andes SBT was built before the LLBT project
Target
Source
ARM
Binary
Andes
Binary
SBT
MIPS
Binary ?
x86
Binary ?
X86-64
Binary ?
C2
Binary ?
An LLVM-based Static Binary Translator
48
LLBT (LLVM based SBT)
An LLVM-based Static Binary Translator
• Translate machine instructions (ARMv5TE) into LLVM IR
Source
Binary
Target
LLBT
LLVM IR
LLVM
Binary’
Binary’
Binary’
Binary’
Leverage the LLVM infrastructure
• Exploit existing analysis and optimization passes
• Re-targetable
• X86, x86-64, MIPS, PowerPC, C2, …
LLVM is a widely used open source compiler infrastructure
• Android Renderscript runtime, ChromeOS, NaCl (Native
Client) (running compiled executable in web browser)
An LLVM-based Static Binary Translator
49
LLBT Translation Flow
• Read and disassemble the
input binary
• Translate to LLVM IR
• Apply LLVM optimizations
• Compile the IR to target
assembly
• Assemble the target
assembly
• Link the object files and the
source image together
An LLVM-based Static Binary Translator
50
20
An LLVM-based Static Binary Translator
5.2
6.4
7.5
6.9
2.3
4.1
7.1
7.4
4.5
4.7
7.0
7.8
19.0
23.7
15.9
18.0
64.1
70
8.2
5.6
8.1
4.1
5.3
7.4
3.6
3.3
4.5
4.9
29.3
47.2
X86 (Intel Xeon E5506)
10.0
7.5
5.5
4.2
12.8
10.5
8.5
13.0
40.5
50
4.1
3.2
3.4
6.8
6.0
7.9
7.5
7.4
6.1
8.1
7.7
3.2
4.6
5.3
4.9
4.7
5.1
4.7
8.0
2.7
5.8
5.7
5.0
4.2
5.3
6.0
6.5
4.9
10
29.5
28.8
30
a2time01
aifftr01
aifirf01
aiifft01
autcor00
basefp01
bezier01
bitmnp01
cacheb01
canrdr01
cjpeg
conven00
dither01
djpeg
fbital00
fft00
idctrn01
iirflt01
matrix01
ospf
pktflow
pntrch01
puwmod01
rgbcmy01
rgbhpg01
rgbyiq01
rotate01
routelookup
rspeed01
tblook01
text01
ttsprk01
viterb00
average
Execution time ratio (QEMU / LLBT)
LLBT (SBT) vs. QEMU (DBT)
MIPS (Ingenic JZ4760)
60
40
7.4
7.1
0
Execution time ratio of “QEMU / LLBT” on translating EEMBC from ARM to (a) X86 and (b) MIPS
51
LLBT vs. QEMU (CoreMark)
6.04X
6000
5135.39
Iterations / Sec
5000
4000
2.49X
3000
1.07X
2114.16
2000
1000
850.73
912.86
0
qemu-arm
opt -O0 & llc -O0 (LLBT) opt -O3 & llc -O0 (LLBT) opt -O3 & llc -O3 (LLBT)
An LLVM-based Static Binary Translator
52
HQEMU
(A Hybrid QEMU)
HQEMU ( Hybrid QEMU)
 What is HQEMU?
A Hybrid QEMU, with multi-threaded code generation and
optimization approach.
 Original QEMU trades runtime generated code quality for
fast translation. (a typical tradeoff made by JIT)
 HQEMU uses two threads approach – the original QEMU
thread, plus a new thread to use LLVM to generate higher
quality native code. This enhances QEMU simulation speed
significantly:
 X86-x86/64 emulation: 2.5X
 ARM-x86/64 emulation: 2.4X
Why not optimize generated code?
 Optimization takes time. In DBT, it is part of the
execution time.
 Some applications are short running, optimization time may
not be amortized.

Optimization may generate bugs, which decreases the
reliability of DBT
• QEMU has converted many different ISAs to TCG IRs
• Support ISAs: x86, ARM, MIPS, PowerPC, SPARC ... etc.
• TCG has only ~142 opcodes
• Two-level IR conversion in HQEMU
• Guest ISA  QEMU TCG IR  LLVM IR
• Only implement 142 conversion procedures from TCG IR to
LLVM IR
• All ISAs supported by QEMU is supported by HQEMU
• Reduce the overhead to re-implement direct conversion
from original guest ISAs to LLVM IR. E.g.
• X86: > 2000 opcodes
• ARM: > 100 opcodes
HQEMU Architecture
• Architecture for process-mode emulation
Multi-thread App
QEMU
TCG
Translator
M Cores
Block Code Cache
Optimization Thread Pool
Optimization
Queue
LLVM
Translator
(Trace
Generation+Opti
mization)
N Cores
HPM-Based
Feedback-Dir
Optimizer
P Cores
Trace Code Cache
Execution Profile
57
• SPEC CINT2006
Performance of Single-Thread Apps
CINT2006 results with reference
inputs.
(a) ARM to x86/64
(b) x86/32 to ARM
59
HQEMU/SIMD
HQEMU
Target
Source
Guest SIMD
(ARM Neon
or Intel SSE/AVX)
Helper
function
Scalar
LLVM IR
LLVM
Binary’
Binary’
Binary’
Binary’
HQEMU/SIMD
Target
Source
Guest SIMD
(ARM Neon
or Intel SSE/AVX)
TCG
Vector
IR
LLVM
Vector
IR
LLVM
An LLVM-based Static Binary Translator
Binary’
Binary’
Binary’
Binary’
61
HSA Runtime
Finalizer in a Heterogeneous System
Applications
IR (HSAIL, SPIR, Bytecode)
Finalizer
CPU
GPGPU
DSP
Finalizer in a Heterogeneous System
Good code and runtime
performance depend on
-
Input kernel
Input data
Target ISA
Target microArch
CPU
Applications
IR (HSAIL, SPIR, Bytecode)
Challenge:
How to get good
performance with
compile time and
space constrains ??
Finalizer
GPGPU
DSP
Finalizer in a Heterogeneous System
Many finalizers are planning
to leverage on LLVM
infrastructure.
Applications
LLVM
It will be common for a
kernel function to go through
LLVM twice.
IR (HSAIL, SPIR, Bytecode)
In such cases, how to select
a cost effective configuration
will be crucial.
LLVM /Finalizer