Transcript Paper 20

ReVirt: Enabling Intrusion Analysis
through Virtual Machine
Logging And Replay
Authors:
George W. Dunlap
Samuel T. King
Sukru Cinar
Murtaza A. Basrai
Peter M. Chen
Presentation by: Will Hrudey
Introduction

ReVirt is an intrusion analysis solution that facilitates post attack
analysis

ReVirt applies VM and fault tolerant techniques to enable the
Administrator to replay long term instruction-by-instruction
execution of a computer system

ReVirt runs the target operating system (OS) and applications in
a VM running as a kernel module in a host OS, allowing:
–
Migration of logging from the target OS to the host OS below the VM
–
Playback of the target system’s execution before, during, and after
an intruder compromises the system
Motivation

The improvement of today’s computer system security is an
urgent and difficult problem

The complexity and rapid change in software systems prevent
developers from verifying their code to eliminate all vulnerabilities

Administrators have to routinely cope with computer break-ins
–
CERT Coordination Center reports steady a increase of incidents
handled and the number of vulnerabilities over the past 4 years
Goals
Solve two problems with current audit logging:
1.
Improve the integrity of the logger because:



Existing loggers depend on the integrity of the OS
Attackers can disable, modify or delete system logs
Kernel’s are large and complex so tend to contain many bugs
Solution:
Encapsulate target system within VM and place logging below VM
2.
Improve the completeness of the logger because:


Existing loggers don’t save enough data to replay and analyze attacks so
Administrator still has to guess what happened
Can’t account for non-determinism
Solution:
Utilize checkpointing, logging and roll-forward recovery
Virtual Machines

A virtual-machine monitor (VMM) is a software layer that
emulates the hardware of a complete computer system

The VMM creates an abstraction called a virtual machine (VM)

The host platform that the VMM runs on can be another OS
(the host OS) or the bare hardware
–

So the VMM runs in a separate domain from the guest OS and
applications
Although the VMM can still be compromised, it makes a better
trusted computing base (TCB) than the guest OS due to its
narrow interface and small size
Virtual Machines

The VMM interface is similar to the physical hardware whereas the
interface provided by a typical OS is much richer

The narrower interface restricts actions and the smaller code is
easier to verify the VMM

VM's can be classified by how similar they are to the host hardware
–
On one end, VM’s export a backwards compatible interface with the host
hardware such as IBM VM/370. OS’s and applications intended to run
on the host platform can run on these VMM’s without change
–
On the other end, language-level VM's like Java VM export an interface
completely different from the host hardware. These VMM’s can run only
OS’s and applications written specifically for them
Virtual Machines
Different VM Configurations
Direct-on-Host (DoH)
OS-on-OS (OoO)
UMLinux



ReVirt uses UMLinux as the virtual machine
–
VMM in UMLinux exports an interface similar but not identical to
the host hardware
–
VMM custom optimizations in the underlying OS increase speed
Virtual machine in UMLinux runs as a user process on the host
–
Guest OS and guest applications run inside this user host process
–
Guest OS uses host services (system calls and signals) as the
interface to peripheral devices, hence OS-on-OS architecture
Normal structure of target applications running directly on the
host OS reflects the Direct-on-Host architecture
UMLinux

VMM in UMLinux is a loadable kernel module in the host OS
–
–
Module is called before/after each signal and system call to/from the VM
process
Most instructions executed within the VM execute directly on host CPU

Memory accesses are translated by the host’s MMU based on
translations that are set up via the host OS’s memory system calls

A host X application displays console output and reads keyboard
input

The VMM module maintains a virtual privilege level (VPL)
–
–
Set to kernel when transferring control to the guest kernel
Set to user when transferring control to a guest application
UMLinux

If the current VPL is kernel, the VMM knows the guest OS made the
system call and it checks to ensure its a call the guest OS should be
making, then passes it onto the host OS

If the current VPL is user, the VMM knows the guest application made the
system call and it sends a SIGUSR1 to the guest OS to notify it
–
SIGUSR1 signal handler in the guest kernel is the equivalent of the system-call
trap handler in a normal OS

SIGALRM, SIGIO, and SIGSEGV signals are used to emulate the
hardware timer, I/O device interrupts, and memory exceptions

UMLinux emulates the enabling/disabling of interrupts by masking signals

The TCB is comprised of the VMM kernel module and the host OS
UMLinux
UMLinux
Attacker strategies:

From above
DoH: Attacker can cause application processes to exploit any/all host OS
functionality in dangerous ways
OoO: Attacker can take similar avenues to attack Guest OS, however VMM
limits available systems calls to < 7% and Guest OS can only access
a limited number of host files and devices

From below
DoH: Attacker can send dangerous network packets to the host to compromise lower
levels of the protocol stack
OoO: Less of the host OS network stack is exposed to the same dangerous packets
Logging And Replaying

Logging is used to recover state
–
Start from a checkpoint of a prior state, then roll forward using the log

Most events are deterministic and needn’t be logged however any host
system calls that can yield non-deterministic results must be logged

Non-deterministic events are categorized as either time or external input
–
Time refers to the point in the execution stream which an event takes place
–
External input is data received from a non-logged entity (keyboard, mouse, etc)

Output to peripherals does not affect the replay process

Log records are added and saved to disk similar to Linux syslogd daemon

PC and the # of branches executed since the last interrupt are logged

New asynchronous virtual interrupts do not perturb VM process playback
Logging And Replaying

ReVirt goes through two phases to find the right instruction at which to
deliver the original asynchronous virtual interrupt
–
–
1st phase has branch_retired generate an interrupt after most branches
2nd phase is needed to stop at exactly the right instruction

Replay can occur on any host with similar processor type as host

Most non-deterministic sources generate small amounts of log data

Received network messages can generate massive logs
–
Can reduce the amount of logged network data since the receiver doesn’t
need to log data because the sender can recreate the data via replay
–
Requires cooperating computers to trust each other to regenerate the
same message data during replay
Logging And Replaying
Administrator tools used to in understanding the attack:
–
Tools that run inside the guest VM to probe the VM state


–
edit files
list current processes, etc
Tools that run outside the guest VM to analyze the state of a VM



Xserver
Debuggers
Disk Analyzer, etc
Experiments: Testbed


VM is configured to use 192 MB of physical memory
Virtual hard disk is stored on a raw disk partition
Experiments: Objective

Measure Virtualization Overhead:
–
–

Validate Correctness:
–
–

Micro-benchmarks run in the VM to verify virtual interrupts are being
replayed at the same point at which they occurred during logging
Macro-benchmark verifies ReVirt faithfully plays back input from
external systems
Measure Logging And Replaying Overhead
–
–

Application runtimes within UMLinux vs. runtimes on the host OS
Evaluates 5 workloads with a warm cache averaged over 3 runs
Quantify the time and space overhead of logging
Checkpoint overhead is not included
Attack Analysis
–
Exploit the ptrace race condition and verify replay
Experiments: Virtualization Overhead
Experiments: Logging / Replaying
Future Work

Make checkpointing faster and more convenient
–
–
Accelerate disk copy done during checkpointing
Enable the VMM to checkpoint a running VM

Reduce host OS size used to support UMLinux

Build higher level analysis tools to leverage ability to replay
detailed, long-term executions

Move the X server into another VM

Use ReVirt as a building block for new security services

Cooperative logging in ReVirt?
Conclusion

ReVirt adopts VM and fault tolerance techniques to enable
replay of long-term instruction by instruction execution to
facilitate attack analysis

Target OS and applications run within the VM

ReVirt can replay execution before, during and after an intrusion

ReVirt logs all non-deterministic events so it can replay nondeterministic attacks and executions

ReVirt provides arbitrarily detailed observations about what
transpired

ReVirt is implemented as a set of modifications to the host OS

ReVirt adds “reasonable?” time and space overhead
Observations

Total overhead for kernel-intensive workloads: up to 66%
–
–


Checkpoint time and space overhead not characterized
Host OS can still be compromised
–







Is this overhead justifiable?
Should have reported total overhead in tables for increased clarity
No quantitative data to support narrower interface is more secure
Tests seem to focus on overhead rather than ability to enable analysis
There are no specific tools to analyze potentially large ReVirt logs
Log growth could be much larger since SPECWeb99 benchmark was
based on only 15 simultaneous connections
Replay must start from a powered-off VM state, is this practical?
How portable is ReVirt to other guest/host OS’s?
“No perceptible time overhead” is a weak measurement. Better metric?
No multiprocessor support yet published in late 2002
Discussion
1.
The authors state that they “believe that even an overhead of
58% is not prohibitive for sites that value security.” (p11) I
believe that an overhead of 58% is pretty big, especially for
busy systems. How much of a concern is this really?
2.
They show the average space/day logging takes. But does
this include the daily snapshot as well? If you're running a lot
of guest OS’s concurrently, couldn't this become a bottleneck
(or does ReVirt only run one guest OS at a time)? They give
results for both virtualization overhead and logging overhead,
but not both at the same time (which is the real-world
scenario). Is there any indication to how much the total
overhead is?
Discussion
3.
The authors talk about checkpointing in a few areas of the
paper. They claim it will be a rare event and so do not test the
time and space overhead to run one. They then say that their
future work is to “make checkpointing faster and more
convenient.”
I wonder how slow and inconvenient checkpointing is at this
point for them to avoid testing it (or releasing the test results)?
I think this should have been included in the paper as, even
though checkpointing may not happen often, it is still part of
the system overhead.
Discussion
4.
If ReVirt detects the non-deterministic events occurred during
the attack, what can it do to prevent further attack? Is it
possible to isolate them?
5.
Is UMLinux the only guest OS that can be used in ReVirt? Is
there any other OS were ported to ReVirt? Or how about the
development of ReVirt or some system like it?
Discussion
6.
The authors introduce ReVirt to address two shortcomings of
current systems - integrity and completeness. They state that
the "current system loggers lack integrity because they
assume the operating system kernel is trustworthy."
However, they also indicate that "even the VMM may be
subject to security breaches," but that the VMM is more
trustworthy than operating system because the interface is
narrower.
Does a narrower interface really make that much of a
difference in securing the system? Can't attackers still do a
lot of damage?
Discussion
7.
They talk about how this approach is useful in analyzing an
attack, and in section 5.4 give an example of this. But to do so
they introduced a vulnerability and then used the logging
method to analyze an attack that they themselves initiated.
While the example may have some validity, it would have
been nice to see something that they didn't set up
themselves.
Discussion
8.
Cooperative logging is cited as being capable of significantly
reduced storage as no LAN data needs to be logged (it can
just be regenerated); however you lose the ability to run
independent machines without running the whole network (or
so it seems). Are there any schemes that let you do both?
9.
They use a modified version of Linux 2.4.18 as the host OS.
I’m wondering how modified it is? They claim that the host
OS is safe from attack, but because it is still just an ordinary
OS, I’m not sure about this. What do you think?
Discussion
10.
ReVirt logs all input from external devices. Could these logs
be used to pick up passwords from keyboard input or other
security input (i.e. fingerprint readers and files from memory
sticks)?
11.
"ReVirt log all input from external entities. These include most
virtual devices: keyboard, mouse, network interface card, ..."
When we want to analyze the intrusion of a highly-used web
server, logging all input from the network device seems quite
expensive (I believe it would be much more than 1.4 GB/day
as shown in the experiment). Any solution for that?
Discussion
12.
So does it make more sense to add this VM layer just so we
can track, or is it just easier? (i.e. what are the arguments for
not having a VM layer?)
13.
When they used ReVirt to analyze and attack, they only
tested it with one attack. I think a broader range of attacks
should have been tested to get an accurate account of what
ReVirt can do. What do you think about this?
14.
What kind of analysis tools do the authors suggest/ provide?
They were able to find an error, but when they themselves
knew exactly what they were looking for.
Discussion
15.
In section 4.4, the paper mentioned alternative architectures
for logging and replay. Basically, they compared OS-on-OS
structure with direct-on-host structure. How about the directon-VMM structure? Does removing host OS improve the
performance and stability of ReVirt?
16.
In section 6, the paper compared hypervisors with ReVirt and
argued that they are targeting different goals. However, since
Hypervisors already have similar logging functionalities, why
not design ReVirt as a plugin (i.e. a special VM) for some
hypervisors?
Discussion
17.
Is there some other way to improve security that does not
involve loading the VMM as a kernel module?
18.
The guest doesn't run X itself, but rather connects to a remote
X server (say on the host). Doesn't this introduce a hook that
a malicious user could use to gain access to (or at least
destabilize) the host?
Discussion
19.
Why does ReVirt have only a single disk checkpoint which is
the virtual machine being powered off? Why did they not
think to add in other checkpoints? Why did they "envision
checkpointing being a rare event?" Is this because they don't
see their system being attacked more frequently than that?