PPT - Computer Science Department, Technion

Download Report

Transcript PPT - Computer Science Department, Technion

Transparent
Fault-Tolerant
Java Virtual Machine
Roy Friedman & Alon Kama
Computer Science — Technion
FT-JVM Goals

Fault-tolerant environment for executing Java applications



Highly Reliable


Fault-tolerance can be extended by utilizing more machines
Low Maintainability


Apps should execute without interruption, overcoming failures of
individual machines
Apps should not have to be modified in order to run on the system
Recovery upon failure of individual machines should be swift
Transparency

Failures should be masked and the transition to another machine
should be transparent
Fault tolerance by Replication

Replication — Coordinating a set of replicas of the computation on
processors that fail independently
Potential for a dramatic decrease in Mean Time To Repair (MTTR)
Achieve t-fault-tolerance, where t is the number of replicas
Increased cost of hardware for duplication of effort
Overhead and complexity of maintaining consistency

Replication + Transparency (masking of failures, maintaining the
illusion of a single copy) = High availability
Replication for Java

Replication at the Java Virtual Machine level


Replication at this level is cost-effective, portable, and transparent to
the application developer and the user
Approach extends Bressoud & Schneider (1995) who
implemented active replication below the Operating System
T. Bressoud and F. Schneider. Hypervisor-based Fault-Tolerance,
SOSP-15
Design of the FT-JVM


Replication requires deterministic execution. Difficult to achieve because
of:

Preemptive context switches

Lock contention in SMP

I/O availability differences

Environment-specific attributes
Changes made to the VM:

Deterministic thread scheduling

Deterministic thread switching

Non-deterministic ops relay info
to replication module
Design of the FT-JVM

Replication module:



One replication engine per processor, on both primary and backups
Data packages are passed to engine on primary, retrieved from it for
backups
Threads waiting for I/O now yield instead, to be re-scheduled at
specific intervals

I/O is checked at beginning of a frame, determined by X contextswitches or the lack of schedulable application threads primary
Frame n
data
Frame n+1
data
backup
Performance Results
Performance Results
SMP Raytrace
Conclusion



Ideal for long-running, low-I/O Java applications
Only a small performance degradation even for frequent
synchronization between replicas (e.g. every second)
Quick detection and recovery from failure