FaultDetection
Download
Report
Transcript FaultDetection
Fault Detection
Sathish S. Vadhiyar
Source/Credits: From Referenced
Papers
Introduction
Fine Grain Cycle Sharing (FGCS)
Host computers allow guest jobs to utilize
CPU cycles
Availability of host computers vary
Guest jobs may incur resource failures
Need to predict availability of host
computers
A scheduling system can allocate guest jobs
based on the availability of host computers
Kinds of Non Availabilities
FRC (Failures Caused by Resource
Contention)
A guest job may significantly impact host
processes
Hence a guest job can be removed
FRR (Failures Caused by Resource
Revocation)
A machine owner suspends resource
contribution without notice
Hardware-software failures occur
Resource Failure Prediction
A multi-state failure model and application of a
semi-Markov Process (SMP) to predict the
temporal reliability
Predicting probability that no resource failure will
occur on a machine in a future time window
Observing host resource usage values in a time
window; calculating parameters of SMP based
on host resource usage values
Multi-state resource failure model
FRR – 2 states
A machine is either available or unavailable
FRC
Failures when host processes incur
noticeable slowdown due to contention from
guest processes
A host processor can first decrease the
priority of guest processes; If this does not
help, the guest process is terminated
Measured host resource usage as indicators
of noticeable slowdown
Initial Experiments
To study relations between host resource usage
and FRC - Experiments conducted to simulate
resource contentions between a guest process
and host processes
Host-group – an aggregated set of host
processes with various resource usages
Slowdown of host group – reduction of its CPU
utilization due to contending guest process
Host programs are run with their isolated CPU
usage between 10% and 100%
Guest process – a CPU bound program
Experiments on CPU contention
Also measured reduction rate of host CPU
usage for a host-group
Experiments repeated with different host groups
with host priority 0, and guest priority 0 and 19
(renice)
Measured reduction rate plotted as function of
isolated host CPU usage, LH
Found 2 thresholds for LH
Th1 – highest value of LH when guest process needs
to be reniced to keep reduction rate below 5%
Th2 – highest value of LH when guest process needs
to be suspended to keep reduction rate below 5%
State model for LRC
3 states
S1 - When LH < Th1; ignore resource
contention due to guest processes;
slowdown already less than 5%
S2 - When Th1 < LH < Th2; renice guest
processes for slowdown to be < 5%
S3 - When LH > Th2; terminate guest
process
Experiments on CPU and Memory
Contention
When memory trashing occurs
Total memory of guest and host processes
exceed available memory size
Experiments were conducted to verify
memory trashing does not depend on guest
priority
S4 – for failure due to memory trashing
Multi-State Failure Model
Proposed prediction algorithm is to predict the
probability that a machine will never transfer to
S3, S4, or S5 within a future time window
Transitions
Between S1, S2, S3 – decided by measured host CPU
usage
To S4 – when memory is limited
Semi-Markov Process Model
(SMP)
Applicable when next transition depends only on
Current state
How long the system at the current state
Transition probabilities depend on amount of
time elapsed since last change in state
SMP is defined by a 3-tuple
S – finite set of states
Q – state transition matrix
H – holding time mass function matrix
SMP (Contd…)
The most important statistics of SMP - Interval transition
probabilities, P
To calculate P
Continuous time SMP is expensive
Hence the work develops a discrete time SMP model
SMP for Resource Availability
TR – probability of never transferring to S3, S4 or S5
within an arbitrary time window, W
Sinit – initial system state
W – Winit + T
Q and H calculated based on statistics from history logs
due to monitoring host resource usage
SMP for Resource Availability
Pi,j(m) = Pi,j(Winit, Winit+m)
P1i,k(l) – interval transition probabilities for a one-step
transition
d – time unit of a discretization interval
Q and H calculated based on statistics from history logs
due to monitoring host resource usage
System Design and Implementation
Client requests job submission
Client’s job scheduler queries
the gateways on available
machines for temporal
availabilities
Chooses a machine and
spawns a guest job
During job execution, monitor
detects state transition and
notifies gateway
Gateway renices or kills the
guest processes accordingly
Resource monitor uses simple
cpu commands like `top’ to
calculate cpu usages
Computation in Solving SMP
Matrix sparsity in SMP is exploited to reduce
computations
The sparse matrix is constructed based on 2
facts:
It takes a finite amount of time to transition from one
state to another
S3, S4, S5 are unrecoverable failure states
Prediction Accuracy
TR gets close to 0 for large time windows
Appropriate Training Size
Comparison with Linear
Regression Techniques
Injecting Noises
References
Resource Failure Prediction in FineGrained Cycle Sharing Systems. X. Ren,
S. Lee, R. Eigenmann, S. Bagchi. HPDC
2006.