Correlating Instrumentation data to system states: A building block

Download Report

Transcript Correlating Instrumentation data to system states: A building block

Correlating Instrumentation data
to system states: A building block
for automated diagnosis and
control
Ira Cohen, Jeffrey S. Chase et al.
Introduction

Networked systems continue to grow in scale

Complex behavior stemming from interaction of






Workload
Software structure
Hardware
Traffic conditions
System goals
Pervasive System needed to manage such a system

Examples?



HP’s Openview
IBM’s Tivoli
(Aggregates + displays graphically)
Introduction

Two approaches to build self managing
systems

A priori models



Event-condition-action rules
Not based on real systems
(Disadvantages?)


Difficult and costly
Unreliable, does not take account of all
Introduction

Statistical learning techniques

Assumes little to no domain knowledge


Hence “general”
Problem!

Still have to identify techniques that are
powerful enough to induce effective models
that are:



Efficient
Accurate
Robust
Goals

Automatically analyze instrumentation data from
network services in order to




Forecast
Diagnose
Repair failure conditions
We use the Tree-Augmented Naïve Bayesian
Networks (TANs) as the basis for


Diagnosis
Forcasting


System-level instrumentations in a 3-tier network service.
Widely used in various fields, but TANs are not used
in the context of computer systems.
Goals

Analyzed data from 124 metrics gathered
from

3 tiered e-commerce site under synthetic load




Httperf
Java PetStore as platform
TAN model select combination of metrics and
threshold values that complies with Service
Level Objectives for average response time.
Results later
What is a TAN?


Bayesian network is an annotated directed
acyclic graph encoding a joint probability
distribution
Naïve Bayesian Network



State var S is only parent of all other vertices
Assumes all metrics are fully independent given S
TANs consider relationships among metrics
themselves, with constraint that each metric has
only one other parent than S
Why Use a TAN?


Based on premise that a relatively small
subset of metrics and threshold values
is sufficient to approximate the
distribution accurately
Outperforms generalized Bayesian
networks and other alternatives in both


Cost
Accuracy
Why use a TAN?



Useful for forecasting failures and violations
Possible to induce models that predict SLO violations
in near future, even when system is stable
Automated controller can invoke directly


Identify impending violation
Respond





Loading
Adding resources
Cheap model to induce
Possible to maintain multiple models
Periodic refresh
Setup

System is 3-tier webservice






Apache
Middleware (BEA WebLogic)
Oracle db
3 Servers with HP Openview to collect
statistics
Load Generator is httperf
SLO indicator processes the logs to determine
compliance
Interpretability and
Modifiability

TANs offer other advantages



Interpretability
Modifiability
Influence of each metric can be quantified in a
probabilistic model



Analysis catalogs each type of violation according to the
metrics and values that correlate with observed instances
Strength is given from prob value occurring in different
states
Gives insight to causes of violations and how to repair
Workloads

Varies several characteristics




Aggregate req rate
Number of concurrent connections
Fraction of data-intensive vs app-intensive
requests
This is to exercise the model-induction
methodology by providing it with a wide
range of M,P pairs


Where M = sample of values for system metrics
P = vector of app-level performance
measurements
Workloads


RAMP: Increasing concurrency
STEP: Background + Step function



Background constant traffic
Bursty, hour long bursts
BUGGY: Increasing aggregate req. rate
Results




Varied SLO thresholds to explore effect
on induced models
To eval accuracy of models under
varying conditions
Trained and evaled TAN classifier for
each of 31 different SLO definitions
Baseline: accuracy of 60-pctile SLO
classifier (MOD) and CPU as metric.
Results






Overall BA of TAN is 87-94%
90+% for all experiments
6% False alarm for 2 experiments, 17% for BUGGY
Single metric is not sufficient to capture pattern of
SLO violations (CPU)
Small number of metrics is sufficient to capture
pattern (3-8)
Sensitive to workload and SLO definition (MOD
always has high detection rate, but generate false
alarms at increasing rate as SLO thresh increases)
Conclusion

TANs are attractive for self-managing
systems





Build system models automatically
No a priori knowledge required
Generalizes to wide range of conditions
Zeroes in on most relevant metrics
Practical
Conclusion



Possible work to adapt this to changing
conditions
Close the loop for automated diagnosis
and control
Ultimately most successful model is a
hybrid of


Automatically induced models
A priori models
Questions?