Transcript document
Towards Highly Reliable Enterprise
Network Services via Inference of
Multi-level Dependencies
Defense by
Chen, Jiazhen & Chen, Shiqi
Introduction
• Using a network-based service can be a
frustrating experience…
Even inside the network of
a single enterprise, where
traffic does not cross the
open Internet, userperceptible service
degradations are rampant.
Introduction
When 13% of the
requests take 10x
longer than normal
and acceptable
time to process,
it’s time for us to
do something.
Introduction
• Conventional approach: up OR down
• Our approach: up, troubled or down.
Response fall significantly
outside of normal
response time, including
the case when only a
subset of service requests
are performing poorly.
Sherlock
• Detects the existence of faults and
performance problems by monitoring the
response time of services.
• Determines the set of components that
could be responsible.
• Localizes the problem to the most likely
component.
Inference Graph and
Ferret Algorithm
• Inference Graph: Captures the dependencies
between all components of the IT infrastructure.
Determine the set of component
• Ferret Algorithm: Efficiently localizes faults in
enterprise-scale networks using the Inference
graph and measurements of service response
times made by agents.
Localize the problem to the most likely
component
Tradeoffs
• Sherlock falls short of full problem diagnosis.
• It has not been evaluated in an environment that
dependencies are deliberately and frequently
changed.
• …
• BUT, measurements indicate that the vast
majority of enterprise applications do not
fall into these classes.
Inference Graph in detail
Node structure
• Root-cause node: physical components
whose failure can cause an end-user to
experience failures. (two special rootcause nodes: AT and AD)
• Observation node: accesses to network
services whose performances can be
measured by Sherlock.
• Meta-nodes: model the dependencies
between root causes and observations.
Node State
• The state of each node is expressed by a
three-tuple: (Pup, Ptrouble, Pdown)
• Pup + Ptrouble + Pdown = 1
• The state of the root-cause node is
independent of any other node in the
Inference Graph.
Edges
• Edge from node A to B: the dependency
that node A has to be in the up state for
node B to be up.
• Each edge is given a dependency
probability, indicating how strong the
dependency is.
Meta-Nodes and the
Propagation of State
Noisy-Max Meta-Nodes
• Max: The node gets the worst condition of
its parents.
• Noisy: if the weight of a parent’s edge is
d, then with probability (1-d) the child is
not affected by that parent.
Meta-Nodes and the
Propagation of State
Selector Meta-Nodes
• Used to model load balancing scenarios.
• NLB: Network Load Balancer
• ECMP: A commonly used technique in
enterprise networks where routers send
packets to a destination along several paths.
Meta-Nodes and the
Propagation of State
Failover Meta-Nodes
• Failover: Clients access primary
production servers and failover to backup
servers when the primary server is
inaccessible.
Calculation of State
Propagation
For Noisy-Max Mega-Nodes:
Time Complexity: O(n) instead of O(3n).
Calculation of State
Propagation
For Selector and Failover meta-nodes:
• Still needs O(3n) time.
• HOWEVER, those two types of metanodes have no more than 6 parents.
Fault Localization
• Assignment-vector: An assignment of
state to every root-cause node in the
Inference Graph where the root-cause
node has probability 1 of being either up,
troubled or down.
• Our target: Find the assignment-vector
that can explain the observation best.
Ferret Algorithm
• Basically we need to compute a score for
how well the vector explains the
observations for EACH vector, and give
the most possible results in sorting order.
• But there are 3r possible vectors for r rootcause nodes, how to search through all
these vectors?
Ferret Algorithm
• IMPORTANT OBSERVATION:
It is very likely that at any point in time only
a few root-cause nodes are troubled or
down.
• Ferret Algorithm evaluates the vectors with
no more than k nodes troubled or down.
So it processes at most (2*r)k vectors.
Ferret algorithm
• The approximation
error of Ferret
algorithm becomes
vanishingly small
for k=4 and
onwards
Ferret Algorithm
• OBSERVATION:
Since a root-cause is assigned to be up in
most assignment vectors, the evaluation of
an assignment vector only requires reevaluation of states at the descendants of
root-cause nodes that are not up.
• So Ferret Algorithm can be further
speeded up.
Ferret Algorithm
Score for a given assignment vector:
• Track the history of response time and fits two
Gaussian distribution to the data, namely
Gaussianup and Gaussiantroubled.
• If the observation time is t and the predicted
observation node is (pup, ptroubled, pdown), then the
score of this vector is calculated as:
pup*Prob(t|Gaussianup) + ptroubled*Prob(t|Gaussiantroubled)
Ferret Algorithm
• We use a statistical test to determine if the
prediction is sufficiently meaningful to
deserve attention.
• Null hypothesis: all root causes are up
• Score(best prediction) – Score(null
hypothesis)
The Sherlock System
Step1: Discover Service-Level Dependencies
Step2: Construct the Inference Graph
Step3: Fault Localization using Ferret
Remark: Sherlock requires no changes to
routers, applications, or middlewares used
in the enterprise.
Discovering Service-Level
Dependencies
• OBSERVATION: If accessing service B
depends on service A, then packets exchanged
with A and B are likely to co-occur.
• Dependency probability: conditional probability
of accessing service A within the dependency
interval, prior to accessing service B.
• Dependency interval: 10ms
• Chance of co-occurrence: 10ms/I. We retain
only the dependency probability which is much
greater than the chance of co-occurrence.
Aggregating Probabilities across
Clients
• OBSERVATION: many clients in an
enterprise network have similar host,
software and network configuration and
are likely to have similar dependencies.
• Aggregation allows us to deal with
infrequent service and false
dependencies better.
Constructing the Inference
Graph
A: Create a noisy-max meta-node to represent the
service. And create an observation node for each
client.
Constructing the Inference
Graph
B: Identify the set of services Ds that the clients are
dependent on accessing S. And recurs.
Constructing the Inference
Graph
C: Create a root-cause node to represent the host on
which the service runs and makes this root-cause a
parent of the meta-node.
Constructing the Inference
Graph
D: Add network topology information to the Inference
Graph. Add a noisy-max meta node to represent the
path and a root-cause nodes to represent router and
link.
Constructing the Inference
Graph
E: Finally put AT and AD into the Inference Graph. Give
each the edges connecting AT/AD to the observation
point a weight 0.001. And give the edges between a
router and a path meta-node a probability 0.999.
Implementation
• Implement Sherlock Agent as user-level daemon
– Windows XP
– Pcap-based sniffer
Implementation
• Centralized inference Engine
– Aggregate information easily
– Scalability
• Extremely large network (105 agents) – bandwidth Req.10Mbps
Implementation
• Computational Complexity of fault localization
– Linearly with graph size
Evaluation
• Deploy Sherlock system
in our enterprise network
–
–
–
–
–
40 servers
34 routers
54 IP links
2 LANs
3 Weeks
• Agents periodically send requests to the web and file servers
– Mimic user behavior
• Controlled environments: Testbed and simulation
Evaluation
– Discovering Service Dependencies
• Evaluate Sherlock’s algorithm
– discovering Service-level dependency graphs
• Two example:
– Web Server
– File Server
• In the graph:
– Arrows:
• Point from server provides service to server rely on it.
– Edges :
• Represent strength of the dependencies with weights
Evaluation
– Discovering Service Dependencies
• Service-level dependency graphs for visiting
web portal and sales website
– Clients depend on name lookup servers to access
websites
– Both websites share substantial portions of their
back-end dependencies
Evaluation
– Discovering Service Dependencies
• Service-level dependency graphs for visiting file server
– Turns out file server is the root name of a distributed file system
Evaluation
– Discovering Service Dependencies
• Summary
– Service-level dependencies is various
– Similar dependencies have different strength
– Dependencies change over the time
Evaluation
– Discovering Service Dependencies
• Number of samples
– For about 4,000 samples, converged.
– Once converged, service-level dependencies are
stable over several days to weeks.
Evaluation
– Discovering Service Dependencies
• Number of clients
– Sherlock aggregates samples from more clients
– Aggregate 20 clients reduce false-dependency greatly
Evaluation
– Localizing Faults
• Evaluate “Ferret”
– Ability to localize faults
– Sensitivity to errors
Evaluation
– Localizing Faults
• Testbeds
– Three web servers,
– SQL server,
– Authentication server
• Failed or Overloaded:
– Overloaded link: 5% of packets drop
– Overloaded server: high CPU and disk utilization
Evaluation
– Inference Graph
• Service-level dependencies
– Arrows at the bottom level
Features of Ferret
• Localizes faults
– Correlating observations from multiple points
• No “Threshold” setting
• Clients observation of web server’s performance
– Probabilistic Correlation Algorithm for
reasoning
• List of suspects of being the root cause
• Handle multiple simultaneous failures
Fault Localization Capability
• Affected by
– The number of vantage points
– The number of observations available in time window
• Having observations from the same time period that exercise
all paths in the Inference Graph.
Fault Localization Capability
• In time window of 10 seconds
– At least 70 unique clients access every one of the top
20 servers
– Top 20 servers alone provide enough observations to
localize faults
Evaluation
- Sherlock Field Deployment
• The resulting Inference Graph contains:
– 2565 nodes
– 358 components
– Can fail independently
• Experiment:
– Running Sherlock system over 120 hours
– Found 1029 performance problems
Evaluation
- Sherlock Field Deployment
• The result
– X: time
– Y: one component
Evaluation
- Sherlock Field Deployment
• Ferret’s computation:
– 87% of the problems are caused by 16 components
– The Server1, Server2, and Link on R1 for instance
Features of Sherlock
• Discover problems overlooked by using
traditional threshold-based techniques.
– Requests requiring interaction between the web
server and the SQL backend experience poor
performance
• No Threshold-based Alerts raised
• Sherlock caught it!
Features of Sherlock
• Discover problems overlooked by using
traditional threshold-based techniques.
– No threshold setting that can catches this problem
while avoiding false alarms
Sherlock vs. prior approaches
• Multi-level inference graph vs. Two-level
• Probabilistic dependencies
• Need a large set of observations that the
actual root cause of problems are known.
– Simulation!
Sherlock vs. prior approaches
• Conduct Experiment with simulations
– Known topology and Inference Graph
• Randomly set state of each root cause
• Repeat 1,000 times produced 1,000 sets of observations
– Compare different techniques’ ability
• To identify the correct root cause given the 1000
observation
• Up to 32% more faults found than Shrink and
Score
Sherlock vs. prior approaches
– Compare different techniques’ ability
• To identify the correct root cause given the 1000
observation
• Up to 32% more faults found than Shrink and
Score
Sherlock vs. prior approaches
– Compare different techniques’ ability
• Shrink: using Two-level inference graph
• SCORE: using minimal set cover
– Dependency either exist or not
Time to Localize Faults
• Experiment:
– AMD 1.8 GHz, 1.5GB RAM
– Inference Graph:
• 500,000 nodes
• 2,300 clients
• 70 servers
• Ferret takes 24 minutes to localize an injected fault
Time to Localize Faults
• Running time
– Ferret is always less than 4ms times the number of nodes in the
Inference Graph
• Ferret is easily parallelizable
Impact of
-Errors in Inference Graph
•
Errors are unavoidable when constructing
inference graphs
–
•
How sensitive Ferret is to these errors?
Experiment:
1.
2.
3.
4.
–
–
Randomly add new parents
Randomly swap one of its parents with another node
Randomly change weight on edges
Randomly add extra hop or permute its intermediate hops
First three types: perturbation to errors in dependency
Last type: errors in traceroute
Impact of
-Errors in Inference Graph
• From observation:
• Each point represents the average of 1000 experiment
– When half of paths/nodes/weights are perturbed
• Ferret correctly localizes faults in 74.3% of the case
– Perturbing the paths seems to be most impact
Modeling Redundancy Tech.
• Modeling load-balancing and redundancy
– Using specialized meta-node
• Experiment:
– Create 24 failure scenarios
• Where the root cause is a component connected to a
specialized meta-node
– Test Ferret by using
• Inference graphs with specialized meta-nodes
• Inference graphs with noisy-max meta-nodes instead
– Result
• 14 cases about secondary server or backup path
– No difference for both kind of inference graph
Summary of Evaluation
• Ferret algorithm is able to discover service
dependencies in a few hours
• Since enterprise network might contains huge number of
nodes, we justify the need to an automatic approach
• Effectively identifying performance problems and
narrowing the root-cause
• Sherlock is robust to noise, compare to other approachs.
Discussion – future works
• Virtual Machines
– Many enterprises put multiple servers on a
single piece of hardware via VM.
– We can let all the algorithms in this paper,
treat each VM as a separate host.
Discussion – future works
• Software Failures
– Software Failures not modeled
– Extending our inference graph to these
common failures modes
– This is our next step plan
Contributions
• Main contribution
– Sherlock system
• Assist IT Admin for troubleshooting.
– We contribute
• A Multi-level probabilistic inference model
• Technique for Automatic Construction of the
Inference Graph
• An algorithm with Inference Graph to localize root
cause.
Thanks~
Any
Question?