IPaddr, port

Download Report

Transcript IPaddr, port

Towards Highly Reliable
Enterprise Network Services
via Inference of
Multi-level Dependencies
Paramvir Bahl, Ranveer Chandra, Albert Greenberg,
Srikanth Kandula, David A. Maltz, Ming Zhang
Microsoft Research
Presented by
Zhenyu Pan
Introduction
Enterprise Network Service
• Network service is an
(IPaddr, port) pair.
• Enterprise network
– network of a single
enterprise.
– traffic does not
cross the open
Internet.
– user-perceptible
service
degradations are
rampant.
Introduction
Sherlock System
• Conventional approach
– treat each service as up or down.
– box-centric, blind to the complex set of dependencies
– meaningless alerts (15,000 alerts a day, almost universally
ignored).
• The new approach
– models service availability as a 3-state value:
• Up: response time is normal;
• Down: requests result in either an error status or no response at
all;
• Troubled: response times fall significantly outside of normal
response times.
– user-centric, does not report problems that do not directly
affect users.
Introduction
Sherlock System
• System components
– Detects faults and performance problems by monitoring the
response time of services. Software agents: run on each host,
analyze the packets, determine the set of services the host
depends.
– Determines the set of components that responsible, a service,
a router, or a link, etc. Sherlock server: assembles an multilevel, 3-state inference graph that captures the dependencies
between all components.
– Localizes the problem to the most likely component. Ferret
Algorithm: localizes faults using the inference graph.
• Main contributions:
– Inference Graph
– Ferret Algorithm
Introduction
Inference Graph
• An example
Inference Graph
Node Types
• Root-cause node: physical components whose failure can
cause an end-user to experience failures.
– computer, service, router, IP link, etc;
– two special root-cause nodes: always troubled (AT) and always
down (AD) to model external factors
• Observation node: accesses to network services whose
performances can be measured by Sherlock.
• Meta-node: model the dependencies between root causes
and observations. Three types of meta-nodes:
– noisy-max
– selector (load-balancers)
– failover (failover redundancy)
Inference Graph
Node States
• The state of each node
– three-tuple: (Pup, Ptrouble, Pdown)
– P stands for probability
– Pup + Ptrouble + Pdown = 1
• The state of the root-cause node is independent of any
other node.
• The state of observation nodes can be uniquely determined
from the state of its ancestors.
Inference Graph
Edges
• Edge from node A to B:
– the dependency that node A has to be in the up state for node
B to be up.
– For example,
• a client cannot retrieve a file from a file server if the path to that
file server is down.
• A client might still be able to retrieve the file even when the DNS
server is down, if the file server’s name to IP address mapping is
found in the client’s local DNS cache.
– dependency probability indicates how strong the dependency
is.
Inference Graph
Propagation of State
• Noisy-Max Meta-Nodes
– Max: The node gets the worst condition of its parents.
– Noisy: if the weight of a parent’s edge is d, then with
probability (1-d) the child is not affected
Inference Graph
Propagation of State
• Selector Meta-Nodes
– Used to model load balancing
• NLB: Network Load Balancer.
• ECMP: routers send packets to a destination along several paths.
Inference Graph
Propagation of State
• Failover Meta-Nodes
– Failover: Clients access primary servers and failover to backup
servers when the primary server is inaccessible.
Inference Graph
Time to Propagate
• the majority of the nodes with more than one parent are
noisy-max meta-nodes.
– For these nodes, computation time is O(n)
• for selector and failover meta-nodes:
– Still needs O(3n) time.
– HOWEVER, those two types of meta-nodes have no more than
6 parents.
Inference Graph
Fault Localization
• Assignment-vector
– An assignment of state to every root-cause node which has
probability of 1 of being either up, troubled or down.
• Our target
– Find the assignment-vector that best explain the observation
• Ferret
– sets the root causes to the states specified in the assignmentvector and then propagate state probabilities downwards until
they reach the observation nodes.
– for each observation node, computes a score based on how
well the probabilities in the state of the observation node
agree with the statistical evidence.
Inference Graph
Fault Localization
• Impossible to traverse all possible assignment vectors to
determine the vector with the highest score
– OBSERVATION 1. It is very likely that at any point in time only
a few root-cause nodes are troubled or down.
– OBSERVATION 2. Since a root-cause is assigned to be up in
most assignment vectors, the evaluation of an assignment
vector only requires re-evaluation of states at the
descendants of root cause nodes that are not up.
Inference Graph
ferret algorithm
Sherlock System
Three-Step Process
• Step1: service-level dependency graph
– We define the dependency probability of a host on service A
when accessing service B as the probability the host needs to
communicate with service A before it can successfully
communicate with service B.
• Step2: Inference Graph
• Step3: Fault Localization using Ferret
• Score for a given assignment vector:
– Track the history of response time and fits two Gaussian
distribution to the data, namely Gaussianup and
Gaussiantroubled.
– If the observation time is t and the predicted observation node
is (pup, ptroubled, pdown), then the score of this vector is
calculated as: pup*Prob(t|Gaussianup) +
ptroubled*Prob(t|Gaussiantroubled)
Sherlock System
Discovering Service-Level Dependencies
• OBSERVATION: If accessing service B depends on service
A, then packets exchanged with A and B are likely to cooccur.
• Dependency probability: conditional probability of accessing
service A within the dependency interval, prior to accessing
service B.
– Dependency interval: 10ms
• Chance of co-occurrence: first calculate the average
interval, I, between accesses to the same service and
estimate the likelihood of “chance co-occurrence” as
(10ms)/I. They then retain only the dependencies where
the dependency probability is much greater than the
likelihood of chance co-occurrence.
Sherlock System
Constructing the Inference Graph
• creates a noisy max meta-node to represent the service.
• creates an observation node and makes the service metanode a parent of the observation node.
• examines the service dependency information recursively.
• creates a root-cause node to represent the host on which
the service runs and makes this root-cause a parent of the
meta-node.
• adds network topology information by using trace route
results. For each path between hosts, it adds a noisy-max
meta node to represent the path and root-cause nodes to
represent every router and link on the path.
• adds each of these root-causes as parents of the path
meta-node.
• put AT and AD. Give each the edges connecting AT/AD to
the observation point a weight 0.001. And give the edges
between a router and a path meta-node a probability
0.999.
Implementation
Agent
Implementation
Service Dependency Graphs
Implementation
Dependency Probabilities
Implementation
Test Bed
•
Evaluation
Root Cause
•
Evaluation
Performance Comparison
•
Evaluation
Error Influence
•
Summary
• Main contribution
– Sherlock system
•
•
•
•
Assist IT Admin for troubleshooting.
A Multi-level probabilistic inference model
Automatic Construction of the Inference Graph
An algorithm to localize root cause.