wespy_1 (dec 2007)

Download Report

Transcript wespy_1 (dec 2007)

A victim-centric peer-assisted
framework for monitoring and
troubleshooting routing
problems
How to Monitor?
Four schemes
How to monitor? Pros
Cons
Monitor devices
such as routers
No overhead
Device status does not
directly translate into
user perceived
performance
Monitor BGP
updates
No overhead; Know
what happens in
other network
Do not see some
data-plane anomaly
Monitor flowlevel traffic
No overhead; Real
traffic; Witness direct
impact of failures
Do not witness
failures directly
Active probing
Extra overhead; May
Witness direct
impact of failures not mimic the real
traffic
What Constraint to Monitor?

Network meets ISP’s goals




Resource utilization
Routing goes as specified by policy
…
Network meets users’ goals



Reachability
 Most fundamental end-to-end property
 Easy to define and formulate
Delay, loss
 Less easier to define and formulate
Application level: Bulk transfer, VOIP
 Depends on reachability, delay, loss, etc
Our Monitor Scheme

Monitor reachability using active probing




Focus on reachability
Use ping – no need for remote cooperation
Trade off between probing efficiency and probing
coverage (challenges)
Disclaimers


Do not monitor delay or loss
Do not consider ISP’s goals
Troubleshooting -Next Step to Monitoring

Goal of troubleshooting



Localize the root cause
How local? Depends on the nature of the cause
Purpose of troubleshooting

Local root cause


Pin-point the problem and fix it
Remote root cause


Contact the responsible networks to solve the problem
By-pass the faulty network
Localize the Root Cause
Topology dimension
AS 1
a->b b->c c->d
AS 2
m->n m->l n->l
AS 3
x->y y->z z->x
Firewalls (those
who prevent
forwarding)
Forwarding paths
(those who do
forwarding)
Control plane
Physical and link
layers
Protocol dimension
Localize
the cause
Link level
at AS level
Localize the cause at protocol
Both AS
level
and protocol level
Troubleshooting:
Three building blocks

Tool


Data: generated by tool


traceroute, ping, netflow, looking glass, etc
e2e reachability, BGP updates, traffic profile, etc
Brain: the intelligent part, usually network
operator


Digest the data, make inference, leverage
dependency, draw from past experience
The key of troubleshooting. Hard problem
What Can We Do to Improve?

Improve the tool




Promote the cooperation among networks
Traceroute -> resilient remote traceroute
BGP feed -> resilient remote BGP feed
Improve the automation of brain

Unify previous work
Automatic Brain

It’s a challenging problem



Fault may occur at multiple levels
Involve machine learning
Example work:
Enterprise network services, sigcomm’07, by
Paramvir Bahl et al.
Dependency Graph Approach


Decompose a large system into components
Infer the dependencies among components



A set of observations on some components


A depends on B: If B fails, A fails
Lead to a hierarchy of dependencies: dependency graph
(like Makefile)
For example, F,H,X works but G fails
Infer the status of other components using
dependencies, finally locate the root cause
component
Dependency Graph
Example 1
Multi-tier dependency graph. Diagnoses multi-level fault but needs
automated construction. [ From Paramvir Bahl et al, sigcomm’07 ]
Dependency Graph
Example 2
Flat dependency graph. Diagnoses simple fault. [From Ramana Kompella
et al, infocom’07 ]
Trade-off in Decomposition


The granularity of decomposition determines
the how specific the troubleshooting is
Fine-grained decomposition



Advantage: more specific
Disadvantage: graph is more complex,
constructing and solve it is challenging
Coarse-grained decomposition


Advantage: graph is simple, constructing and
solving it is less challenging
Disadvantage: less specific
Dependency Graph
Regarding Internet Routing
A
B
p can ping q
A depends on B
p can send packets to q
q can send packets to p
Forwarding path p->q is OK
Physical path p->q is OK
Link u_i->u_{i+1}
is up
……
Control plane info is correctly propagated
AS N_i has correct
route
N_{i+1}
AS N_i imports routes
of prefix p
Path p->q before failure: IP hops: u_0, u_1, …, u_n,
AS hops: N_0, N_1, …., N_m
Dependency Graph Regarding
Internet Routing (cont.)

Account for three common root causes




Link/router failure
Router misconfiguration leading to missing route
(i.e. does not import route)
Router misconfiguration or attack leading to prefix
hijacking
Topology-wise locate the root cause, and also
tell among the three root causes

Reasonably specific
Recent Work on Network
Troubleshooting

Infocom’07, Detection and Localization of Network Black Holes, by
Ramana R. Kompella et al


CoNext’07, NetDiagnoser: Troubleshooting network unreachabilities
using end-to-end probes and routing data, by Amogh Dhamdhere et al


Automate the “brain”. Consider both physical failure and control plane fault.
For inter-domain. Flat dependency graph.
Sigcomm’07, Automating Cross-layer Diagnosis of Enterprise Wireless
Networks, by Cheng et al


Automate the “brain”. Consider only physical failure. Mainly for intra-domain.
Flat dependency graph.
Improving the “tool”. Measure and infer various delays in a wireless
environment
Sigcomm’07, Towards Highly Reliable Enterprise Network Services Via
Inference of Multi-level Dependencies, by Paramvir Bahl et al

Automate the “brain”. Mainly for enterprise network and services. Deal with
multi-level faults. Automatically generate multi-tier dependency graph.
NetDiagnoser: Overview


Troubleshooting unreachability
Fault assumption:





Link failure, router misconfiguration causing partial link
failure (in particular BGP export filter misconfiguration)
Deal with filtered traceroute
More comprehensive than previous work
Infrastructure: sensors, all pair-wise traceroute
Mechanisms:



Binary tomography
Per-neighor-basis logical link modeling control plane
Combining BGP withdraw message
NetDiagnoser: Logical Links
Netdiagnoser:
Dependency Assumption
P can send packets to q
Forwarding path
p->q is OK
Physical path p->q
is OK
Link u_i->u_{i+1}
is up
A
B
Control plane info is
correctly propagated
AS N_{i+1} exports
prefix q to AS N_i
A depends on B
P->q: IP hops: u_0, u_1, …, u_n,
AS hops: N_0, N_1, …., N_m