Sing06-diagnosis

Download Report

Transcript Sing06-diagnosis

Automated Fault diagnosis in
VoIP
31st March,2006
Vishal Kumar Singh and Henning
Schulzrinne
1
VoIP Diagnosis

What is automated VoIP diagnosis



Why VoIP diagnosis



Networks are complex, making it difficult to
troubleshoot problems
Automatic fault diagnosis reduces human intervention
Issues in VoIP diagnosis



Determining failures in network
Automatically finding the root cause of the failure
Detecting failures/faults
Finding the cause of failure, determining dependency
relationships among different components for
diagnosis
Solution steps and approaches
2
Issues in Automated VoIP
Diagnosis




Increasingly complex and diverse network
elements
Complex interactions/relationships between
different network elements
Different run time bindings for each application
usage instance, e.g., different calls may use
different DNS, SIP proxy servers, media path
Problem in one network element may manifest
itself as user perceived failure of another
element
3
Fault Identification

Service unavailability reporting




Node/Device/UA generates faults (failure events) e.g.
SNMP Traps, failure messages
Monitoring application e.g., SNMP based application
detects service unavailability and reports the failure
event
Affected user reports service unavailability , e.g., by
e-mail, calling to helpdesk, automatically by pressing
a button on phone while in a call and experiencing
echo
Dependent application detects service unavailability
and generates fault (failure events)
4
Fault Localization : Determining
the Source of Problem

Fault Classification – Local Vs. Global
(Does it affect only me or Does it affect others
also)

Global failures



Local failures



Server failure e.g. SIP proxy, DNS failure, DB failures
Network failures
Specific Source failure e.g. node A cannot make call to
anyone
Specific destination or participant failure e.g. No one can
make call to node B
Locally observed but global failures e.g., DNS service
failed, but only B observed it.
5
Solution Approach





DYSWIS “Do you see what I see” [1]
Peers (Nodes) perform diagnostic tests when another
peer reports or detects failure
Nodes can choose the diagnostic test depending on
dependency encoded as decision tree
Nodes (at least some) will be initially preloaded with
the dependency relationship in some format (e.g., XML
based)
Nodes (at least some) may build and update the
dependency relationship based on statistical and
temporal analysis of failure events which they receive
and diagnostic tests which they perform
6
Solution Approach

Store context information of past failures experienced by each node


Store locality of past failures instances







E.g., specific server that was acting as the proxy server (for my call
which failed)
LAN, domain, subnet
First hop at each layer e.g., switch (MAC), default gateway (IP),
domain’s proxy (Application layer),
Failure count for each network element (statistical)
Last failure timestamp for each network element
Last successfully seen timestamp for each network element
(why do I need to test the proxy for you, my call just went through)
Temporal correlation of past failures (proxy seems to be failing after
DNS fails)
Each node has a runtime dependency list based on past failures and
diagnostic tests
7
Solution Architecture
P6
P2P
P2P
PESQ Test
P5
P2P
Service Provider 1
P7
Service Provider 2
P2P
P8
P4
P2P
P2P
DNS Test
SIP Test
P2
SIP Server
DNS Server
P3
P2P
P2P
P1
Domain A
Call Failed at P1
Nodes in different domains cooperating to determine cause of failure
8
Solution Architecture: Logical
View
Failures in
Network
Dependency graph generation
[Bayesian network based,
Inference, other models ]
Test results
Decision Tree
updates
Admin
input
[Dependency
relationships and
tests (XML) ]
Dependencies
encoded as
decision tree, static
and dynamic
rules
Triggers
to perform TESTS.
(Peer selection and
Probe selection.
Alerts
The above figure shows logical entities and separation of dependency graph
generation and Distributed diagnostic infrastructure (enclosed in blue).
9
Solution Requirements




Request-Response protocol between the node which
experiences the failure and the peer nodes
Nodes capability to perform diagnostic tests (probes),
probe selection based on cost/result
Encoding the dependency relationship into a decision
tree (giving as an input from an expert e.g., as XML)
Peer node discovery, based on



Location (local network, domain)
Capability to perform tests (based on specific tests)
Dependency graph generation and updation, based on


Network failure events
Diagnostic test results correlated with failures
10
Test/ Probe Selection

Which diagnostic probe to run –
network layer or application layer and
for what kind of failures.

A probe covering broad range of failures
can give faster and crude but less accurate
results


E.g. PING vs TCP Connect vs. SIP PING tests
Cost of Probe
11
Dependency Classifications

Functional dependency:


Structural dependency


At generic service level e.g. SIP proxy depends on
DB service, DNS service
Configuration time e.g. Columbia CS SIP proxy is
configured to use mysql database on metro-north
Operational dependency

Runtime dependencies or run time bindings, e.g.,
the call which failed was using failover SIP server
obtained from DNS which was running on host
a.b.c.d in IRT lab
12
Dependency classifications:
Layered Approach

Vertical and Lateral dependencies: Applications depends
on other application layer services (e.g., SIP service
depends on DB, DNS service) as well as lower layer
services

OSI layers as service dependency layers
 Application layer service also depends on transport layer
service which in turn depends on network layer service




MAC layer: Access point, Switch
Network layer: Router
Application layer: DNS, SIP, Database
Topology based dependency

e.g., calls from CS domain depends on specific SIP server, calls
from lab phones depends on specific switches and routers
13
Dependency Graph
14
Dependency Graph Encoded
to Decision Tree
A
C
A
Yes
B
C
A = SIP Call
C = SIP Proxy
B = DNS Server
D = Connectivity
A Failed,
Use Decision Tree
D
Invokes Decision
Tree for C
Invokes Decision
Tree for B
Invokes Decision
Tree for D
No
B
No
Yes
D
Yes
No
Cause Not Known
Report, Add new
Dependency
15
Diagnostic Tests

SIP proxy

Proxy server availability


Call Routing availability


Invite tests
Call Path determination


SIP PING
SIP TraceRoute
Media path

Quality related





Speech quality degradation - MOS
Echo
jitter- MOS, PESQ
QoS – RTCP
NAT/Firewall


Checking binding expiration.
Firewall failure to open a port - One way media.

How to determine which Firewall in the path ? SIP signaling ?
16
Diagnostic Tests



DNS tests
DHCP
Switch/Router




Conference mixers
Gateway




ARP/RARP/Multicast
BGP failures
Echo return loss- readings- Analysis
DB
XCAP server tests
Presence service availability tests
17
Example

Call Failure – Possible Causes

SIP Proxy server



Database
Authentication
Media path failure

Gateway





Specific call legs – ERL, Authentication, etc.
DNS server failure
End station failure
Network failure, e.g., router, switch failure
Different calls will have different run time
dependencies
18
Mapping to a Human Medical
System

Doctors perform diagnostic tests to find out the
cause of disease when the symptoms are
mentioned – They may learn new things about
the disease as a part of diagnostic tests


Failures and triggered tests update the dependency
graph
Medical researchers do different types of tests
to learn about new diseases, determine the
cause and relationship of a disease with other
physiological system

Set of tests that can run periodically and can be
used to build dependency graph independent of
failures
19
Solution Evolution


Learning the dependency graph from
failure events and diagnostic tests
Learning using random/periodic testing
to identify failures and determine
relationships
20
Future Directions




Self healing
Predicting failures
Protocols for labeling event failures
which would enable automatically
incorporating new devices/applications to
the dependency system
Decision tree (dependency graph) based
event correlation
21
Reference

[1] User-oriented Management of VoIP
Applications (http://www.ibr.cs.tubs.de/projects/nmrg/meetings/2005/nancy
/dyswis.pdf)
22