mmns2007 - Columbia University

Download Report

Transcript mmns2007 - Columbia University

Distributed Self Fault-Diagnosis
for SIP Multimedia Applications
Kai X. Miao (Intel)
Henning Schulzrinne (Columbia U.)
Vishal Kumar Singh (Columbia U./Motorola)
Qianni Deng (Shanghai Jiaotong University)
Oct. 2007
MMNS (San Jose)
Overview
•
•
The transition in IT cost metrics
End-to-end application-visible reliability still poor (~ 99.5%)
– even though network elements have gotten much more reliable
– particular impact on interactive applications (e.g., VoIP)
– transient problems
•
•
•
•
Lots of voodoo network management
Existing network management doesn’t work for VoIP and other modern
applications
Need user-centric rather than operator-centric management
Proposal: peer-to-peer management
– “Do You See What I See?”
•
Using VoIP as running example -- most complex consumer application
– but also applies to IPTV and other services
•
Also use for reliability estimation and statistical fault characterization
Oct. 2007
MMNS (San Jose)
Circle of blame
probably packet
loss in your
Internet connection 
reboot your DSL modem
ISP
VSP
OS
must be a
Windows registry
problem  re-install
Windows
Oct. 2007
probably a gateway fault
 choose us as provider
app
vendor
MMNS (San Jose)
must be
your software
 upgrade
Diagnostic undecidability
• symptom: “cannot reach server”
• more precise: send packet, but no response
• causes:
–
–
–
–
–
NAT problem (return packet dropped)?
firewall problem?
path to server broken?
outdated server information (moved)?
server dead?
• 5 causes  very different remedies
– no good way for non-technical user to tell
• Whom do you call?
Oct. 2007
MMNS (San Jose)
Traditional network management model
X
SNMP
“management from the center”
Oct. 2007
MMNS (San Jose)
Old assumptions, now wrong
•
Single provider (enterprise, carrier)
– has access to most path elements
– professionally managed
•
Problems are hard failures & elements operate correctly
– element failures (“link dead”)
– substantial packet loss
•
Mostly L2 and L3 elements
– switches, routers
– rarely 802.11 APs
•
Problems are specific to a protocol
– “IP is not working”
•
Indirect detection
– MIB variable vs. actual protocol performance
•
End systems don’t need management
– DMI & SNMP never succeeded
– each application does its own updates
Oct. 2007
MMNS (San Jose)
Managing the protocol stack
media
RTP
UDP/TCP
IP
Oct. 2007
echo
gain problems
VAD action
protocol problem
playout errors
TCP neg. failure
NAT time-out
firewall policy
no route
packet loss
MMNS (San Jose)
protocol problem
authorization
asymmetric conn
(NAT)
SIP
Types of failures
• Hard failures
– connection attempt fails
– no media connection
– NAT time-out
• Soft failures (degradation)
– packet loss (bursts)
• access network? backbone? remote access?
– delay (bursts)
• OS? access networks?
– acoustic problems (microphone gain, echo)
Oct. 2007
MMNS (San Jose)
Examples of additional problems
• ping and traceroute no longer works reliably
– WinXP SP 2 turns off ICMP
– some networks filter all ICMP messages
• Early NAT binding time-out
– initial packet exchange succeeds, but then TCP binding is
removed (“web-only Internet”)
• policy intent vs. failure
– “broken by design”
– “we don’t allow port 25” vs. “SMTP server temporarily
unreachable”
Oct. 2007
MMNS (San Jose)
Fault localization
• Fault classification – local vs. global
– Does it affect only me or does it affect others also?
• Global failures
– Server failure
• e.g., SIP proxy, DNS failure, database failures
– Network failures
• Local failures
– Specific source failure
• node A cannot make call to anyone
– Specific destination or participant failure
• no one can make call to node B
– Locally observed, but global failures
• DNS service failed, but only B observed it
Oct. 2007
MMNS (San Jose)
Proposal: “Do You See What I See?”
DYSWIS
• Each node has a set of active and passive measurement tools
• Use intercept (NDIS, pcap)
– to detect problems automatically
• e.g., no response to HTTP or DNS request
– gather performance statistics (packet jitter)
– capture RTCP and similar measurement packets
• Nodes can ask others for their view
– possibly also dedicated “weather stations”
• Iterative process, leading to:
– user indication of cause of failure
– in some cases, work-around (application-layer routing)  TURN
server, use remote DNS servers
• Nodes collect statistical information on failures and their likely
causes
Oct. 2007
MMNS (San Jose)
Architecture
“not working”
(notification)
inspect protocol requests
request diagnostics
orchestrate tests
contact others
(DNS, HTTP, RTCP, …)
ping 127.0.0.1
can buddy reach our
resolver?
“DNS failure for 15m”
notify admin
(email, IM, SIP events, …)
Oct. 2007
MMNS (San Jose)
Solution architecture
P6
P2P
P2P
PESQ Test
P5
P2P
Service Provider 1
P7
Service Provider 2
P2P
P8
P4
P2P
P2P
DNS Test
SIP Test
P2
SIP Server
DNS Server
P3
P2P
P2P
P1
Domain A
Call Failed at P1
Nodes in different domains cooperating to determine cause of failure
Oct. 2007
MMNS (San Jose)
Failure detection tools
• STUN server
– what is your IP address?
• ping and traceroute
• Transport-level liveness and QoS
– open TCP connection to port
– send UDP ping to port
– measure packet loss & jitter
• Need scriptable tools with dependency graph
– using DROOLS for now
media
RTP
UDP/TCP
• TBD: remote diagnostic
– fixed set (“do DNS lookup”) or
– applets (only remote access)
Oct. 2007
MMNS (San Jose)
IP
Dependency classification
• Functional dependency
– At generic service level
• e.g., SIP proxy depends on DB service, DNS service
• Structural dependency
– Configuration time
• e.g., Columbia CS SIP proxy is configured to use mysql database
on host metro-north
• Operational dependency
– Runtime dependencies or run time bindings
• e.g., the call which failed was using failover SIP server obtained
from DNS which was running on host a.b.c.d in IRT lab
Oct. 2007
MMNS (San Jose)
Dependency Graph
Oct. 2007
MMNS (San Jose)
Dependency graph encoded as decision tree
A
C
A
Yes
B
C
A = SIP Call
C = SIP Proxy
B = DNS Server
D = Connectivity
Oct. 2007
A Failed,
Use Decision Tree
D
Invokes Decision
Tree for C
Invokes Decision
Tree for B
Invokes Decision
Tree for D
MMNS (San Jose)
No
B
No
Yes
D
Yes
No
Cause Not Known
Report, Add new
Dependency
Current work
• Building decision tree system
• Using JBoss Rules (Drools 3.0)
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Oct. 2007
MMNS (San Jose)
Future work
• Learning the dependency graph from failure events and
diagnostic tests
• Learning using random or periodic testing to identify
failures and determine relationships
• Self healing
• Predicting failures
• Protocols for labeling event failures --> enable
automatically incorporating new devices/applications to
the dependency system
• Decision tree (dependency graph) based event
correlation
Oct. 2007
MMNS (San Jose)
Conclusion
• Hypothesis: network reliability as single largest open
technical issue  prevents (some) new applications
• Existing management tools of limited use to most
enterprises and end users
• Transition to “self-service” networks
– support non-technical users, not just NOCs running HP
OpenView or Tivoli
• Need better view of network reliability
Oct. 2007
MMNS (San Jose)