Root-Cause VoIP Troubleshooting.ppsx

Download Report

Transcript Root-Cause VoIP Troubleshooting.ppsx

Root-Cause VoIP
Troubleshooting
Optimizing the Process
Tim Titus
CTO, PathSolutions
1
Agenda
•
•
•
•
•
•
2
Business disconnect
Why is VoIP troubleshooting so hard?
Troubleshooting methodology
Tool selection
Finding the root-cause
Achieving Total Network Visibility
Business Disconnect
• You’re responsible for the entire VoIP Infrastructure
• Most telecom engineers know less about their
network’s health and performance than their user
community
You can’t manage
what you can’t measure
-- Peter Drucker
3
Why is VoIP Troubleshooting so Hard?
Business Reasons
• Networks are getting more complex
• Less staff remains to support the network
Technical Reasons
• Proper methodology is not utilized
• Wrong tools are employed
4
Troubleshooting Methodologies
What graduates a junior level
Engineer to a senior level
Engineer is their
troubleshooting methodology
5
Bad Methodology
“Do something to try to fix the problem”
•
•
•
•
6
Reboot the device
Change the network settings
Replace hardware
Re-install the OS
Good Methodology
Collect information
Create hypothesis
Test hypothesis
Implement fix
Verify Original Problem is
Solved and no new problems
exist
Notify users
Document fix
7
Undo changes
Tool Selection
Types of Tools
• Packet analyzers/capture
• Application Performance Monitoring (Call Simulation)
• CDR Analysis Tools
• SNMP Collectors
8
Packet Capture
Using a sniffer to solve a call quality problem
Results of VoIP Call
Latency: 127ms
Jitter: 87ms
Packet loss: 8.2%
Packet
Capture
x51
A
G
x41
B
D
F
H
E
x43
9
Actual VoIP Call
You have confirmation
that there is a problem,
but no idea which device or link
caused the packet loss
x52
I
C
x42
x53
Packet Capture
Good for:
• Confirming packet loss
(Are we missing packets?)
• Confirming packet contents issues
(No QoS tagging on packets when there should be)
• Determining application-level issues
(Source and destination IP and ports used for a session)
Bad for:
• Finding physical, data-link, or network issues
• Finding bandwidth limitations
• Finding device limitations
10
Application Performance Monitoring
Using call simulation to determine performance
Results of Simulation
Latency: 127ms
Jitter: 87ms
Packet loss: 8.2%
Agent
x51
A
G
x41
B
D
F
H
E
x43
11
Agent
Simulated VoIP Call
You have knowledge of the experience
across the network, but no understanding
of the source or cause of the problem.
x52
I
C
x42
x53
Application Performance Monitoring
Good for:
• Measuring user experience across the network
(Are we having problems right now?)
Bad for:
• Finding physical, data-link, or network issues
• Finding bandwidth limitations
• Finding device limitations
12
CDR Analysis Tools
Using Call Detail Records to determine VoIP usage
Call from x43 to x53 at 2:45pm
8.3% packet loss
46ms jitter
PBX
CDR Collector
CDR Record
x51
A
G
x41
B
D
F
H
E
x43
13
Actual VoIP Call
You have knowledge of a VoIP call
and its perception of call quality,
but no understanding of where or
why there was a problem.
x52
I
C
x42
x53
CDR Analysis Tools
Good for:
• Confirming a VoIP problem
Bad for:
• Finding physical, data-link, or network issues
• Finding bandwidth limitations
• Finding device limitations
14
SNMP Collectors
Collecting information from switches and routers to discover faults
Results of Collection
WAN link is overloaded at
2:35pm
SNMP
Collector
x51
A
G
x41
B
D
F
H
E
x43
15
Actual VoIP Call
You have data about conditions on
some parts of the network,
but no analysis of the problem or
correlation to events
x52
I
C
x42
x53
SNMP Collectors
Good for:
• Tracking packet loss per interface/device
(Are we dropping packets on a link? why?)
• Monitoring device and link resource limitations
(Are we over-utilizing a link? Is the router CPU pegged?)
Bad for:
• Determining who is using the network
• Finding application layer problems
16
Finding the Root-Cause
x51
A
B
G
x41
Poor Quality VoIP Call
D
F
H
E
x52
x43
17
I
C
x42
Step 1:
Identify the involved endpoints
and where they are connected
into the network
x53
Finding the Root-Cause
x51
A
G
x41
B
D
F
H
E
x52
x43
18
I
C
x42
Step 2:
Identify the full layer-2 path
through the network from the
first phone to the second phone
x53
Finding the Root-Cause
x51
A
G
x41
B
D
F
H
E
x52
x43
19
I
C
x42
Step 3:
Investigate involved switch and
router health (CPU & Memory)
for acceptable levels
x53
Finding the Root-Cause
TRANSIENT PROBLEM WARNING:
If the error condition is no longer
occurring when this investigation
is performed, you may not catch
the problem
x51
A
G
x41
B
D
F
H
E
x43
20
Step 4:
Investigate involved interfaces for:
•
•
•
•
•
VLAN assignment
DiffServe/QoS tagging
Queuing configuration
802.1p Priority settings
Duplex mismatches
•
•
•
•
•
x52
I
C
x42
Cable faults
Half-duplex operation
Broadcast storms
Incorrect speed settings
Over-subscription
x53
Optimizing the Methodology
In a perfect world, you want:
• Monitoring of:
 Every switch, router, and link in the entire infrastructure
 All error counters on the interfaces
 QoS configuration and performance
• Continuous collection of information
• Automatic layer-1, 2, and 3 mapping from any IP
endpoint to any other IP endpoint
• Problems identified in plain-English for rapid
remediation
This is what PathSolutions TotalView does
21
Deployment
All Switches and Routers are queried
Install TotalView
for information
x51
A
G
x41
B
D
F
H
E
x52
x43
22
I
C
x42
Result:
One location is able to monitor all
devices and links in the entire network
for performance and errors
x53
Total Network Visibility®
• Broad: All ports on all routers & switches
• Continuous: Health collected every 5 minutes
• Deep: 18 different error counters collected
and analyzed
• Network Prescription engine provides plainEnglish descriptions of errors:
“This interface is dropping 12% of its packets due to a cable fault”
23
Results Within 12 Minutes
Establish Baseline of Network Health
12% Loss from
Alignment
Errors
x51
A
G
x41
28% Loss from
Duplex
mismatch
B
D
F
H
E
x52
x43
24
I
C
x42
7% Loss from
cabling fault
11% Loss from
Jumbo Frame
Misconfiguration
x53
Results Within 12 Minutes
Repair Issues
12% Loss from
Alignment
Errors
x51
A
G
x41
28% Loss from
Duplex
mismatch
B
D
F
H
E
x52
x43
25
I
C
x42
7% Loss from
cabling fault
11% Loss from
Jumbo Frame
Misconfiguration
x53
Path Analysis Report
Investigate a call quality problem between x43 and x51 that happened
around 2:35pm
x51
A
G
x41
B
D
F
H
E
x52
x43
26
I
C
x42
2:36pm
18% Loss from
Cable Fault
x53
Demo
27
Don’t turtle your network
28
29