Transcript Finding

Root-Cause Network
Troubleshooting
Optimizing the Process
Tim Titus
CTO, PathSolutions
1
Agenda
•
•
•
•
•
•
2
Business disconnect
Why is troubleshooting so hard?
Troubleshooting methodology
Tool selection
Finding the root-cause
Achieving Total Network Visibility
Business Disconnect
• You’re responsible for the entire network
• Most network engineers know less about their
network’s health and performance than their user
community
You can’t manage
what you can’t measure
-- Peter Drucker
3
Why is Troubleshooting so Hard?
Business Reasons
• Networks are getting more complex
• Less staff remains to support the network
Technical Reasons
• Proper methodology is not utilized
• Wrong tools are employed
4
Troubleshooting Methodologies
What graduates a junior level
Engineer to a senior level
Engineer is their
troubleshooting methodology
5
Bad Methodology
“Do something to try to fix the problem”
•
•
•
•
6
Reboot the device
Change the network settings
Replace hardware
Re-install the OS
Good Methodology
Collect information
Create hypothesis
Test hypothesis
Implement fix
Verify Original Problem is
Solved and no new problems
exist
Notify users
Document fix
7
Undo changes
Tool Selection
Types of Tools
• Cable Testers
• Packet analyzers/capture
• Application Performance Monitoring (APM)
• Flow collectors
• SNMP Collectors
8
Cable Testers
Using a cable tester to solve a call quality problem
Results
4.3db of loss
NEXT detected
Cable
Tester
x51
A
G
x41
B
D
F
H
E
x43
9
Actual VoIP Call
You have information about
Layer 1 on one link in the
network
x52
I
C
x42
x53
Cable Testers
Good for:
• Confirming physical issues on one link in the network
Bad for:
• Finding physical issues on the network
• Determining application usage
• Finding bandwidth limitations
• Finding device limitations
10
Packet Capture
Using a sniffer to solve a call quality problem
Results of VoIP Call
Latency: 127ms
Jitter: 87ms
Packet loss: 8.2%
Packet
Capture
x51
A
G
x41
B
D
F
H
E
x43
11
Actual VoIP Call
You have confirmation
that there is a problem,
but no idea which device or link
caused the packet loss
x52
I
C
x42
x53
Packet Capture
Good for:
• Confirming packet loss
(Are we missing packets?)
• Confirming packet contents issues
(No QoS tagging on packets when there should be)
• Determining application-level issues
(Source and destination IP and ports used for a session)
Bad for:
• Finding physical, data-link, or network issues
• Finding bandwidth limitations
• Finding device limitations
12
Application Performance Monitoring
Using APM to determine performance through the network
Results of Simulation
Latency: 127ms
Jitter: 87ms
Packet loss: 8.2%
Agent
x51
A
G
x41
B
D
F
H
E
x43
13
Agent
Simulated VoIP Call
You have knowledge of the experience
across the network, but no understanding
of the source or cause of the problem.
x52
I
C
x42
x53
Application Performance Monitoring
Good for:
• Measuring user experience across the network
(Are we having problems right now?)
Bad for:
• Finding physical, data-link, or network issues
• Finding bandwidth limitations
• Finding device limitations
14
Flow Collectors
Using a flow collector to determine usage of the network
Results of Flow
SourceIP: 192.168.1.12:80
DestinationIP: 172.16.3.98:3411
Packets: 251
Bytes: 19,386
Flow Collector
x51
A
G
x41
B
D
F
H
E
x43
15
Actual VoIP Call
You have knowledge of a transfer across
the network, but no recognition if there
were any problems with the transfer.
x52
I
C
x42
x53
Flow Collectors
Good for:
• Determining communications across the network
Who is using a link?
When do they use it?
What do they use it for?
Bad for:
• Finding physical, data-link, or network issues
• Finding bandwidth limitations
• Finding device limitations
16
SNMP Collectors
Collecting information from switches and routers to discover faults
Results of Collection
WAN link is overloaded at
2:35pm
SNMP
Collector
x51
A
G
x41
B
D
F
H
E
x43
17
Actual VoIP Call
You have data about conditions on
some parts of the network,
but no analysis of the problem or
correlation to events
x52
I
C
x42
x53
SNMP Collectors
Good for:
• Tracking packet loss per interface/device
(Are we dropping packets on a link? why?)
• Monitoring device and link resource limitations
(Are we over-utilizing a link? Is the router CPU pegged?)
Bad for:
• Determining who is using the network
• Finding application layer problems
18
Finding the Root-Cause
x51
A
B
G
x41
Poor Quality VoIP Call
D
F
H
E
x52
x43
19
I
C
x42
Step 1:
Identify the involved endpoints
and where they are connected
into the network
x53
Finding the Root-Cause
x51
A
G
x41
B
D
F
H
E
x52
x43
20
I
C
x42
Step 2:
Identify the full layer-2 path
through the network from the
first phone to the second phone
x53
Finding the Root-Cause
x51
A
G
x41
B
D
F
H
E
x52
x43
21
I
C
x42
Step 3:
Investigate involved switch and
router health (CPU & Memory)
for acceptable levels
x53
Finding the Root-Cause
TRANSIENT PROBLEM WARNING:
If the error condition is no longer
occurring when this investigation
is performed, you may not catch
the problem
x51
A
G
x41
B
D
F
H
E
x43
22
Step 4:
Investigate involved interfaces for:
•
•
•
•
•
VLAN assignment
DiffServe/QoS tagging
Queuing configuration
802.1p Priority settings
Duplex mismatches
•
•
•
•
•
x52
I
C
x42
Cable faults
Half-duplex operation
Broadcast storms
Incorrect speed settings
Over-subscription
x53
Optimizing the Methodology
In a perfect world, you want:
• Monitoring of:
 Every switch, router, and link in the entire infrastructure
 All error counters on the interfaces
 QoS configuration and performance
• Continuous collection of information
• Automatic layer-1, 2, and 3 mapping from any IP
endpoint to any other IP endpoint
• Problems identified in plain-English for rapid
remediation
This is what PathSolutions TotalView does
23
Deployment
All Switches and Routers are queried
Install TotalView
for information
Result:
One location is able to monitor all
devices and links in the entire network
for performance and errors
24
Total Network Visibility®
• Broad: All ports on all routers & switches
• Continuous: Health collected every 5 minutes
• Deep: 18 different error counters collected
and analyzed
• Network Prescription engine provides plainEnglish descriptions of errors:
“This interface is dropping 12% of its packets due to a cable fault”
25
Results Within 12 Minutes
Establish Baseline of Network Health
12% Loss from
Alignment
Errors
7% Loss from
cabling fault
26
28% Loss from
Duplex
mismatch
11% Loss from
Jumbo Frame
Misconfiguration
Results Within 12 Minutes
Repair Issues
12% Loss from
Alignment
Errors
7% Loss from
cabling fault
27
28% Loss from
Duplex
mismatch
11% Loss from
Jumbo Frame
Misconfiguration
Path Analysis Report
12:02pm
12% Loss from
Collisions
7:56am
18% Loss from
Cable Fault
11:32am
100% Transmit utilization
15% Loss from discards
Latency & Jitter penalty incurred
28
Demo
29
Don’t turtle your network
30
Free Network Equipment Magnet Set
With it, you will always
have an easy way to map
out your network on any
white board!
www.PathSolutions.com
31
34