Regular Latency Monitoring

Download Report

Transcript Regular Latency Monitoring

Regular Latency Monitoring
Or: How I Learned to Start Worrying and Hate the Jitter
Aaron Brown
Internet2
Regular Monitoring
• Throughput testing is being done
• Regular Iperf tests give administrators a good
view of how user’s applications are performing
• ESnet has found numerous “soft failures” by
deploying this kind of infrastructure
• Effective at seeing how a bulk transport
application will perform
Regular Monitoring
• What about latency-sensitive applications?
• Cisco Telepresence Limits
• 10 ms jitter
• 160 ms delay
• 0.05% loss
• Polycom Limits
• 30-35 ms jitter
• 300 ms delay
• <1% loss
Demo Monitoring
• Spring Member Meeting Demo
• Cisco Telepresence
• Endpoints
• Harvard
• Crystal Gateway Hotel
• Goals
• Measure delay/jitter/loss between these points
• Be able to fix any issues that come up
Demo Monitoring
• Deployed measurement machines at the
endpoints
• Setup regular tests between the machines
• perfSONAR-BUOY
• OWAMP
Demo Monitoring – Enabling Debugging
• Path Decomposition
• Deploy more hosts and run regular latency tests on
smaller segments of the path between end hosts
• Shows where on the path to look for the problem’s
cause
• Path Measurements
• Obtain utilization statistics from routers along endto-end path.
• Allow drilling down to better understand why
problems are occurring
Demo Deployment
Internet2
POP
Harvard
Northern Crossroads
Internet2
POP
Mid-Atlantic Crossroads
Hotel
Analysis Software
• Software was written or modified to make it
easy to view and understand the data.
•
•
•
•
•
Provides a variety of views
Status of the entire network
Status of a given host
Status of a given path
Alerting mechanism when problems are seen
Network Health
• A grid view of the network describing the
latency, jitter and loss between all hosts
Host Health
• Shows graphs of jitter and loss from a given
host to all other hosts.
Path Status
• Shows graphs of jitter and loss between hosts
along with interface utilization for the path.
Nagios Alarms
• Alerts administrators when problems are
seen
• Easy integration into NOC reporting systems.
Network Performance Analysis
• Several Potential Issues Identified
•
•
•
•
•
Highly utilized link in the path
Cross traffic
Test machine capability/quality
Other software running on the hosts
NTP Drift
• All were solved and verified through
diagnostics and monitoring
Highly Utilized Link
• Initial observation: High jitter values observed
between Hotel and Harvard.
• Process: Isolate where the Jitter is
happening.
Highly Utilized Link – Path Decomposition
• Hotel to Northern Crossroads:
Highly Utilized Link – Path Decomposition
• Hotel to Internet2 (New York):
Highly Utilized Link – Path Decomposition
• Hotel to Internet2 (Washington DC):
Highly Utilized Link – Path Decomposition
• Hotel to Mid-Atlantic Crossroads:
Highly Utilized Link
• What we know via OWAMP
• Jitter is not between Hotel and MAX
• Jitter is somewhere between MAX and New York
• Next Steps
• Drill down using alternate data sources
• What do we have access to?
• SNMP on MAX, Internet2 Backbone (via perfSONAR of
course!)
• Can inquire about NOX/Harvard if necessary
Highly Utilized Link
• Examine each leg of the path:
•
•
•
•
Hotel to College Park
College Park Core
College Park to Level3 (McClean, VA)
Internet2 Uplink
• Identify points of congestion
• 1g Uplink from Hotel to College Park
• 10g Max Core
• 2.5g Internet2 Uplink
Highly Utilized Link
• Observed on Internet2 Backbone:
Highly Utilized Link - Results
• Potential Solutions
• Identify flows, re-engineer traffic
• Re-plumb the demo path
• Increase Capacity
• Results
• Increase MAX Headroom to 10G
Cross Traffic
• Short Jitter events observed between
backbone hosts (Chicago to New York):
Cross Traffic
• Short Jitter events observed between
backbone hosts (Chicago to Washington):
Cross Traffic
• Events were not directly related (different
times) but showed similar results
• Open question of what this is related to:
• Could be ongoing REDDnet testing
• Could be other research traffic
• Was a large but short flow…
• Proposed Solution: Drill down into SNMP
Data again
Cross Traffic
• Backbone (Chicago to New York)
Cross Traffic
• Backbone (Chicago to Washington)
Cross Traffic – Drilling Down
• Found the specific VLAN with the traffic:
Cross Traffic – Drilling Down
• The specific VLAN in this case was the
Internet2 Observatory VLAN – a 10g BWCTL
test was responsible!
• Potential Solutions:
• Isolate the traffic from the BWCTL machines away from the
OWAMP machines
• Disable BWCTL on New York and Washington
• Solution: Will do both, but settled with the latter for
now.
Test Machine Capability/Quality
• Machine stability is important for OWAMP:
• Stable Clocks
• Capable Hardware (CPU/Memory/Network)
• Capable Software (Operating System, Drivers)
• WAN Testing focused on host types (different
hardware – all running CentOS 5.3):
• Apple Mini
• ‘Cakebox’
• Shuttle
Test Machine Capability/Quality
• Apple Mini (Random Glance)
Test Machine Capability/Quality
• Shuttle (Random Glance)
Test Machine Capability/Quality
• ‘Cakebox’ (Random Glance)
Test Machine Capability/Quality
• Results:
• ‘Cakebox’ is older hardware, lots of jitter to all
sites
• Using a “built-in” NIC
• Mini is somewhat jittery, but works reasonably
• Shuttle worked best, featured better NIC
• Intel Network Card (e1000 Driver)
• Solution is to use the Shuttle as the primary
measurement host with the Mini as a backup.
Other Software Running On The Hosts
• Other software running on the monitoring
hosts can adversely affect monitoring
Other Software Running On The Hosts
• Went through turning off software that was
unused by the demo.
• Made a list of all software running on the machine
• Used a binary-search approach
• Disable half the list, and see if the problem is fixed.
• Repeatedly pare down the list until the correct process is
found
• Problematic software: brltty
• “a background process which provides access to the
Linux/Unix console (when in text mode) for a blind
person using a refreshable braille display.”
Other Software Running On The Hosts
• When does brltty get disabled?
NTP/Clock Quality
• Clock stability is paramount for getting
reliable numbers between hosts
Same Timeframe, Same Destination Host, Different Source Host
Spring Member Meeting Fiber Cut
• Fiber cut between Washington DC and New
York
Spring Member Meeting Fiber Cut
• Momentary packet loss as the path rerouted
Spring Member Meeting Fiber Cut
• Latency jumps as the paths get rerouted
• IP reroutes via Chicago
• Layer2 path rerouted via Cleveland
• Original Layer2 path restored
Questions?