投影片 1 - sigcomm

Download Report

Transcript 投影片 1 - sigcomm

Toward Interactive Debugging
for ISP Networks
Chia-Chi Lin†, Matthew Caesar†,
Jacobus Van der Merwe§
†University of Illinois at Urbana-Champaign
§AT&T Labs – Research
Debugging in ISP Networks
• Internet: most complex distributed system ever created
– Leads to complex failure modes
– Bugs, vulnerabilities, compromise, misconfigurations
• Major challenges in debugging in ISP Networks
– Lack of visibility
– High rates of change of protocols
– Complex interdependencies
• These could cause devastating effects
– Long-term outages, slow repair
– February 2009 BGP outage
1
Interactive Debugging is Necessary
• Problems exist with fully automated
techniques
– Focus on detection rather than diagnosis
– Modeling could be inexact
– Logical and semantic errors seems to require
human knowledge to solve
• Our position:
– Humans must be “in-the-loop”
– Tools are required to facilitate the process
2
Pause when
the outage
occurs
A Scenario
Cloned Network
ISP
Customer
3
Our Vision
• Isolation of the operational network
– Prevent diagnostic procedure from interfering with live network
operation
– Solution: virtualization technologies
• Reproducibility of network execution
– Enable operator to replay execution, narrow in on rare events
– Solution: instill a pseudorandom ordering over events, messages
• Interactive stepping through execution
– Operator can slowly step through operation, trace messages
– Solution: protocols providing tight control over distributed execution
4
The Architecture
Virtual Service
Platforms
Physical
Network
Infrastructure
Application 2:
e.g. OSPF
Application 1:
e.g. BGP
Virtual Service
Nodes
Virtual Service
Coordinator
Physical
Network Debugging
Node
Coordinator
User (human
troubleshooter)
5
Key Challenge: Reproducibility
• Reproducibility simplifies interactive debugging
– Can run multiple times, varying inputs to narrow down cause
– When rare bug occurs, don’t need to wait for it to reoccur
• One option: generate comprehensive logs of all events
– e.g., log all packet sends/receives, all data
– Problem: not scalable to large networked software
• Our approach: eliminate randomness in execution
– Starting with the same initial state will produce same execution
– Make execution “pseudorandom” to explore different execution paths
– Key challenge: how to eliminate randomness in large-scale software
execution?
6
An Algorithm for Distributed, Reproducible Execution
• Approach:
– Encapsulate software in virtual environment
– Intercept software’s inputs/outputs, instill an ordering over them
– Make sure that ordering is the same, every time software is run
• How this is done:
– Network is run in lockstep fashion
– On every cycle: messages from neighbors are buffered
– Before deliver to application, pseudorandom ordering is instilled by
consistent hash of packet’s contents
– Human sends “step” commands to move to next lockstep cycle
7
Improving Performance for the Production Network
• Problem: running application in lockstep fashion
slows operation
– Might be okay for some protocols (e.g., BGP)
– Probably not okay for others (e.g., OSPF)
• Solution: “optimistic” execution of events
– Choose pseudorandom ordering in advance that is likely to
happen anyway
– Don’t buffer packets, deliver them immediately
– If we guess wrong, roll back application to earlier state
8
Example: Running the Lockstep Algorithm in a Cloned Network
Transmission
Phase
App
S
I finished processing.
transmitting.
I am
ready to transmit.
process.
Processing
S
A
Phase
App
L
K
A
S
K
App
App
A
App
App
App
L
K
App
L
1.
2.
3.
4.
S
L
K
……
App
Sending Buffer
Receiving Buffer
9
Example: Live Algorithm in Production Network
S
The
Seattle
Packets
from
live algorithm
does
two things:
Seattle
should of events
• Determine the
ordering
come before
•10
Roll back events
violating the ordering
those
from Los is violated!
Pseudorandom ordering
Chicago
13
L
14
Salt Lake City Angeles
13
S
K
C
KL
C
K
C
Los Angeles
1.
2.
3.
4.
5.
6
14
Seattle
16
Los Angeles
Kansas City
Chicago
……
New York
C
3
9
K
10
Kansas City
8
Washington
7
Atlanta
11
Houston
10
Connecting the Two Algorithms
• We can run the production network using the live
algorithm
– Achieves a fixed ordering over messages
– But how to actually debug it?
• Solution: replay using the lockstep algorithm
– First let the production network run, checkpoint starting
state
– To debug, start lockstep algorithm with same staring state
– Lockstep algorithm will traverse the same execution
• Can replay multiple times, narrow in on problem, experiment by
changing inputs, etc.
11
Simulation Settings
• Protocol evaluated: OSPF
• Topologies used: BRITE, Internet2 backbone
• Link delay model: 1 ms + (0, 0.5] exponentially
distributed random delay
• Events simulated: Abilene IS-IS traces over the
month of January 2009 (giving 209 events)
• Measure performance overheads of our
approach
12
Results – Overhead in Production Networks
• Live algorithm suffers from rollbacks, incurring 4x
inflation in traffic overhead
• Using delay-estimation optimization reduces
overhead to 0.02x traffic inflation
13
Results – Response Time in Cloned Networks
• Low response time is beneficial to interactive
debugging
• Response time is low for variety of network sizes
14
Conclusion
• Humans are required to be “in-the-loop” to diagnose
problems
• Our architecture is a first step towards interactive
debugging
– Builds on known techniques, e.g., virtualization
technologies and distributed semaphores
– Develop techniques to reproduce distributed executions
• Simulations on real-world events show the scheme
accompanied with low overheads
15
16
The State of the Art: Automated Techniques
• Logging observations
– X-Trace, Friday, etc.
• Model checking
– rcc, OD flow, etc.
• Debugging standalone programs
– Coverity, AVIO, etc.
17
Optimized Ordering in the Production Network
• Goal: avoid rollbacks by selecting ordering likely to happen
anyway
– Events separated by long period will fall into different groups which
means ordering is easy
– Problem: some failure events are correlated
• E.g., multiple overlay links sharing same physical link
– How to order events in same group?
• Solution: if we know link delays, we can reliably estimate
expected arrival of events
– In practice we don’t know exact link delays
– But we can estimate them
– Can improve estimation by giving protocol messages high priority
18
Results – Storage in Production Network
• State required for rolling back packets is small
and increases slowly with network size
19