Internet Routing (COS 598A) Jennifer Rexford Today: Root-Cause Analysis Tuesdays/Thursdays 11:00am-12:20pm

Download Report

Transcript Internet Routing (COS 598A) Jennifer Rexford Today: Root-Cause Analysis Tuesdays/Thursdays 11:00am-12:20pm

Internet Routing (COS 598A)
Today: Root-Cause Analysis
Jennifer Rexford
http://www.cs.princeton.edu/~jrex/teaching/spring2005
Tuesdays/Thursdays 11:00am-12:20pm
Outline
• Network troubleshooting
– Motivation for network troubleshooting
– Investigating from the edge vs. inside
• Active probing
– Traceroute
– Mapping IP addresses to AS numbers
• Passive monitoring
– Analyzing BGP update streams
– Identifying location and cause of routing change
– Limitations of the approach
Network Troubleshooting
“Why can’t I reach www.cnn.com?”
“Why is the performance bad?”
Internet
www.cnn.com
Reachability Problems: What Could be Wrong?
• End-host problem
– Web server down
– DNS server down, or misconfigured
• Forwarding-path problem
– Packet filter or firewall restricting access
– Mismatch in Maximum Transmission Unit (MTU)
• Routing problem
– User or server disconnected from Internet
– Blackhole dropping all packets
– Persistent loop
Performance Problem: What Could be Wrong?
• End-host problems
– Overloaded Web server
– Overloaded DNS server
– Overloaded user machine
• Forwarding-path problem
– High round-trip time
– Link congestion
• Routing problem
– Long-term routing instability
– Transient disruption during convergence
Motivation for Troubleshooting
• Improving performance
– Detect, diagnose, and fix the problem
– Pick a path through another provider
– Pick a different path in any overlay network
• Establishing accountability
– Enforce Service Level Agreements
– Rate service providers
• Characterizing the Internet
– Understand causes of performance problems
– Understand challenges of troubleshooting
Troubleshooting Outside vs. Inside
• Outside: from network edge
Today
– Who: users and researchers, and operators
troubleshooting problems outside their network
– Data: ping/traceroute, public feeds of BGP
updates, and public measurement platforms
– Challenges: inference from very limited data
• Inside: from inside the network
– Who: operators running a network
– Data: SNMP, fault data, traffic measurement, route
monitors, and router configuration files
– Challenges: collecting and joining the data
Active Probing
Pros and Cons of Active Probing
• Advantages
– Can run from any end system
– Measure the actual forwarding path
• See black-holes, loops, and delays directly
• Disadvantages
– Effects of routing changes, not the cause
– Current path, not the path used in the past
• Requires frequent probes to observe the changes
– Shows only properties of round-trip path
• Hard to tell if problem is on forward vs. reverse
Traceroute: Measuring the Forwarding Path
• Time-To-Live field in IP packet header
– Source sends a packet with a TTL of n
– Each router along the path decrements the TTL
– “TTL exceeded” sent when TTL reaches 0
• Traceroute tool exploits this TTL behavior
TTL=1
source
Time
exceeded
destination
TTL=2
Send packets with TTL=1, 2, 3, … and record source of “time exceeded” message
Example Traceroute Output (Berkeley to CNN)
Hop number, IP address, DNS name
No response
from router
1 169.229.62.1
inr-daedalus-0.CS.Berkeley.EDU
2 169.229.59.225
soda-cr-1-1-soda-br-6-2
3 128.32.255.169
vlan242.inr-202-doecev.Berkeley.EDU
4 128.32.0.249
gigE6-0-0.inr-666-doecev.Berkeley.EDU
5 128.32.0.66
qsv-juniper--ucb-gw.calren2.net
6 209.247.159.109
POS1-0.hsipaccess1.SanJose1.Level3.net
7 *
?
8 64.159.1.46
?
9 209.247.9.170
pos8-0.hsa2.Atlanta2.Level3.net
No name resolution
10 66.185.138.33
pop2-atm-P0-2.atdn.net
11 *
?
12 66.185.136.17
pop1-atl-P4-0.atdn.net
13 64.236.16.52
www4.cnn.com
Example Troubleshooting Results
• No packets go beyond your gateway
– Gateway’s connection to Internet is dead
• Traceroute stops at intermediate point
– Perhaps a blackhole
• Traceroute path has a loop
– Transient or persistent forwarding loop
• Traceroute shows a very long path
– Routing anomaly, route hijacking, etc.
• Traceroute shows very long delays
– Delay or congestion on forward or reverse path
Problems with Traceroute
• Missing responses
– Routers might not send “Time-Exceeded”
– Firewalls may drop the probe packets
– “Time-Exceeded” reply may be dropped
• Misleading responses
– Probes taken while the path is changing
– Name not in DNS, or DNS entry misconfigured
• Mapping IP addresses
– Mapping interfaces to a common router
– Mapping interface/router to Autonomous System
Map Traceroute Hops to ASes
Traceroute output: (hop number, IP)
1 169.229.62.1
AS25
2 169.229.59.225 AS25
Berkeley
3 128.32.255.169 AS25
4 128.32.0.249
AS25
5 128.32.0.66
AS11423 Calren
6 209.247.159.109 AS3356
7 *
AS3356
8 64.159.1.46
AS3356
9 209.247.9.170
AS3356
10 66.185.138.33
AS1668
11 *
AS1668
12 66.185.136.17
AS1668
13 64.236.16.52
AS5662 CNN
Level3
AOL
Need accurate
IP-to-AS mappings
(for network equipment).
Candidate Ways to Get IP-to-AS Mapping
• Routing address registry
– Voluntary public registry such as whois.radb.net
– Used by prtraceroute and “NANOG traceroute”
– Incomplete and quite out-of-date
• Mergers, acquisitions, delegation to customers
• Origin AS in BGP paths
– Public BGP routing tables such as RouteViews
– Used to translate traceroute data to an AS graph
– Incomplete and inaccurate… but usually right
• Multiple Origin ASes, no mapping, wrong mapping
Example: BGP Table (“show ip bgp” at RouteViews)
Network
* 3.0.0.0/8
*
*
*
*
*>
*
* 9.184.112.0/20
*
*>
*
*
*
Next Hop
Metric LocPrf Weight Path
205.215.45.50
0 4006 701 80 i
167.142.3.6
0 5056 701 80 i
157.22.9.7
0 715 1 701 80 i
195.219.96.239
0 8297 6453 701 80 i
195.211.29.254
0 5409 6667 6427 3356 701 80 i
12.127.0.249
0 7018 701 80 i
213.200.87.254
929
0 3257 701 80 i
205.215.45.50
0 4006 6461 3786 i
195.66.225.254
0 5459 6461 3786 i
203.62.248.4
0 1221 3786 i
167.142.3.6
0 5056 6461 6461 3786 i
195.219.96.239
0 8297 6461 3786 i
195.211.29.254
0 5409 6461 3786 i
AS 80 is General Electric, AS 701 is UUNET, AS 7018 is AT&T
AS 3786 is DACOM (Korea), AS 1221 is Telstra
Why Would IP-to-AS Mapping Be Wrong?
• IP addresses of equipment
– Interfaces on the routers, not end hosts
– Identifies equipment in routing protocols
– Doesn’t need to be globally visible consistent
• Three reasons the mappings may be “wrong”
– Addresses of Internet Exchange Points
– Sibling ASes that share address space
– ASes that don’t announce their addresses
• Look at traceroute path vs. BGP AS path
– Traceroute path after IP-to-AS mapping
– BGP AS path taken from the BGP table
Extra AS due to Internet eXchange Points
• IXP: shared place where providers meet
– E.g., Mae-East, Mae-West, PAIX
– Large number of fan-in and fan-out ASes
A
B
C
D
E
A
E
F
B
F
G
C
G
Traceroute AS path
BGP AS path
Ignore extra traceroute AS hop with high fan-in and fan-out
Extra AS due to Sibling ASes
• Sibling: organizations with multiple ASes:
– E.g., Sprint AS 1239 and AS 1791
– AS numbers equipment with addresses of another
A
B
C
H
D
E
A
F
B
G
C
Traceroute AS path
E
D
F
G
BGP AS path
Merge sibling ASes “belong together” as if they were one AS.
Unannounced Infrastructure Addresses
12.0.0.0/8
A
B
C does not announce part of
its address space in BGP
(e.g., 12.1.2.0/24)
ACAC
C
AC
BAC
BC
Fix the IP-to-AS map to associate 12.1.2.0/24 with C
Refining Initial IP-to-AS Mapping
• Start with initial IP-to-AS mapping
– Mapping from BGP tables is usually correct
– Good starting point for computing the mapping
• Collect many BGP and traceroute paths
– Signaling and forwarding AS path usually match
– Good way to identify mistakes in IP-to-AS map
• Successively refine the IP-to-AS mapping
– Find add/change/delete that makes big difference
– Base these “edits” on operational realities
http://www.cs.princeton.edu/~jrex/papers/sigcomm03.pdf
http://www.cs.princeton.edu/~jrex/papers/infocom04.pdf
Research Areas
• Better version of traceroute
– Router support for active measurement
– IPPM (IP Performance Measurement)
– http://www1.ietf.org/mail-archive/web/imrg/current/msg00154.html
• Peer-to-peer troubleshooting
www.cnn.com
“Yes”
“No”
Passive Monitoring
Limitations of Active Measurements
• Active measurements: traceroute-like tools
– Can’t probe in the past
– Shows the effect, not the cause
AS 2
AS 4
AS 1
User
(s)
AS 3
Web
Server
(d)
Appealing to Peek Inside
• Passive measurements: public BGP data
BGP update feeds
Data Correlation
Data Collection
(RouteViews, RIPE)
root cause
Inspect BGP Routing Changes
• Changes in paths to reach destination d
– AS
– AS
– AS
– AS
1:
2:
3:
4:
“1 3 4”  “1 2 4”
“2 4” (no change)
“3 4”  “3 1 2 4”
“4” (no change)
AS 2
AS 4
AS 1
User
(s)
AS 3
Web
Server
(d)
Idea #1: ASes in Paths Undergoing Change
• Key assumption
– “The AS responsible for the change appears in the
old and/or the new AS path to the destination.”
• If an AS has a routing change
– All ASes in old and new paths may be responsible
– Call these ASes the “suspect set”
• Combining across vantage points
– Consider all ASes that had a routing change
– Perform the intersection across the suspect sets
Idea #2: Excluding ASes in Non-Changing Paths
• Key assumption
– “If an AS has no routing change, the ASes in the
path are not responsible and can be excluded.”
• Example
– AS 1: “1 2 4”  “1 2 3 4”: suspects {1, 2, 3, 4}
– AS 2: “2 4”  “2 3 4”: suspects {2, 3, 4}
– AS 3: “3 4” (no change): non-suspects {3, 4}
AS 3
AS 1
AS 2
AS 4
Idea #3: Blaming the ASes in the Better Path
• Key assumption
– “The better path is the one that contains the AS
responsible for the change.”
• Example
– “1 2 4”  “1 2 3 4”: better path to worse path,
with ASes {1,2,4} as the suspects (not AS 3)
• Heuristics for identifying the “better” path
– E.g., the shorter AS path
AS 3
AS 1
AS 2
AS 4
Idea #4: Combining Across Destinations
• Key assumption
– “All destinations experiencing routing changes in a
short period of time have a common cause.”
• Exploiting the observation
– Form suspect sets for each destination
– Perform intersections of the sets across the
destinations
Difficulties With Root-Cause Analysis
• Misleading BGP routing changes
– Responsible AS not on old or new path
– Looking across destinations doesn’t resolve
• Missing routing changes
– Some routers in an AS don’t have a change
– Some subnets are not visible in BGP
– Some internal changes are not visible in BGP
Misleading BGP Changes
Myth:The AS responsible for the change appears in the old or the new AS path.
BGP data
collection
old:
1,2,8,9,10
new:
1,4,5,6,7,10
1
2
4
8
3
5
9
6
11
7
10
Misleading BGP Changes
Myth:Looking at routing changes across prefixes resolves causes
d2
AS 3
d3
AS 2
AS 1
d1
A
B
7
10
12
C
BGP data
collection
Changes for d2,
but not for d1 and d3
Missing Routing Changes
Myth: The BGP updates from a single router accurately represent the AS
dst
AS 2
AS 1
A
B
7
6
12
C
10
D
BGP data
collection
No change
Missing Routing Changes
Myth:BGP data from a router accurately represents changes on that router.
12.1.1.0/24
BGP data
collection
A
12.1.0.0/16
Missing Routing Changes
Myth:Routing changes visible in eBGP have greater impact end-to-end
impact than changes with local scope.
dst
AS 2
AS 1
A
B
5
6
12
C
10
7
D
BGP data
collection
Hybrid of Active and Passive Monitoring
Omni 2
AS 2
i
User
(s)
AS 4
AS 1
Omni 1
j
AS 3
(i,s,d,t)
failure link (3,4)
(j,s,d,t’)
failure link (3,4)
Omni 4
Omni 3
Web
Server
(d)
Research Questions
• Understanding if root-cause analysis can work
– How many vantage points are needed?
– Do the assumptions usually hold?
– Can algorithms tolerate occasional violations?
– Can some additional information help?
• Distributed algorithms for root-cause analysis
– Can ASes cooperate in distributed fashion?
– How to prevent or detect ASes that cheat?
– Do all ASes have to participate?
– Other hybrids of active and passive monitoring?
Conclusions
• Troubleshooting is important
– Detect, diagnose, and fix problems
– Accountability and service-level agreements
• Troubleshooting is hard
– Active measurement (e.g., traceroute) not enough
– Root-cause analysis techniques are not enough
• New innovation necessary
– Hybrid active/passive approaches
– Router support for active measurement
– Routing protocol extensions for troubleshooting
For Next Time: From Inside an AS
• Two papers
– “OSPF monitoring: Architecture, design, and
deployment experience”
– “Finding a needle in a haystack: Pinpointing
significant BGP routing changes in an IP network”
• Optional reading
– Materials from Packet Design and Ipsum Networks
• Review only of first paper
– Summary
– Why accept
– Why reject
– Future work