Internet Routing (COS 598A) Jennifer Rexford Today: Detecting Anomalies Inside an AS

Download Report

Transcript Internet Routing (COS 598A) Jennifer Rexford Today: Detecting Anomalies Inside an AS

Internet Routing (COS 598A)
Today: Detecting Anomalies Inside an AS
Jennifer Rexford
http://www.cs.princeton.edu/~jrex/teaching/spring2005
Tuesdays/Thursdays 11:00am-12:20pm
Outline
• Traffic
– SNMP link statistics
– Packet and flow monitoring
• Network topology
– IP routers and links
– Fault data, layer-2 topology, and configuration
– Intradomain route monitoring
• Interdomain routes
– BGP route monitoring
– Analysis of BGP update data
• Conclusions
Why is Traffic Measurement Important?
• Billing the customer
– Measure usage on links to/from customers
– Applying billing model to generate a bill
• Traffic engineering and capacity planning
– Measure the traffic matrix (i.e., offered load)
– Tune routing protocol or add new capacity
• Denial-of-service attack detection
– Identify anomalies in the traffic
– Configure routers to block the offending traffic
• Analyze application-level issues
– Evaluate benefits of deploying a Web caching proxy
– Quantify fraction of traffic that is P2P file sharing
Collecting Traffic Data: SNMP
• Simple Network Management Protocol
– Standard Management Information Base (MIB)
– Protocol for querying the MIBs
• Advantage: ubiquitous
– Supported on all networking equipment
– Multiple products for polling and analyzing data
• Disadvantages: dumb
– Coarse granularity of the measurement data
• E.g., number of byte/packet per interface per 5 minutes
– Cannot express complex queries on the data
– Unreliable delivery of the data using UDP
Collecting Traffic Data: Packet Monitoring
• Packet monitoring
– Passively collecting IP packets on a link
– Recording IP, TCP/UDP, or application-layer traces
• Advantages: details
– Fine-grain timing information
• E.g., can analyze the burstiness of the traffic
– Fine-grain packet contents
• Addresses, port numbers, TCP flags, URLs, etc.
• Disadvantages: overhead
– Hard to keep up with high-speed links
– Often requires a separate monitoring device
Collecting Traffic Data: Flow Statistics
• Flow monitoring (e.g., Cisco Netflow)
– Statistics about groups of related packets (e.g.,
same IP/TCP headers and close in time)
– Recording header information, counts, and time
• Advantages: detail with less overhead
– Almost as good as packet monitoring, except no
fine-grain timing information or packet contents
– Often implemented directly on the interface card
• Disadvantages: trade-off detail and overhead
– Less detail than packet monitoring
– Less ubiquitous than SNMP statistics
Using the Traffic Data in Network Operations
• SNMP byte/packet counts: everywhere
– Tracking link utilizations and detecting anomalies
– Generating bills for traffic on customer links
– Inference of the offered load (i.e., traffic matrix)
• Packet monitoring: selected locations
– Analyzing the small time-scale behavior of traffic
– Troubleshooting specific problems on demand
• Flow monitoring: selective, e.g,. network edge
– Tracking the application mix
– Direct computation of the traffic matrix
– Input to denial-of-service attack detection
Network Topology
IP Topology
• Topology information
– Routers
– Links, and their capacities
• Internal links inside the AS
• Edge links connecting to neighboring domains
• Ways to learn the topology
– Inventory database
– SNMP polling/traps
– Traceroute
– Route monitoring
– Router configuration data
Below IP
• Layer-2 paths
– ATM virtual circuits
– Frame Relay virtual circuits
• Mapping to lower layers
– Specific fibers
– Shared optical amplifiers
– Shared conduits
– Physical length (propagation delay)
• Information not visible to IP
– Stored in an inventory database
– Not necessarily generated/updated automatically
Intradomain Monitoring: OSPF Protocol
• Link-state protocol
– Routers flood Link State Advertisements (LSAs)
– Routers compute shortest paths based on weights
– Routers identify next-hop to reach other routers
2
3
2
1
1
1
3
5
4
3
Intradomain Route Monitoring
• Construct continuous view of topology
– Detect when equipment goes up or down
– Input to traffic-engineering and planning tools
• Detect routing anomalies
– Identify failures, LSA storms, and route flaps
– Verify that LSA load matches expectations
– Flag strange weight settings as misconfigurations
• Analyze convergence delay
– Monitor LSAs in multiple locations with go
– Compare the times when LSAs arrive
• Detect router implementation mistakes
Passive Collection of LSAs
• OSPF is a flooding protocol
– Every LSA sent on every participating link
– Very helpful for simplifying the monitor
• Can participate in the protocol
– Shared media (e.g., Ethernet)
• Join multicast group and listen to LSAs
– Point-to-point links
• Establish an adjacency with a router
• … or passively monitor packets on a link
– Tap a link and capture the OSPF packets
Reducing the Volume of Information
• Prioritizing the messages
– Router failure over router recovery
– Link failure or weight change over a refresh
– Informational messages about weight settings
• Grouping related messages
– Link failure: group messages for the two ends
– Router failure: group the affected links
– Common failure: group links failing close in time
Anomalies Found in the Shaikh04 paper
• Intermittent hardware problem
– Router periodically losing OSPF adjacencies
– Risk of network partition if 2nd failure occurred
• External link flaps
– Congestion on edge link causing lost messages
– Lost adjacency leading to flapping routes
• Configuration errors
– Two routers assigned the same IP address
– Inefficient config leading to duplicate LSAs
• Vendor implementation bug
– More frequent refreshing of LSAs than specified
Interdomain Route Monitoring
Motivation for BGP Monitoring
• Visibility into external destinations
– What neighboring ASes are telling you
– How you are reaching external destinations
• Detecting anomalies
–
–
–
–
Increases in number of destination prefixes
Lost reachability to some destinations
Route hijacking
Instability of the routes
• Input to traffic-engineering tools
– Knowing the current routes in the network
• Workload for testing routers
– Realistic message traces to play back to routers
BGP Monitoring: A Wish List
• Ideally: knowing what the router knows
– All externally-learned routes
– Before policy has modified the attributes
– Before a single best route is picked
• How to achieve this
– Special monitoring session on routers that tells
everything they have learned
– Packet monitoring on all links with BGP sessions
• If you can’t do that, you could always do…
– Periodic dumps of routing tables
– BGP session to learn best route from router
Using Routers to Monitor BGP
Talk to operational
routers using SNMP or
telnet at command line
Establish a “passive” BGP
session from a workstation
running BGP software
eBGP or iBGP
(-) BGP table dumps
are expensive
(+) BGP table dumps do not
burden operational routers
(+) Table dumps show all
alternate routes
(-) Receives only best routes from
BGP neighbor
(-) Update dynamics lost
(+) Update dynamics captured
(-) restricted to interfaces
provided by vendors
(+) not restricted to interfaces
provided by vendors
Collect BGP Data From Many Routers
Seattle
Cambridge
Chicago
New York
Kansas City
Denver
San
Francisco
Detroit
Philadelphia
St. Louis
Washington, D.C.
2
Los Angeles
Dallas
San Diego
Atlanta
Phoenix
Austin
Houston
BGP is not a flooding protocol
Orlando
Route Monitor
Detecting Important Routing Changes
• Large volume of BGP updates messages
– Around 2 million/day, and very bursty
– Too much for an operator to manage
• Identify important anomalies
– Lost reachability
– Persistent flapping
– Large traffic shifts
• Not the same as root-cause analysis
– Identify changes and their effects
– Focus on mitigation, rather than diagnosis
– Diagnose causes if they occur in/near the AS
Challenge #1: Excess Update Messages
• A single routing change
– Leads to multiple update messages
– Affects routing decision at multiple routers
BR
E
BR
E
BGP
Updates
BR
E
BGP Update
Grouping
Events
Persistent
Flapping
Prefixes
Group updates for a prefix with inter-arrival < 70 seconds,
and flag prefixes with changes lasting > 10 minutes.
Determine “Event Timeout”
Cumulative distribution of BGP update inter-arrival time
(70, 98%)
BGP beacon
Event Duration: Persistent Flapping
Complementary cumulative distribution of event duration
(600, 0.1%)
Long
Events
Detecting Persistent Flapping
• Significant persistent flapping
– 15.2% of all BGP update messages
– … though a small number of destination prefixes
– Surprising, especially since flap dampening is used
• Types of persistent flapping
– Conservative flap-damping parameters (78.6%)
– Protocol oscillations, e.g., MED oscillation (18.3%)
– Unstable interface or BGP session (3.0%)
Example: Unstable eBGP Session
AE
AT&T
DE
Peer
BE
CE
p
Customer
• Flap damping parameters is session-based
• Damping not implemented for iBGP sessions
Challenge #2: Identify Important Events
• Major concerns of network operators
– Changes in reachability
– Heavy load of routing messages on the routers
– Flow of the traffic through the network
No Disruption
Events
Event
Classification
Loss/Gain of Reachability
“Typed”
Events
Internal Disruption
Single External Disruption
Multiple External Disruption
Classify events by type of impact it has on the network
Event Category – “No Disruption”
p
AS2
AS1
DE
No Traffic Shift
EE
AE
BE
“No Disruption”: eachAT&T
of the border routers has
no traffic shift
CE
Event Category – “Internal Disruption”
p
AS2
AS1
DE
EE
AE
BE
“Internal Disruption”: all of the traffic shifts are
internal traffic shift AT&T
CE
Internal Traffic Shift
Event Type: “Single External Disruption”
p
AS2
AS1
DE
external Traffic Shift
EE
AE
BE
AT&T
“Single External Disruption”: traffic at one exit
point shifts to other exit points
CE
Statistics on Event Classification
No Disruption
Internal Disruption
Single External Disruption
Multiple External Disruption
Loss/Gain of Reachability
Events
50.3%
15.6%
20.7%
7.4%
6.0%
Updates
48.6%
3.4%
7.9%
18.2%
21.9%
Challenge #3: Multiple Destinations
• A single routing change
– Affects multiple destination prefixes
“Typed”
Events
Event
Correlation
Clusters
Group events of same type that occur close in time
Main Causes of Large Clusters
• External BGP session resets
– Failure/recovery of external BGP session
– E.g., session to another large tier-1 ISP
– Caused “single external disruption” events
– Validated by looking at syslog reports on routers
• Hot-potato routing changes
– Failure/recovery of an intradomain link
– E.g., leads to changes in IGP path costs
– Caused “internal disruption” events
– Validated by looking at OSPF measurements
Challenge #4: Popularity of Destinations
• Impact of event on traffic
– Depends on the popularity of the destinations
Clusters
Traffic Impact
Prediction
Large
Disruptions
Netflow
Data
BR
E
BR
E
BR
E
Weight the group of destinations by the traffic volume
Traffic Impact Prediction
• Traffic weight
– Per-prefix measurements from Netflow
– 10% prefixes accounts for 90% of traffic
• Traffic weight of a cluster
– The sum of “traffic weight” of the prefixes
• Flag clusters with heavy traffic
– A few large clusters have large traffic weight
– Mostly session resets and hot-potato changes
Conclusions
• Network troubleshooting from the inside
– Traffic, topology, and routing data
– Easier to understand what’s going on
– … though still challenging to collect/analyze data
• Traffic measurement
– SNMP, packet monitoring, and flow monitoring
• Routing monitors
– Track network state and identify anomalies
– Intradomain monitor capturing LSAs
– BGP monitor capturing BGP updates
Next Time: BGP Routing Table Size
• Three papers
– “On characterizing BGP routing table growth”
– “An empirical study of router response to large
BGP routing table load”
– “A framework for interdomain route aggregation”
• Review only of the first paper
– Summary
– Why accept
– Why reject
– Avenues for future work
• Optional
– Vanevar Bush on “As We May Think” (1945)