fast rerouting - CSE Labs User Home Pages

Download Report

Transcript fast rerouting - CSE Labs User Home Pages

Network Failures and Their Impacts;
Fast IGP Routing Convergence and IP
Fast Re-Routing (IP FRR)
• Network Failures and Intra-Domain Routing
(IGP)
– ISP Link Failure Studies
– Failure Characteristics, Causes and Impacts
• IGP Fast Routing Convergence
•Speed up routing convergence after routing
changes
• IP Fast Re-Routing (IP FRR)
– Fast Rerouting Schemes: Failure Insensitive Routing
– Other Schemes
Readings: Please do the required readings
CSci5221:
Network Failures; IGP Fast Convergence
and IP Fast Re-Routing
1
Why Network Fails
• Many, many possible reasons and causes
• Human Errors
– misconfigurations
– other mistakes: e.g, let’s see what that red button does
• Software Bugs
– buggy implementation, incompatibility, …
• Hardware failures
– flaky interfaces, link errors, fiber cuts, router crashes
due to CPU overload or running of memory, ….
• Malicious attacks
• Network Overload
– traffic surges causing network congestion, …
• Others: e.g., natural disasters, major accidents
– E.g., Baltimore tunnel fire, Ohio train accident, …
CSci5221:
Network Failures; IGP Fast Convergence
and IP Fast Re-Routing
2
Understanding Network Link
Failure Characteristics
• Failure Characteristics Within an ISP network
–
–
–
–
How often do links/routers failure?
How many? Are they random, correlated?
How long do they last?
…
• What about inter-domain or Internet wide?
– What causes BGP to update/withdraw routes?
– Destination network down? AS internal failures? BGP session
resets? Policy changes? …
• How do we measure, detect and analyze network failures?
• How do we trouble-shoot network failures and perform rootcase analysis?
• How do we design more robust and resilient mechanisms?
CSci5221:
Network Failures; IGP Fast Convergence
and IP Fast Re-Routing
3
This Lecture: Focusing on Impact
of Failures within an ISP Network
• With IP networks becoming the dominant and
“converged” information delivery substrate, displacing
telephone networks, and eventually cable TV?
– Need to better “service availability”
– Telephone networks: service availability metrics: 5 9’s:
i.e., 99.999%
– What about IP networks?
• Effect of IP network failures:
routers lose “reachability”: i.e., no forward entries
– or existence of transient/permanent forwarding loops
–
• What are impacts of network failures?
– In particular, on VoIP services
CSci5221:
Network Failures; IGP Fast Convergence
and IP Fast Re-Routing
4
Failures Affect Link Loads
• Many ISP networks are “over-provisioned” so as to
handle network failures:
•
– Many claim: normal load utilization < 50%
But still high variability in link utilization:
– Can find a link w/ load > 50% every 15 minutes; > 90% every 8
days
CSci5221:
Network Failures; IGP Fast Convergence
and IP Fast Re-Routing
5
Traffic Potholes or Blackholes
Sprint Measurement Study Anecdote:
•
Average delay over 5
sec intervals
•
Traffic was blackholed
for more than 10
minutes
•
It took about 40
minutes for the
network to reach a
stable state
• Root Cause:
Route Misconfigurations!
CSci5221:
Network Failures; IGP Fast Convergence
and IP Fast Re-Routing
6
Routing Loops under Failures
• Loops due to link failures/new route advertisement
• Measurements from 3 backbone links
– 25% packets caught in a loop in one failure instance
– 1% lost due to expire TTL; those that escape have long delays
CSci5221:
Network Failures; IGP Fast Convergence
and IP Fast Re-Routing
7
Sprint Link Failure Study
• Link “failures” occur fairly frequently, well spread
over time
– Inter-POP links are more stable than intra-PoP links
– Many intra-PoP link “failures” due to planned events, less
impact on traffic due to “full-mesh” intra-PoP topology
• Most link failures tend to be transient
– Excluding “planned” failures
– Most are single link failures
– Some are correlated link failures
• Link failure characteristics vary depend on links
– Depending causes of failures, e.g., flaky interfaces, router
overloads, fiber cuts, etc.,
• Impact of link failures
– OC48 link down for 6 seconds: 3 million packets may be lost!
– significant impact on applications such as VoIP, on-line gamiing
CSci5221:
Network Failures; IGP Fast Convergence
and IP Fast Re-Routing
8
Methodology: Integrated Monitoring
Sprint Measurement Study [I+02,M+04]
• Tier-1 ISP backbone (600+ nodes)
• Passive route listener software to collect IS-IS & BGP
updates
• IPMON passive traffic monitoring & active probes
• SONET alarm logs; router configurations and BGP policies
CSci5221:
Network Failures; IGP Fast Convergence
and IP Fast Re-Routing
9
IGP Failure Events
• IP link: adjacency between two IS-IS routers
• Link Failure: loss of this adjacency
• Results shown in the following slides only include
– US inter-PoP links (OC48)
– Failures less than 24 hrs long
CSci5221:
Network Failures; IGP Fast Convergence
and IP Fast Re-Routing
10
Sprint Study: Link Failure
Frequencies
CSci5221:
Network Failures; IGP Fast Convergence
and IP Fast Re-Routing
11
Sprint Study: Duration of Failures
CSci5221:
Network Failures; IGP Fast Convergence
and IP Fast Re-Routing
12
Sprint Study: Failures across Links
CSci5221:
Network Failures; IGP Fast Convergence
and IP Fast Re-Routing
13
Scatter Plot of US Failure Events
• Apr. – Nov. 2002
CSci5221:
Network Failures; IGP Fast Convergence
and IP Fast Re-Routing
14
Maintenance (or Planned Failures)
• Weekly schedule (Mondays 5am – 2pm UC): 20% of failures
CSci5221:
Network Failures; IGP Fast Convergence
and IP Fast Re-Routing
15
Examples of Planned Failures
• Upgrades
– Changing link to higher capacity
– Loading new operating system on a router
– Swapping out an old interface card
• Maintenance
– Fixing a flaky optical amplifier
– Configuration changes that require a reboot
– Responsible for 50% of intradomain failures
• Cable intrusions
– Construction activities near a fiber
CSci5221:
Network Failures; IGP Fast Convergence
and IP Fast Re-Routing
16
Failure Classification
CSci5221:
Network Failures; IGP Fast Convergence
and IP Fast Re-Routing
17
Anomalies Found in Shaikh04 paper
• Intermittent hardware problem
– Router periodically losing OSPF adjacencies
– Risk of network partition if 2nd failure occurred
• External link flaps
– Congestion on edge link causing lost messages
– Lost adjacency leading to flapping routes
• Configuration errors
– Two routers assigned the same IP address
– Inefficient config leading to duplicate LSAs
• Vendor implementation bug
– More frequent refreshing of LSAs than specified
CSci5221:
Network Failures; IGP Fast Convergence
and IP Fast Re-Routing
18
Converging After a Failure
• Failure detection
– Router recognizes an incident link has failed
• Failure notification
Routing convergence
– Router informs other routers about the change
• Path re-computation
– Routers compute new paths avoiding the link
• Forwarding-table update
Forwarding
convergence
– Routers update their forwarding tables
– Data traffic starts to flow over the new path
• AT&T, Sprint studies show
– convergence time 100s milliseconds up to a few seconds
CSci5221:
Network Failures; IGP Fast Convergence
and IP Fast Re-Routing
19
All Together: Looking Inside Router
LSA Processing
Route Processor (CPU)
OSPF Process
LSA Flooding
Topology
View
SPF Calculation
SPF Calculation
FIB Update
FIB
LSA
LS Ack
Forwarding
Forwarding
Data packet
Interface card
CSci5221:
LSA
Switching
Fabric
Network Failures; IGP Fast Convergence
and IP Fast Re-Routing
Data packet
Interface card
20
Bad Things Happen During
Convergence
• Transient inconsistencies
– Creating “transient forwarding loops” due to
– Routers have different views of the network
– Forwarding decisions may be inconsistent
• Effects on data traffic
–
–
–
–
Black-hole: packet loss
Loops: packets going in circles
Delay: packets going on very long paths
Out-of-order: new packets arrive before old ones
• Want to minimize convergence delay
– … and especially the effects on the data traffic
CSci5221:
Network Failures; IGP Fast Convergence
and IP Fast Re-Routing
21
Example: Transient Forwarding
Loop (or Micro-loop)
• Set of routers disagree
– One router acting on old information
– Another router acting on new information
Loop!
s
CSci5221:
Network Failures; IGP Fast Convergence
and IP Fast Re-Routing
d
22
Reducing Impact of Link Failures
Assuming Traditional Link-State Protocols
• Improving convergence time of control/data plane
– Reducing timer value for HELO messages
– Can achieve sub-second convergence time
• 200 msecs common target, threshold for VoIP quality, do-able!
– However,
• Still react to failure events, can’t prevent packet loops or losses
during convergence
• may amplify effect of short “transient” failures that last subseconds
• Prevent “micro loops” during transient routing
convergence periods
– One solution: using “ordered FIB updates”
– requires coordination among routers, adds complexity,
delays convergence time
• Dealing with “Planned Failures” ?
CSci5221:
Network Failures; IGP Fast Convergence
and IP Fast Re-Routing
23
Reducing Impact of Link Failures
Using MPLS
• Can pre-compute back-up paths
•
– Often done using the “link protection” scheme
– For each link, there is a MPLS protection (back-up) path
But
– Need to change “forwarding plane” of routers
– Many networks don’t have MPLS deployed
Question:
Can we perform fast rerouting using “traditional”
link state routing protocols without resort to
MPLS?
CSci5221:
Network Failures; IGP Fast Convergence
and IP Fast Re-Routing
24
Fast Re-Routing using Link State
Protocols [Nelakuditi et al]
• Motivations
– Most common link failures are transient single-link failures
– Hastily react to such failures by LSA flooding may do more harm
than good, causing network instability!
– Suppress such failures unless it lasts longer than a threshold
– But we want to be able to re-route affected packets along a
back-up path, not simply dropping them !
• FIFR (failure Insensitive Fast Re-routing): nearly 100%
forwarding continuity
– prepare for (instead of react to) failures
– adapt to changes while ensuring stability
– Other Advantages:
• no change to forwarding plane
• minimal change to routing plane
CSci5221:
Network Failures; IGP Fast Convergence
and IP Fast Re-Routing
25
What is Interface Specific
Forwarding?
• Interface-independent forwarding
– destination  next-hop
– Each line card has a copy of the same FIB
• Interface-specific forwarding
– <incoming interface, destination>  next-hop
– Different forwarding entries at each line card
• Forwarding operation remains the same
CSci5221:
Network Failures; IGP Fast Convergence
and IP Fast Re-Routing
26
ISF Enables Local Rerouting
• Infer failures based on interface and destination
– Find the farthest keylink whose failure would cause a
packet to arrive at the unusual interface along the
reverse shortest path to the destination
• Precompute interface-specific forwarding tables
– Avoid the keylink in choosing next hop for a destination
• Failure Inferencing based Fast Rerouting
– IP fast reroute without explicit routing/tunneling
CSci5221:
Network Failures; IGP Fast Convergence
and IP Fast Re-Routing
27
Illustration: No Failure Scenario
F
CSci5221:
B
B
C
C
D
D
E
B
F
B
A
A
C
AE
D
A
E
E
F
E
F
Network Failures; IGP Fast Convergence
and IP Fast Re-Routing
28
Illustration: Local Rerouting without ISF
F
A
A
C
A
D
A
E
A
F
A
new routing table at router B
after detecting the failure
link B – E fails!
CSci5221:
B
B
C
C
D
D
E
B
F
B
F
Network Failures; IGP Fast Convergence
and IP Fast Re-Routing
29
Illustration: Local Rerouting with ISF
CSci5221:
B
B
C
C
D
D
E
B
F
B
B
-
A
A
C
C
C
A
D
D
D
A
E
C
E
A
F
D
F
A
F
F
Network Failures; IGP Fast Convergence
and IP Fast Re-Routing
30
ISF Table Computation
•
Infer failed links from packet’s arrival at an interface
– K dj i keylink whose failure causes packet to d arrive at i from j
–
A link u -> v is a candidate keylink if
•
•
with u->v, j is a next hop from i to d
without u->v, edge j->i is along the shortest path from u to d
– K dj i is the farthest one from i among candidate keylinks
•
Avoid keylink in choosing the destination’s next hop
– Fd
j i
•
next hops to d from i when packet arrives at i from j
Fjdi  Rid (E \ K dji )
Failure inferencing is not done per packet
–
CSci5221:
ISF table entries computed upon link state updates
Network Failures; IGP Fast Convergence
and IP Fast Re-Routing
31
Illustration: ISF Table Computation
CSci5221:
B
-
C
C
D
D
E
C
F
D
B
B
C
C
D
-
E
B
F
B
B
B
C
-
D
D
E
B
F
B
K BC A 
{}
K BD A 
{}
K BE A 
{B-E}
K BF A 
{E-F}
When no more than one link failure is
suppressed in a network with symmetric
weights, FIFR always forwards successfully
to a destination if a path to it exists
Network Failures; IGP Fast Convergence
and IP Fast Re-Routing
32
Operations under FIFR
Event
Adjacent nodes
Packet arrival
Interface-specific forwarding
Link down
Initiate local rerouting
Link up before
suppression interval
Resume forwarding on
the recovered link
Link down beyond
suppression interval
Link state update
Recompute interfacespecific forwarding tables
Link up after
suppression interval
Link state update
Recompute interfacespecific forwarding tables
CSci5221:
Network Failures; IGP Fast Convergence
and IP Fast Re-Routing
Other nodes
33
Handling both Link and Node Failures
• Infer keynodes instead of keylinks
– A node u is a candidate keynode if
• with u, j is a next hop from i to d
• without u, edge j->i is along the shortest path from the upstream
node of u (w.r.t. the path from i to u) to d
– Keynode is the farthest one from i among candidates
• When no route to destination without a node
– Node adjacent to the failure assumes link failure
– Non-adjacent nodes treat it as adjacent node failure
– May cause loops when destination is indeed not reachable
• Protects against non-partitioning single failures
CSci5221:
Network Failures; IGP Fast Convergence
and IP Fast Re-Routing
34
Networks with Asymmetric Link Weights
• FIFR can handle asymmetric link weights
– By forcing packets to take reverse shortest path
– Provided links are bidirectional
• Keynode computation based on rSPF
– A node u is a candidate keynode if
• with u, j is a next hop from i to d
• without u, edge i->j is along the shortest path from d to the
upstream node of u (w.r.t the path from i to u)
– Keynode is the farthest one from i among candidates
– Works with both symmetric and asymmetric weights
CSci5221:
Network Failures; IGP Fast Convergence
and IP Fast Re-Routing
35
Networks with Broadcast Links
• FIFR applicable to networks with broadcast links
– A broadcast link is modeled with point to point links from/to
the designated router
• Adjacent failures
– Broadcast link failure treated as that of designated router
• Non-adjacent failures
– Not necessary to know the previous hop of a packet to
compute interface-specific keynode per destination
– Failure inferencing can be done as before
CSci5221:
Network Failures; IGP Fast Convergence
and IP Fast Re-Routing
36
Summary of FIFR
• Fast reroute under any single failures
– Without changing/encapsulating IP datagram
• May cause loops under multiple failures
– With ISF, guaranteed-protection against single failures
or loop-freedom under multiple failures but not both
– Blacklist-based Interface Specific Forwarding
• Needs interface-specific forwarding
– Two forwarding entries per destination
– O(|E|log2|V|) to compute forwarding entries
CSci5221:
Network Failures; IGP Fast Convergence
and IP Fast Re-Routing
37
Other Approaches
See the optional reading [GRY07] for more detials.
•Loop-free Alternative (LFA): fast re-routing only when direct
link to (default) next-hop fails
– simpler computation – we know exactly which link to remove
when computed new next-hop; but protection limited
– using IP-tunnels, etc.
•U-turn: allow protection over multiple hops
•Using “Not-Via” Addresses
•Multi-topology routing
– routers and links (with possibly different link weights) belong
to multiple topologies
• E.g., a default topology, plus “back-up” topologies with various
(assumed) links removed (or new link weights)
– packets are “marked” with “topology id” for look-up
• IETF Fast Rerouting and MT-Routing Working Groups
CSci5221:
Network Failures; IGP Fast Convergence
and IP Fast Re-Routing
38