Transcript Slides

Protection and Restoration
• Definitions
• A major application for MPLS
The problem
• Network resources will fail
– Nodes and links
• IGP will re-converge
– But this may take some time
• 10s of seconds
– Fast convergence has a price
• May make IGP more sensitive/unstable
– I may have sensitive traffic that can not afford
interruptions
• Voice, Consumer TV
• Do something for the time until IGP reconverges
Terminology
• Restoration
– Bring traffic back to normal
• Backup
– Alternative resources to be used when there is a failure
• Protection
– Determine and allocate the backup resources before
the failure
– When there is a failure just activate them
– Can be very fast
• Repair
– Determine, allocate and activate the backup resources
after the failure
– Will be slower
Failure Modes
• Single vs. multiple link failures
– If duration of link failure is short, can assume
that there will be only a single link failure
– Much harder to deal with multiple link failures
• Node vs. link failures
– Can assume that links will fail more frequently
than nodes
– Node failures are harder to handle
Backup resources
• Can be multiple types
–
–
–
–
–
Links
Paths
Trees
Cycles
Whole topologies
• In order to avoid network overload after a failure
need to have some extra capacity for backup
resources
• Problem is how to engineer them so as not to
make the network too expensive
– Minimize the amount of backup capacity that is
reserved
More jargon
• 1:1
– 1 working, 1 backup
– Wastes a lot of bandwidth for the backups
• 1:N
– N working and 1 backup
– Assume that only 1 working will fail
– Then 1 backup is enough – save bandwidth
• Revertive:
– when the failure is fixed, revert to the primary
• SRLG: Shared Risk Link Group
– A set of network links that fails together
– E.g fibers that are in the same conduit
• A bulldozer will cut all of them together
Other issues
• How to detect the failure fast
– BFD is one general solution
– There are medium specific solutions
• OAM for ATM
• Alarms for SONET
• Preferable if they exist
– Protocol mechanisms (RSVP HELLOs, OSPF
HELLOs, etc)
• How to activate the backup
– I.e how to make traffic use an alternate path, or
a tree
Backbone failure analysis
• Sprint backbone ca. March 2002
– Link in class website
• Monitor IS-IS traffic
• Data only for link failures, not node failures
• Failure Duration
– 50% failures last less than 1 min
– 40% failures last between 1 and 20 min
• Maintenance
– 50% of failures during maintenance windows
• Mean time between failure (MTBF)
– Mean time between failures varies a lot across links
• “good” and “bad” links
– 3 bad links account for 25% of the failures
More analysis
• Unplanned failure breakdown
– Shared link failures = 30%
• Router related = 16.5%
• Optical related = 11.5%
– Individual link failures = 70%
• Node failures less common that single link
failures
• About 16.5% of failures affect more than 1
link
Handling failures with IP
• Easy case
– ECMP, no need to do anything extra during failure
– But it may not repair all failures
– Coverage: what percentage of the possible failures can
be repaired
• In general activating backup resources is hard with
IP
– Packets will follow the IP route table/FIB
– Forwarding is hop-by-hop
– Even if I compute a backup link for a failure, I have no
control what will happen after the next hop
• May have routing loops
IP protection
• Backup next-hop
– Each node computes a backup nexthop for each
destination
• so that I will not have routing loops
– It may not have 100% coverage
• For more general solutions I need tunneling
– Must force packets to reach their destination
– Without crossing the failed resource
• Tunnel to the node after the failed link
• Tunnel to an intermediate node
– IP tunneling is an expensive operation
• It is packet encapsulation
Not-Via addresses
• Consider router A, with interfaces A1, A2, A3
–
–
–
–
A1 connects to interface B1 or router B,
A2 connects to interface C2 of router C
B1 has a second address B1-not-via-A
All routers compute paths to B1-not-via-A by
removing router A from topology and running SPF
– When router A fails, if C wants to reach B sends
packets to address B1-not-via-A
• Encapsulates the packets
• 100% coverage
• Can handle node and link failures
• Still needs encapsulation
Multi-topology protection
• New approach
• Have multiple subsets of the topology
– IGP protocols already support multi-topology
routing
– Switch to a different topology when there is a
failure
• By modifying the header of the packet
• Or even using an MPLS label
• Allows for more flexible routing of traffic
after a failure
Using MPLS
• MPLS can conveniently direct traffic where
I want
• Ideal for setting up backup resources
– Mostly backup paths
• Can be used to repair both IP and MPLS
failures (I.e. LSP failure)
• LSP protection can be
– Path
– Local
Path protection
• For each LSP (primary) have a backup LSP
– It is already established (with RSVP) but it is
not carrying any traffic
• Primary and backup LSPs should be link
and node disjoint
• When there a failure the source of the LSP
will start sending traffic to the backup
• Source needs to be notified for the failure
– May take some time for the repair of the traffic
• Can work in both 1:1 and 1:N modes
Local protection
• When a link or node fails the node upstream from the
failure repairs the traffic
– Traffic is put into a back LSP that does not go over the failed
resource
– Backup LSP merges with the primary LSP
• Repairing router does not send a PATHerr upstream
– Instead notify upstream nodes that it is repairing the failure
• It is very fast
• Can work in 1:1 and 1:N modes
• Can be
– Node
• Bypass a failed node
– Link
• Bypass a failed link
Link local protection
• The node upstream of the failed link
initiates the protection
– Point of local repair (PLR)
• Backup LSP will merge back to the primary
one
– At the next-hop (Nhop) of the PLR
• Can work in 1:1 and 1:N modes
– Usually a single backup LSP protects multiple
primary LSPs
– Else scalability is not good
Node local protection
• When a node fails, assume its links have failed too
• The node upstream of the failed node initiates the
protection
– Point of local repair (PLR)
• Backup LSP will merge back to the primary one
– At the next-next-hop (NNHop) of the PLR
• What label does the NNHop use for the primary
LSP?
– Need RSVP’s help to find out
• Will need multiple backup LSPs for each node
– At least one for each NNHop
– Can optionally configure more
Label stacking
• Each time I send traffic into an LSP I push a
label on the packets
• Packets in the primary LSP already have a
label
– I create a label stack
– Top label is popped by the router just before the
merge point
• A catch
– At the merge point, packet arrives from an
interface different than the expected one
– Must have global (platform) label space
Need some RSVP support
• If the LSP is protected do not send a errors
upstream/downstream when there is a failure
– Instead notify upstream nodes that repair is in progress
• During failure the PATH,RESV for the primary
LSP must continue
– Send them through the backup LSP
• For node protection need to know the label the
NNHop is using for the primary
– Use the record label option for the LSP
– All the labels used in all the hops are recorded in the
RESV message
LSP protecting IP
• Can use the above techniques to also protect
IP traffic
• If a link fails all the traffic that would go
through the link is sent over the backup LSP
• Similar for node failures
– But in this case, do I know the nnhop for IP?
• In general, If I have MPLS in my network
all my traffic will be inside MPLS tunnels
anyway
Observations
• If node degree is d and I have N nodes then
– I need at least O(Nd) tunnels for link protection
– And at least O(Nd^2) for node protection
• Of course I can not protect from failures of
the ingress or egress node
• The assumption is that failures will be short
lived
– Traffic may be unbalanced during the failure
– Links can get overloaded
The resource allocation problem
• How do I setup the backup tunnels so that
– I do not overload any link after a failure
– I minimize the amount of extra bandwidth that will
need to be reserved for the backups
• It is a form of traffic engineering (TE)
– We will see more on TE later on
• Has been studied a lot
– In optical and telephone networks
– And recently in MPLS type networks
• Solutions can be
– On-line (as the requests arrive)
– Off-line
Example
• Kodialam, Lakshman, 2001
– Local link and node protection
– Assume I know the b/w demands of all LSPs
– Assume that only one link or node can fail at a time
• Find a set of backup paths that minimizes the amount of
bandwidth for both primary and backup LSPs
– Backup LSPs can share bandwidth on some links
• What do I know about the links?
– How much bandwidth is used by each LSP
• Complete but expensive to maintain
– How much bandwidth is available
• Almost zero information
– How much bandwidth is used by backup LSPs
• Little bit better than zero