No Slide Title

Download Report

Transcript No Slide Title

Reliability and Resilience in Data
Communication Networks
Weidong Cui
EECS, UC Berkeley
[email protected]
CS294-3, Spring 2002
1
Outline
•
•
•
•
•
•
Overview
Resilience in Optical Layer
Resilience in IP Layer
Resilience in Application Layer
Resilience in Multilayer Networks
Case Study: Sprint Long Distance Network
Reliability
• Summary
2
Overview
• Network Reliability
– Networks should be able to detect faults and somehow repair
themselves before end-users perceive any problem with their
communications.
• Technical Concerns
– Robustness
– Efficiency
– Expedition
• Management Concerns
– Cost
– Configurability
– Interoperability
3
Terminology
• Protection
– Uses preassigned capacity to ensure survivability
• Restoration
– Reroutes the affected traffic after failure occurrence
by using available capacity
• Survivability
•
•
•
•
•
– Property of a network to be resilient to failures
Proactive: protection
Reactive: restoration
Path-based vs. Link-based
Dedicated Backup vs. Backup Multiplexing
1+1 protection vs. 1:1 protection vs. 1:N protection
4
Link-based vs. Path-based
•
Link-based
– Shorter restoration time
– Less efficient.
– Can only fix link failures
•
Path-based
– longer restoration time
– More efficient.
Mohan and Murthy, “Lightpath Restoration in WDM Optical Networks”, 2000.
5
Dedicated Backup vs.
Backup Multiplexing
•
Dedicated backup
– More robust
– Less efficient.
•
Backup multiplexing
– Less robust
– More efficient.
Mohan and Murthy, “Lightpath Restoration in WDM Optical Networks”, 2000.
6
Resilience in Optical Layer
• Why wants resilience in optical layer?
– The ever increasing bit-rate makes optical layer
failures a significant loss for network
operators.
– Cable cuts are very frequent?
– Fast restoration time, invisible to end-users.
– Easy to achieve “physical diversity”
7
Resilience in Optical Layer
• Linear Systems
– 1+1 protection
– 1:1 protection
– 1:N protection
• Ring-based
– UPSR: Uni-directional Path Switched Rings
– BLSR: Bi-directional Line Switched Rings
• Mesh-based
– Optical mesh networks connected by optical crossconnects
(OXCs) or optical add/drop multiplexers (OADMs)
– Link-based/path-based protection/restoration
• Hybrid Mesh Rings
– Physical: mesh
– Logical: ring
8
UPSR vs. BLSR
•
UPSR
–
–
–
–
Simple: automatically bridging all
traffic counterclockwise at entry
nodes
Not efficient
No communication between entry
and exit nodes
Switching time is not affected by
the number of nodes in the ring.
•
BLSR
–
–
–
–
Traffic is routed in both directions
More efficient
Use APS messaging
The requirement of 50ms switching
time restricts the BLSR to 16
nodes.
www.acterna.com, “The Fundamentals of SONET”.
9
Resilience in IP Layer
• Why wants resilience in IP layer?
– Survivability in optical layer is not enough.
• Less efficient and expensive
• Node failures within a service layer can only be dealt
with by the actions of peer-level network elements.
• Some networking operating contexts need IP
restoration other than physical layer restoration.
– No rigid distinction between working and spare
capacity (More Efficient!)
• Spare capacity can be used by best-effort traffic
during normal operation.
10
Virtual Protection Cycles:
p-cycle (I)
Protection for link failures
(on-cycle and straddling failures)
Protection for node failures
(encircling p-cycles)
Stamatelakis and Grover, “IP Layer Restoration and Network Planning Based on Virtual Protection Cycles11
Virtual Protection Cycles:
p-cycle (II)
•
p-cycle can be implemented using IP tunneling or label switching.
•
Encapsulating original IP packets
•
Decapsulating p-cycle packets
– A few more routing entries for p-cycle routing in routing tables
– When the original route cost is larger than the local route cost.
•
Formulation
– Objective: minimize worst-case restoration-induced oversubscription
– Mixed integer programming
•
Performance
– Max oversubscription is close to 1.0
– Restoration time is dependent on failure detection time.
Stamatelakis and Grover, “IP Layer Restoration and Network Planning Based on Virtual Protection Cycles12
Restoration in Different
Information Scenarios (I)
• None information scenario
– Residual bandwidth on each link
• Complete information scenario
– Full routing information of all the connections currently
in progress
• Partial information scenario
– Residual bandwidth on each link
– Total bandwidth used by active (primary) paths on each
link
– Total bandwidth used by backup paths on each link
Kodialam and Lakshman, “Dynamic Routing of Bandwidth Guaranteed Tunnels with Restoration”, Infocom’00.
Kodialam and Lakshman, “Dynamic Routing of Locally Restorable Bandwidth Guaranteed Tunnels using
Aggregated Link Usage Information”, Infocom’01.
13
Restoration in Different
Information Scenarios (II)
• Objective: optimize active and backup path (or
bypass path, if locally restorable is desired)
routing for accommodating more path requests.
– Formulate as integer programming problems
– Heuristics to solve the sharing with partial information
problem
• Performance
– Efficiency of restorable routing under partial
information scenario is close to the performance under
complete information scenario.
Kodialam and Lakshman, “Dynamic Routing of Bandwidth Guaranteed Tunnels with Restoration”, Infocom’00.
Kodialam and Lakshman, “Dynamic Routing of Locally Restorable Bandwidth Guaranteed Tunnels using
Aggregated Link Usage Information”, Infocom’01.
14
Resilience in Application Layer
• Why wants resilience in application layer?
– Survivability in IP layer is not widely deployed.
– BGP’s fault recovery mechanisms may take many
minutes before routes converge to a consistent
form.
– Application layer restoration has more
knowledge of application requirements.
– Application service providers can do restoration
only in application layer.
15
Resilient Overlay Networks
(RONs) (I)
• Goal:
– Failure detection and recovery in less than 20 seconds
– Tighter integration of routing and path selection with the
application
– Expressive policy routing
• Basic Idea:
– Detect problems by aggressively probing and monitoring the
paths connecting the nodes
– RON nodes exchange information about the quality of the paths
among themselves via a routing protocol and building forwarding
tables based on a variety of path metrics (latency/loss
rate/throughput)
– Route packets over the RON rather than the underlying
Internet path if the latter is not the best one.
D. Andersen, et al., “Resilient Overlay Networks”, SOSP’01.
16
Resilient Overlay Networks
(RONs) (II)
• Performance
– Average fault detection and recovery time is 18
seconds
– 100% in RON1 and 60% in RON2 of the hundred
significant observed outages are overcome by
RON.
• Limitations
– Not scalable, RON size is 2 ~ 50.
D. Andersen, et al., “Resilient Overlay Networks”, SOSP’01.
17
Fault Detection and Recovery for
Wide-Area Service Composition
• Goals
– Availability: detect and recover from failures quickly
– Performance: choose set of service instances
– Scalability: internet-scale operation
• Design
– Leverage an overlay network of service clusters
– Link-state propagation
• Need full network topology information
• Quick propagation of failure information
• Link-state floods is acceptable
• Evaluation
– Good recovery time for real-time applications: O(3 seconds)
– Good scalability: minimal additional provisioning for cluster
managers
B. Raman, “Wide-Area Service Composition: Availability, Performance and Scalability”, 2002.
18
Overlay Restoration based on
Correlated Link Failures (I)
• Motivation
– Overlay link failures are correlated!
Lij
j
The Overlay Network
i
Lik
k
j
i
• Goal
m
lim
l mk
lmj
The Underlying Network
k
– Robustness: reserve two paths with minimum joint path
failure probability based a correlated overlay link
failures
– Efficiency: leverage on backup bandwidth sharing
W. Cui et al., “Backup Path Allocation based on a Link Failure Probability Model in Overlay Networks”, 2002.
19
Overlay Restoration based on
Correlated Link Failures (II)
• Backup Path Routing
– Optimal Backup Path Routing (OPR)
• Integer quadratic programming
– Failure Probability Cost Backup Path Routing (FPR)
• Decouple primary path routing and backup path routing
• Primary path: shortest path based on default metric
• Backup path: shortest path based on failure probability cost
– Secondary Shortest Backup Path Routing (SSR)
• Same idea as FPC, but
• Backup path: (secondary) shortest path link disjoint to the primary
path
• Backup Path Bandwidth Allocation
– No backup bandwidth sharing
– Backup bandwidth sharing for single-link-failure recovery
– Backup bandwidth sharing for double-link-failure recovery
W. Cui et al., “Backup Path Allocation based on a Link Failure Probability Model in Overlay Networks”, 2002.
20
Overlay Restoration based on
Correlated Link Failures (III)
• Evaluation
– Metric
• Fatal path failure probability (robustness)
• Number of path requests admitted (efficiency)
– Main conclusions
• FPR is 15% ~ 25% better than SSR and close to OPR
on robustness
• The overlay network can admit 100% more path
requests by using backup bandwidth sharing than
without backup bandwidth sharing.
• FPR is tolerant to inaccurate overlay link failure
estimates.
W. Cui et al., “Backup Path Allocation based on a Link Failure Probability Model in Overlay Networks”, 2002.
21
Resilience in Multilayer Networks
• Why wants resilience in multilayer
networks?
– Avoid contention between different singlelayer recovery schemes.
– Promote cooperation and sharing of spare
capacity
22
PANEL: Protection Across
Network Layers (I)
P. Demeester et al., “Resilience in Multilayer Networks”, 1999.
23
PANEL Guidelines
•
Recovery in the highest layer is recommended when:
•
Recovery in the lowest layer is recommended when:
•
Using unprotected or preemptible server (lower) paths to carry the
client (upper) layer spare capacity is recommended to alleviate
redundant protection and remain cost-effective.
– Multiple reliability grades need to be provided with fine granularity
– Recovery interworking cannot be implemented
– Survivability schemes in the highest layer are more mature than in the
lowest layer
– The number of entities to recover has to be limited/reduced
– The lowest layer supports multiple client layers and it is appropriate to
provide survivability to all services in a homogeneous way
– Survivability schemes in the lowest layer are more mature than in the
highest layer
– It is difficult to ensure the physical diversity of working and backup
paths in the higher layer
P. Demeester et al., “Resilience in Multilayer Networks”, 1999.
24
Protection/Restoration in
IP over WDM Networks (I)
• Goal
– Deliver services reliably among border LSRs
(Label Switched Routers)
• IP Layer Protection
– One physical cut can expand to tens of
thousands of simultaneous logical link failures
at the IP layer.
• WDM Layer Protection
– Very low network utilization
Y. Ye et al., “A Simple Dynamic Integrated Provisioning/Protection Scheme in IP over WDM Networks”, 2001.
25
Protection/Restoration in
IP over WDM Networks (II)
• Dynamic Integrated Approach
– Periodically (globally) optimize the network by using
offline computation, and then use online dynamic path
selection to fine tune between offline calculations
– If the source LSR could not locate a link-disjoint backup
path from existing lighpaths or found one that had no
available bandwidth, the LSR would
• Run the path selection algorithm with constraints
• If failed, request a new lightpath from the WDM layer and
check if it can locate a link-disjoint backup path
• If not, drop the flow request and release all the reserved
resources.
Y. Ye et al., “A Simple Dynamic Integrated Provisioning/Protection Scheme in IP over WDM Networks”, 2001.
26
Case Study: Sprint Long
Distance Network Reliability
• Network reliability factors
– Transport architecture: SONET 4F BLSR
– Redundant equipment
• Internal redundancy of ADMs
• 2 independent WDM systems in SONET 4F BLSR
• Redundant IP routers
– Conservative synchronization
• Primary reference sources recover clocking from GPS and
Loran-C receivers.
– Protected power
M.L. Jones et al., “Sprint Long Distance Network Survivability: Today and Tomorrow”, 1999.
27
Summery
• Protection/Restoration in physical layer has
shorter fault switching time (~50ms) but worse
network utilization.
• Protection/Restoration in IP layer or application
layer may take from several seconds to several
minutes but has higher network utilization.
• Protection/Restoration in one layer cannot be
completely replaced by protection/restoration in
another layer.
• Integrated protection/restoration across multiple
layers needs extensive study.
28