Introduction - Rudra Dutta

Download Report

Transcript Introduction - Rudra Dutta

Introduction
CSC/ECE 772: Survivable Networks
Spring, 2009, Rudra Dutta
Motivation

Failures can affect our lives fairly directly
–

Modern society critically depends on communication networks
–

Internet evolution: lab curiosity, mil/gov, educational, commercial,
business, social, lifeline/ubiquitous
Similar to power grid, transportation system, water distribution
Mission critical business functions must be available 24/7
–
Web-based transaction systems
– 1-800 services
– e-commerce



Government services, emergency (911) services
Scientific projects (e-Science, etc)
Everyday communication services
–
BT will switch entirely to IP by 2012 (Spectrum, 1/2007)
Survivability must be foremost consideration in network design, not an afterthought
Copyright Rudra Dutta, NCSU, Spring, 2009
2
Failure Events

Link failures - fiber cuts
 Failure of active components inside network
equipment
–
–

Node failures - due to catastrophic event
–

Transmitters, receivers, controllers
Individual channel failures (in a WDM system)
Rare events, but cause widespread disruption
Software failures - due to immense complexity
–
–
Usually dealt with by using proper software design
techniques
Hard to protect against in the network
Copyright Rudra Dutta, NCSU, Spring, 2009
3
Failure Causes

Human error - most common cause
–

Natural events - floods, snow storms, earthquakes
–
–
–
–

–

1997 solar storm caused Telstar failure
1988 fire at Hindsdale CO
1999 damage from hurricane Floyd
2002 fire melted Verizon fiber cable
Animals !
–

“Backhoe effect”, operator errors, · · ·
Rodents gnaw on cable jackets
Sharks bite undersea cables (TAT-8)
Sabotage - terrorism (9/11)
Wear and tear
Copyright Rudra Dutta, NCSU, Spring, 2009
4
Chances of Failure

Fiber optic cables are critical components: we know to
–
–
–
–

Similar issues in operating many large-scale systems:
–
–
–


...physically protect cables,
...bury them suitably deep,
...be careful when digging,
So why do fiber cables get cut at all?
Nuclear reactors, water systems, air traffic control / airplanes
Lay person: baffled when things go wrong
Insider: knows how many things can go wrong
Statistical certainty of fairly high rate of failures !
Average life of fiber span - 228 years
–
But consider laid fiber-miles
Copyright Rudra Dutta, NCSU, Spring, 2009
5
Service Outage Statistics
Copyright Rudra Dutta, NCSU, Spring, 2009
6
Engineering Fault Tolerance

Failures may be rare or common, but are
inevitable
 Should not be baffling !
–
–
(At least not to the designer of system !)
Should in fact, be predictable (at least statistically)

Must engineer for failure - common to many
disciplines
 Most repair times are much larger than
acceptable restoration times
–
Restoration of service must be engineered to operate
with active failure in network
Copyright Rudra Dutta, NCSU, Spring, 2009
7
Outage Duration

Revenue loss
–
–

Loss of business (e.g., voice-calling revenue)
Default on SLAs
Business disruption
–
Regular business impacted
– Societal impact/risks (travel, education, financial
services, 911)
– Lawsuits, bankruptcies

Network dynamics
–
–
Application/TCP session timeouts, router connectivity
loss
Overloading
Copyright Rudra Dutta, NCSU, Spring, 2009
8
Outage Effects
Target Range
Duration
Main Effects
Protection
Switching
< 50 ms
Service “hit”
TCP sees no impact
1
50 - 200
Few voiceband disconnects
ATM cell rerouting may start
2
200-2000
Some switched connections drop
TCP protocol backoff
3
2 - 10 s
Switched circuit services drop (X.25)
TCP session timeouts
“webpage not available” errors
Affects router hello protocol
Copyright Rudra Dutta, NCSU, Spring, 2009
9
Outage Effects
Target Range
Duration
4
10s - 5 min Calls and data sessions terminated
TCP/IP applications timeout
Users attempt mass redials
Routers issue LSAs
Topology update, network-wide resynch
“Undesirable”
5 - 30 min
“Unacceptable” > 30 min
Copyright Rudra Dutta, NCSU, Spring, 2009
Main Effects
Routers under heavy reattempts load
Minor business/societal impact
Noticeable “Internet brownout”
Regulatory reporting required
Major societal impacts/risks, headlines
SLA clauses triggered, lawsuits
10
Planning Graded Fault Tolerance

Instantaneous recovery from most significant/frequent
failures
–

Fast recovery from other significant or frequent failures
–

Also automatic - device or system
Reasonably fast recovery from next tier of failures
–

Eliminate human involvement - device level
Automated, but may be system / software
Least likely tier - repair and recovery plans
–
Includes manual repair, must also plan for liability
Copyright Rudra Dutta, NCSU, Spring, 2009
11
Mechanisms for Fault Tolerance

Carefully design-in specific amounts of spare
capacity
–
–

Design network topology for physical diversity
–
–
–

spare links/channels/equipment
bumping low priority traffic
bi-connected topology (or better)
diverse routing
shared risk link group (SRLG) concept
Embed real-time mechanisms to
develop/implement “patch plan”
–
–
appropriate protocols and algorithms
cross-layer interactions
Copyright Rudra Dutta, NCSU, Spring, 2009
12
Stages of Failure Plan Operation
Almost certainly starting
from device layer

Failure detection
 Failure localization
 Failure recovery
 Failure repair
Cooperation between
device and software
Software
Human
Copyright Rudra Dutta, NCSU, Spring, 2009
13
Summary

Faults are real, must plan to address
 Faults are diverse, plan must be diverse
 Fault tolerance is a system concept, not add-on
–
Must plan at various levels
– At each level, must be appropriate response

Must address together with network design
problem, hard to achieve after the fact
Copyright Rudra Dutta, NCSU, Spring, 2009
14