Introduction - Rudra Dutta
Download
Report
Transcript Introduction - Rudra Dutta
Introduction
CSC/ECE 772: Survivable Networks
Spring, 2009, Rudra Dutta
Motivation
Failures can affect our lives fairly directly
–
Modern society critically depends on communication networks
–
Internet evolution: lab curiosity, mil/gov, educational, commercial,
business, social, lifeline/ubiquitous
Similar to power grid, transportation system, water distribution
Mission critical business functions must be available 24/7
–
Web-based transaction systems
– 1-800 services
– e-commerce
Government services, emergency (911) services
Scientific projects (e-Science, etc)
Everyday communication services
–
BT will switch entirely to IP by 2012 (Spectrum, 1/2007)
Survivability must be foremost consideration in network design, not an afterthought
Copyright Rudra Dutta, NCSU, Spring, 2009
2
Failure Events
Link failures - fiber cuts
Failure of active components inside network
equipment
–
–
Node failures - due to catastrophic event
–
Transmitters, receivers, controllers
Individual channel failures (in a WDM system)
Rare events, but cause widespread disruption
Software failures - due to immense complexity
–
–
Usually dealt with by using proper software design
techniques
Hard to protect against in the network
Copyright Rudra Dutta, NCSU, Spring, 2009
3
Failure Causes
Human error - most common cause
–
Natural events - floods, snow storms, earthquakes
–
–
–
–
–
1997 solar storm caused Telstar failure
1988 fire at Hindsdale CO
1999 damage from hurricane Floyd
2002 fire melted Verizon fiber cable
Animals !
–
“Backhoe effect”, operator errors, · · ·
Rodents gnaw on cable jackets
Sharks bite undersea cables (TAT-8)
Sabotage - terrorism (9/11)
Wear and tear
Copyright Rudra Dutta, NCSU, Spring, 2009
4
Chances of Failure
Fiber optic cables are critical components: we know to
–
–
–
–
Similar issues in operating many large-scale systems:
–
–
–
...physically protect cables,
...bury them suitably deep,
...be careful when digging,
So why do fiber cables get cut at all?
Nuclear reactors, water systems, air traffic control / airplanes
Lay person: baffled when things go wrong
Insider: knows how many things can go wrong
Statistical certainty of fairly high rate of failures !
Average life of fiber span - 228 years
–
But consider laid fiber-miles
Copyright Rudra Dutta, NCSU, Spring, 2009
5
Service Outage Statistics
Copyright Rudra Dutta, NCSU, Spring, 2009
6
Engineering Fault Tolerance
Failures may be rare or common, but are
inevitable
Should not be baffling !
–
–
(At least not to the designer of system !)
Should in fact, be predictable (at least statistically)
Must engineer for failure - common to many
disciplines
Most repair times are much larger than
acceptable restoration times
–
Restoration of service must be engineered to operate
with active failure in network
Copyright Rudra Dutta, NCSU, Spring, 2009
7
Outage Duration
Revenue loss
–
–
Loss of business (e.g., voice-calling revenue)
Default on SLAs
Business disruption
–
Regular business impacted
– Societal impact/risks (travel, education, financial
services, 911)
– Lawsuits, bankruptcies
Network dynamics
–
–
Application/TCP session timeouts, router connectivity
loss
Overloading
Copyright Rudra Dutta, NCSU, Spring, 2009
8
Outage Effects
Target Range
Duration
Main Effects
Protection
Switching
< 50 ms
Service “hit”
TCP sees no impact
1
50 - 200
Few voiceband disconnects
ATM cell rerouting may start
2
200-2000
Some switched connections drop
TCP protocol backoff
3
2 - 10 s
Switched circuit services drop (X.25)
TCP session timeouts
“webpage not available” errors
Affects router hello protocol
Copyright Rudra Dutta, NCSU, Spring, 2009
9
Outage Effects
Target Range
Duration
4
10s - 5 min Calls and data sessions terminated
TCP/IP applications timeout
Users attempt mass redials
Routers issue LSAs
Topology update, network-wide resynch
“Undesirable”
5 - 30 min
“Unacceptable” > 30 min
Copyright Rudra Dutta, NCSU, Spring, 2009
Main Effects
Routers under heavy reattempts load
Minor business/societal impact
Noticeable “Internet brownout”
Regulatory reporting required
Major societal impacts/risks, headlines
SLA clauses triggered, lawsuits
10
Planning Graded Fault Tolerance
Instantaneous recovery from most significant/frequent
failures
–
Fast recovery from other significant or frequent failures
–
Also automatic - device or system
Reasonably fast recovery from next tier of failures
–
Eliminate human involvement - device level
Automated, but may be system / software
Least likely tier - repair and recovery plans
–
Includes manual repair, must also plan for liability
Copyright Rudra Dutta, NCSU, Spring, 2009
11
Mechanisms for Fault Tolerance
Carefully design-in specific amounts of spare
capacity
–
–
Design network topology for physical diversity
–
–
–
spare links/channels/equipment
bumping low priority traffic
bi-connected topology (or better)
diverse routing
shared risk link group (SRLG) concept
Embed real-time mechanisms to
develop/implement “patch plan”
–
–
appropriate protocols and algorithms
cross-layer interactions
Copyright Rudra Dutta, NCSU, Spring, 2009
12
Stages of Failure Plan Operation
Almost certainly starting
from device layer
Failure detection
Failure localization
Failure recovery
Failure repair
Cooperation between
device and software
Software
Human
Copyright Rudra Dutta, NCSU, Spring, 2009
13
Summary
Faults are real, must plan to address
Faults are diverse, plan must be diverse
Fault tolerance is a system concept, not add-on
–
Must plan at various levels
– At each level, must be appropriate response
Must address together with network design
problem, hard to achieve after the fact
Copyright Rudra Dutta, NCSU, Spring, 2009
14