Transcript Part II

PhD Course: Performance & Reliability Analysis
of IP-based Communication Networks
Henrik Schiøler, Hans-Peter Schwefel, Mark Crovella, Søren Asmussen
•
Day 1
Basics & Simple Queueing Models(HPS)
•
Day 2
Traffic Measurements and Traffic Models (MC)
•
Day 3
Advanced Queueing Models & Stochastic Control (HPS)
•
Day 4
Network Models and Application (HS)
•
Day 5
Simulation Techniques (HPS, SA)
•
Day 6
Reliability Aspects (HPS)
Organized by HP Schwefel & H Schiøler
PHD Course: Performance & Reliability Analysis, Day 6 Part II, Spring04
Page 1
Hans Peter Schwefel
Content
1. Basic concepts
•
•
•
Fault, failure, fault-tolerance, redundancy types
Availability, reliability, dependability
Life-times, hazard rates
2. Mathematical models
•
•
Availability models, Life-time models, Repair-Models
Performability
3. Availability and Network Management
•
Rolling upgrade, planned downtime, etc.
4. Methods & Protocols for fault-tolerant systems
a) SW Fault-Tolerance
b) Network Availability
•
IP Routing, HSRP/VRRP, SCTP
c) Server/Service Availability
•
Server Farms, Cluster solution, distributed reliability
Demonstration: Fault-tolerant call control systems
PHD Course: Performance & Reliability Analysis, Day 6 Part II, Spring04
Page 2
Hans Peter Schwefel
Motivation: Failure Types in Communication NWs
WANs
•
•
•
PSTNs
IP networks designed to be ‘robust’ but not highly available
End-to-end availability
– Public Internet: below 99% (see e.g. http://www.netmon.com/IRI.asp).
– PSTN: 99.94%
Operation, Administration, and Maintenance (OAM) one major source of outage times
PHD Course: Performance & Reliability Analysis, Day 6 Part II, Spring04
Page 3
Hans Peter Schwefel
Overview: Role of OAM
OAM
System High Availability
Network Element HA
Network HA


network design
network redundancy
/fail-over



NE-Interface



overcome outages
Data HA
backup and restore
regional redundancy
rolling data upgrade
keep the system stable
multi-homing
stacks and protocols
IP fail-over
redundancy and diversity
state replication
message distribution
Error Handling




detection and recovery
tracing and logging
escalation and alarming
operational modes
PHD Course: Performance & Reliability Analysis, Day 6 Part II, Spring04
system startup
graceful shutdown
Supervision & Auditing



resource supervision
health/lifetime supervision
overload protection
repair and replacement
Fault Management
HA control & verification
system diagnosis
Startup & Shutdown


Continuous Execution







Rolling Upgrade



rolling upgrade
patch procedure
migration
Adapted from S. Uhle, Siemens ICM N
Page 4
Hans Peter Schwefel
OAM Concepts I: Startup/Upgrade/Shutdown
• Startup
– Put components into stable operational state
– Caution: potential for high-load (synchronisation etc.) in start-up phase
– Concept for start-up of larger sub-systems needed (e.g. Mutual dependence)
• Upgrades
– Rolling Upgrade
• Allow upgrades of single components while its replicates are in operation
• Consistency/data compatibility problems
– Patch concept (SW)
• Instead of full SW re-installation, incremental changes
• Goal: zero or minimal outage of components
• Graceful shutdown
•
HW: Hot plug/swap ability
– E.g. Interface cards
•
Life Testing
– Take components safely out of operation
– Possible steps:
• Stop accepting/redirect new tasks
• Finish existing tasks
• Synchronize data and isolate component
PHD Course: Performance & Reliability Analysis, Day 6 Part II, Spring04
Page 5
Hans Peter Schwefel
OAM Concepts II:Monitoring/Supervision
• Resource Supervision and Overload Protection
–
–
–
–
•
E.g. CPU load, queue-lengths, traffic volumes
Alarming  operator intervention, e.g. upgrades
Signalling to reduce overload at source
Graceful degradation
Logging & Tracing
– For off-line analysis of incidents
– Correlation of traces for system analysis
– Problem: adequate granularity of logging data
•
Service concepts/contracts
–
–
–
–
Reaction to alarms
System recovery modes
Spare-parts handling
Qualified technicians, availability (24/7 ?)
PHD Course: Performance & Reliability Analysis, Day 6 Part II, Spring04
•
High-availability of OAM system
– Redundancy (frequently separate
OAM network(s))
– Prioritisation of OAM traffic
– Handling/Storing of OAM/logging data
Page 6
Hans Peter Schwefel
Content
1. Basic concepts
•
•
•
Fault, failure, fault-tolerance, redundancy types
Availability, reliability, dependability
Life-times, hazard rates
2. Mathematical models
•
•
Availability models, Life-time models, Repair-Models
Performability
3. Availability and Network Management
•
Rolling upgrade, planned downtime, etc.
4. Methods & Protocols for fault-tolerant systems
a) SW Fault-Tolerance
b) Network Availability
•
IP Routing, HSRP/VRRP, SCTP
c) Server/Service Availability
•
Server Farms, Cluster solution, distributed reliability
Demonstration: Fault-tolerant call control systems
PHD Course: Performance & Reliability Analysis, Day 6 Part II, Spring04
Page 7
Hans Peter Schwefel
General Approaches: Fault-tolerance
• Basic requirements for fault-tolerance
– Number of fault-types and number of faults is bounded
– Existence of redundancy (structural: equal/diversity; functional; information; timeredundancy/retries)
• Functional Parts of fault-tolerant systems
– Fault detection (& diagnosis)
•
•
•
•
Replication and Comparison (if identical realisations  not suitable for design errors!)
Inversion (e.g. Mathematical functions)
Acceptance (necessary conditions on result, e.g. ranges)
Timing behavior (time-outs)
– Fault isolation: prevent spreading
• Isolation of functional components, e.g. Atomic actions, layering model
– Fault recovery
• Backward: Retry, Journalling (Restart), Checkpointing & Rollback (non-trivial in distributed
cooperating systems!)
• Forward: move to consistent, acceptable, safe new state; but loss of result
• Compensation, e.g. TMR, FEC
PHD Course: Performance & Reliability Analysis, Day 6 Part II, Spring04
Page 8
Hans Peter Schwefel
Software fault-tolerance
• Mainly: Design errors & user interaction (as opposed to production errors, wear-out, etc.)
•
Observations/Estimates (experience in computing centers, numbers a bit old however)
–
–
–
–
0.25...10 errors per 1000 lines of code
Only about 30% of error reports by users accepted as errors by vendor
Reaction times (updates/patches): weeks to months
Reliability not nearly as improved as hardware errors, various reasons:
• immensely increased software complexity/size
• Faster HWmore operations per time-unit executed  higher hazard rate
– However, relatively short down-times (as opposed to HW): ca 25 min
• Approaches for fault-tolerance
– Software diversity (n-version programming, back-to-back tests of components)
• However, same specification often leads to similar errors
• Forced diversity as improvement (?)
• Majority votes for complex result types not trivial
PHD Course: Performance & Reliability Analysis, Day 6 Part II, Spring04
Page 9
Hans Peter Schwefel
Content
1. Basic concepts
•
•
•
Fault, failure, fault-tolerance, redundancy types
Availability, reliability, dependability
Life-times, hazard rates
2. Mathematical models
•
•
Availability models, Life-time models, Repair-Models
Performability
3. Availability and Network Management
•
Rolling upgrade, planned downtime, etc.
4. Methods & Protocols for fault-tolerant systems
a) SW Fault-Tolerance
b) Network Availability
•
IP Routing, HSRP/VRRP, SCTP
c) Server/Service Availability
•
Server Farms, Cluster solution, distributed reliability
Demonstration: Fault-tolerant call control systems
PHD Course: Performance & Reliability Analysis, Day 6 Part II, Spring04
Page 10
Hans Peter Schwefel
Network Connectivity: Basic Resilience
• Loss of network connectivity
– Link Failures (cable cut)
– Router/component failures along the path
• Basic resilience features on (almost) every protocol layer
– L1+L2: FEC, Link-Layer Retransmission,
Resilient Switching (e.g. Resilient Packet Ring)
– L3, IP: Dynamic Routing
– TCP: Retransmissions
– Application: application-level retransmissions,
loss-resilient coding (e.g. VoIP, Video)
Application
L5-7
TCP/UDP
L4
IP
L3
Link-Layer
L2
• IP Layer Network Resilience: Dynamic Routing, e.g. OSPF
–
–
–
–
’Hello’ packets used to determine adjacencies and link-states
Missing hello packets (typically 3) indicate outages of links or routers
Link-states propagated through Link-state advertisements (LSA)
Updated link-state information (adjacencies) lead to modified path selection
PHD Course: Performance & Reliability Analysis, Day 6 Part II, Spring04
Page 11
Hans Peter Schwefel
Dynamic Routing: Improvements
• Drawbacks of dynamic routing
– Long Duration until newly converged routing tables (30sec up to several minutes)
– Rerouting not possible if first router (gateway) fails
• Improvements
– Speed-up: Pre-determined secondary paths (e.g. Via MPLS)
– ’local’ router redundancy:
Hot Standby (HSRP, RFC2281) & Virtual Router Redundancy Protocol (VRRP, RFC2338)
• Multiple routers on same LAN
• Master performs packet routing
• Fail-over by migration of ’virtual’ MAC address
Virtual Router
NE 1
HUB 1
Router 1
NE 2
HUB 2
Router 2
HSRP
VRRP
PHD Course: Performance & Reliability Analysis, Day 6 Part II, Spring04
Page 12
single IPaddress of
the virtual
router in the
network
for clientTransparency
Hans Peter Schwefel
•
•
•
Streaming Control Transmission Protocol
(SCTP)
Defined in RFC2960 (see also RFC 3257, 3286)
Purpose initially: Signalling Transport
Features
– Reliable, full-duplex unicast transport (performs retransmissions)
– TCP-friendly flow control (+ many other features of TCP)
– Multi-streaming, in sequence delivery within streams
 Avoid head of line blocking (performance issue)
– Multi-homing: hosts with multiple IP addresses, path monitoring (heart-beat mechanism),
transparent failover to secondary paths
• Useful for provisioning of network reliability
SCTP Association
IPb1
IPa1
Host A
Separate Networks
IPa2
PHD Course: Performance & Reliability Analysis, Day 6 Part II, Spring04
Page 13
Host B
IPb2
Hans Peter Schwefel
Content
1. Basic concepts
•
•
•
Fault, failure, fault-tolerance, redundancy types
Availability, reliability, dependability
Life-times, hazard rates
2. Mathematical models
•
•
Availability models, Life-time models, Repair-Models
Performability
3. Availability and Network Management
•
Rolling upgrade, planned downtime, etc.
4. Methods & Protocols for fault-tolerant systems
a) SW Fault-Tolerance
b) Network Availability
•
IP Routing, HSRP/VRRP, SCTP
c) Server/Service Availability
•
Server Farms, Cluster solution, distributed reliability
Demonstration: Fault-tolerant call control systems
PHD Course: Performance & Reliability Analysis, Day 6 Part II, Spring04
Page 14
Hans Peter Schwefel
Resilience for state-less applications
•
State-less applications
1
– Server response only depends on client
request
– Principle in most IETF protocols, e.g. http
– If at all, state stored in client, e.g. ’cookies’
•
Internet
Client
5
Redundancy in ’server farms’ (same LAN)
– Multiple servers
– IP address mapped to (replicated) ’load
balancing’ units
• HW load balancers
• SW load balancers
– Load balancers forward requests to specific
Server MAC address according to Server
Selection Policy, e.g.
•
• Random Selection
• Round-Robin
• Shortest Response-Time First
•
PHD Course: Performance & Reliability Analysis, Day 6 Part II, Spring04
Firewall / Switch
2
4
Gateway Node
Service Nodes
Gateway Node
Backup
3
Server failure
– Load-Balancer does not select failed
server any more
Examples: Web-services, Firewalls
Page 15
Hans Peter Schwefel
Cluster Framework
•
•
SW Layer across several servers within
same subnetwork
Functionality
– Load Balancer/dispatching functionality
– Node Management/
Node Supervision/Node Elimination
– Network connection: IP aliasing
(unsolicited/gratuitous ARP reply)
 fail-over times few seconds (up to 10
minutes, if router/switch does not support
unsolicited ARP!)
• Single Image view (one IP address for
cluster)
– For communication partners
– For OAM
• Possible Platform: Server Blades
PHD Course: Performance & Reliability Analysis, Day 6 Part II, Spring04
Page 16
Hans Peter Schwefel
Reliability middleware: Example RTP
•
Software Layer on top of cluster framework
–
All fault-tolerant operations are done within the cluster
–
One virtual IP address for the whole cluster (UDP dispatcher redirects
incoming messages)  “one system image”
– Transparency for the client
•
Redundancy at the level of processes:
–
–
•
Notion of node disappears:
–
•
Each process is replicated among the servers of the cluster
Only a faulty process is failed over
Use of logical name (replicated processes can be
in the same node)
Middleware Functionality (Example: Reliable Telco Platform, FSC)
–
–
–
–
–
–
Context management
Reliable inter-process communication
Load Balancing for Network traffic (UDP/SS7 dispatcher)
Ticket Manager
Events/Alarming
Support of rolling upgrade
PHD Course: Performance & Reliability Analysis, Day 6 Part II, Spring04
Page 17
–Node Manager
•Supervision of processes
•Communication of process status to other
nodes
–Trace Manager (Debugging)
–Administrative Framework (CLI, GUI, SNMP)
Hans Peter Schwefel
Example Architecture: Cluster middleware
IN
Pay
Parlay/
GSM
Manage-
ment
OSA
HLR
ment
Third party
applications
Agents
basic
Apps-APIs
Common Application Framework
single image
view
Component Framework
Reliant Telco Platform (RTP)
OPS/
RAC
Signal
ware
Net
worker
Corba common basic functions ...
Cluster Framework (Primecluster, SUNcluster)
OS (e.g. Solaris)
HW
OS (e.g. Solaris)
HW
PHD Course: Performance & Reliability Analysis, Day 6 Part II, Spring04
Page 18
OS (e.g. Solaris)
HW
Hans Peter Schwefel
Distributed Redundancy: Reliable Server Pooling
•
Distributed architecture
Server Pool
– Servers need only IP address
– Entities offering same service grouped
into pools accessible by pool handle
•
[State-sharing]
PE (A)
Pool User (PU) contact servers (PE) after
receiving the response to a name resolution
request sent to a name server (NS)
PE (B)
Name
Server(s)
Name
Server(s)
Fail-over
– Name Server monitors PEs
– Messages for dynamic registration and
De-registration
Name
– Flat Name Space
•
Architecture and pool access protocols (ASAP,
ENRP) defined in IETF RSerPool WG
•
Failure detection and fail-over performed in
ASAP layer of client
PHD Course: Performance & Reliability Analysis, Day 6 Part II, Spring04
Pool
User
ASAP=Aggregate Server Access Protocol
ENRP=Endpoint Name Resolution Protocol
Page 19
Hans Peter Schwefel
Rserpool: more details
• RSerPool Scenario
– Each PE contains full implementation of functionality, no distributed sub-processes
(different to RTP)
 reduced granularity for possible load balancing
More Details:
•
RSerPool Name Space
– Flat Name Space  easier management, performance (no recursive requests)
– All name servers in operational scope hold same info (about all pools in operational scope)
•
Load Balancing
– Load factors sent by PE to Name Server (initiated by PE)
– In resolution requests, Name Server communicates load factors to PU
– PU can use load factors in selection policies
PHD Course: Performance & Reliability Analysis, Day 6 Part II, Spring04
Page 20
Hans Peter Schwefel
RSerPool Protocols: Functionality
(Current status in IETF)
•
ENRP
+ State-Sharing between Name Servers
•
ASAP
+
+
+
+
+
+
(De -)Registration of Pool-elements (PE, Name Server)
Supervision of Pool-elements by a Name Server
Name Resolution (PU, Name Server)
PE selection according to policy (& load factors) (PU)
Failure detection based on transport layer (e.g. SCTP timeout)
Support of Fail-over to other Pool-element (PU-PE)
+ Business cards (pool-to-pool communication)
+ Last will
+ Simple Cookie Mechanism (usable for Pool Element state information) (PUPE)
+ Under discussion: Application Layer Acknowledgements (PU-PE)
PHD Course: Performance & Reliability Analysis, Day 6 Part II, Spring04
Page 21
Hans Peter Schwefel
RserPool and ServerBlades
• Server Blades
–
–
–
–
Many processor cards in one 19‘‘racks  space efficiency
Integrated Switch (possibly duplicated) for external communication
Backplane for internal communication (duplicated)
Management support: automatic configuration & code replication
• Combination RSerPool on Server Blades
– No additional load balancing SW necessary (but less granularity)
– Works on any heterogeneous system without additional implementation effort
(e.g. Server Blades + Sun Workstations)
– Standardized protocol stacks, no implementation effort in clients (except for
use of RSerPool API)
PHD Course: Performance & Reliability Analysis, Day 6 Part II, Spring04
Page 22
Hans Peter Schwefel
References/Acknowledgments
•
•
•
•
M. Bozinovski, T. Renier, K. Larsen: ‘Reliable Call Control’, Project reports and presentations,
CPK/CTIF--Siemens ICM N, 2001-2004.
Siemens ICM N, S. Uhle and colleagues
Fujitsu Siemens Computers (FSC), ‘Reliable Telco Platform’, Slides.
E. Jessen, ‘Dependability and Fault-tolerance of computing systems’, lecture notes (German),
TU Munich, 1996.
PHD Course: Performance & Reliability Analysis, Day 6 Part II, Spring04
Page 23
Hans Peter Schwefel
Summary
1. Basic concepts
•
•
•
Fault, failure, fault-tolerance, redundancy types
Availability, reliability, dependability
Life-times, hazard rates
2. Mathematical models
•
•
Availability models, Life-time models, Repair-Models
Performability
3. Availability and Network Management
•
Rolling upgrade, planned downtime, etc.
4. Methods & Protocols for fault-tolerant systems
a) SW Fault-Tolerance
b) Network Availability
•
IP Routing, HSRP/VRRP, SCTP
c) Server/Service Availability
•
Server Farms, Cluster solution, distributed reliability
Demonstration: Fault-tolerant call control systems
PHD Course: Performance & Reliability Analysis, Day 6 Part II, Spring04
Page 24
Hans Peter Schwefel
Course Overview & Feedback
PHD Course: Performance & Reliability Analysis, Day 6 Part II, Spring04
Page 25
Hans Peter Schwefel
PhD Course: Overview
•
Day 1
Basics & Simple Queueing Models(HPS)
•
Day 2
Traffic Measurements and Traffic Models (MC)
•
Day 3
Advanced Queueing Models & Stochastic Control (HPS)
•
Day 4
Network Models and Application (HS)
•
Day 5
Simulation Techniques (SA)
•
Day 6
Reliability Aspects (HPS,HS)
PHD Course: Performance & Reliability Analysis, Day 6 Part II, Spring04
Page 26
Hans Peter Schwefel
Day 1: Basics & Simple Queueing Models
1. Intro & Review of basic stochastic concepts
•
•
Random Variables, Exp. Distributions, Stochastic
Processes, Poisson Process & CLT
Markov Processes: Discrete Time, Continuous Time, ChapmanKolmogorov Eqs., steady-state solutions
2. Simple Qeueing Models
•
•
•
•
•
Kendall Notation
M/M/1 Queues, Birth-Death processes
Multiple servers, load-dependence, service disciplines
Transient analysis
Priority queues
3. Simple analytic models for bursty traffic
•
•
Bulk-arrival queues
Markov modulated Poisson Processes (MMPPs)
4. MMPP/M/1 Queues and Quasi-Birth Death Processes
PHD Course: Performance & Reliability Analysis, Day 6 Part II, Spring04
Page 27
Hans Peter Schwefel
Day 2: Traffic Measurements and Models
(Mark)
• Morning: Performance Evaluation
–
–
–
–
Marginals, Heavy-Tails
Autocorrelation
Superposition of flows, alpha and beta traffic
Long-Range Dependence
• Afternoon: Network Engineering
– Single Link Analysis
• Spectral Analysis, Wavelets, Forecasting, Anomaly Detection
– Multiple Links
• Principle Component Analysis
PHD Course: Performance & Reliability Analysis, Day 6 Part II, Spring04
Page 28
Hans Peter Schwefel
Day 3: Advanced Queueing Models
1. Matrix Exponential (ME) Distributions
•
•
•
Definition & Properties: Moments, distribution function
Examples
Truncated Power-Tail Distributions
2. Queueing Systems with ME Distributions
•
•
M/ME/1//N and M/ME/1 Queue
ME/M/1 Queue
3. Performance Impact of Long-Range-Dependent Traffic
•
N-Burst model, steady-state performance analysis
4. Transient Analysis
•
Parameters, First Passage Time Analysis (M/ME/1)
5. Stochastic Control: Markov Decision Processes
•
•
•
Definition
Solution Approaches
Example
PHD Course: Performance & Reliability Analysis, Day 6 Part II, Spring04
Page 29
Hans Peter Schwefel
Day 4: Network Models (Henrik)
1. Queueing Networks
•
•
Jackson Networks
Flow-balance equations, Product Form solution
2. Deterministic Network Calculus
•
•
•
•
•
Arrival Curves
Service Curves: Concatenation, Prioritization
Queue-Lengths, Output Flow
Loops
Examples: Leaky Buckets,
PHD Course: Performance & Reliability Analysis, Day 6 Part II, Spring04
Page 30
Hans Peter Schwefel
Day 5: Simulation Techniques
(Hans, Soeren)
1. Basic concepts
•
•
Principles of discrete event simulation
Simulation tools
2. Random Number Generation
3. Output Analysis
•
•
Terminating simulations
Non-terminating/steady state case
4. Rare Event Simulation I: Importance Sampling
5. Rare Event Simulation II: Adaptive Importance Sampling
PHD Course: Performance & Reliability Analysis, Day 6 Part II, Spring04
Page 31
Hans Peter Schwefel
Day 6: Reliability Aspects
1. Basic concepts
•
•
•
Fault, failure, fault-tolerance, redundancy types
Availability, reliability, dependability
Life-times, hazard rates
2. Mathematical models
•
•
Availability models, Life-time models, Repair-Models
Performability Models
3. Availability and Network Management
•
Rolling upgrade, planned downtime, etc.
4. Methods & Protocols for fault-tolerant systems
a) SW Fault-Tolerance
b) Network Availability
•
IP Routing, HSRP/VRRP, SCTP
c) Server/Service Availability
•
Server Farms, Cluster solution, distributed reliability
Demonstration: Fault-tolerant call control systems
PHD Course: Performance & Reliability Analysis, Day 6 Part II, Spring04
Page 32
Hans Peter Schwefel