Mendosus:A SAN-Based Fault Injection Test
Download
Report
Transcript Mendosus:A SAN-Based Fault Injection Test
Mendosus
A SAN-Based Fault Injection Test-Bed for
Construction of
Highly Available Network Services
Xiaoyan Li, Richard Martin, Kiran Nagaraja,
Thu D. Nguyen and Bin Zhang
Dept. of Computer Science, Rutgers University
http://www.panic-lab.rutgers.edu
Talk Outline
Motivation
Design
Implementation
Benchmarks
Case Studies
Related Work
Future Work
Motivation
Ubiquitous network access exponential growth
in network services
Availability is one key challenge
Networked systems are comprised of large numbers of
heterogeneous components
Faults are not uncommon
Complex interaction between components
Examples of costly failures: Ebay, Brittanica
Currently difficult to assess service availability
How to analyze impact of failures?
How to set up an appropriate test-bed?
Mendosus
Goal: provide infrastructure for service designers
to assess the availability of network services
Overview:
Provide flexible infrastructure to accurately model a
variety of different networking systems from the
application’s point-of-view
Run application in real-time and inject faults to assess
application’s behavior
Two key components:
Real-time emulation of a variety of interconnects
General fault injection infrastructure
Vision
Map available resources to emulated network
Design
Mendosus Architecture
Central Controller
User
Level
Mendosus daemon
Events
Applications
Routing
Fault Inclusion
Network State
Emulator
Module
Latency
Kernel
Fast & Reliable SAN
Design Decisions
Central controller
Advantage: consistent network and fault information
Disadvantage: limits scalability
Not involved in network emulation so should still scale well to
targeted system sizes (thousands or tens of thousands of components)
Entire network state is maintained at each end node
Advantage: performance
Disadvantage: limits scalability
Only maintain state for LAN
Emulation module embedded within kernel
Advantage: no modifications to application code
Disadvantage: more difficult to modify and extend
Functional Components
Topology Maintenance
Fault Injection
Emulation
Topology Maintenance
Specification - simple ns-2 like topology scripts
Specify available resources
Central controller manages topology
Initializes original topology on each node
Consistent view
Real time topology changes
Specified as scripted events
Controller monitors network connectivity
Detects partitions
Fault Injection
Every n/w component can have a fault profile
Switches, hubs, NICs, links, end nodes
Fault specification:
trace files or theoretical distributions
Exponential, Weibull, constant
Simulate fail-stop components
MTTR - constant or follow a distribution
E.g. unplugging, port shutdown
Emulation
Completely distributed
Every node has enough network state
Emulation Messaging sequence
Application initiates communication
Routing – determine route
Fault Inclusion – effect of injected faults
Latency – corresponding to route taken
We do not implement the innards of network
components
Switching
Implementation
Ethernet LAN Emulation
Routing
Emulate computation of Ethernet spanning tree
Controller chooses root of tree
Emulator on each node computes identical spanning tree
Reconfiguration performed periodically (every 2 secs)
Broadcast & Multicast
Emulate using sequence of unicast
Ethernet LAN Emulation - Faults
Network partitions
Controller monitors connectivity
Multiple roots - one for each partition
NIC fail-over
Multiple interfaces using IP aliasing support in Linux
Emulation completeness…
Feature
Ethernet
Emulated
Ethernet
P-to-P
Yes
Yes
Multicast
Hardware
Software
(Broadcast w/ filters)
Broadcast
Hardware
Software
(multiple unicast)
Layer 3, 4 services Some advanced
switches
E.g.VLAN, IGMP
Not implemented
Micro-benchmarks
Emulation Limits
Network
No. of
Throughput
Switches in
MB/sec
Topology
RTT usec
Fast Ethernet
1
11.8
88.9
Gigabit
Ethernet
0
66.0
130.0
1
79.6
53.4
8
79.1
54.8
Emulator
Software Broadcast Scaling
Fault View Convergence
Case Studies
Group Membership
Test protocol behavior under faults
subtle interactions in distributed protocols
Three Round Membership algorithm
Robust against multiple node failures, packet drops and
network partitions
Two modes of operation: normal and FCM
Membership Observations
1
2 3
4
5
C
A
L
B
D
1. NIC failure at B
2. Link L down
3. NIC at B recovers
4. Packet drops at A
5. Link L up
Multi-Level Switched Network
Large enterprise LANs have multiple layers of
network components
Access, core and aggregation switches
How to evaluate availability vs. cost vs.
complexity?
Study service availability with increased
redundancy
Faults following exponential distributions
Enterprise LAN
Availability Vs Redundancy
Related Work
Network Emulation
Distributed emulation
Centralized emulation
NISTNET, Lancaster emulator
Fault injection
Script-based probing and fault injection
Orchestra, DOCTOR
Co-related faults
Emulab [Utah], DelayLine
Loki [UIUC]
Simulation
NS-2, REAL[Cornell], SSFNet, x-sim[Arizona]
Future Work
Extend Mendosus to emulate other networks
WAN: Build in performance dynamics model
Wireless LAN - Realistic fault and performance models
Support pluggable modules within network
components which add functionality and
additional failures !
Intelligent Routing protocols (E.g. HSRP)
Dynamic DNS, RR DNS
Summary
Test-bed for service designers to systematically
analyze network and protocol design against
failures
Results show that real-time emulation is feasible
given capability of current SAN networks
Demonstrated the flexibility and usefulness
of Mendosus through 2 case studies
Another step towards building highly available
services…