Mendosus:A SAN-Based Fault Injection Test

Download Report

Transcript Mendosus:A SAN-Based Fault Injection Test

Mendosus
A SAN-Based Fault Injection Test-Bed for
Construction of
Highly Available Network Services
Xiaoyan Li, Richard Martin, Kiran Nagaraja,
Thu D. Nguyen and Bin Zhang
Dept. of Computer Science, Rutgers University
http://www.panic-lab.rutgers.edu
Talk Outline
Motivation
 Design
 Implementation
 Benchmarks
 Case Studies
 Related Work
 Future Work

Motivation
Ubiquitous network access  exponential growth
in network services
 Availability is one key challenge


Networked systems are comprised of large numbers of
heterogeneous components
Faults are not uncommon
 Complex interaction between components



Examples of costly failures: Ebay, Brittanica
Currently difficult to assess service availability
How to analyze impact of failures?
 How to set up an appropriate test-bed?

Mendosus
Goal: provide infrastructure for service designers
to assess the availability of network services
 Overview:

Provide flexible infrastructure to accurately model a
variety of different networking systems from the
application’s point-of-view
 Run application in real-time and inject faults to assess
application’s behavior
 Two key components:

Real-time emulation of a variety of interconnects
 General fault injection infrastructure

Vision

Map available resources to emulated network
Design
Mendosus Architecture
Central Controller
User
Level
Mendosus daemon
Events
Applications
Routing
Fault Inclusion
Network State
Emulator
Module
Latency
Kernel
Fast & Reliable SAN
Design Decisions

Central controller
Advantage: consistent network and fault information
 Disadvantage: limits scalability



Not involved in network emulation so should still scale well to
targeted system sizes (thousands or tens of thousands of components)
Entire network state is maintained at each end node
Advantage: performance
 Disadvantage: limits scalability



Only maintain state for LAN
Emulation module embedded within kernel
Advantage: no modifications to application code
 Disadvantage: more difficult to modify and extend

Functional Components

Topology Maintenance

Fault Injection

Emulation
Topology Maintenance

Specification - simple ns-2 like topology scripts


Specify available resources
Central controller manages topology
Initializes original topology on each node
 Consistent view


Real time topology changes


Specified as scripted events
Controller monitors network connectivity

Detects partitions
Fault Injection

Every n/w component can have a fault profile


Switches, hubs, NICs, links, end nodes
Fault specification:
trace files or theoretical distributions
 Exponential, Weibull, constant


Simulate fail-stop components
MTTR - constant or follow a distribution
 E.g. unplugging, port shutdown

Emulation

Completely distributed


Every node has enough network state
Emulation Messaging sequence
Application initiates communication
 Routing – determine route
 Fault Inclusion – effect of injected faults
 Latency – corresponding to route taken


We do not implement the innards of network
components

Switching
Implementation
Ethernet LAN Emulation

Routing

Emulate computation of Ethernet spanning tree
Controller chooses root of tree
 Emulator on each node computes identical spanning tree



Reconfiguration performed periodically (every 2 secs)
Broadcast & Multicast

Emulate using sequence of unicast
Ethernet LAN Emulation - Faults

Network partitions
Controller monitors connectivity
 Multiple roots - one for each partition


NIC fail-over

Multiple interfaces using IP aliasing support in Linux
Emulation completeness…
Feature
Ethernet
Emulated
Ethernet
P-to-P
Yes
Yes
Multicast
Hardware
Software
(Broadcast w/ filters)
Broadcast
Hardware
Software
(multiple unicast)
Layer 3, 4 services Some advanced
switches
E.g.VLAN, IGMP
Not implemented
Micro-benchmarks
Emulation Limits
Network
No. of
Throughput
Switches in
MB/sec
Topology
RTT usec
Fast Ethernet
1
11.8
88.9
Gigabit
Ethernet
0
66.0
130.0
1
79.6
53.4
8
79.1
54.8
Emulator
Software Broadcast Scaling
Fault View Convergence
Case Studies
Group Membership

Test protocol behavior under faults


subtle interactions in distributed protocols
Three Round Membership algorithm
Robust against multiple node failures, packet drops and
network partitions
 Two modes of operation: normal and FCM

Membership Observations
1
2 3
4
5
C
A
L
B
D
1. NIC failure at B
2. Link L down
3. NIC at B recovers
4. Packet drops at A
5. Link L up
Multi-Level Switched Network

Large enterprise LANs have multiple layers of
network components

Access, core and aggregation switches
How to evaluate availability vs. cost vs.
complexity?
 Study service availability with increased
redundancy


Faults following exponential distributions
Enterprise LAN
Availability Vs Redundancy
Related Work

Network Emulation

Distributed emulation


Centralized emulation


NISTNET, Lancaster emulator
Fault injection

Script-based probing and fault injection


Orchestra, DOCTOR
Co-related faults


Emulab [Utah], DelayLine
Loki [UIUC]
Simulation

NS-2, REAL[Cornell], SSFNet, x-sim[Arizona]
Future Work

Extend Mendosus to emulate other networks
WAN: Build in performance dynamics model
 Wireless LAN - Realistic fault and performance models


Support pluggable modules within network
components which add functionality and
additional failures !
Intelligent Routing protocols (E.g. HSRP)
 Dynamic DNS, RR DNS

Summary
Test-bed for service designers to systematically
analyze network and protocol design against
failures
 Results show that real-time emulation is feasible
given capability of current SAN networks
 Demonstrated the flexibility and usefulness
of Mendosus through 2 case studies
 Another step towards building highly available
services…
