Transcript Document

Fail-stutter Behavior
Characterization of NFS
Jichuan Chang
CS736 Final Project, UW-Madison
December 13, 2002
Motivation
• We want systems to be very Fast and Available!
• Hard to achieve for modern computer systems
– complex interactions among components;
– can’t assume everything is always working perfectly!
• We need a better fault model
– Simpler than the Byzantine model;
– Richer than the fail-stop model;
– Fail-stutter Fault-tolerance [Remzi 01].
10
Performance
8
6
Stable
Performance
Fail-stutter:
Fail-stop:
performance
correctness
fault
fault
fault
4
2
Low
Performance
Down
0
Time
2
Fail-stutter Issues
• Separate performance faults from correctness faults
– What are performance faults?
• Need a performance specification, but how to get the spec.?
• How to distinguish “interference” and performance fault?
– What are correctness faults?
• Correctness should be defined in an end-to-end manner.
• How to diagnose both types of faults?
– Must observe how systems behave!
• Exploit fail-stutter behavior
– Who should be notified about failures, when and how?
– System supports - programming tools / runtime support
– Integration with existing systems - less intrusion
3
Our Approach
• Case study: NFS fail-stutter characterization
– Fault-injection (vs. system monitoring)
– Performance measurement
– Simple, software-based test-bed
• Interesting observations
– Different failed parts have different performance impact
– Different types of clients have different behaviors
• Patient (keep retrying) vs. Impatient (try other servers)
– Transition between performance and correctness faults
• Can be determined proactively by fault-injection;
• Performance spec. could be application-specific.
4
Experimental Settings
…
NFS Client App
…
X
X
Click S/W Router
•
•
•
•
…
NFS
Server
X
Storage
System
Workloads - SpecSFS97, file (micro-benchmark).
Data to collect - throughput, response time, errors.
Faulty components - network, server, disk, bus, etc.
Fault injection - network package dropping
– drop k% Ethernet packages,
– drop k% IP packages coming from the server.
5
Results (1) - Patient Client
1. Performance degradation
scales with drop probability.
X
X
X = Error occurred
2. Ethernet dropping
less harmful compared
with IP dropping.
X
X
X
X
X
X
X
X
XX
X
X XX X
3. Performance data
less meaningful
when error occurs.
4. Different operations switch to correctness faults at different points
(e.g. 5%, 15%, 20%). Total execution time can hide such difference.
6
Results (2) - Impatient Client
1. 2.
Throughput
decreases
Throughput
drops
linearly
as theunder
dropping
manifest
probability
increases.
heavy loads.
2. Throughput drops
manifest under
heavy loads.
3. Response time doesn’t change as much!
7
SpecSFS97
Retry once!
4. Ethernet dropping less harmful.
Summary
• Modern computer system design needs a
better fault-tolerance model.
• Using fault-injection to characterize NFS failstutter behavior.
• Preliminary observations address some of
the fail-stutter issues
– How to separate different types of faults?
– Suggest that we can extract performance
specification by fault-injection and probing.
8
Future Work
• Very-short-term
– More classes of faults
– More realistic fault injection
• Short-term
– Separate “interference” and performance fault
– Extract/refine performance specifications
– Performance-fault diagnosis
• Long-term
– Detailed model for a specific workload / system
– System support for fail-stutter failures
9