CS300 Fall `00

Download Report

Transcript CS300 Fall `00

CompSci 296.2
Self-Managing Systems
Shivnath Babu
Today
• Wrap up sample projects
• ROC discussion
2
Sample Projects
• NIMO
• Fa
• Combining structured & unstructured data
• Projects using Nagios
• Projects using IBM autonomic computing toolkit
3
NIMO: NonInvasive Modeling for
Optimization
•
Build performance models for scientific apps
– Automatic, online, and noninvasive
•
Projects
– Study many scientific apps (e.g., 140 bio apps in
BioPortal)  characterize behavior, good models
– “Steal app”, build and refine model
– Incorporate NIMO in a “grid” scheduler (Condor, Globus)
– Optimization problems in scheduling workflows
4
Fa
•
Testbed to study:
– Whether we can automate problem prediction, diagnosis
– Relationship among problems, causes, data, & models
•
Projects
– Models for predicting performance problems (online)
– Models and mechanisms for root-cause queries
– Others
5
Structured and Unstructured Data
•
Combined querying/mining of structured and
unstructured system data
– Structured data: time series of CPU utilization
– Unstructured data (free text): System error log
•
Ex: Characterize system state when a specific
error occurs
6
Add New Features to Current Systems
•
Add problem-prediction capability to Nagios
•
Add root-cause querying to Nagios
•
Similar projects using the IBM Autonomic
Computing Toolkit + ABLE framework
•
Remember the “mechanism projects”
– Undo, virtualization, active probing
7
ROC: Recovery-Oriented Computing
• Complaints about current systems
– Focus only on performance  Availability &
maintainability is neglected
– Focus on MTTF of individual components  MTTR
neglected
– MTTF of system << MTTF of individual components
8
ROC Philosophy
“If a problem has no solution, it may not be a problem,
but a fact, not to be solved, but to be coped with over time”
— Shimon Peres (“Peres’s Law”)
• People/HW/SW failures are facts, not problems
• Recovery/repair is how we cope with above facts
ROC focus is on fast repair Vs.
old focus on longer time between failures
9
ROC Principles
• Recovery experiments: benchmarking recovery
• Pinpoint: Automatic problem diagnosis
• Recursive restart: Innovative use of reboot
• App and system undo
• Defense in depth: ROC at hardware level
10
Discussion
• Strong point: Comprehensive, relate to other fields
• Margin of safety for systems
– Current examples?
– How to incorporate?
• Negative point: Evolution Vs. revolution?
– What approach is the project taking?
• At what level should we support Undo?
– Transaction, application, system
– Pros and cons
• Benchmarking availability/recovery (TOC?)
– How can you claim that a system is 99.999% available?
• Dealing with the automation irony
– Fire drills
11