Optimization of Fault Mitigation for Large Commodity Clusters

Download Report

Transcript Optimization of Fault Mitigation for Large Commodity Clusters

Cluster Reliability Project
ISIS
Vanderbilt University
Purpose
Put together a set of software tools that
allows us to easily create “pluggable”
components that
– Monitor our clusters (hardware, utilization,
errors, jobs, etc.)
– Recognizes anomalous conditions (inference
rules, model comparison, probabilities)
– Take actions to correct or alleviate the
problems (triggered by other components)
Goals
Increase the availability, utilization, and
reliability of the computing clusters
Reduce the time it takes to diagnose
problems
Reduce the administrative workload
associated with operating the clusters
Automate routine administration tasks
Monitor the use of the clusters
Allow other systems and user scripts to
easily interact with this system
Some basic features include:
Identifying types of information that must be collected or
communicated,
APIs for communicating information and the creation of monitors
and reactors,
Popular programming language bindings,
An environment for writing, testing, and releasing new reactors and
monitors,
A basic set of problems that must be addressed, the monitors,
reactors and data that they will communicate,
Recording of monitoring information and actions taken,
Administration tools that allow for single-point release distribution
and installation, and control of the runtime environment,
A configuration system that allows for uniform parameter setting and
allows for tuning to adjust the performance impact on the system.
ISIS Goals
Monitor the health (performance,
utilization, state) of all processors and
networks in the system (leveraging
existing tools and standards).
It should closely monitor performance and
the status of a job, and work together with
the workflow subsystem, ensuring good
progress for larger analysis campaigns
that are being conducted.
It should be coupled to the application.
Mitigation actions depend on the
properties of the application and its overall
workflow.
ISIS Research
In this project we will address the critical difficulties in achieving a
fault mitigation framework for a large cluster, which is configurable,
and strives to minimally affect the performance of the cluster.
For this, we will utilize a model-based design approach, that uses
domain-specific modeling languages and model transformer to
enable system design using domain-specific and higher-level
abstractions, and uses the monitoring and control framework.
Overall planned deliverables
–
–
–
–
Customizable Monitoring & Control framework.
Mitigation engine.
Monitoring and mitigation design tool.
Monitoring and mitigation system generator
Current activities
Strong desire to leverage existing
infrastructures for networked-system
health monitoring.
The evaluation criteria is based on:
– Architecture: centralized vs. hierarchical
– Monitoring: available sensors as well as
configurability for new sensors.
– Scalability: cluster size vs. resource
consumption
– Handling: smart data mining and virtual
sensors.
Near-term major work
Nov/Dec 06:
– Get all job information and some currently measured
attributes into a database
– Complete evaluation of products
Jan/Feb 07:
– Additions to current code to record resource
utilization, IMPI information, errors, and actions
– Create test system for selected monitoring product
Mar/Apr 07:
– Replicating existing functionality in new framework
General architecture
Phys Attr
Monitor
Help Ticket
Monitor
IB Fabric
Monitor
IP Network
Monitor
Email
Monitor
User Proc
Monitor
Database
To/From
Subordinates
Service
Monitor
Disk
Monitor
Archivers
IPMI
Monitor
Alarm
Presenters
Coordinator
Dcache
Monitor
PBS
Maui
Email
Senders
Job
Scanner
Action
Takers
Acct Log
qstat
Job
Checker
Bookeeping
PBS
Database
Monitor
IB HCA
Monitor
Service
Monitor
Storage
Monitor
Head Node Functions – Final System
IP Network
Monitor
Phys Attr
Monitor
CPU state
Monitor
Action
Takers
To/From
Manager
Job Class/
Profile Monitor
IPMI
Monitor
Coordinator
User Proc
Monitor
Driver
Monitor
Uptime
Monitor
Bookeeping
Database
Job Resource
Monitor
Job Activity
Monitor
Restart services,
Report success/fail,
Recycle drivers,
Reboot machine
Activity timing,
Running, staging,
etc.
Worker Functions – Final System
Backup slides follow
Model-Based Approach
Design / Modeling
Environment
System Health Monitor
Status/Diagnostics/
Prognostics
Monitoring
Modeling in GME
Fault handling
Process dataflow
Hardware Configuration
Analysis
Campaigns
Fault
Mitigation
Engine
System Planner
Global Manager
Resource Planner
Multi-Campaign Manager
Actuator
Job
Control
Cluster Computing
Resources/Grid
Dynamic
Job
Scheduling
&
Resource
Allocation
IB HCA
Monitor
IP Network
Monitor
Phys Attr
Monitor
CPU state
Monitor
Action
Takers
To/From
Manager
Storage
Monitor
Job Class/
Profile Monitor
IPMI
Monitor
Coordinator
User Proc
Monitor
Driver
Monitor
Uptime
Monitor
Bookeeping
Database
Job Resource
Monitor
Job Activity
Monitor
Restart services,
Report success/fail,
Recycle drivers,
Reboot machine
Activity timing,
Running, staging,
etc.
Worker Functions – Final System
Run Time
System Health
Monitoring:
Open NMS /Aware
PBS
Monitor
Fault Mitigation Engines
Service
Monitor
Proposed Architecture
Design / Modeling
Environment
System Health Monitor
Status/Diagnostics/
Prognostics
Monitoring
Analysis
Campaigns
Fault
Mitigation
Engine
System Planner
Global Manager
Resource Planner
Multi-Campaign Manager
Actuator
Job
Control
Cluster Computing
Resources/Grid
Dynamic
Job
Scheduling
&
Resource
Allocation
System Health Monitoring
– Cluster
– Campaign
– Application
– Leverage existing tool
and standard
Mitigation
– Application
– Campaign
– Cluster
Both monitoring and
mitigation must be
synchronized across
related jobs.
Design/Modeling
environment for deploying
campaigns, monitoring
and mitigation policies.
Motivation
Jobs on LQCD cluster are usually long term and
interdependent.
Failure on one node can have domino effect on
other nodes.
We cannot rely on job-level fault tolerance:
– as it will be computationally expensive
– will cause a decrease in performance
– will make synchronization between related jobs
difficult.
We need a cluster-wide fault-tolerant framework that does
monitoring and mitigation and is integrated with the
scheduling framework.
Fault Mitigation Engine
Two mitigation schemes:
– Reflex based scheme e.g. relocate jobs,
shutdown jobs, rewire nodes.
– Planning based scheme e.g. optimize the
campaign, reschedule jobs
Mitigation schemes integrated with the
workflow and job scheduling system.
Constraints: deadlines, resources
consumption.
Approach: model-based generators to
transform the designs into components
and configurations for the runtime system,