RM3G: Next Generation Recovery Manager

Download Report

Transcript RM3G: Next Generation Recovery Manager

RM3G: Next Generation Recovery
Manager
Steve Zhang and Armando Fox
Stanford University
Design Goals

Overall Goal: Manage the detection of and
recovery from system failures

New in 3G: Focus on online Statistical
Learning Theory (SLT) algorithms for
application generic failure detection



Previous generation used End-2-End and
Exception monitors
SLTs
RM3G
Not tie ourselves to any particular algorithms
and make new algorithms easy to plug-in

Standardize the APIs for observation, analysis,
and control of system components

Provide common services and abstractions to
SLT algorithms
Comp
RM itself must also be resilient to failures
© 2004 Steve Zhang
RADS Architecture
User
Operator
Client
Server
Distributed
Middleware
SLT Services
(RM3G)
Distributed
Middleware
PNE Edge
Network
ApplicationSpecific
Overlay Network
EdgePNE
Network
Router
Router
Commodity
Internet & IP networks
© 2004 Steve Zhang
Design Diagram
SLT Processes
Comp
B
Spawned by SLT Proc Srv
Comp
C
Comp A
SLT Plug-ins
Data
Store Srv
SLT Select Srv
Ctrl Srv
Ctrl/Obsrv point descriptors
Control policies
RM
Proc Srv
Observation Points
RMDB
Name &
Reg Srv
Control Points
© 2004 Steve Zhang
Collaboration with ACME



Infrastructure for monitoring, analyzing, and controlling
Internet-scale systems

Sensors = Observation Points

Actuators = Control Points
RM potentially benefits from two ACME features

An in-network aggregator combines data from sensors as they
are routed through an overlay network

Configuration language that specifies under what conditions to
trigger actuators
ACME could benefit from more powerful sensor data
analysis using SLTs
© 2004 Steve Zhang
Observation Points

We want to avoid requiring every component to be
individually instrumented


Components may directly provide their own observation data if
they wish (e.g. D-store and SSM provide their own data for
monitoring with Pinpoint)
Several types of observation data can be collected in an
application generic way

OS can provide application level data (e.g. memory usage,
number of files open, etc) and system level data (e.g. size of
swap space, network ports used, etc)

Middleware can provide intra-application data (e.g. interaction
between different components of an application)
© 2004 Steve Zhang
SLT Data Services



Abstracts information from observation points

SLT algorithms are spawned for each component in the system,
as they are instantiated

Observation data stored by SLT Data Server possibly in a
streaming database.
Listens for feedback from SLT algorithms to adjust the
data stream as necessary

Increase data sampling rate if anomaly is suspected

Stop reporting certain data if it is deemed to be irrelevant
Provide persistent data storage for SLT algorithms

Remember properties learned from previous analysis of
observation data
© 2004 Steve Zhang
Control Points

Assumes crash-only components



Components can be reliably restarted through external means
(can’t rely on components restarting themselves cleanly)
Initially, only restart control points are supported

Instrument application server (JBoss) to restart applications and
application components

OS can restart application servers

IP addressable power strips can restart entire nodes
Components can specify custom control policy

Leverage ACME’s configuration language
© 2004 Steve Zhang
Future Work

“Master” SLT


Support additional types of control points


Multiple level settings that tune component parameters (e.g. filter
level)
Support additional types of observation points


Multiple SLTs are run for each component. Choosing which SLTs
to believe is itself an interesting SLT problem.
Use programming language techniques (e.g. source code
transformation) to instrument applications in a generic way
Online SLT algorithms for anomaly detection are not
mature
© 2004 Steve Zhang