RM3G: Next Generation Recovery Manager
Download
Report
Transcript RM3G: Next Generation Recovery Manager
RM3G: Next Generation Recovery
Manager
Steve Zhang and Armando Fox
Stanford University
Design Goals
Overall Goal: Manage the detection of and
recovery from system failures
New in 3G: Focus on online Statistical
Learning Theory (SLT) algorithms for
application generic failure detection
Previous generation used End-2-End and
Exception monitors
SLTs
RM3G
Not tie ourselves to any particular algorithms
and make new algorithms easy to plug-in
Standardize the APIs for observation, analysis,
and control of system components
Provide common services and abstractions to
SLT algorithms
Comp
RM itself must also be resilient to failures
© 2004 Steve Zhang
RADS Architecture
User
Operator
Client
Server
Distributed
Middleware
SLT Services
(RM3G)
Distributed
Middleware
PNE Edge
Network
ApplicationSpecific
Overlay Network
EdgePNE
Network
Router
Router
Commodity
Internet & IP networks
© 2004 Steve Zhang
Design Diagram
SLT Processes
Comp
B
Spawned by SLT Proc Srv
Comp
C
Comp A
SLT Plug-ins
Data
Store Srv
SLT Select Srv
Ctrl Srv
Ctrl/Obsrv point descriptors
Control policies
RM
Proc Srv
Observation Points
RMDB
Name &
Reg Srv
Control Points
© 2004 Steve Zhang
Collaboration with ACME
Infrastructure for monitoring, analyzing, and controlling
Internet-scale systems
Sensors = Observation Points
Actuators = Control Points
RM potentially benefits from two ACME features
An in-network aggregator combines data from sensors as they
are routed through an overlay network
Configuration language that specifies under what conditions to
trigger actuators
ACME could benefit from more powerful sensor data
analysis using SLTs
© 2004 Steve Zhang
Observation Points
We want to avoid requiring every component to be
individually instrumented
Components may directly provide their own observation data if
they wish (e.g. D-store and SSM provide their own data for
monitoring with Pinpoint)
Several types of observation data can be collected in an
application generic way
OS can provide application level data (e.g. memory usage,
number of files open, etc) and system level data (e.g. size of
swap space, network ports used, etc)
Middleware can provide intra-application data (e.g. interaction
between different components of an application)
© 2004 Steve Zhang
SLT Data Services
Abstracts information from observation points
SLT algorithms are spawned for each component in the system,
as they are instantiated
Observation data stored by SLT Data Server possibly in a
streaming database.
Listens for feedback from SLT algorithms to adjust the
data stream as necessary
Increase data sampling rate if anomaly is suspected
Stop reporting certain data if it is deemed to be irrelevant
Provide persistent data storage for SLT algorithms
Remember properties learned from previous analysis of
observation data
© 2004 Steve Zhang
Control Points
Assumes crash-only components
Components can be reliably restarted through external means
(can’t rely on components restarting themselves cleanly)
Initially, only restart control points are supported
Instrument application server (JBoss) to restart applications and
application components
OS can restart application servers
IP addressable power strips can restart entire nodes
Components can specify custom control policy
Leverage ACME’s configuration language
© 2004 Steve Zhang
Future Work
“Master” SLT
Support additional types of control points
Multiple level settings that tune component parameters (e.g. filter
level)
Support additional types of observation points
Multiple SLTs are run for each component. Choosing which SLTs
to believe is itself an interesting SLT problem.
Use programming language techniques (e.g. source code
transformation) to instrument applications in a generic way
Online SLT algorithms for anomaly detection are not
mature
© 2004 Steve Zhang