Transcript pps

Autonomous Recovery in
Componentized Internet
Application
Candea et. al
Vikram Negi
Introduction
•
•
•
•
Autonomic Problem
Approach
Results
Discussion
The Autonomic Problem
• To allow the application to recover automatically
from transient and intermittent software failure.
The Approach
• Introduce the idea :
– Microanalysis (fault detection)
– Microrebooting (rapid recovery)
– External Management (recovery action)
• Integrate and Test with JBOSS
Design Overview
• Autonomous Process
– Monitoring
• Java probes
– Fault detection
• Generate Anomaly report
– Recovery
• Takes action
• Total time to recovery.
J2EE Review
•
•
•
•
J2EE enterprise apps = collection
of reusable Java modules
JSPs / servlets invoke EJBs, which
invoke other EJBs, ...
EJB = Java component that
complies to a certain interface and
provides a service
Deployment descriptor (per-bean
XML file) conveys run-time
characteristics and dependencies;
used in deploying the application
JBoss Design
•
•
•
Open-source J2EE app server
Written entirely in Java
Microkernel with components held together by JMX (Mgmt Support)
JAGR = ROC-ified JBoss with
Application-Generic Recovery
•
•
3 Tier Architecture
Key Components
– Macro analysis Engine
– Microrebooting Hook
– Recovery Manager
Pinpoint : Detection and Localization
• Store Observation
–
–
–
–
–
IP address of machine, timestamp
Globally unique request ID.
# of calls/returns to EJB’s
Association between sender and receiver.
Collect SQL Queries, update, read
Pinpoint : Analysis
• Analysis Engine
– Centralized Engine
– Plugin based architecture
• Modeling Components
– Assume both present
component behavior and
historical (normal) behavior
have same probability
distribution.
– Ki square test to determine
different probability
distribution.
Recovery : micro-reboot is not expensive
• State Segregation
– Store impt. state outside the application in database.
– Persistent State
• CMP (container managed persistence, J2EE) is a requirement for
prototype.
– Session State
• Store in modified SSM(external session state store)
• Containment and Reintegration
– Microreboot transitive closure of all inter-EJB references
– XML deployment descriptors to determine grouping for closure
– Complete or micro reboot
Recovery
• Enabling Micro reboot
– Method in JBOSS EJB Container
– Preserve Class Loader
Manage Recovery
• Recovery Policy
–
–
–
–
–
Read failure report consider components > 1.0
Micro-reboot(top n) or all >1.0
Allow delay (~30sec)
If error is present still try few time or reboot completely
Finally report it to sys admin
Evaluation Test Framework
• Application
– Petstore 1.1 (12 comp, 233 java file, 11K Loc)
– Petstore 1.3.1(47 comp, 310 java file 10K Loc)
– RUBiS (21 comp, 500 java file , 25K Loc)
• Workload
– Implement Simulators with Transition table.
– 350 client (max utilization principle)
• Faultload
– Based on industry experience
– No low level hardware or OS faults.
Evaluation Detection
• Result similar to other detector
• No discussion on absolute numbers?
•
•
Forced Java Runtime/Declared Exceptions, call emission and src code bug
1# How well the fault was detected, 2#how well major outage was detected ?
Evaluation : Localization
Localization % for a algorithm per fault type CIA > 85%
No absolute data again ?
Evaluation : Recovery
• Introduce faults in SSMRUBiS.
• Restart SSM-RUBiS or
micro reboot component.
• Observation from 10 trials
per 350 concurrent client.
Full v/s Micro reboot
•
•
•
•
•
Injected a null reference fault in
SB CommitBid, then a corrupt
User-Item, SB BrowseCategories
and SB CommitUserFeedback.
Microreboot maintains steady
response.
425 vs 3916 failed request
61527 vs 56028 success request
What error condition did other
trials had?
Total Recovery Time
•
•
•
•
Corrupt SB_ViewItem set it to NULL.
19.4 sec TRT
18.5 sec in analysis
Pinpoint is bottleneck in micro reboot.
Pinpoint is app generic ?
• Upgrade to Petstore v.1.3.2
– Works for the confidence interval
How different was the updated version??
Perfomance Overload
• Results for 30min fault free run w/ 350 clients
• In memory v/s Out memory (SSM)
• Marshalling costs
Assumption
• Well defined interface for components
(.Net,J2ee)
• Deterministic call path b/w component
• No critical service request
• Training data for statistical model
• Guidelines (Crash Only Software)
Discussion
• Overall one of the Good Papers maybe bit verbose in
introduction !
• Integrating framework for earlier work by Candea.
• Limitation of the present statistical model.
• Shared EJB state
– Modify JIT, disable microreboots(ref, static var)
• Application – Global data not scrubbed.
• Cost Benefit : micro reboot v/s total reboot
Supplementary
• Application server = operating system for Internet applications
(instantiates app components in containers, provides runtime system
services, integrates with web server to make app webaccessible)
• http://people.epfl.ch/george.candea