Failure Avoidance through Fault Prediction Based on

Download Report

Transcript Failure Avoidance through Fault Prediction Based on

Failure Avoidance through Fault
Prediction Based on Synthetic
Transactions
Mohammed Shatnawi 1, 2
Matei Ripeanu 2
1 – Microsoft Online Ads,
Microsoft Corporation
Redmond, Washington
2 – Electrical and Computer Engineering Department,
University of British Columbia
Vancouver BC, Canada
This work was supported in part by the Institute for
Computing, Information and Cognitive Systems (ICICS) at UBC.
Microsoft Ad Exchange
•
The Exchange Market Place Model
•
•
•
•
•
Advertisers, Publishers, and Brokers
Ad Networks – Aggregators of Advertising Demands
Value-Added Providers
Exchange Operator
Exchange Characteristics
•
•
•
•
•
Liquidity
Auction – Bidding and Pricing
Eligible Participants
Federation
Fairness and Neutrality Vs. Arbitraging
• Strict Requirements
• Performance, Reliability and Strict SLA
The Problem with Logs
• Complexity
• Contain large disparate content
• Verbose and Large Size
• Often Incomplete for data analysis and mining
• Record problems after they have happened
• Users already endured the bad experience
• Enterprises may lose customer’s trust in the service
Related Work
• Lack of holistic approach
• Focus on specific aspects of the problems
• Pre-process logs to reduce their complexity
• Log Aggregations – Snyder et al [5]
• Event Categorization – Zhen et al [10]
• Log parsing and text mining – Xu et al [7, 9]
• Proactive fault avoidance
• System Monitoring at run time – Pietrantuono [1]
• Runtime fault inject to ensure faults are detected – Cotroneo [2]
• Problems with such approaches
» They are not holistic – approach one aspect of the problems
» One problem solution may cause another problem (e.g. fault injection at
runtime may interfere with the system behavior, and may add to log
complexity)
Suggested Approach Goals
• Address the problem holistically:
• Proactivity
• Using synthetic transactions before going to production
• Log Design Simplicity and Completeness
• Use of specialized logs for specific set of metrics
• Log data is complete for data analysis (through iteration)
• Data mining in Mind
• Log schema is designed with data analysis mining in mind
Suggested Approach Flowchart
System Test
Complete and
Deployment
Ready
Proposed
Approach Start
Synthetic Transactions and
Log Design
Synthetic Workload
Execution and Result Logging
NO
Analyze Log, Ensure
Only Relevant Data is
Included
Log Has All Required
Data?
Yes
Synthetic Data Analysis and
Mining
Enough Training
and Test Data?
Yes
Pre-Deployment
Configuration
Updates
System
Deployment To
Production
Proposed
Approach End
NO
Suggested Approach Goals in
Details
• Proactivity – System functionality emulation in
a test environment
• System Replica in Pre-production Environment
• Replica of the production hardware system in a test
environment
• Replica of the production software system in a test
environment
• Synthetic Transactions
• Use of software client to Emulate the expected workload
through distribution of function calls and data load
intensity.
Suggested Approach Goals in
Details
• Log design Simplicity and Completeness
• Tailored to the analysis of the problem(s) at hand (e.g.
response time, correctness, error handling, …etc.)
• Log schema design is advised by Dimensional
Modeling, this ensures accounting for all impactful data
• The schema design is iterative to allow for isolation of
the most impactful parameters to the metrics at hand
» This leads to more compact logs
Suggested Approach Goals in
Details
• Data mining in Mind
• The dimensional model ensure data set completeness
• Also ensures:
– Amenability to data mining
– Completeness of data mining data requirements
– Easily allows for addition of dimensions and new data
• Use of any available mining solutions
• The goal is not to create new data mining models/techniques
• Generating data with data mining in mind simplifies the mining
process
Pre-Deployment based on Mining
Findings
• Before deployment, configure the system
based on the mining results
• The findings advise the system parameters and
conditions that cause faults
• System Configuration, guard against the conditions that
cause problems, and so prevent them from happening
Experiment
• Experiment Goals
• predict the conditions that cause delay in executing CRUDQ
operations (and so missing SLA)
• Accordingly, set operational limits and system configuration before
deployment to prevent these problems from happening.
• Methodology:
•
•
•
•
Devise synthetic transactions to emulate the CRUDQ ops
Log results from the synthetic system
Build a data mining model from these logs
Find out the most impactful system conditions in system failure (SLA
latency)
• Configure the system to guard against them
Experiment Baseline
• Compare the data mining ability of the
synthetic system to actual log data
• We used synthetic log data to train and test the
model
» 8 hours worth of data with 28k transactions
» Log size was 11MB
• We used actual log data to verify our results
» Five weeks worth of data with 650M transactions
» Log size was about 88GB of data a day
Results
• Naïve Bayes prediction using synthetic data
» “CPU load” (77%) and “Recent historical trend” (increasing) impacted
results the most
» Using this model on actual log data showed 91% prediction accuracy
Results
• Decision Trees prediction using synthetic data
» “CPU load” impacted the results the most (77%)
» Using this model on actual log data showed 89% prediction accuracy
Summary and Conclusions
• Approach to enhance an online service’s reliability,
availability and performance through
•
•
•
•
Use of synthetic transactions in pre-production environments.
Use of specialized logs for failure prediction,
Use of data-mining on the compact specialized logs.
Identify the environment conditions that correlate to failures, and guard against
them.
• Advantages
• Analysis and predictor training occur before system goes to production.
• Data set used in creating the synthetic predictor systems are orders of magnitude
smaller, easier to use, and faster to process than their production counter parts.
Challenges and Limitations
• Main challenges:
• Understanding the service at hand,
• Identifying the quality of service requirements
• Tactical Challenges and Limitations
• Producing the service in pre-production environments,
– Replication makes it easy
– Emulation techniques required if replication is not feasible
• Generating accurate synthetic transactions
– Requires understanding of service, APIs, and usage patterns.
– Exacerbated by the complexity of service and its inter-system dependencies.
• Isolating the measures of interest
– May not always be attainable.
– Measures of interest may be grouped.
Q&A