Machine Learning for automated diagnosis of
Download
Report
Transcript Machine Learning for automated diagnosis of
Machine Learning
for Automated
Diagnosis of
Distributed
Systems
Performance
Ira Cohen
HP-Labs
June 2006
http://www.hpl.hp.com/personal/Ira_Cohen
© 2006 Hewlett-Packard Development Company, L.P.
The information contained herein is subject to change without notice
Intersection of systems and ML/Data
mining: Growing (research) area
•
Berkeley’s RAD lab (Reliable Adaptable Distributed
systems lab) got $7.5mil from Google, Microsoft and Sun
for:
“…adoption of automated analysis techniques from Statistical Machine
Learning (SML), control theory, and machine learning, to radically improve
detection speed and quality in distributed systems”
•
Workshops devoted to area (e.g., SysML), papers in
leading system and data mining conferences
•
Part of IBM’s “Autonomic Computing” and HP’s Adaptive
Enterprise visions
•
Startups (e.g., Splunk, LogLogic)
•
And more…
Ira Cohen - HP-Labs
2
SLIC project at HP-Labs*: Statistical
learning inference and control
•Research objective: Provide technology enabling
automated decision making, management and control of
complex IT systems.
−Explore statistical learning, decision theory and
machine learning as basis for automation.
I’ll Focus today on Performance diagnosis
*Participants/Collaborators:
Moises Goldszmidt, Julie Symons, Terence Kelly,
Armando Fox, Steve Zhang, Jeff Chase, Rob Powers, Chengdu Huang, Blaine
Nelson
Ira Cohen - HP-Labs
3
Intuition: Why is performance
diagnosis hard?
•
What do you do when your PC is slow?
Ira Cohen - HP-Labs
4
Why care about performance?
•
Answer: It costs companies BIG money
Analysts estimate that poor application
performance costs U.S.-based companies
approximately $27 billion each year
•
Performance management software products
revenue growing at double digit % every year!
Ira Cohen - HP-Labs
5
Challenges today in
diagnosing/forecasting IT performance
problems
•
Distributed systems/services are complex
− Thousands of systems/services/applications is typical
− Multiple levels of abstractions and interactions between
components
− Systems/Applications change rapidly
•
Multiple levels of responsibility (infrastructure operators,
application operators, DBAs, …) --> a lot of finger pointing
− Problems can take days/weeks to resolve
•
Loads of data, no actionable information
− Operators manually search for needle in haystack
− Multiple types of data sources --- lack of unifying tools to even view
data
•
Operators hold past diagnosis efforts in their head history of diagnosis efforts mostly lost.
Ira Cohen - HP-Labs
6
Translation to Machine Learning
Challenges
•
Transforming data to information: Classification,
feature selection methods – with need for explanation
•
Adaptation: Learning with concept drift
•
Leveraging history: Transforming diagnosis to an
information retrieval problem, clustering methods, etc.
•
Using multiple data sources: combining structured and
semi-structured data
•
Scalable machine learning solutions: distributed
analysis, transfer learning
•
Using human feedback (human in the loop): semisupervised learning (active learning, semi-supervised
clustering)
Ira Cohen - HP-Labs
7
Outline
•
Motivation (already behind us…)
•
Concrete example: The state of distributed
performance management today
•
ML challenges
− examples of research results
•
Bringing in all together as a tool: Providing
diagnostic capabilities as a centrally managed
service
•
Discussion/Summary
Ira Cohen - HP-Labs
8
Example: A real distributed HP
Application architecture
Geographically distribution 3-tier application
Results shown today are from last 19+ months of
data collected from this service
Ira Cohen - HP-Labs
9
Application performance “management”:
Service Level Objectives (SLO)
Unhealthy = SLO Violation
Ira Cohen - HP-Labs
10
Detection is not enough…
•
Triage:
− What are the symptoms of the
problem?
− Who do I call?
•
Leverage history:
− Did we see similar problems in
the past?
− What were the repair actions?
− Do/Did they occur in other
data centers?
•
Problem prioritization
− How many different problems
are there and their severity?
− Which are recurrent?
•
Can we forecast these problems?
Ira Cohen - HP-Labs
Unhealthy
11
Challenge 1: Transforming data to
information…
•
Many measurements (metrics) available on ITsystems (OpenView, Tivoli, etc…)
− System/application metrics: CPU, memory, disk, network
utilizations, queues, etc...
− Measured on a regular basis (1-5 minutes with
commercial tools).
•
Other semi-structure data (log files)
Where is the relevant information?
Ira Cohen - HP-Labs
12
ML Approach: Model using
Classifiers
Leverage all the data collected in the infrastructure to:
1) Use classifiers: F(M) -> SLO state
2) Classification accuracy is a measure of success
3) Use feature selection to find most predictive metrics
of SLO state
Unhealthy
Ira Cohen - HP-Labs
F(M ,SLO)
13
But we need an explanation, not just
classification accuracy...
Our approach: Learn joint probability
distribution (Bayesian network classifiers)
Unhealthy
P(M,SLO)
Inferences (“metric attribution”):
Normal
Metric has a value
associated with
healthy
behavior
Abnormal
Metric has a value
associated with unhealthy
behavior
Ira Cohen - HP-Labs
P(M|SLO)
14
Bayesian network classifiers:
Results
•
− Models takes 2-10 seconds to train on
days worth of data
− Metric attribution: Takes 1ms-10ms to
compute
SLO State
M3
M5
M30
•
Found that order of 3-10 metrics are
needed (out of hundreds) to capture
accurately a performance problem
•
Accuracy is high (~90%)*
•
Experiments showed metrics are useful
for diagnosing certain problems on real
systems
•
Hard to capture with single model
multiple types of performance
problems!
M8
M32
Ira Cohen - HP-Labs
“Fast”: (in the context of 1-5 mins data
collection)
15
Additional issues
•
How much data is needed to get accurate
models?
•
How to detect model validity?
•
How to present models/results to operators?
Ira Cohen - HP-Labs
16
Challenge 2: Adaptation
•
Systems and application change
•
Reasons for performance problems change over
time (and sometimes recur)
Different? Same problem?
Learning with “Concept drift”
Ira Cohen - HP-Labs
17
Adaptation: Possible approaches
•
Single omniscient model: “Train once, use
forever”
− Assumes training data provides all information.
•
Online updating of model
− E.g., parameter/structure updating of Bayesian
networks, online learning of Neural networks, Support
vector machines, etc.
− Potentially wasteful retraining when similar problems
reoccur
•
Maintain ensemble of models
− Requires criteria for choosing subset of models in
inference.
− Criteria for adding new models to ensemble
− Criteria for removing models from ensemble
Ira Cohen - HP-Labs
18
Our approach: Managing an ensemble of
models for our classification approach
Construction
1.
Periodically induce a
new model
2.
Check whether the
model adds new
information
(classification
accuracy)
3.
Update the
ensemble of models
Inference:
Use Brier score for selection of models
Ira Cohen - HP-Labs
19
Adaptation: Results
Single model: No
Adaptation
Single model trained
with all history (no
forgetting)
Single model with
sliding window
Ensemble of Models
•
•
Accuracy
(%)
61.4
Total Processing
Time (mins)
0.2
82.4
71.5
84.2
0.9
90.7
7.1
~7500 samples, 5 mins/sample (one month), ~70 metrics
Classifying a sample with the Ensemble of BNCs:
− Used model with best Brier Score for predicting class (winner takes all)
• Brier score was better than other measures (e.g., accuracy, likelihood)
• Winner takes all was more accurate than other combination
approaches (e.g., majority voting)
Ira Cohen - HP-Labs
20
Adaptation: Result
•
“Single adaptive” slower to adapt to recurrent issues
− Must re-learn behavior, instead of just selecting a previous
model
Ira Cohen - HP-Labs
21
Additional issues
•
Need criteria for “aging” models
•
Periods of “good” behavior also change: Need
robustness to those changes as well.
Ira Cohen - HP-Labs
22
Challenge 3: Leveraging history
•
It would be great to have the following system:
Diagnosis: Stuck thread due to insufficient
Database connections
Repair: Increase connections to +6
Periods:
:
:
:
Severity: SLO time increases up to 10secs
:
:
Location: Americas. Not seen in Asia/Pacific
Ira Cohen - HP-Labs
23
Leveraging history
•
Main challenge: Find a representation (signature) that
captures the main characteristics of the system behavior
that is:
− Amenable to distance metrics
− Generated automatically
− In Machine readable form
Ira Cohen - HP-Labs
Diagnosis: Stuck thread due to insufficient
Database connections
Repair: Increase connections to +6
Periods:
:
:
:
Severity: SLO time increases up to 10secs
:
24
Our approach to defining signatures
1) Learn probabilistic classifiers
Unhealthy
Models
P(SLO,M)
2) Inferences: Metric
Attribution
Abnormal metrics
app cpu util
app alive proc high
app active proc high
3) Define these as
signatures of the
problems
DB cpu util high
Ira Cohen - HP-Labs
25
Example: Defining a signature
•
For a given SLO violation, the models provide a list of
metrics that are attributed with the violation.
•
Metric has value 1 if it is attributed with the violation, -1 if it
is not attributed, 0 if it is not relevant, e.g.:
Attribution
Ira Cohen - HP-Labs
26
Results: With signatures…
•
We were able to accurately retrieve past occurrences of
similar performance problems with the diagnosis efforts
•
ML technique: Information retrieval
Diagnosis: Stuck thread due to insufficient
Database connections
Repair: Increase connections to +6
Periods:
:
:
:
Severity: SLO time increases up to 10secs
:
:
Location: Americas. Not seen in Asia/Pacific
Ira Cohen - HP-Labs
27
Results: Retrieval accuracy
Retrieval of "Stuck Thread" problem
Ideal P-R curve
Top 100: 92 vs 51
Ira Cohen - HP-Labs
28
Results: With signatures we can
also…
•
•
•
Automatically identify groups of different problems and
their severity
Identify which are recurrent
ML technique: Clustering
Ira Cohen - HP-Labs
29
Additional issues
•
Can we generalize and abstract signatures for
different systems/applications?
•
How to incorporate human feedback for retrieval
and clustering?
− Semi-supervised learning: results not shown today
Ira Cohen - HP-Labs
30
Challenge 4: Combining multiple
data sources
•
We have a lot of semi-structured text logs, e.g.,
− Problem tickets
− Event/error logs
(application/system/security/network…)
− Other logs (e.g., operators actions)
•
Logs can help obtain more accurate diagnosis
and models – sometimes system/application
metrics not enough
•
Challenges:
− Transforming logs to “features”: information extraction
− Doing it efficiently!
Ira Cohen - HP-Labs
31
Properties of logs
•
Logs events have relatively short text messages
•
Much of the diversity in messages comes from
different “parameters” – dates,
machine/component names. Core is less unique
compared to free text.
•
Amount of events can be huge (e.g., >100 million
events per day for large IT systems)
Processing events needs to compress logs
significantly while doing it efficiently!
Ira Cohen - HP-Labs
32
Our approach: Processing
application error-logs
2006-02-26T00:00:06.461
ES_Domain:ES_hpat615_01:2257913:
Thread43.ES82|commandchain.BaseE
rrorHandler.logException()|FUNCT
IONAL|0||FatalException occurred
type=com.hp.es.service.productEntitlement.knight.logic.access.KnightIOException,
message=Connection timed out,
class=com.hp.es.service.productEntitlement.knight.logic.RequestKnightResultMENUC
ommand
2006-02-26T00:00:06.465 ES_Domain:ES_hpat615_01:22579163:Thread43.ES82|com.hp.es.service.productEntitlement.combined.errorhandling.DefaultAlway
sEIAErrorHandlerRed.handleException()|FATAL|2706||KNIGHT system unavailable:
java.io.IOException
2006-02-26T00:00:06.465 ES_Domain:ES_hpat615_01:22579163:Thread43.ES82|com.hp.es.service.productEntitlement.combined.errorhandling.DefaultAlway
sEIAErrorHandlerRed.handleException()|FATAL|0||com.hp.es.service.productEntitlem
ent.knight.logic.RequestKnightResultMENUCommand message: Connection timed out
causing exception type: java.io.IOException KNIGHT URL accessed:
http://vccekntpro.cce.hp.com/knight/knightwarrantyservice.asmx
2006-02-26T00:00:06.466 ES_Domain:ES_hpat615_01:22579163:Thread43.ES82|com.hp.es.service.productEntitlement.combined.errorhandling.DefaultAlway
sEIAErrorHandlerRed.handleException()|FATAL|0||com.hp.es.service.productEntitlem
ent.knight.logic.access.KnightIOException: Connection timed out
2006-02-26T00:00:08.279 ES_Domain:ES_hpat615_01:22579163:ExecuteThread: '16' for
'weblogic.kernel.Default'.ES82|com.hp.es.service.productEntitlement.combined.Mer
geAllStartedThreadsCommand.setWaitingFinished()|WARNING|3709||2006-0226T00:00:08.279 ES_Domain:ES_hpat615_01:22579163:ExecuteThread: '16' for
2006-02-26T00:00:06.465 ES_Domain:ES_hpat615_01:22579163:Thread43.ES82|com.hp.es.service.productEntitlement.combined.errorhandling.DefaultAlway
sEIAErrorHandlerRed.handleException()|FATAL|0||com.hp.es.service.productEntitlem
Over 4,000,000 error log entries
200,000+ distinct error messages
•
Similarity-based
Sequential Clustering
190 “feature messages”
Use count of
appearances over 5minute intervals of the
features messages as
metrics for learning
Significant reduction of messages
− 200,000 190
•
Accurate
− Clustering results validated with hierarchical tree clustering
algorithm
Ira Cohen - HP-Labs
33
Learning Probabilistic Models
Construct probabilistic models metrics using a
“hybrid-gamma distribution” (Gamma distribution
with zeros)
PDF
•
# of appearances
Ira Cohen - HP-Labs
34
Results: Adding Log based metrics
•
Signatures using error logs metrics pointed to the
right causes in 4 out of 5 “High” severity incidents
in past 2 months
− System metrics were not related to the problems in
these cases
From Operator Incident Report:
From Application Error Log:
Diagnosis and Solution: Unable
to start SWAT wrapper. Disk
usage reached 100%. Cleaned
up disk and restarted the
wrapper…
CORBA access failure: IDL:hpsewrapper/SystemNotAvailableException:…
com.hp.es.wrapper.corba.hpsewrapper.SystemNotAvailableException
Ira Cohen - HP-Labs
35
Additional issues
•
With multiple instances of an application – how to
do joint, efficient processing of the logs?
•
Treating events as sequences in time could lead
to more accuracy and compression.
Ira Cohen - HP-Labs
36
Challenge 5: Scaling up Machine Learning
techniques
•
Large scale distributed applications have various
level of dependencies
− Multiple instances of components
− Shared resources (DB, network, software components)
− Thousands to millions of metrics (features)
A
Ira Cohen - HP-Labs
B
C
D
E
37
Challenge 5: Possible approaches
•
Scalable approach: Ignore dependencies
between components
− Putting head in the sand?
− See Werner Vogel’s (Amazon’s CTO) thoughts on it…
•
Centralized approach: Use all available data
together for building models.
− Not scalable
•
A different approach: Transfer models, not
metrics.
− Good for components that are similar and/or have
similar measurements
Ira Cohen - HP-Labs
38
Example: Diagnosis with Multiple
Instances
•
Method 1: diagnosing multiple instances by
sharing measurement data (metrics)
A
Ira Cohen - HP-Labs
B
39
Diagnosis with Multiple Instances
•
Method 1: diagnosing multiple instances by
sharing measurement data (metrics)
G
C
A
Ira Cohen - HP-Labs
D
E
H
F
B
40
Diagnosis with Multiple Instances
•
Method 2: diagnosing multiple instances by
sharing learning experience (models)
− A form of transfer learning
A
Ira Cohen - HP-Labs
B
41
Diagnosis with Multiple Instances
•
Method 2: diagnosing multiple instances by
sharing learning experience (models)
G
A
C
D
E
H
F
B
Ira Cohen - HP-Labs
42
Metric Exchange: Does it help?
Online Prediction
Violation detection
w/ model exchange
Violation detection w/o
model exchange
Online Prediction
Building models based on metrics of other
instancesInstance 1
Instance 2
•
False Alarm
Time Epoch
•
Time Epoch
Observation: metric exchange does not improve
model performance for load-balanced instances
Ira Cohen - HP-Labs
43
Model Exchange: Does it help?
Apply models trained on other instances
Online Prediction
•
Models imported from other
instances improve
Violation detection
accuracy
Violation detection
w/o model exchange
w/ model exchange
False alarm w/o
model
exchange
False alarm w/
model
exchange
Time Epoch
•
Observation 1: model exchange enables quicker
recognition of previously unseen problem types
•
Observation 2: model exchange reduces model training
cost
Ira Cohen - HP-Labs
44
Additional issues
•
How do/Can we do transfer learning on similar
but not identical instances?
•
More efficient methods for detecting which data is
needed from related components during
diagnosis
Ira Cohen - HP-Labs
45
Providing diagnosis as a web
service: SLIC’s IT-Rover
Monitored
Services
Metrics/SLO
Monitoring
Retrieval
Signature
construction
Signature
DB
engine
engine
Clustering
Admin
engine
Centralized diagnosis web service allows:
•Retrieval across different data centers/different services/possibly
different companies
•Fast deployment of new algorithms
•Better understanding of real problems for further development of
algorithms
•Value of portal is in the information (“Google” for systems)
Ira Cohen - HP-Labs
46
Discussion: Additional issues,
opportunities, and challenges
•
Beyond the “black box”: Using domain knowledge
− Expert knowledge
− Topology information
− Use known dependencies and causal relationship between
components
•
Provide solutions in cases where SLOs are not known
− Learn relationship between business objectives and IT
performance
− Anomaly detection methods with feedback mechanisms
•
Beyond diagnosis: Automated control and decision
making
− HP-Labs work on applying adaptive controllers for controlling
systems/applications
− IBM Labs work using reinforcement learning for resource
allocation
Ira Cohen - HP-Labs
47
Summary
•
Presented several challenges at the intersection
machine learning and IT automated diagnosis
•
A relatively new area for machine learning and
data mining researchers and practitioners
•
Many more opportunities and challenges ahead:
research and product/business wise…
Read more: www.hpl.hp.com/research/slic
− SOSP-05, DSN-05, HotOS-05, KDD-05, OSDI-04
Ira Cohen - HP-Labs
48
Publications:
•
•
•
•
•
•
•
•
Ira Cohen, Steve Zhang, Moises Goldszmidt, Julie Symons, Terence Kelly, Armando
Fox, "Capturing, Indexing, Clustering, and Retrieving System History", SOSP 2005.
Rob Powers, Ira Cohen, and Moises Goldszmidt, "Short term performance forecasting
in enterprise systems", KDD 2005.
Moises Goldszmidt, Ira Cohen, Armando Fox and Steve Zhang, "Three research
challenges at the intersection of machine learning, statistical induction, and systems",
HOTOS 2005.
Steve Zhang, Ira Cohen, Moises Goldszmidt, Julie Symons, Armando Fox,
"Ensembles of models for automated diagnosis of system performance
problems", DSN 2005.
Ira Cohen, Moises Goldszmidt, Terence Kelly, Julie Symons, Jeff Chase, "Correlating
instrumentation data to system states: A building block for automated diagnosis and
control", OSDI, 2004.
George Forman and Ira Cohen, "Beware the null hypothesis", European Conference
on Machine Learning/ European Conference on Principles and Practice of Knowledge
Discovery in Databases (ECML/PKDD) 2005.
Ira Cohen and Moises Goldszmidt, "Properties and Benefits of Calibrated Classifiers",
European Conference on Machine Learning/ European Conference on Principles and
Practice of Knowledge Discovery in Databases (ECML/PKDD) 2004.
George Forman and Ira Cohen, "Learning from Little: Comparison of Classifiers given
Little Training", European Conference on Machine Learning/ European Conference on
Principles and Practice of Knowledge Discovery in Databases (ECML/PKDD) 2004.
Ira Cohen - HP-Labs
49