Transcript Document

A Statistical Learning Approach to
Diagnosing eBay’s Site
Mike Chen, Alice Zheng, Jim Lloyd,
Michael Jordan, Eric Brewer
[email protected]
Motivation
 Fast failure detection and diagnosis are critical
to high availability
– But, exact root cause may not be required for many
recovery techniques
 Many potential causes of failures
– Software bugs, hardware, configuration, network,
database, etc.
– Manual diagnosis is slow and inconsistent
 Statistical approaches are ideal
– Simultaneously examining many possible causes of
failures
– Robust to noise
Jan 12, 2004
Path-based Diagnosis
Slide 2
Challenges

Lots of (noisy) data

Near real-time detection and diagnosis

Multiple independent failures

Root cause might not be captured in logs
Jan 12, 2004
Path-based Diagnosis
Slide 3
Talk Outline
 Introduction
 eBay’s infrastructure
 3 statistical approaches
 Early results
Jan 12, 2004
Path-based Diagnosis
Slide 4
eBay’s Infrastructure
 2 physical tiers
– Web server/app server + DB
– Migrating to Java (WebSphere) from C++
 SuperCAL (Centralized Application Logging)
– API for app developer to log anything to CAL
– Runtime platform provides application-generic logging:
cookie, host, URL, DB table(s), status, duration, etc.
– Supports nested txns
– A path can be identified via thread ID + host ID
Jan 12, 2004
Path-based Diagnosis
Slide 5
SuperCAL Architecture
detection
App
Servers
LB
Switch
diagnosis
……
Real-time
msg bus
 Stats
– 2K app servers, 40 SuperCAL machines
– 1B URLs/day
– 1TB raw logs/day (150GB gzipped), 200Mbps peak
Jan 12, 2004
Path-based Diagnosis
Slide 6
Failure Analysis
 Summarize each transaction into:
ID
Type
Name
FeaturesPool
Host
Version
DB
Status
1
URL
ViewFeedback
Cgi0
134
1.2.1
FeedbackDB,
UserDB, …
NullPointer
2
URL
Bid
Cgi2
231
1.0.3
PriceDB
Success
3
XML
…
…
…
…
…
…
Class
 What features are causing requests to fail?
– Txn type, txn name, pool, host, version, DB, or a combination
of these?
– Different causes require different recovery techniques
Jan 12, 2004
Path-based Diagnosis
Slide 7
3 Approaches
 Machine learning
– Decision trees
– MinEntropy – eBay’s greedy variant of decision trees
 Data mining
– Association rules
Jan 12, 2004
Path-based Diagnosis
Slide 8
Decision Trees
 Classifiers developed in the statistical machine
learning field
 Example: go skiing tomorrow?
New snow
No new snow
Y
Sunny
Cloudy
Y
Sunny
Y
Cloudy
N
New snow
Y
No new snow
N
 “learning” => inferring the decision trees rules
from data
Jan 12, 2004
Path-based Diagnosis
Slide 9
Decision Trees


Feature selection
– Look for features that best separates the classes
– Different algorithms uses different metrics to measure “skewness”
(e.g. C4.5 uses information gain)
TxnName
Failed
Machine
Failed
MyEBay
636
Attila
2985
MyEBaySeller
512
Lenin
20
MyEBayLogin
736
Marcus
4
Scipio
5
…
…
…
…
The goal of decision tree algorithm
– to split nodes until leaves are “pure” enough or until no further split
is possible
• i.e. pure => all data points have the same class label
– Use pruning heuristics to control over-fitting
Jan 12, 2004
Path-based Diagnosis
Slide 10
Decision Trees – Sample Output

Pool = icgi1
(Correct, incorrect)
| TxnName = LeaveFeedback: failed (8,1)
| TxnName = MyFeedback: failed (205,3)
Pool = icgi2
| TxnName = Respond: failed (1)
| TxnName = ViewFeedback: failed (3554,52)
 Naïve diagnosis:
icgi1
LeaveFdbk
8
MyFdbk
ViewFdbk
205
Jan 12, 2004
1. Pool=icgi1 and TxnName=LeaveFeedback
icgi2
1
2. Pool=icgi1 and TxnName=MyFeedback
Respond
3. Pool=icgi2 and TxnName=Respond
4. Pool=icgi2 and TxnName=ViewFeedback
3554
Path-based Diagnosis
Slide 11
Feature Selection Heuristics
1. Ignore leaf nodes with no failed transactions
2. Problem: noisy leaves
– keep the top N leaves, or ignore nodes with < M% failues
3. Problem: features may not be independent
– drop ancestor nodes that are “subsumed” by the leaves
4. Rank by impact
– sort the predicted causes by failure count
icgi1
icgi2
LeaveFdbk
MyFdbk
ViewFdbk
8
205
Jan 12, 2004
1
icgi1
Respond
3554
icgi2
MyFdbk
Respond
MyFdbk
Respond
205
Path-based Diagnosis
3554
205
3554
Slide 12
MinEntropy
 Entropy measures the randomness of data
– E.g. if failure is evenly distributed (very random), then entropy
is high
 Rank features by the normalized entropy
– Greedy approach searches for the leaf node with most failures
 Always produces one and exactly one diagnosis
 Deployed on the entire eBay site
– Sends real-time alerts to ops
– Pros: fast (<1s for 100K txns and scales linearly)
– Cons: optimized for single faults
Jan 12, 2004
Path-based Diagnosis
Slide 13
MinEntropy example
TxnType
Errors
URL
4350
SQL
47
EMAIL
XSLT
…
12
0
…
Errors
MyEBay
636
MyEBaySel
ler
512
Pool
Errors
Cgi0
12
4002
MyEBayLo
gin
736
Cgi1
Cgi2
30
…
…
Cgi3
8
Cgi4
5
…
…
Alert: Version E293 causing
URL failures (not specific to
any URL) in pool CGI1
Jan 12, 2004
TxnName
Machine
Errors
Attila
1985
Lenin
2002
Marcus
4
Scipio
0
…
…
Path-based Diagnosis
Version
Errors
E293
3987
E291
15
Slide 14
Association Rules
 Data mining technique to compute item sets
– e.g. Shoppers who bought this item also shopped for …
 Metrics
– Confidence: (# of A & B) / # of A
• Conditional probability of B given A
– Support: (# of A & B)/total # of txns
 Generates rules for all possible sets
– e.g. machine=abc, txn=login
=> status=NullPointer (conf:0.1, support=0.02)

Applied to failure diagnosis
– Find all rules that has failed status on the right, then rank by conf
– Pros: looks at combinations of features
– Cons: generates many rules
Jan 12, 2004
Path-based Diagnosis
Slide 15
Association Rules – Sample Output
 Sample output (rules containing failures):
TxnType=URL Pool=icgi2 TxnName=LeaveFeedback ==> Status=Failed
conf:(0.28)
Pool=icgi2 TxnName=LeaveFeedback ==> Status=Failed conf:(0.28)
TxnType=URL TxnName=LeaveFeedback ==> Status=Failed conf:(0.28)
TxnName=LeaveFeedback ==> Status=Failed conf:(0.28)
 Problem: features may not be independent
– e.g. all LeaveFeedback txns are of type URL
– Drop rules that are subsumed by more specific rules
 Diagnosis: TxnName=LeaveFeedback
Jan 12, 2004
Path-based Diagnosis
Slide 16
Experimental Setup

Dataset
– About 1/8 of the whole site
Type
Name
Pool
Machine
Version
Database
Status
10
300
15
260
7
40
8
– 10 one-minute traces, 4 with 2 concurrent faults
• total of 14 independent faults
Host
DB
Host, Host
Host, DB
Host, SW
DB, SW
2
4
1
1
1
1
– True faults identified through post-mortems, ops chat logs, application
logs, etc.

Metrics
– Precision: (# of identified faults) / (# of true faults)
– Recall: (# of identified faults) / (# of predicted faults)
Jan 12, 2004
Path-based Diagnosis
Slide 17
Results: DBs in Dataset


True causes for DB-related
failures are captured in the
dataset
– Variable number of DBs
used by each txn
Feature selection heuristics
1. Ignore leaf nodes with
no failed transactions
2. Noise filtering
–
ignore nodes with < M%
failues (in this case, M = 10)
100%
80%
60%
40%
20%
0%
3. Path trimming
–
recall
precision
C4.5 naïve
C4.5 (noise
filtering)
C4.5 (noise
filtering + path
trim m ing)
drop ancestor nodes subsumed
by the leaf nodes
Jan 12, 2004
Path-based Diagnosis
Slide 18
Results: DBs not in Dataset
 True cause not captured for DB-related failures
100%
precision
recall
80%
60%
40%
20%
0%
C4.5
MinEntropy Association Association
Rules (N=5)
Rules
(N=10)
 C4.5 suffers from unbalanced dataset
– i.e. produces a single-rule that predicts every txn to be
successful
Jan 12, 2004
Path-based Diagnosis
Slide 19
What’s next?
 ROC curves
– show tradeoff between precision and recall
 Transient failures
– Up-sample to balance dataset or use cost matrix
 Some measure of the “confidence” of the
prediction
 More data points
– Have 20hrs of logs that have failures
Jan 12, 2004
Path-based Diagnosis
Slide 20
Open Questions
 How to deal with multiple symptoms?
– E.g. DB outage causing multiple types of requests to fail
– Treat it as multiple failures?
 Failure importance (count vs. rate)
– Two failures may have similar failure count
– Low volume and higher failure rate vs. high volume and
lower failure rate
Jan 12, 2004
Path-based Diagnosis
Slide 21