Fa: A System for Automating Failure Diagnosis

Transcript Fa: A System for Automating Failure Diagnosis

Fa: A System for Automating Failure
Diagnosis
Songyun Duan, Shivnath Babu, Kamesh Munagala
Department of Computer Science, Duke University
(ICDE09)
1
Outline
 Motive
 Introduction
 Anomaly-based clustering
 Diagnose(F,H)
 Diagnose(F,L)
 Fa to generate signature DB
 Conclusion
2
Introduction
 A tools that can diagnose the cause of failures
quickly and automatically from system-monitoring
data.
 Fa uses monitoring data to construct a database of
failure signatures against which data from
undiagnosed failures can be matched.
 Fa uses a new technique called anomaly-based
clustering when the signature database has no highconfidence match for an undiagnosed failure
3
Anomaly-based clustering
 Fa system for mining the large volumes of high-dimensional
and noisy monitoring data generated by databases
 Q=Diagnose (F, H ∪L ∪ U):
 F is monitoring data from the system during the failure (or
just before the failure in the case of a system crash).
 H ∪ L ∪ U is the historic data collected so far
4
Anomaly-based clustering
5
Anomaly-based clustering
6
Diagnose(F,H) -- Anomaly-based clustering

 anomaly-based clustering will place two instances into the
same cluster iff they have similar deviations from F.
 This strategy gives the right answer for the example in right
figure, generating a single cluster for H, and linking the failure
to attribute x only.
7
Diagnose(F,H) –Margin Classifiers
 Diagnosis Vectors and Margin Classifiers
-- Computing the Diagnosis Vector
Fa processes a Diagnose (F, H) query by first clustering the
healthy data H into a set of clusters C1, C2,...
outputting the deviation: <W1,C1> <W2,C2>...

8

n
that produces the maximum separation
between C and F. This maximum separation is called
the margin
j1
wix j
Diagnose(F,H) -- Margin Classifiers
9
Diagnose(F,H) -- MAC
 Margin-based Agglomerative Clustering
 MAC is an agglomerative hierarchical clustering

10
// dilute the “clusterdness”
Diagnose(F,H) -- PCM
 Partition-Check-Merge (PCM)結合Margin-based
Agglomerative Clustering( accurate, not efficient),其
O(|H|^2)和Distance-based partition( efficient, but less
accuracy)
 PCM: DPC->part do MAC
 If good enough, then possibly consolidate several small
clusters into a minimal set of clusters
 If not good enough, then increasing the input parameter k to
the DPC algorithm that specifies the number of clusters to
generate.
11
Diagnose(F,L)
 four distinct annotations
 Clustering
-- blue point is centroid, to set f1 = <32,41>
 f1 can be matched with SD1 to find the centroid (signature)
nearest to f1
12
Diagnose(F,L)
 Separating function
-- Signature Database2
is a metrix with each
row representing the
signature of some failure
 Using the Hamming distance

1000 0100 0010 0001 (annotation)
+ 0100 0100 0100 0100 (f1)
= 2
0
2
2
13
Diagnose(F,L)
 If f2 = <39,41>,using SD2
<0000> => distance 均為1
 Handling error
-- S5, S6
-- for f2 = <000010>
 Why did SD3 diagnose f2
correctly, while SD2 did not?

14
to transmit some selected extra bits along with
regular data so that the receiver can reconstruct
the original data in the presence of errors
caused by noise or other impairments during
transmission.
Fa to generate signature DB
 Generating the Binary Matrix
-- To random gave M threshold Rt, if (r<Rt)
(I) Each row should be distinct since no two failures can have the same
signature.
(II) Remove columns containing all 0s or 1s(no differentiation among failures)
(III) Two columns cannot be the same or complementary
(IV) The radius r of M, defined as half the minimum Hamming distance, the higher
the radius, the higher the error-correction ability of M.
 Generating the Separating Functions
-- fa learns the separating function as a binary classification tree (CART) is best
15
Fa to generate signature DB
 Weighting the Separating Functions
-- machine-learning ,holds

Fa uses SupportVector Machines (SVMs) to learn the weight
 Confidence estimate:
16
Fa to generate signature DB
 Setting the confidence threshold ( Ct )
a low Ct can lead to incorrect diagnosis, while a high Ct can
invoke the more expensive Diagnose(F;H) more often than needed.
 The main idea is to generate an accuracy-confidence curve (AC-
Curve) for the signature database.
 confidence threshold is Ct = x, the signature database has an
expected accuracy of y% for matches having confidence>= x.
17
Conclusion
 Diagnose(F,L)中, signature database越大越好嗎?
18

Fa: A System for Automating Failure Diagnosis

Transcript Fa: A System for Automating Failure Diagnosis

Directory