Fa: A System for Automating Failure Diagnosis
Download
Report
Transcript Fa: A System for Automating Failure Diagnosis
Fa: A System for Automating Failure
Diagnosis
Songyun Duan, Shivnath Babu, Kamesh Munagala
Department of Computer Science, Duke University
(ICDE09)
1
Outline
Motive
Introduction
Anomaly-based clustering
Diagnose(F,H)
Diagnose(F,L)
Fa to generate signature DB
Conclusion
2
Introduction
A tools that can diagnose the cause of failures
quickly and automatically from system-monitoring
data.
Fa uses monitoring data to construct a database of
failure signatures against which data from
undiagnosed failures can be matched.
Fa uses a new technique called anomaly-based
clustering when the signature database has no highconfidence match for an undiagnosed failure
3
Anomaly-based clustering
Fa system for mining the large volumes of high-dimensional
and noisy monitoring data generated by databases
Q=Diagnose (F, H ∪L ∪ U):
F is monitoring data from the system during the failure (or
just before the failure in the case of a system crash).
H ∪ L ∪ U is the historic data collected so far
4
Anomaly-based clustering
5
Anomaly-based clustering
6
Diagnose(F,H) -- Anomaly-based clustering
anomaly-based clustering will place two instances into the
same cluster iff they have similar deviations from F.
This strategy gives the right answer for the example in right
figure, generating a single cluster for H, and linking the failure
to attribute x only.
7
Diagnose(F,H) –Margin Classifiers
Diagnosis Vectors and Margin Classifiers
-- Computing the Diagnosis Vector
Fa processes a Diagnose (F, H) query by first clustering the
healthy data H into a set of clusters C1, C2,...
outputting the deviation: <W1,C1> <W2,C2>...
8
n
that produces the maximum separation
between C and F. This maximum separation is called
the margin
j1
wix j
Diagnose(F,H) -- Margin Classifiers
9
Diagnose(F,H) -- MAC
Margin-based Agglomerative Clustering
MAC is an agglomerative hierarchical clustering
10
// dilute the “clusterdness”
Diagnose(F,H) -- PCM
Partition-Check-Merge (PCM)結合Margin-based
Agglomerative Clustering( accurate, not efficient),其
O(|H|^2)和Distance-based partition( efficient, but less
accuracy)
PCM: DPC->part do MAC
If good enough, then possibly consolidate several small
clusters into a minimal set of clusters
If not good enough, then increasing the input parameter k to
the DPC algorithm that specifies the number of clusters to
generate.
11
Diagnose(F,L)
four distinct annotations
Clustering
-- blue point is centroid, to set f1 = <32,41>
f1 can be matched with SD1 to find the centroid (signature)
nearest to f1
12
Diagnose(F,L)
Separating function
-- Signature Database2
is a metrix with each
row representing the
signature of some failure
Using the Hamming distance
1000 0100 0010 0001 (annotation)
+ 0100 0100 0100 0100 (f1)
= 2
0
2
2
13
Diagnose(F,L)
If f2 = <39,41>,using SD2
<0000> => distance 均為1
Handling error
-- S5, S6
-- for f2 = <000010>
Why did SD3 diagnose f2
correctly, while SD2 did not?
14
to transmit some selected extra bits along with
regular data so that the receiver can reconstruct
the original data in the presence of errors
caused by noise or other impairments during
transmission.
Fa to generate signature DB
Generating the Binary Matrix
-- To random gave M threshold Rt, if (r<Rt)
(I) Each row should be distinct since no two failures can have the same
signature.
(II) Remove columns containing all 0s or 1s(no differentiation among failures)
(III) Two columns cannot be the same or complementary
(IV) The radius r of M, defined as half the minimum Hamming distance, the higher
the radius, the higher the error-correction ability of M.
Generating the Separating Functions
-- fa learns the separating function as a binary classification tree (CART) is best
15
Fa to generate signature DB
Weighting the Separating Functions
-- machine-learning ,holds
Fa uses SupportVector Machines (SVMs) to learn the weight
Confidence estimate:
16
Fa to generate signature DB
Setting the confidence threshold ( Ct )
a low Ct can lead to incorrect diagnosis, while a high Ct can
invoke the more expensive Diagnose(F;H) more often than needed.
The main idea is to generate an accuracy-confidence curve (AC-
Curve) for the signature database.
confidence threshold is Ct = x, the signature database has an
expected accuracy of y% for matches having confidence>= x.
17
Conclusion
Diagnose(F,L)中, signature database越大越好嗎?
18