A statistical approach for anomaly detection: preliminary results

Transcript A statistical approach for anomaly detection: preliminary results

DOTS-LCCI PM 5th Meeting
ROMA
30-31 Maggio 2011
A statistical anomaly-based algorithm for
on-line fault detection in complex software
critical systems
A. Bovenzi – F. Brancati
Università degli Studi di NAPOLI
"Federico II"
Dipartimento di Informatica e Sistemistica
Università degli Studi di Firenze
Dipartimenti di Sistemi e Informatica
Motivations and conclusion
From last Meeting
Detect Failures in complex & critical SW systems
Legacy and OTS (Off-The-Shelf) based
Several interacting components
Different configurations
Detection performed at process (thread) level
Crash failures
which cause a process(thread) to terminate unexpectedly
Hang failures (active and passive)
which cause a process(thread) to be suspended and its external state to be constant.
Fail-halt (or Fail-stop) Systems
A. Bovenzi and F. Brancati
DOTS-LCCI - ROMA
30/05/2011 - 2
Motivations and conclusion
From last Meeting
Definition of Anomaly
With respect to a monitored variable characterizing the behavior of the system, the
term anomaly is a change in this variable caused by specific and non-random factors
[Montgomery 00]
overload, the activation of faults, malicious attacks
On-line anomaly detection is an essential mean to
guarantee dependability of complex and critical
software systems
Difficult task because of system properties
Complexity (lots of interacting components)
Highly dynamic (frequent reconfigurations, updates)
Several sources of non-determinism
A. Bovenzi and F. Brancati
DOTS-LCCI - ROMA
30/05/2011 - 3
Motivations and conclusion
From last Meeting
Anomaly Detectors can take advantage of the possibility to
evaluate online the expected behavior of monitored variables
•
Internal R&SAClock algorithm (SPS) adapted for anomaly
detection
• Comparison with regards to static thresholds
• Preliminary results improve
• Fewer False Positives
• Better Precision and Recall
What has been done in the meantime?
A. Bovenzi and F. Brancati
DOTS-LCCI - ROMA
30/05/2011 - 4
Outline
1. SPS-based detection framework
a) Static vs Adaptive Thresholds
2. Experimental Evaluation
a) Case study
b) Monitored variables
c) The Experimental Phase
3. Metrics
4. Experimental results
5. Conclusion and future work
A. Bovenzi and F. Brancati
DOTS-LCCI - ROMA
30/05/2011 - 5
The Detection Framework
training
𝜶𝟏
Application
Middleware
Kernel
𝜶𝟐
∑𝒂𝒊 𝒘𝒊
𝑫
Monitoring
tool
𝜶𝒊
Detection framework
Static thresholds limitations
Operational conditions of the system similar to those of the
training set
A. Bovenzi and F. Brancati
DOTS-LCCI - ROMA
30/05/2011 - 6
Static vs Adaptive
Thresholds
Failure at ~ 160sec
•
•
SPS signals the failure
Static thresholds signal the failure but…
produce lots of False Positives
A. Bovenzi and F. Brancati
DOTS-LCCI - ROMA
30/05/2011 - 7
SPS-base Detection
training
𝜶𝟏
Application
Middleware
𝑺𝑷𝑺𝟏
Kernel
𝜶𝟐
Monitoring
tool
∑𝒂𝒊 𝒘𝒊
𝑫
𝑺𝑷𝑺𝟐
𝜶𝒊
𝑺𝑷𝑺𝒊
Detection framework
SPS algorithm
𝑇𝑢 𝑡 = 𝑥 𝑡0 + 𝑃 𝑡 +𝑆𝑀(𝑡0 )
𝑇𝑙 𝑡 =𝑥 𝑡0 −𝑃 𝑡 −𝑆𝑀(𝑡0 )
• Copes with variable and non-stationary behavior
• Works without any initial training phase
A. Bovenzi and F. Brancati
DOTS-LCCI - ROMA
30/05/2011 - 8
Case Study: the SWIM-BOX ®
ATC domain
Legacy A From Fragmented Systems
Towards
A network of integrated co-operators
SWIM‐BOX SWIM Network Adapter A Legacy B SWIM‐BOX Adapter B Legacy site Common
Infrastructure
Legacy site SWIM-BOX
To cooperate & share information
between distributed and
heterogeneous ATC legacy
• Web Services
• Publisher/Subscriber
• OTS-based
•
•
JBOSS AS, OSPL,
RTI DDS, Mysql DB
A. Bovenzi and F. Brancati
DOTS-LCCI - ROMA
30/05/2011 - 9
Monitored variables
Capture the application behavior indirectly
Breakpoints placed in specific kernel functions
Probe handlers to quickly collect data
e.g., input parameters, return values
Probe
Trigger condition for events registration
System call error code
An error code is returned
Time scheduling of process
Timeout exceeded since the process is preempted
Signal
A signal is received
Process/Thread creation/termination
Creation or Termination of a Process (Thread)
I/O on Disk
Timeout exceeded since last disk read/write
I/O on Socket
Timeout exceeded since last socket read/write
Holding time for Mutex/Semaphore
Timeout exceeded for mutex/semaphore possession
Waiting time for Mutex/Semaphore acquisition
Timeout exceeded for mutex/semaphore acquisition
Disk Throughput
A byte is read/write
Network Throughput
A byte is send/received
A. Bovenzi and F. Brancati
DOTS-LCCI - ROMA
30/05/2011 - 10
The Experimental Phase
1. Workload selection (Swim-box Validation Plan)
Differing for
message rate
message per burst
time between burst
2. Experiments execution
Golden Run
Faulty Run
Source code mutation tool (http://www.mobilab.unina.it/SFI.htm)
3. Post processing phase
Both algorithms applied to the monitored data
Varying several algorithm parameters
A. Bovenzi and F. Brancati
DOTS-LCCI - ROMA
30/05/2011 - 11
Post processing methodology
Experimental dataset was divided into
training set (used for parameter tuning)
validation set (used for testing both algorithms)
SPS algorithm parameter:
coverage 𝑐: 0.9, 0.99, 0.9999
memory depth 𝑚 (in terms of number of data considered in the statistics): 10, 20
time for detection 𝑑: {𝑚,
𝑚
2
,
𝑚
3
}.
Static Thresholds algorithm parameter:
method for the evaluation of thresholds for each monitor 𝑀𝑖 :
A. Bovenzi and F. Brancati
DOTS-LCCI - ROMA
𝒓𝒊 = { 𝒎𝒊𝒏, 𝒎𝒂𝒙 , 𝝁 − 𝝈, 𝝁 +
30/05/2011 - 12
Quality metrics
Basic:
True Positive (TP): if a failure occurs and the detector triggers an alarm;
False Positive (FP): if no failure occurs and an alarm is given;
True Negative (TN): if no real failure occurs and no alarm is raised;
False Negative (FN): if the algorithm fails to detect an occurring failure.
Derived:
Metric
Metric
Formula
Formula
A⋅C
Coverage (C)
𝑇𝑃/(𝑇𝑃 + 𝐹𝑁)
A*C, Accuracy-Coverage
TradeOff
Precision (P)
𝑇𝑃/(𝑇𝑃 + 𝐹𝑃)
aTM
(Avarage mistake duration)
𝐸𝐸 𝑇𝑇𝑀𝑅
𝑀
(2 ⋅ 𝑃 ⋅ 𝐶)/(𝑃 + 𝐶)
MDD
(Mean Delay for Detection)
𝐸(𝑇𝐷 )
F-Measure
FPR
(False Positive Rate)
Accuracy (A)
A. Bovenzi and F. Brancati
𝐹𝑃/(𝐹𝑃 + 𝑇𝑁)
𝑇𝑃 + 𝑇𝑁
𝑇𝑃 + 𝐹𝑃 + 𝑇𝑁 + 𝐹𝑁
aMR (Average Mistake Rate)
aPA (Query accuracy
probability)
DOTS-LCCI - ROMA
1/𝐸 𝑇𝑀𝑅
𝐸 𝑇𝑀𝑅 − 𝑇𝑀
𝐸 𝑇𝑀𝑅
30/05/2011 - 13
Experimental results
Three different scenarios:
SPS Algorithm
Static Thresholds Algorithm with a training phase
Static Thresholds Algorithm without a training phase
Best configuration comparison
𝐒𝑦𝑛𝑡ℎ𝑒𝑠𝑖𝑠 = 𝑚𝑒𝑎𝑛(||𝑎𝑃𝐴|| + ||𝐶𝑜𝑣𝑒𝑟𝑎𝑔𝑒|| + ||𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦|| +
+ (1 − ||𝑎𝑇𝑀||) + (1 − ||𝑎𝑀𝑅||))
Most relevant & not correlated metrics
Weight depends from system requirements
Best configurations:
d
t
(c,m)
SPS
20
0.4
(0.99,20)
ST-Training
3
0.5
ST-No Training
3
0.3
A. Bovenzi and F. Brancati
DOTS-LCCI - ROMA
30/05/2011 - 14
Experimental results
aTM (sec)
aMR
aPA
A*C
Synthesis
SPS Algorithm
3.611111
0.009804
0.95098
0.96004
0.914551
Static T. without training
11.63951
0.032724
0.674494
0.582356
0.59768
4.75
0.023715
0.85584
0.974436
0.842296
Static T. with training
A. Bovenzi and F. Brancati
DOTS-LCCI - ROMA
30/05/2011 - 15
Conclusion
Error Detection exploiting OS-level indicators
can be improved by means of SPS algorithm
•
Experimental results (achieved via fault injection) show the limitations
of static threshold algorithms in scenarios, where the operational
conditions of system differ from those of the training phase
•
Detector equipped with SPS
•
•
•
copes with variable and non-stationary systems
needs no training phase
performs better in terms of Coverage, Query accuracy probability, Mistake rate
and Mistake duration
A. Bovenzi and F. Brancati
DOTS-LCCI - ROMA
30/05/2011 - 16
Future work
Investigate SPS-based Detector performance by varying the number and
type of monitored variables
Is the detection framework application independent?
Explore how the detection framework performs under different OSs
Is the detection framework OS independent? Which OS is best suited for the
proposed approach?
New experimental campaign planned
Same case study under Windows Server 2008
Compare the Detector performance by varying Predictors
Is SPS-based predictor the best choice?
Compare SPS with ARIMA models, neural networks, …
A. Bovenzi and F. Brancati
DOTS-LCCI - ROMA
30/05/2011 - 17
Thank you for your attention
Questions ?
Insight for the future work ?
A. Bovenzi and F. Brancati
DOTS-LCCI - ROMA
30/05/2011 - 18

A statistical approach for anomaly detection: preliminary results

Transcript A statistical approach for anomaly detection: preliminary results

Directory