Adapt the Monitoring

Download Report

Transcript Adapt the Monitoring

Tavolo 2 - Big Data
Adaptive Monitoring
CIS, UNINA, UNICAL, UNIFI
Primi risultati
• I primi risultati del tavolo sono stati pubblicati
nell’articolo
Big Data for Security Monitoring: Challenges
Opportunities in Critical Infrastructures Protection
L. Aniello2, A. Bondavalli3, A. Ceccarelli3, C. Ciccotelli2, M. Cinque1, F.
Frattini1, A. Guzzo4, A. Pecchia1, A. Pugliese4, L. Querzoni2, S. Russo1
(1) UNINA
(2) CIS
(3) UNIFI
(4) UNICAL
• Presentato al workshop BIG4CIP @EDCC (12 maggio
2014)
Data-Driven Security Framework
DATA ANALYSIS
ATTACK
MODELING
INVARIANTBASED
MINING
CONFORMANCE
CHEKING
FUZZY
LOGIC
BAYESIAN
INFERENCE
…
DATA PROCESSING
MONITORING
ADAPTER
RAW DATA
COLLECTION
ENVIRONMENTAL DATA
NODE RESOURCE DATA
NETWORK AUDIT
CRITICAL INFRASTRUCTURE
…
KNOWLEDGE
BASE
APPLICATION/
SYSTEM LOGS
IDS ALERTS
PROTECTION
ACTIONS
ADAPTIVE MONITORING
Scenario
• Problem:
need to analyze more data coming from distinct sources in
order to improve the capability to detect faults/cyber attacks
o Excessively large volumes of information to transfer and analyze
o Negative impact on the performance of monitored systems
• Proposed solution:
dynamically adapt the granularity of monitoring
o Normal case: coarse-grained monitoring (low-overhead)
o Upon anomaly detection: fine-grained monitoring (higher overhead)
• Two distinct scenarios
o Fault detection  current CIS’s research direction
o Cyber attack detection
Anomaly Detection
• Metrics Selection
o Find correlated metrics (invariants) to be used as anomaly signals
o Learn which invariants hold when the system is healthy
 Profile the healthy behavior of the monitored system
• Anomaly Detection
o Monitor the health of the system by looking at a few metrics
 How to choose these metrics?
o When an invariant stops to hold, adapt the monitoring
 The aim is detecting the root cause of the problem
 Possibility of false positives
[1] J., M., R., W., "Information-Theoretic Modeling for Tracking the Health of Complex Software Systems", 2008
[2] J., M., R., W., " Detection and Diagnosis of Recurrent Faults in Software Systems by Invariant Analysis", 2008
[3] M., J., R., W., "Filtering System Metrics for Minimal Correlation-Based Self-Monitoring", 2009
Adapt the Monitoring
• Two dimensions in adapting the monitoring
o Change the set monitored metrics
o Change the frequency of metrics retrieval
• How to choose the way of adapting the monitoring on the
basis of the detected anomaly?
• Additional issue
o The goal of the adaptation is discovering the root cause of the problem
o Need to zoom-in specific portions of the system
 Very likely to increase the amount of data to transfer/analyze
 Risk to have a negative impact on system performance
 Possible solution: keep the volume of monitored data limited by zoomingout other portions of the system
[4] M., R., J., A., W., "Adaptive Monitoring with Dynamic Differential Tracing-Based Diagnosis", 2008
[5] M., W., "Leveraging Many Simple Statistical Models to Adaptively Monitor Software Systems", 2014
Fault Localization
• Goal: given a set of alerts, determine which fault occurred and
which component originated it
• Problems
o A same alert may be due to different faults (Ambiguity)
o A single fault may cause several alerts (Domino Effect)
o Concurrent alerts may be generated by concurrent unrelated faults
o Tradeoff: monitoring granularity vs precision of fault identification
• Approaches:
o Probabilistic models (e.g. HMM, Bayesian Networks)
o Machine learning techniques (e.g. Neural Networks, Decision Trees)
o Model-based techniques (e.g., Dependency Graphs, Causality Graphs)
[6] S., S., "A survey of fault localization techniques in computer networks", 2004
[7] D., G., B., C., "Hidden Markov Models as a Support for Diagnosis: ...", 2006
Prototype - Work in Progress
monitoring of a JBoss cluster by using Ganglia
Host #1
Host #2
Host #3
Host #4
JBoss AS
JBoss AS
JBoss AS
JBoss AS
gmond
gmond
gmond
gmond
monitored
metrics
gmetad
Adaptive
Monitoring
Mon. Host
Prototype - Goals
• Identify a small set of metrics to monitor on a JBoss cluster to
detect possible faults
o Find existing correlations
o Profile healthy behavior
• Inject faults on JBoss with Byteman (http://byteman.jboss.org/)
• For each fault, identify the set of additional metrics to monitor
• Implement the prototype in order to evaluate
o The effectiveness of the approach
o The reactivity of the adaptation
o The overhead of the adaptation
OPERATING SYSTEMS AND
APPLICATION SERVERS
MONITORING
Data collection and processing
• Collects a selection of attributes from OS and AS,
through probes that have been installed on machines
– Current implementation observes Tomcat 7 ad CentOS 6
• Executes the Statistical Prediction and Safety Margin
algorithm on the data collected
• The CEP Esper is used to apply rules on events
(performs the detection of anomalies)
• Work partially done within the context of the Secure!
Project (see later today)
High level view
INVARIANTS MINING
Why invariants?
• Invariants are properties of a program that are
guaranteed to hold for all executions of the program.
– If those properties are broken at runtime, it is possible to
raise an alarm for immediate action
• Invariants can be useful to
–
–
–
–
detect transient faults, silent errors and failures
report performance issues
avoid SLAs violations
help operators to understand the runtime behavior of the
app
• Pretty natural properties for apps performing batch
work
An example of flow intensity
invariant
• A platform for the batch processing of files: the processing
time is proportional to the file size
• Measuring the file size and the time spent in a stage, I(x)
and I(y), (the flow intensities), the equation
I ( y )  k  I ( x)
is an invariant relationship characterising the expected
behaviour of the batch system.
– If there is an execution problem (e.g., file processing
hangs) the equation does not hold any more (broken
invariant)
Research questions
RQ1: how to discover invariants out of the hundreds
of properties observable from an application log?
RQ2: How to detect broken invariants at runtime?
Our contribution
AUTOMATED MINING
A framework and a tool for mining invariants automatically from application
logs
• tested on 9 months of logs collected from a real-world Infosys CPG SaaS
application
• able to to automatically select 12 invariants out of 528 possible
relationships
IMPROVED DETECTION
An adaptive threshold scheme defined to significantly shrink down the number
of broken invariants
• from thousands to tens broken invariants w.r.t. static thresholds on our
dataset
BAYESIAN INFERENCE
Data-driven Bayesian Analysis
• Security monitors may produce a large number of false alerts
• A Bayesian network can be used to correlate alerts coming
from different sources and to filter out false notifications
• This approach has been successfully used to detect credential
stealing attacks
– Raw alerts generated during the progression of an attack (e.g.
user-profile violations and IDS notifications) are correlated
– The approach was able to remove around 80% of false
positives (i.e., not compromised user being declared
compromised) without missing any compromised user
Data-driven Bayesian Analysis
• Vector extraction starting from raw data:
– each vector represents a security event, e.g., attack,
compromised user, etc…
– suitable for post-mortem forensics and runtime analysis;
– event logs, network audit, environmental sensors.
event
VECTOR
EXTRACTION
binary features (0 / 1)
v1
✓
v2
✓
✓
✓
v3
vN
✓
✓
✓
✓
✓
✓
✓
Bayesian network
• Allows estimating the probability of the hypothesis
variable (attack event), given the evidence in the raw
data:
hypothesis variable
C
(the user is compromised)
unknown
address
information variables
A1
multiple logins
A14
A2
…
Network parameters


a-priori probability P(C);
conditional probability table (CPT) for
each alert Ai.
(alerts)
suspicious download
P(A | C)
P(C | A) =
× P(C)
P(A)
N
P(A) = å P(A | Ci )× P(Ci )
i=1
Incident analysis
• Estimate the probability that the vector
represents an attack, given the features
vector
V_i
✓
✓
✓
✓
✓
…
P(C)=0.31
Preliminary testbed
• A preliminary implementation with Apache Storm
___
___
___
LogStreamer
(spout)
FactorCompute
(bolt)
• Tested with synthetic logs emulating
the activity of 2,5 million users,
generating 5 millions log entries per
day (IDS logs and user access logs)
AlertProcessor
(bolt)
Log lines
Time (ms)
4.300.000
140.886
4.400.000
143.960
4.600.000
147.024
4.500.000
150.448
4.700.000
153.551
4.800.000
159.567
4.900.000
162.642