Anupam`s slides [pdf]

Download Report

Transcript Anupam`s slides [pdf]

Anomaly Detection
Presented by: Anupam Das
CS 568MCC Spring 2013
Network Security
Some of the contents were taken from the authors of "Anomaly
Detection : A Survey ACM Computing Surveys, Vol. 41(3),
Article 15, July 2009
Department of Computer Science, UIUC
March 27, 2016
Introduction
We are drowning in the deluge of
data that are being collected
world-wide, while starving for
knowledge at the same time
Anomalous events occur relatively
infrequently
However, when they do occur, their
consequences can be quite
dramatic and quite often in a
negative sense
“Mining needle in a haystack.
So much hay and so little time”
Department of Computer Science, UIUC
March 27, 2016
What are Anomalies?
• Anomaly is a pattern in the data that does not
conform to the expected behaviour
• Also referred to as outliers, exceptions,
peculiarities, surprise, etc.
• Anomalies translate to significant (often critical)
real life entities
Department of Computer Science, UIUC
March 27, 2016
Real World Anomalies
• Credit Card Fraud
• Cyber Intrusions
• Healthcare Informatics / Medical
diagnostics
• Industrial Damage Detection
• Image Processing / Video
surveillance
• Novel Topic Detection in Text Mining
Department of Computer Science, UIUC
March 27, 2016
Simple Example
• N1 and N2 are
regions of normal
behavior
• Points o1 and o2 are
anomalies
• Points in region O3
are anomalies
Y
N1
o1
O3
o2
N2
X
Department of Computer Science, UIUC
March 27, 2016
Key Challenges
• Defining a representative normal region is challenging
• The boundary between normal and outlying behavior is
often not precise
• The exact notion of an outlier is different for different
application domains
• Availability of labeled data for training/validation
• Malicious adversaries
• Data might contain noise
• Normal behavior keeps evolving
Department of Computer Science, UIUC
March 27, 2016
Aspects of Anomaly Detection Problem
•
•
•
•
•
Nature of input data
Availability of supervision
Type of anomaly: point, contextual, structural
Output of anomaly detection
Evaluation of anomaly detection techniques
Department of Computer Science, UIUC
March 27, 2016
Input Data
• Input Data could be
– Univariate : single variable
– Multivariate: multiple variable
• Nature of attributes
– Binary
– Categorical
– Continuous
– Hybrid
Tid
SrcIP
Number
Internal
of bytes
Duration
Dest IP
1 206.163.37.81
0.10
160.94.179.208
150
No
2 206.163.37.99
0.27
160.94.179.235
208
No
3 160.94.123.45
1.23
160.94.179.221
195
Yes
4 206.163.37.37 112.03 160.94.179.253
199
No
5 206.163.37.41
181
No
0.32
Department of Computer Science, UIUC
160.94.179.244
March 27, 2016
Input Data – Complex Data Types
• Relationship among data instances
– Sequential
• Temporal
– Spatial
– Spatio-temporal
– Graph
Department of Computer Science, UIUC
GGTTCCGCCTTCAGCCCCGCGCC
CGCAGGGCCCGCCCCGCGCCGTC
GAGAAGGGCCCGCCTGGCGGGCG
GGGGGAGGCGGGGCCGCCCGAGC
CCAACCGAGTCCGACCAGGTGCC
CCCTCTGCTCGGCCTAGACCTGA
GCTCATTAGGCGGCAGCGGACAG
GCCAAGTAGAACACGCGAAGCGC
TGGGCTGCCTGCTGCGACCAGGG
March 27, 2016
Data Labels
• Supervised Anomaly Detection
– Labels available for both normal data and anomalies
– Similar to rare class mining
• Semi-supervised Anomaly Detection
– Labels available only for normal data
• Unsupervised Anomaly Detection
– No labels assumed
– Based on the assumption that anomalies are very
rare compared to normal data
Department of Computer Science, UIUC
March 27, 2016
Type of Anomaly
• Point Anomalies
• Contextual Anomalies
• Collective Anomalies
Department of Computer Science, UIUC
March 27, 2016
Point Anomalies
• An individual data instance is anomalous if it
deviates significantly from the rest of the data
Y
set.
Anomaly
N1
o1
O3
o2
N2
Department of Computer Science, UIUC
X
March 27, 2016
Contextual Anomalies
• An individual data instance is anomalous within a
context
• Requires a notion of context
• Also referred to as conditional anomalies*
Anomaly
Normal
Department of Computer Science, UIUC
March 27, 2016
Collective Anomalies
• A collection of related data instances is anomalous
• Requires a relationship among data instances
– Sequential Data
– Spatial Data
– Graph Data
• The individual instances within a collective anomaly are
not anomalous by themselves
Anomalous Subsequence
Department of Computer Science, UIUC
March 27, 2016
Output of Anomaly Detection
• Label
– Each test instance is given a normal or anomaly label
– This is especially true of classification-based
approaches
• Score
– Each test instance is assigned an anomaly score
• Allows the output to be ranked
• Requires an additional threshold parameter
Department of Computer Science, UIUC
March 27, 2016
Evaluation of Anomaly Detection
Accuracy is not sufficient metric for evaluation
– Example: network traffic data set with 99.9% of normal data and
0.1% of intrusions
– Trivial classifier that labels everything with the normal class can
achieve 99.9% accuracy !!!!!
Actual
class
NC
C
Predicted
class
NC
TN
FN
C
FP
TP
• Focus on both recall and precision
– Recall /Detection (R)= TP/(TP + FN)
– Precision (P) = TP/(TP + FP)
– False rate (F)=FP/(TN+FP)
anomaly class – C
normal class
– NC
ROC curves for different outlier detection techniques
1
0.9
0.8
Detection rate
Confusion
matrix
0.7
0.6
0.5
AUC
0.4
0.3
0.2
0.1
0
0
Department of Computer Science, UIUC
0.1
0.2
0.3
0.4
0.5
0.6
0.7
False alarm rate
0.8
0.9
1
March 27, 2016
Taxonomy*
Anomaly Detection
Classification Based
Point Anomaly Detection
Nearest Neighbor Based
Clustering Based
Statistical
Others
Rule Based
Density Based
Parametric
Information Theory Based
Neural Networks Based
Distance Based
Non-parametric
Spectral Decomposition Based
SVM Based
Contextual Anomaly
Detection
Visualization Based
Collective Anomaly
Detection
Online Anomaly
Detection
Distributed Anomaly
Detection
* Outlier Detection – A Survey, Varun Chandola, Arindam Banerjee, and Vipin Kumar, Technical Report
TR07-17, University of Minnesota (Under Review)
Department of Computer Science, UIUC
March 27, 2016
Classification Based Techniques
• Main idea: build a classification model for normal (and anomalous)
events based on labelled training data, and use it to classify each
new unseen event
• Classification models must be able to handle skewed (imbalanced)
class distributions
• Categories:
– Supervised classification techniques
Pros and Cons
• Require knowledge of both normal and anomaly class
• Build classifier to distinguish between normal and known anomalies
– Semi-supervised classification techniques
• Require knowledge of normal class only!
• Use modified classification model to learn the normal behavior and then detect
any deviations from normal behavior as anomalous
Department of Computer Science, UIUC
March 27, 2016
Classification Based Techniques
• Some techniques
– Neural network based approaches
– Support Vector machines (SVM) based approaches
– Bayesian networks based approaches
– Rule based techniques
– Fuzzy Logic
– Genetic Algorithms
– Principle Component Analysis
Department of Computer Science, UIUC
March 27, 2016
Using Neural Networks
Multi-layer NNs
• Creating hyper-planes for separating between various
classes
• Good when dealing with huge data sets and handles noisy
data well
• Bad because learning takes a long time
x y
x y
Department of Computer Science, UIUC
March 27, 2016
Neural Networks
• INPUT LAYER- X = { x1, x2, …. xn}, where n is the number of
attributes. There are as many nodes as no. of inputs.
• HIDDEN LAYER – the number of nodes in the hidden layer and
the number of hidden layers depends on implementation.
• OUTPUT LAYER – corresponds to the class attribute. There are
as many nodes as classes.
• Back Propagation learns by iteratively processing a set of
training data (samples).
Department of Computer Science, UIUC
March 27, 2016
SVM
Support vector machine constructs a hyperplane or set of hyperplanes in
a high or infinite-dimensional space, which can be used for classification
denotes +1
• How would you classify
these points using a linear
discriminant function in order
to minimize the error rate?

Infinite number of answers!
•
Which one is the best?
x2
denotes -1
x1
Department of Computer Science, UIUC
March 27, 2016
Linear SVM
denotes +1
denotes -1
The linear discriminant
function (classifier) with
the maximum margin is
the best
Support Vectors
are those data
points that the
margin pushes up
against

Why it is the best?

Robust to outliners and thus strong generalization ability
Department of Computer Science, UIUC
Linear SVM
• Formulation:
denotes +1
x2
wT x   b  1
wT x   b  1

Margin
denotes -1
x+
The margin width is:
M  (x   x  )  n
w
2


 (x  x ) 

w
w
x+
Quadratic
programming
with linear constraints
n
x-
• Goal:
2
maximize
w
T
y
(
w
x i  b)  1
such that
i
Department of Computer Science, UIUC
x1
March 27, 2016
Taxonomy
Anomaly Detection
Classification Based
Point Anomaly Detection
Nearest Neighbor Based
Clustering Based
Statistical
Others
Rule Based
Density Based
Parametric
Information Theory Based
Neural Networks Based
Distance Based
Non-parametric
Spectral Decomposition Based
SVM Based
Contextual Anomaly
Detection
Visualization Based
Collective Anomaly
Detection
Online Anomaly
Detection
Department of Computer Science, UIUC
Distributed Anomaly
Detection
March 27, 2016
Nearest Neighbor Based Techniques
• Key assumption: normal points have close neighbors
while anomalies are located far from other points
• General two-step approach
1.Compute neighborhood for each data record
2.Analyze the neighborhood to determine whether data record is
anomaly or not
• Categories:
– Distance based methods
•
Anomalies are data points most distant from other points
– Density based methods
•
Anomalies are data points in low density regions
Department of Computer Science, UIUC
March 27, 2016
Distance Based Anomaly Detection
For each object o, examine the # of other objects in the rneighborhood of o, where r is a user-specified distance
threshold
An object o is an outlier if most (taking π as a fraction
threshold) of the objects in D are far away from o, i.e., not
in the r-neighborhood of o
An object o is a DB(r, π) outlier if
Department of Computer Science, UIUC
March 27, 2016
Density Based Anomaly Detection
Compute local densities of particular regions and declare
instances in low density regions as potential anomalies
Approach: Local Outlier Factor (LOF)
Distance from p3 to
nearest neighbor
In the NN approach, p2 is
not considered as outlier,
while the LOF approach
find both p1 and p2 as
outliers
p3 
Distance from p2 to
nearest neighbor

p2

p1
NN approach may
consider p3 as outlier, but
LOF approach does not
Department of Computer Science, UIUC
March 27, 2016
Local Outlier Factor (LOF)
For each data point o compute the # of points in k-distance:
Compute reachability distance (reachdist) for each data example o
with respect to data example o’ as:
Compute local reachability density (lrd) :
LOF is the average of the ratio of local reachability density of o’s knearest neighbors and local reachability density of the data record o
Higher the LOF the more
likely its an outlier
Department of Computer Science, UIUC
March 27, 2016
Taxonomy
Anomaly Detection
Classification Based
Rule Based
Neural Networks Based
Point Anomaly Detection
Nearest Neighbor Based
Clustering Based
Statistical
Density Based
Parametric
Information Theory Based
Distance Based
Non-parametric
Spectral Decomposition Based
SVM Based
Contextual Anomaly
Detection
Others
Visualization Based
Collective Anomaly
Detection
Online Anomaly
Detection
Department of Computer Science, UIUC
Distributed Anomaly
Detection
March 27, 2016
Clustering Based Techniques
• Key assumption: normal data records belong to large and dense
clusters, while anomalies belong do not belong to any of the clusters
or form very small clusters
• Categorization according to labels
– Semi-supervised – cluster normal data to create modes of normal behavior. If a
new instance does not belong to any of the clusters or it is not close to any
cluster, is anomaly
– Unsupervised – post-processing is needed after a clustering step to determine
the size of the clusters and the distance from the clusters is required for the
point to be anomaly
• Anomalies detected using clustering based methods can be:
–Does not belong to any cluster,
–Large distance between the object and its closest cluster
–Belongs to a small or sparse cluster
Department of Computer Science, UIUC
March 27, 2016
Cluster Based Local Outlier Factor
• FindCBLOF: Detect outliers in small clusters
– Find clusters, and sort them in decreasing size
– To each data point, assign a cluster-based local outlier
factor (CBLOF):
– If obj p belongs to a large cluster, CBLOF =
cluster_size X similarity between p and cluster
– If p belongs to a small one, CBLOF = cluster size X
similarity betw. p and the closest large cluster

Ex. In the figure, o is outlier since its closest large cluster is C1, but the
similarity between o and C1 is small. For any point in C3, its closest
large cluster is C2 but its similarity from C2 is low, plus |C3| = 3 is small
Department of Computer Science, UIUC
March 27, 2016
Taxonomy
Anomaly Detection
Classification Based
Point Anomaly Detection
Nearest Neighbor Based
Clustering Based
Statistical
Others
Rule Based
Density Based
Parametric
Information Theory Based
Neural Networks Based
Distance Based
Non-parametric
Spectral Decomposition Based
SVM Based
Contextual Anomaly
Detection
Visualization Based
Collective Anomaly
Detection
Online Anomaly
Detection
Department of Computer Science, UIUC
Distributed Anomaly
Detection
March 27, 2016
Statistics Based Techniques
• Statistical approaches assume that the objects in a data set are
generated by a stochastic process (a generative model)
• Idea: learn a generative model fitting the given data set, and
then identify the objects in low probability regions of the model
as outliers.
• Advantage
– Utilize existing statistical modelling techniques to model
various type of distributions
• Challenges
– With high dimensions, difficult to estimate distributions
– Parametric assumptions often do not hold for real data sets
Department of Computer Science, UIUC
March 27, 2016
Types of Statistical Techniques
• Parametric Techniques
– Assume that the normal data is generated from an underlying
parametric distribution
– Learn the parameters from the normal sample
– Determine the likelihood of a test instance to be generated
from this distribution to detect anomalies
• Non-parametric Techniques
– Do not assume any knowledge of parameters
– Not completely parameter free but consider the number and
nature of the parameters are flexible and not fixed in advance
– Examples: histogram and kernel density estimation
Department of Computer Science, UIUC
March 27, 2016
Parametric Techniques
• Univariate data: A data set involving only one attribute or variable
• Often assume that data are generated from a normal distribution,
learn the parameters from the input data, and identify the points
with low probability as outliers
• Ex: Avg. temp.: {24.0, 28.9, 29.0, 29.1, 29.1, 29.2, 29.2, 29.3, 29.4}
– Compute μ and σ from the samples
Department of Computer Science, UIUC
March 27, 2016
Parametric Techniques
• Univariate outlier detection: The Grubb's test (another statistical
method under normal distribution)
• For each object x in a data set, compute its z-score:
• Now x is an outlier if
where
is the value taken by a t-distribution at a significance level of
α/(2N), and N is the # of objects in the data set
Department of Computer Science, UIUC
March 27, 2016
Non-Parametric Techniques
The model of normal data is learned from the input data without any a priori
structure.
Outlier detection using histogram:

Figure shows the histogram of purchase amounts in transactions

A transaction in the amount of $7,500 is an outlier, since only 0.2%
transactions have an amount higher than $5,000

Problem: Hard to choose an appropriate bin size for histogram

Solution: Adopt kernel density estimation to estimate the probability
density distribution of the data.
Department of Computer Science, UIUC
March 27, 2016
Anomaly Detection on Real Network Data
• Anomaly detection was used at U of Minnesota and Army Research Lab to detect
various intrusive/suspicious activities
• Many of these could not be detected using widely used intrusion detection tools
like SNORT
• Anomalies/attacks picked by MINDS
– Scanning activities
– Non-standard behavior
• Policy violations
• Worms
MINDS – Minnesota Intrusion Detection System
Anomaly
scores
network
Data capturing
device
uNet
Anomaly
detection
flow tools
utcpdump
Filtering
…
…
Association
pattern analysis
Detected
novel attacks
MINDSAT
M
I
N
D
S
Summary and
characterization
of attacks
Human
analyst
Labels
Feature
Extraction
Known attack
detection
Detected
known attacks
Department of Computer Science, UIUC
March 27, 2016
Feature Extraction
• Three groups of features
–Basic features of individual TCP connections
•
•
•
•
•
•
source & destination IP
source & destination port
Protocol
Duration
Bytes per packets
number of bytes
–Time based features
Features 1 & 2
Features 3 & 4
Feature 5
Feature 6
Feature 7
Feature 8
dst … service … flag
dst … service … flag %S0
h1
h1
h1
http
http
http
S0
S0
S0
h1
h1
h1
http
http
http
S0
S0
S0
70
72
75
h2
http
S0
h2
http
S0
0
h4
http
S0
h4
http
S0
0
h2
ftp
S0
h2
ftp
S0
0
existing features
useless
syn flood
normal
construct features with
high information gain
• For the same source (destination) IP address, number of unique destination (source) IP
addresses inside the network in last T seconds – Features 9 (13)
• Number of connections from source (destination) IP to the same destination (source) port in
last T seconds – Features 11 (15)
–Connection based features
• For the same source (destination) IP address, number of unique destination (source) IP
addresses inside the network in last N connections - Features 10 (14)
• Number of connections from source (destination) IP to the same destination (source) port in
last N connections - Features 12 (16)
Department of Computer Science, UIUC
March 27, 2016
Typical Anomaly Detection Output
– 48 hours after the “slammer” worm
score
37674.69
26676.62
24323.55
21169.49
19525.31
19235.39
17679.1
8183.58
7142.98
5139.01
4048.49
4008.35
3657.23
3450.9
3327.98
2796.13
2693.88
2683.05
2444.16
2385.42
2114.41
2057.15
1919.54
1634.38
1596.26
1513.96
1389.09
1315.88
1279.75
1237.97
1180.82
srcIP
63.150.X.253
63.150.X.253
63.150.X.253
63.150.X.253
63.150.X.253
63.150.X.253
63.150.X.253
63.150.X.253
63.150.X.253
63.150.X.253
142.150.Y.101
200.250.Z.20
202.175.Z.237
63.150.X.253
63.150.X.253
63.150.X.253
142.150.Y.101
63.150.X.253
142.150.Y.236
142.150.Y.101
63.150.X.253
142.150.Y.101
142.150.Y.101
142.150.Y.101
63.150.X.253
142.150.Y.107
63.150.X.253
63.150.X.253
142.150.Y.103
63.150.X.253
63.150.X.253
sPort
1161
1161
1161
1161
1161
1161
1161
1161
1161
1161
0
27016
27016
1161
1161
1161
0
1161
0
0
1161
0
0
0
1161
0
1161
1161
0
1161
1161
dstIP
128.101.X.29
160.94.X.134
128.101.X.185
160.94.X.71
160.94.X.19
160.94.X.80
160.94.X.220
128.101.X.108
128.101.X.223
128.101.X.142
128.101.X.127
128.101.X.116
128.101.X.116
128.101.X.62
160.94.X.223
128.101.X.241
128.101.X.168
160.94.X.43
128.101.X.240
128.101.X.45
160.94.X.183
128.101.X.161
128.101.X.99
128.101.X.219
128.101.X.160
128.101.X.2
128.101.X.30
128.101.X.40
128.101.X.202
160.94.X.32
128.101.X.61
dPort
1434
1434
1434
1434
1434
1434
1434
1434
1434
1434
2048
4629
4148
1434
1434
1434
2048
1434
2048
2048
1434
2048
2048
2048
1434
2048
1434
1434
2048
1434
1434
protocolflags packets bytes
17
16 [0,2)
[0,1829)
17
16 [0,2)
[0,1829)
17
16 [0,2)
[0,1829)
17
16 [0,2)
[0,1829)
17
16 [0,2)
[0,1829)
17
16 [0,2)
[0,1829)
17
16 [0,2)
[0,1829)
17
16 [0,2)
[0,1829)
17
16 [0,2)
[0,1829)
17
16 [0,2)
[0,1829)
1
16 [2,4)
[0,1829)
17
16 [2,4)
[0,1829)
17
16 [2,4)
[0,1829)
17
16 [0,2)
[0,1829)
17
16 [0,2)
[0,1829)
17
16 [0,2)
[0,1829)
1
16 [2,4)
[0,1829)
17
16 [0,2)
[0,1829)
1
16 [2,4)
[0,1829)
1
16 [0,2)
[0,1829)
17
16 [0,2)
[0,1829)
1
16 [0,2)
[0,1829)
1
16 [2,4)
[0,1829)
1
16 [2,4)
[0,1829)
17
16 [0,2)
[0,1829)
1
16 [0,2)
[0,1829)
17
16 [0,2)
[0,1829)
17
16 [0,2)
[0,1829)
1
16 [0,2)
[0,1829)
17
16 [0,2)
[0,1829)
17
16 [0,2)
[0,1829)
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
2
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
3
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
4
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
Department of Computer Science, UIUC
5
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
6
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
7
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
8
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
9
10 11 12 13 14 15 16
0.81 0 0.59 0 0 0 0 0
0.81 0 0.59 0 0 0 0 0
0.81 0 0.58 0 0 0 0 0
0.81 0 0.58 0 0 0 0 0
0.81 0 0.58 0 0 0 0 0
0.81 0 0.58 0 0 0 0 0
0.81 0 0.58 0 0 0 0 0
0.82 0 0.58 0 0 0 0 0
0.82 0 0.57 0 0 0 0 0
0.82 0 0.57 0 0 0 0 0
0.83 0 0.56 0 0 0 0 0
0
0
0
0 0 0 1 0
0
0
0
0 0 0 1 0
0.82 0 0.57 0 0 0 0 0
0.82 0 0.57 0 0 0 0 0
0.82 0 0.57 0 0 0 0 0
0.83 0 0.56 0 0 0 0 0
0.82 0 0.57 0 0 0 0 0
0.83 0 0.56 0 0 0 0 0
0.83 0 0.56 0 0 0 0 0
0.82 0 0.57 0 0 0 0 0
0.83 0 0.56 0 0 0 0 0
0.83 0 0.56 0 0 0 0 0
0.83 0 0.56 0 0 0 0 0
0.82 0 0.57 0 0 0 0 0
0.83 0 0.56 0 0 0 0 0
0.82 0 0.57 0 0 0 0 0
0.82 0 0.57 0 0 0 0 0
0.83 0 0.56 0 0 0 0 0
0.83 0 0.56 0 0 0 0 0
0.83 0 0.56 0 0 0 0 0
March 27, 2016
Conclusions
• Anomaly detection can detect critical information
in data
• Highly applicable in various application domains
• Nature of anomaly detection problem is
dependent on the application domain
• Need different approaches to solve a particular
problem formulation
Department of Computer Science, UIUC
March 27, 2016
Related problems
• Rare Class Mining
• Chance discovery
• Novelty Detection
• Exception Mining
• Noise Removal
• Black Swan*
Department of Computer Science, UIUC
March 27, 2016