Mining Anomalies Using Traffic Feature Distributions

Download Report

Transcript Mining Anomalies Using Traffic Feature Distributions

Mining Anomalies Using Traffic
Feature Distributions
Anukool Lakhina
Mark Crovella
Christophe Diot
in ACM SIGCOMM 2005
Presented by:
Sailesh Kumar
Overview


Introduction to PCA
Application of PCA to OD flows
» Lakhina et al. SIGMETRICS’04


Volume Anomalies
Subspace Analysis of Link Volume Traffic
» Lakhina et al. SIGCOMM’04


Feature based Anomaly Detection
Anomaly Classification
» Lakhina et al. SIGCOMM’05
‹#› - Sailesh Kumar - 4/1/2016
Principal Component Analysis



PCA is a useful statistical technique that has found
application in fields such as face recognition and
image compression.
It is a technique for finding patterns in data of high
dimension.
Like variance provides the relationship between
Two important terms:
Covariance Matrix
•A nxn matrix has n eigenvectors.
•All eigenvectors are orthogonal to each other.
Eigenvector/Eigenvalue
‹#› - Sailesh Kumar - 4/1/2016
data items in a single dimension, covariance
provides the relationship between different
dimensions of multi-dimensional data
Principal Component Analysis




PCA is a way of identifying patterns in data, and
expressing the data in such a way as to highlight their
similarities and differences.
Since patterns in data can be hard to find in data of
high dimension, where graphical representation is not
available, PCA is a powerful tool for analyzing data.
It essentially reduces the dimension if data along
multiple dimensions are correlated
A simple example of 2-D correlated data
» (No of hours studied, Marks obtained in exam)
» (Traffic entering in a network, traffic exiting a network)
» Such set of data may exhibit strong correlation (positive
correlation in this instance)
» Thus, it might be worthwhile describing this data set as a
single dimension data.
‹#› - Sailesh Kumar - 4/1/2016
2-D data and Principal Components
‹#› - Sailesh Kumar - 4/1/2016
One Dimensional Projection
‹#› - Sailesh Kumar - 4/1/2016
Principal Component Analysis



Consider a p-dimensional data
If data along the p dimensions are correlated (high
positive or negative covariance), then it can be
represented with fewer dimensions (k)
In general any p dimensional data set can be mapped
onto first k principal axes
» First k principal components with the highest eigenvalues


Data mapped onto the k dimensions are usually called
the normal component
Remaining data is called the residual component
‹#› - Sailesh Kumar - 4/1/2016
OD Flows


OD flow is the traffic that enters at an origin PoP and
exits at a destination PoP of a backbone network.
Relationship between link traffic and OD flow traffic is
captured by the routing matrix A.
» A has size (#links) x (# OD flows)
» Aij = 1 if OD flow j traverses through link i.
» Traffic engineering is essentially adjusting the matrix A.

A network with n PoP will have n2 OD flows.
» Thus OD flows are high dimensional data.
– 20 PoP will result in 400 dimensions.

However, quite intuitively OD flows are correlated.
» Hence they can be represented with far fewer dimensions.
» Lakhina et al. (SIGMETRICS’04) shows it.
‹#› - Sailesh Kumar - 4/1/2016
OD Flows

Lakhina et al. (SIGMETRICS’04) shows that only 5-10
dimensions are sufficient to capture 95+% of the
traffic
‹#› - Sailesh Kumar - 4/1/2016
Why care about OD Flows


Volume anomaly typically arises on an OD flow (traffic
arriving at one PoP and destined for another PoP)
If we only monitor traffic on network links, volume
arising from an OD flow may not be noticeable
» Thus, naïve approach won’t work if OD flow info isn’t available
‹#› - Sailesh Kumar - 4/1/2016
Subspace Analysis of Link Traffic


Even if OD flow information is not available, and only
link traffic information is available, PCA can be applied
and subspace technique can detect volume anomalies
What is the data
» Data consist of time samples of traffic volumes at all m links in
the network
» Thus, Y is the t x m traffic measurement matrix
– An arbitrary row y of Y denotes one sample



Use PCA to separate normal and anomalous traffic
Construct the principal axes and map data onto them
Consider set of first k axes which captures the highest
variance
» Projection of y on these k axes is called normal traffic while
remaining traffic is residual traffic
‹#› - Sailesh Kumar - 4/1/2016
Subspace Analysis of Link Traffic




An approach to separate normal traffic from
anomalous traffic
Normal Subspace, : space spanned by the
first k principal components
Anomalous Subspace, : space spanned by
the remaining principal components
Then, decompose traffic on all links by
projecting onto
and
to obtain:
Traffic vector of all
links at a particular
point in time
‹#› - Sailesh Kumar - 4/1/2016
Normal traffic
vector
Residual traffic
vector
Subspace Analysis Results

Note that during anomaly, normal component doesn’t
change that much while residual component changes
quite a lot
» Thus, anomalies can be detected by setting some threshold
‹#› - Sailesh Kumar - 4/1/2016
Background Over

Lets talk about today’s paper now!

Objective is to build a anomaly diagnosis system
» detects a diverse range of anomalies,
» distinguishes between different types of anomalies,
» and group similar anomalies

Clearly these goals are too ambitious
» Anomalies are a moving target (malicious anomalies)
» New anomalies will continue to arise
» In general this is a difficult problem

This paper takes significant steps towards a system
that fulfills these criteria.
‹#› - Sailesh Kumar - 4/1/2016
Typical Characteristics of Anomaly

Most Anomalies induce a change in distributional
aspects of packet header fields (called features).
» DOS attack – multiple source IP address concentrated on a
single destination IP address
» Network scan – dispersed distribution of destination addresses
» Most worms/viruses also induce some change in distribution of
certain features
» However these changes can be very subtle and mining them is
like searching for needles in a haystack

Unlike many previous approach, this paper aims to
detect events which disturb the distribution of traffic
features rather than traffic volume
‹#› - Sailesh Kumar - 4/1/2016
Limitations of Volume Based Detection

Port scan anomaly (traffic feature changes, however
traffic volume remains more or less the same)
‹#› - Sailesh Kumar - 4/1/2016
We can use entropy to
capture the variations
in the traffic feature
•Takes value 0 when distribution is
maximally concentrated.
•Takes value log2N when
distribution is maximally dispersed.
Effectiveness of Feature Entropy
‹#› - Sailesh Kumar - 4/1/2016
Port scan dwarfed
in volume metrics…
But stands out in
feature entropy,
which also reveals
its structure
Entropy based scheme



In volume based scheme, # of packets or bytes per
time slot was the variable.
In entropy based scheme, in every time slot, the
entropy of every traffic feature is the variable.
This gives us a three way data matrix H.
» H(t, p, k) denotes at time t, the entropy of OD flow p, of the
traffic feature k.

To apply subspace method,
we need to unfold it into a
single-way representation.
‹#› - Sailesh Kumar - 4/1/2016
H(dstPort)
H(dstIP)
H(srcPort)
H(srcIP)
# timebins
Multi-way to single-way
pe
ty
s
H(SrcIP)
H(SrcPort)
H(DstIP) H(DstPort)
# od-pairs


Decompose into a single-way matrix
Now apply the usual subspace decomposition
» Every row of the matrix will be decomposed into
Traffic vector of all
links at a particular
point in time
‹#› - Sailesh Kumar - 4/1/2016
Normal traffic
vector
Residual traffic
vector
Benefits of using Features + OD flow

The application of entropy on both traffic features and
the ensemble of OD flows has a key benefit that
correlated anomalies across both OD flows and
features stand-out.

Moreover, as we have shown in the first example,
entropy is an effective summarization tool, when
traffic volume changes are not significant.

We now evaluate how this scheme improves over
volume based schemes
‹#› - Sailesh Kumar - 4/1/2016
Entropy Based versus Volume Based
‹#› - Sailesh Kumar - 4/1/2016
‹#› - Sailesh Kumar - 4/1/2016
Detection Rates
Anomaly Classification

Cluster Anomaly which are close enough in space.

Anomalies can be thought of as a point in 4-D space
with co-ordinate vectors
» h = [H(srcIP), H(dstIP), H(srcPort), H(dstPort)]

Do anomalies of similar type appear next to each
other in the entropy space
» YES!

Before digging into details, a brief Introduction of
Clustering Algorithms
‹#› - Sailesh Kumar - 4/1/2016
Clustering Algorithms


Objective is to cluster close enough data points
Two general approach to cluster
» K-means and Hierarchical

K-means is one of the simplest unsupervised
learning algorithm.
1.
Place K points into the space represented by the data that
are being clustered. These points form initial group centroids.
2.
Assign each object to the group that has the closest
centroid.
3.
When all objects have been assigned, recalculate the
positions of the K centroids.
4.
Repeat Steps 2 and 3 until the centroids no longer move.

Hierarchical approach begins with one cluster and
breaks it into multiple clusters OR begins with n
clusters and merge different clusters
‹#› - Sailesh Kumar - 4/1/2016
Nearest Neighbor, Level 2, k = 7 clusters.
From http://www.stat.unc.edu/postscript/papers/marron/Stat321FDA/RimaIzempresentation.ppt
‹#› - Sailesh Kumar - 4/1/2016
Nearest Neighbor, Level 3, k = 6 clusters.
From http://www.stat.unc.edu/postscript/papers/marron/Stat321FDA/RimaIzempresentation.ppt
‹#› - Sailesh Kumar - 4/1/2016
Nearest Neighbor, Level 4, k = 5 clusters.
From http://www.stat.unc.edu/postscript/papers/marron/Stat321FDA/RimaIzempresentation.ppt
‹#› - Sailesh Kumar - 4/1/2016
Nearest Neighbor, Level 5, k = 4 clusters.
From http://www.stat.unc.edu/postscript/papers/marron/Stat321FDA/RimaIzempresentation.ppt
‹#› - Sailesh Kumar - 4/1/2016
Nearest Neighbor, Level 6, k = 3 clusters.
From http://www.stat.unc.edu/postscript/papers/marron/Stat321FDA/RimaIzempresentation.ppt
‹#› - Sailesh Kumar - 4/1/2016
Nearest Neighbor, Level 7, k = 2 clusters.
From http://www.stat.unc.edu/postscript/papers/marron/Stat321FDA/RimaIzempresentation.ppt
‹#› - Sailesh Kumar - 4/1/2016
Nearest Neighbor, Level 8, k = 1 cluster.
From http://www.stat.unc.edu/postscript/papers/marron/Stat321FDA/RimaIzempresentation.ppt
‹#› - Sailesh Kumar - 4/1/2016
Clustering Anomalies
Known Labels
Cluster Results
Legend
(DstIP)
Code Red
Scanning
Single source
DOS attack
Multi source
DOS attack
(SrcIP)
(SrcIP)
Summary: Correctly classified 292 of 296 injected anomalies
‹#› - Sailesh Kumar - 4/1/2016
Clustering Anomalies
(DstIP)
– Results of both clustering
algorithms are consistent
•
(SrcPort)
‹#› - Sailesh Kumar - 4/1/2016
(SrcIP)
Heuristics identify about
10 clusters in dataset
‹#› - Sailesh Kumar - 4/1/2016
Questions?