Unsupervised intrusion detection using clustering approach
Download
Report
Transcript Unsupervised intrusion detection using clustering approach
Unsupervised Intrusion Detection
Using Clustering
Approach
Muhammet Kabukçu
Sefa Kılıç
Ferhat Kutlu
Teoman Toraman
1/29
Outline
Introduction
Using Clustering for Intrusion Detection
Methodology
Overall Summary
Conclusion
References
2/29
Introduction
• Intrusion detection is the process of monitoring the events
occurring in a computer system or network and analyzing
them for signs of possible incidents.
• Incidents are violations or
imminent threats of violation of:
* computer security policies,
* acceptable use policies,
* standard security practices.
3/29
Introduction
• An intrusion detection
system (IDS) is software
that automates the
intrusion detection
process.
• IDSs are primarily focuses on identifying possible
incidents and detecting when an attacker has
successfully compromised a system by exploiting
vulnerability in the system.
4 /29
Introduction
Methodologies
of IDS
Technologies
SignatureBased
Detection
AnomalyBased
Detection
Stateful
Protocol
Analysis
5 /29
Signature-Based Detection
A signature is a pattern that corresponds to a known
threat (e.g. a telnet attempt with a username of "root",
which is a violation of an organization's security policy).
Signature-based detection is the process of comparing
signatures against observed events to identify possible
incidents.
Advantage: Very effective at detecting known threats.
Disadvantage: Ineffective at detecting previously
unknown threats.
6 /29
Anomaly-Based Detection
The process of comparing definitions of what activity is
considered normal against observed events to identify
significant deviations.
Capable of detecting previously unknown threats.
Uses host or network-specific profiles.
7 /29
Detection by Stateful Protocol Analysis
The process of comparing predetermined profiles of
generally accepted definitions of benign protocol activity
for each protocol state against observed events to
identify deviations.
Relies on vendor-developed universal profiles that
specify how particular protocols should and should not
be used.
8 /29
Using Clustering for Intrusion Detection
Methods other than Signature-Based Detection use data
mining and machine learning algorithms to train on
labeled network data.
For training data, there are two major paradigms:
Misuse Detection
Anomaly Detection.
Which one to use ???
9 /29
Using Clustering for Intrusion Detection
- Misuse Detection In misuse detection, machine learning algorithms
are used with labeled data.
By using the extracted features from labeled
network traffic, network data is classified.
By using new data which includes new type of
attacks, detection models are retrained.
10 /29
Using Clustering for Intrusion Detection
- Anomaly Detection In anomaly detection,
models are built by training on normal data,
deviations are searched over the normal model.
Generating purely normal data is
very difficult and costly in practice.
It is very hard to guarantee that
there are no attacks during the time
the traffic is collected from the
network.
11 /29
Using Clustering for Intrusion Detection
Misuse Detection
Anomaly Detection.
Use a mechanism to detect
intrusions by using unlabeled
data as a train model.
Find intrusions buried within that
data.
12/29
Using Clustering for Intrusion Detection
A Set of
Unlabeled
Data
Unsupervised
Anomaly
Detection
Algorithm
Assumptions for unsupervised anomaly detection algorithm:
1. The intrusions are rare with respect to normal network
traffic.
2.
Detected Intrusion
Clusters
Connection
Comparison
with Detected
Clusters
The intrusions are different from normal network traffic.
As a Result:
The intrusions will appear as outliers in the data.
Detected
malicious
attacks
13 /29
Using Clustering for Intrusion Detection
The unsupervised anomaly
detection algorithm clusters
the unlabeled data instances
together into clusters using a
simple distance-based metric.
14 /29
Using Clustering for Intrusion Detection
Once data is clustered, all of the
instances that appear in
small clusters are labeled as
Intrusion cluster
anomalies because;
The normal instances should
form large clusters compared to
the intrusions,
Malicious intrusions and normal
instances are qualitatively
different, so they do not fall into
the same cluster.
Normal cluster
15 /29
Methodology
1. Description of the dataset
2. Metric & Normalization
3. Clustering Algorithm
a) Portnoy et. al.
b) Y-means Algorithm
4. Labeling Clusters
5. Intrusion Detection
16 /29
Description of the dataset
• KDD Cup 1999 Data
• Main attack categories
– DOS: Denial of Service, (e.g. synood)
– R2L: Unauthorized access from a remote machine
(e.g. guessing password)
– U2R: Unauthorized access to local superuser
(root) privileges (e.g. various buffer overflow
attacks)
– Probing: Surveillance and other probing (e.g. port
scanning)
• In total, 24 attack types in training data; 14
17/29
additional ones in test data...
Metric & Normalization
• Euclidean Metric
(for distance computation)
• Feature Normalization
(to eliminate the difference in the scale of features)
18/29
Clustering Algorithm (Portnoy et. al.)
.
d1
Xi
Training set
.
.
d2
d3
Empty set of clusters
- d1 is selected.
- if d1 < W ( predefined threshold value ),
then Xi is assigned to that cluster.
- else, a new cluster is created, then Xi is assigned to it.
19/29
Clustering Algorithm (Portnoy et. al.)
• Advantage: No need to know the initial no. of
clusters.
• Disadvantage: Need to know W, which may label
instances wrong in some cases.
• However…
20/29
Clustering Algorithm (Y-means Algorithm)
• 3 main parts:
1. assigning instances to k clusters
2. splitting clusters
3. merging clusters
21/29
Clustering Algorithm (Y-means Algorithm)
1. assigning instances to k clusters
...
...
...
...
...
...
...
...
...
...
redefine
cluster
centroid
...
...
k: no. of clusters
n: no. of instances
1<k<n
Dataset
22/29
Clustering Algorithm (Y-means Algorithm)
2. splitting clusters
t ( normal threshold) = 2.32 σ
σ = standard deviation
di
.
Xi ( instance )
.
t
Confident area
• if di > t , Xi is an outlier.
• New clusters are created firstly
with the farthest outliers.
23/29
Clustering Algorithm (Y-means Algorithm)
3. merging clusters
.
Xi
If Xi is in the confident area of two clusters, merge these
clusters back.
24/29
Labeling Clusters
• Our first assumption:
# of normal instances >> # of intrusions
• Label instances in large clusters: normal
• Label instances in small clusters: intrusion
• Start labeling as normal, until 99% of data is labeled
as normal, label rest of them as intrusion.
Normal cluster
Intrusion cluster
25/29
Intrusion Detection
For test instance x,
Measure the distance to each cluster.
Select the nearest cluster C.
If C is normal cluster, label x as normal,
Otherwise label x as intrusion.
26/29
Overall Summary
• IDS & IDS Technologies
• Using Clustering for Intrusion Detection
• Methodology
1. Description of the dataset
2. Metric & Normalization
3. Clustering Algorithm
4. Labeling Clusters
5. Intrusion Detection
Conclusion
• Unsupervised Clustering is choosen.
• KDD Cup 1999 Data
• Y-means Algorithm is used for creating ID System.
27/29
References
[1] KDD Cup 1999 data.
http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html.
[2] Y. Guan and A. A. Ghorbani. Y-means: A clustering method for
intrusion detection. In Proceedings of Canadian Conference
on Electrical and Computer Engineering, pages 1083{1086,
2003.
[3] L. Portnoy, E. Eskin, and S. Stolfo. Intrusion detection with
unlabeled data using clustering. In Proceedings of ACM CSS
Workshop on Data Mining Applied to Security (DMSA-2001),
2001.
[4] K. Scarfone and P. Mell. Guide to intrusion detection and
prevention systems (idps), 2007.
28/29
Questions?
29/29