Transcript Paper

Survey Presentation on Four Selected Research Papers
on Data Mining Based Intrusion Detection System
60-564: Security and Privacy on the Internet
Instructor: Dr. A. K. Aggarwal
Presented By:
Ahmedur Rahman
Zillur Rahman
Lawangeen Khan
Date: April 05, 2006
1
Selected Research Papers
 Exploiting Efficient Data Mining Techniques to
Enhance Intrusion Detection Systems
 Packet Vs Session Based Modeling For
Intrusion Detection System
 Detecting Denial-of-Service Attacks with
Incomplete Audit Data
 ADAM: Detecting Intrusions by Data Mining
2
Table of Contents
 Introduction
 Paper 1
 Paper 2
 Paper 3
 Paper 4
 Testing Methodology
 Conclusion
 Bibliography
3
Introduction
 Intrusion detection is a process of gathering
intrusion related knowledge occurring in the
process of monitoring the events and analyzing
them for sign or intrusion.
 Detecting the intrusion based on two common
practices – Misuse detection and Anomaly
detection.
 To apply data mining techniques in intrusion
detection, first, the collected data needs to be
preprocessed and converted to the format suitable
for mining processing. Next, the reformatted data
will be used to develop a clustering or
classification model.
4
Cont.
Introduction
 Paper -1 discusses about various types of data
mining based intrusion detection system models.
 Paper -2 discusses on packet-based vs. session
based modeling for intrusion detection.
 Paper -3 discusses on a model named SCAN,
which can work with good accuracy even with
incomplete dataset.
 Paper -4 discusses about ADAM, which was one
of the leading data mining based intrusion
detection system.
5
Paper - 1
 The main motivation behind using intrusion detection in
data mining is automation. Pattern of the normal behavior
and pattern of the intrusion can be computed using data
mining.
 Two Steps for applying data mining technique to intrusion
detection:
– The collected monitoring data needs to be preprocessed and
converted to the format suitable for mining processing.
– The reformatted data will be used to develop a clustering or
classification model.
 Data mining is helpful in detecting new vulnerabilities and
intrusions, discover previous unknown patterns of attacker
behaviors, and provide decision support for intrusion
management.
6
Cont.
Paper - 1
 Data Mining and Intrusion Detection
– Different data mining approaches are
frequently used to analyze network data to
gain intrusion related knowledge.
•
•
•
•
Clustering.
Classification.
Outlier Detection.
Association Rule.
7
Cont.
Paper - 1
Clustering
– Most of the clustering techniques use some basic steps
involved in identifying intrusion. These steps are as
follows:
• Find the largest cluster, i.e., the one with the most number of
instances, and label it normal.
• Sort the remaining clusters in an ascending order of their
distances to the largest cluster.
• Select the first K1 clusters so that the number of data instances
in these clusters sum up to ¼ ´N, and label them as normal,
where ´ is the percentage of normal instances.
• Label all the other clusters as attacks.
8
Cont.
Paper - 1
 Classification
– Classification is similar to clustering in that it also partitions
customer records into distinct segments called classes.
– Unlike clustering, classification analysis requires that the enduser/analyst know ahead of time how classes are defined.
– Classifications algorithms can be classified into three types:
• Extensions to linear discrimination.
• Decision tree
• Rule-based methods
9
Cont.
Paper - 1
 Association
–
–
–
–
The Association rule is specifically designed for use in data analyses.
The objective behind using association rule based data mining is to
derive multi-feature (attribute) correlations from a database table.
Association rule algorithms classified into two categories:
• Candidate-generation-and-test approach such as Apriori.
• Pattern-growth approach.
Basic steps for incorporating association rule for intrusion detection as
follows.
• First network data need to be formatted into a database table where
each row is an audit record and each column is a field of the audit
records.
• There is evidence that intrusions and user activities shows frequent
correlations among network data.
• Also rules based on network data can continuously merge the rules
from a new run to the aggregate rule.
10
Cont.
Paper - 1
 Data Mining Based IDS:
– Data mining is becoming one of the popular
techniques for detecting intrusion. IDS can be
classified on the basis of their strategy of
detection. There are two categories under this
classification.
• Misuse Detection Based IDS.
• Anomaly Detection Based IDS.
11
Cont.
Paper - 1
 IDS Using both Misuse and Anomaly
Detection:
– IDS’s that use both misuse and anomaly
intrusion detection techniques. Thus they are
capable for detecting both known and unknown
intrusions.
• IIDS (Intelligent Intrusion Detection System
Architecture).
• RIDS-100(Rising Intrusion Detection System).
12
Paper - 2
 In this survey they report the findings of
their research in the area of anomaly-based
intrusion detection systems using datamining techniques to create a decision tree
model of their network using the 1999
DARPA Intrusion Detection Evaluation data
set .
13
Cont.
Paper - 2
 Types Of IDS:
– Misuse Detectors
– Anomaly Based Detectors
 Problem with IDS:
– False Negative
– False Positive
14
Cont.
Paper - 2
 Dynamic Modeling:
– Data Preparation
 Studied the data sets’ patterns and modeled the traffic patterns
around a generated target variable, “TGT”.
 They used this variable as a predictor target variable for
setting the stage.
– Two separate data sets:
 Packet-based Modeling.
 Session Based Modeling
15
Cont.
Paper - 2
 Classification Tree Modeling:
– Advantage:
This method over traditional pattern-recognition
methods is that the classification tree is an intuitive
tool and relatively easy to interpret.
16
Cont.
Paper - 2
 Data Modeling And Assessment:
– Accomplish their data-mining goals through the use of supervised
learning techniques on the binary target variable, “TGT”.
– From the revised data sets they further deleted a few extraneous
variables date/time, source IP address, and destination IP address
variable.
– Partitioned both data sets using stratified random sampling.
– The TGT variable was then specified to form subsets of the
original data to improve the classification precision of their
model.
– Created a decision tree using the Chi-Square.
– After five leaves of depth, the tree maximizes its profit.
17
Paper - 3
 Detecting Denial-of-Service Attacks with
Incomplete Audit Data
 SCAN (Stochastic Clustering Algorithm for
Network anomaly detection)
 Improved version of ExpectationMaximization (EM) Algorithm
 Can handle missing data in audit dataset
18
Cont.
Paper - 3
 Features of SCAN:
– Improvement in speed of convergence
• Combination of Data Summaries, Bloom Filters, Arrays
– Ability to detect anomaly in absence of complete audit data
• EM computes the maximum likelihood estimates in parametric
model based on prior information
 Components of SCAN:
– Online Sampling and Flow Aggregation
– Clustering and Data Reduction
• EM based Clustering Algorithm
• Data Summaries for Data Reduction
• Handling Missing Data in a dataset
– Anomaly Detection
19
Cont.
Paper - 3
 System Model Overview:
20
Cont.
Paper - 3
 Online Sampling and Flow Aggregation:
– Traffic is sampled and classified into flows
• Flow is all connection with a particular Dest IP and Dest Port
– Connection records provide the following field:
• (SrcIP,SrcPort,DstIP,DstPort,ConnStatus,Duration)
– First identify the flow that packet belongs to
– If not found: Generate a new flow ID based on
connection information
– If found: Corresponding flow array is updated
21
Cont.
Paper - 3
 Clustering and Data Reduction
– Uses EM based clustering algorithm
– Input: Dataset D and initial estimates
– Output: Clustered set of Data Points
– EM Algorithm has two steps:
• Expectation Step: Finds the expected value of complete data
log likelihood
• Maximization Step: Maximizes the expectations computed in
the E step
22
Cont.
Paper - 3
 Clustering and Data Reduction
– Data Summaries for Data Reduction
–
–
–
–
Execute to build summary for each time slice
This enhancement reduce dataset
Consists of 3 stages
1st Stage: At the end of each time slice they have made a pass
over the connection dataset and built data summaries.
– 2nd Stage: They made another pass over the dataset to built
cluster candidates using the data summaries collected in the
previous step. After the frequently appearing data have been
identified, the algorithm builds all cluster candidates during
this pass.
– 3rd Stage: In this stage clusters are selected from the set of
candidates.
23
Cont.
Paper - 3
 Clustering and Data Reduction
– Handling missing data in a dataset
– At E step the expected value is computed
– At M step that expected value is inserted into the
dataset
 Anomaly Detection:
 Anomalous distribution A
 Normal distribution N
 Determine which group the traffic belongs to
 calculated the likelihood of the two cases to determine the
distribution to which the incoming flow belongs to.
 Statistical functions
24
Paper - 4
 ADAM: Detecting Intrusion by Data Mining
– Audit Data Analysis and Mining
– Combination of Association Rule and Classification
Rule
– Firstly, ADAM collects known frequent datasets
– Secondly, ADAM runs an online algorithm
• Finds last frequent connection records
• Compare them with known mined data
• Discards those, which seems to be normal
• Suspicious ones are forwarded to the classifier
• Trained classifier then classify the suspicious data as
one of the following:
– Known type of attack
– Unknown type of attack
– False alarm
25
Cont.
Paper - 4
 ADAM has two phases in their model
 1st Phase: Train the classifier
– Offline process
– Takes place only once
– Before the main experiment
 2nd Phase: Using the trained classifier
– Trained classifier is then used to detect anomalies
– Online process
26
Cont.
Paper - 4
 Phase 1:
27
Cont.
Paper - 4
 Phase 2:
28
Testing Methodology
 Paper 1:
– In this paper the authors did not use any testing
methodology. They described different kinds of data
mining techniques and rules to implement in various
kinds of data mining based IDS.
 Paper 2:
– The authors of this paper used MIT Lincoln Lab 1999
intrusion detection evaluation (IDEVAL) data sets.
– From this data set they over-sampled and created a new
data set that contained a mix of 43.6% attack sessions
and 56.4% non-attack sessions.
– From both of data set they took random sampling and
allocated 67% of the observations to training data set
and 33% to the validation data set.
29
Cont.
Testing Methodology
 Paper 2: (Continued)
– Packet-Based Results
• They scored the UCF data with the packet-based network
model and found that approximately 2.5% of the packets
having a probability of 1.0000 of being an attack packet.
• Conversely, a packet that scored a probability of 0.0000 does
not necessarily mean that packet is a “good” packet and poses
no threat to their campus networks.
• The following figure shows that more than 70% of the packets
captured have an attack probability of 0.0000 and 97% of the
packets have an attack probability of 0.5000 or less.
30
Cont.
Testing Methodology
 Paper 2: (Continued)
– Packet-Based Results (Continued)
• Overall, out of the approximately 500,000 packets with a 1.0000
probability, there are at least 50,000 packets that require further
study. Retraining of their model and readjusting the model’s prior
probabilities will allow to see if those remaining packets are truly
attack packets or just simply false alarms.
31
Cont.
Testing Methodology
 Paper 2: (Continued)
– Session-Based Results
• They also scored the UCF data with the session-based network
model and found that approximately 32.9% of the sessions
were identified as having a probability of 1.0000 of being an
attack session.
• Conversely, a session that scored a probability lower than
1.0000 does not also necessarily mean that session is a “good”
session and poses no threat to their campus networks.
• The vast majority of the sessions captured had a low or nonexistent probability of being an attack session. Their studies
showed that more than 66% of the sessions captured have an
attack probability of 0.0129.
32
Cont.
Testing Methodology
 Paper 3:
– The authors of this paper evaluated SCAN using the 1999 DARPA
intrusion detection evaluation data.
– The dataset consists of five weeks of TCPDump data.
– Data from week 1 and 3 consist of normal attack-free network
traffic.
– Week 2 data consists of network traffic with labeled attacks.
– The week 4 and 5 data are the “Test Data” and contain 201
instances of 58 different unlabelled attacks, 177 of which are
visible in the inside TCPDump data.
– They trained SCAN on the DARPA dataset using week 1 and 3
data, then evaluated the detector on weeks 4 and 5.
– Evaluation was done in an off-line manner.
33
Cont.
Testing Methodology
 Paper 3: (Continued)
– Simulation Results with Complete Audit Data
• An IDS is evaluated on the basis of accuracy and efficiency. To judge
the efficiency and accuracy of SCAN, they used Receiver-Operating
Characteristic (ROC) curves.
• An ROC curve, which graphs the rate of detection versus the false
positive ratio.
• The performance of SCAN at detecting a SSH Process Table attack is
shown in Fig. 5.
• The attack is similar to the Process
Table attack in that the goal of the
attacker is to cause the SSHD daemon
to spawn the maximum number of
processes that the operating system
will allow.
34
Cont.
Testing Methodology
 Paper 3: (Continued)
– Simulation Results with Complete Audit Data (Continued)
• In Fig. 6 the performance of SCAN at detecting a SYN flood attack is
evaluated. A SYN flood attack is a type of a DoS attack.
• This causes the data structure in the ‘tcpd’ daemon in the server to
overflow.
• They compared the performance of
SCAN with that of the K-Nearest
Neighbors (KNN) clustering algorithm.
• Comparisons of the ROC curves seen in
Fig. 5 and 6 suggests that SCAN
outperforms the KNN algorithm.
• The two techniques are similar to each
other in that both are unsupervised
techniques that work with unlabelled
data.
35
Cont.
Testing Methodology
 Paper 4:
– The authors of this paper discussed that ADAM participated in
DARPA 1999 intrusion detection evaluation.
– It focused on detecting DOS and PROBE attacks from tcpdump
data and performed quite well.
– The following Figures 3 and 4 show the results of DARPA 1999
test data.
36
Cont.
Testing Methodology
 Paper 4: (Continued)
37
Conclusion
 In this report we have studied the details of four papers in
this area.
 We have tried to make summary of those four papers, their
system models, their technologies and their validation
methods.
 We did not go through all the cross-references given in
those papers rather we kept the scope of this paper limited
into these four papers only.
 We strongly believe that this paper will be able to give the
reader a overview on currently development in this area
and how data mining is evolving into the field of network
intrusion detection.
38
Bibliography




[1] Chang-Tien Lu, Arnold P. Boedihardjo, Prajwal Manalwar, “Exploiting Efficient Data Mining
Techniques to Enhance Intrusion Detection Systems. Information Reuse and Integration, Conf,
2005. IRI -2005 IEEE International Conference on.
[2] Caulkins, B.D.; Joohan Lee; Wang, M, “Packet- vs. session-based modeling for intrusion
detection system”, Information Technology: Coding and Computing, 2005. ITCC 2005.
International Conference on Volume 1, 4-6 April 2005 Page(s):116 - 121 Vol. 1 Digital Object
Identifier 10.1109/ITCC.2005.222 .
[3] Patcha, A.; Park, J.-M., “Detecting denial-of-service attacks with incomplete audit data”,
Computer Communications and Networks, 2005. ICCCN 2005. Proceedings. 14th International
Conference on 17-19 Oct. 2005 Page(s):263 - 268 Digital Object Identifier
10.1109/ICCCN.2005.1523864.
[4] Daniel Barbara, Julia Couto, Sushil Jajodia, Leonard Popyack, Ningning Wu, “ADAM:
Detecting Intrusions by Data Mining”, Proceedings of the 2001 IEEE Workshop on Information
Assurance and Security, United States Military Academy, West Point, NY, 5-6 June 2001.
39
Questions
40