Lecture20 - The University of Texas at Dallas

Download Report

Transcript Lecture20 - The University of Texas at Dallas

Data and Applications Security
Developments and Directions
Dr. Bhavani Thuraisingham
The University of Texas at Dallas
Lecture #20
Guest Lecture
Data Mining for Intrusion
Detection
Data Mining &Intrusion
Detection Systems
Mamoun Awad
Dept. of Computer Science
University of Texas at Dallas
Outline
Intrusion Detection
Data Mining
Approach
Data set & Results
What is an intrusion?
An intrusion can be defined as “any set of
actions that attempt to compromise the:
Integrity
confidentiality, or
availability
of a resource”.
Intrusion Examples
 Virus
 Buffer-overflows
2000 Outlook Express vulnerability.
 Denial of Service (DOS)
explicit attempt by attackers to prevent legitimate users
of a service from using that service.
 Address spoofing
a malicious user uses a fake IP address to send
malicious packets to a target.
 Many others
R2L, U2R, Probe, …
Intrusion Detection System (IDS)
An Intrusion Detection System (IDS)
inspects all inbound and outbound network
activity and identifies suspicious patterns
that may indicate a network or system
attack from someone attempting to break
into or compromise a system.
Attack Types
Host-based attacks
Gain access to privileged services or resources on a
machine.
Network-based attacks
Make it difficult for legitimate users to access various
network services
IDS Categories
Intrusion detection systems are split into
two groups:
Anomaly detection systems
Identify malicious traffic based on deviations from
established normal network.
Misuse detection systems
Identify intrusions based on a known pattern
(signatures) for the malicious activity.
Problem Statement
 Goal of Intrusion Detection Systems (IDS):
 To detect an intrusion as it happens and be able to respond to
it.
 False positives:
 A false positive is a situation where something abnormal (as
defined by the IDS) happens, but it is not an intrusion.
 Too many false positives
 User will quit monitoring IDS because of noise.
 False negatives:
 A false negative is a situation where an intrusion is really
happening, but IDS doesn't catch it.
Layered Security Mechanism
Problem Statement
 Misuse Detection
Firewalls
Firewall Rules
Order
Protocol source
IP
source destination destination action
Port
IP
Port
Hierarchical Distributed Firewall Setup
Problem Statement
 Anomaly Detection
Our Approach
Class
Training
Data
SVM Class Training
Testing
Problem???
Testing Data
Our Approach
Class
Hierarchical
Training
Clustering (DGSOT)
Data
SVM Class Training
Testing
Testing Data
Dynamically Growing Self-Organizing
Tree Algorithm (DGSOT)
DGOST
 Learning Process
Winner Node
c : || x  nc || min {|| x  ni ||}
i
Update the Tree
Stopping Criteria
1
ADj 
N
N
 d ( xi , n k )
i 1
AD j 1  AD j
ADj

2
Support Vector Machine
Support Vector Machines (SVM)
One of the most powerful classification
techniques
Find hyper-plane that separates classes
Based on the idea of mapping data points to a
high dimensional feature space where a
separating hyper-plane can be found
The value of support vectors and non-support vectors
The effect of adding new data points on the margins
Feature Mapping

Feature mapping from two dimensional input space to
a two dimensional feature space.
SVM Limitations
Long training time limits its use.
Clustering has a positive impact on the training
of an SVM -- each cluster is represented by only
one reference
• Reduce training time
• Degrade generalization -- we use a fewer number of
points.
Hierarchical clustering with SVM flow chart
Training set
 1998 DARPA data that originated from the
MIT Lincoln Lab
 http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html
Size: 1012,477 data point
Data set / Attack Types
 DOS
denial-of-service
 R2L
unauthorized access from a remote machine, e.g.
guessing password;
 U2R
unauthorized access to local super user (root)
privileges, e.g., various ``buffer overflow'' attacks;
 Probing
surveillance and other probing, e.g., port scanning.
Results
Methods
Weighted
Accuracy
Random Selection
62.5%
Pure SVM
62.74%
SVM+Rocchio
Bundling
63.09%
SVM + DGSOT
63.34%
Average
Accuracy
Average
Training
Time
62.61%
Averag
e FP
rate
Averag
e FN
rate
22.40%
37.38%
30.75%
37,24%
30.98%
36.89%
51.56%
36.64%
0.049 hours
62.75%
0.51 hours
63.11%
0.93 hours
63.36%
0.26 hours
Relevant and Important Publications
 “A Dynamical Growing Self-Organizing Tree (DGSOT) for Hierarchical
Clustering Gene Expression Profiles,” Feng Luo, Latifur Khan , Farokh
Bastani, I-Ling Yen and J. Zhou, the Bioinformatics Journal, Oxford
University Press, UK, 20 16, (November 2004) 2605-2617.
 “Automatic Image Annotation and Retrieval using Weighted Feature
Selection” Lei Wang and Latifur Khan to appear in a special issue in
Multimedia Tools and Applications, Kulwer Publisher.
 “Hierarchical Clustering for Complex Data” Latifur Khan and Feng Luo, to
appear in International Journal on Artificial Intelligence Tools, World
Scientific publishers.
 “A New Intrusion Detection System using Support Vector Machines and
Hierarchical Clustering” Latifur Khan, Mamoun Awad, and Bhavani
Thuraisingham, to appear in VLDB Journal: The International Journal on
Very Large Databases, ACM/Springer-Verlag Publishing.
Relevant and Important Publications
R. Lippman J. Haines, D. Fried., J. Korba,
and K. Das, “The 1999 DARPA off-line
intrusion detection evaluation” , Computer
Networks, 34, pp. 579-595, 2000.