Data Mining Approach for Network Intrusion Detection
Download
Report
Transcript Data Mining Approach for Network Intrusion Detection
Data Mining Approach for Network
Intrusion Detection
Zhen Zhang
Advisor: Dr. Chung-E Wang
04/24/2002
Department of Computer Science
California State University, Sacramento
Outline
Background
– Intrusion Detection: promises and challenges
– Data Mining in IDS: how can it help
Motivation
Approaches, tasks, problems and my
contributions
Results
Conclusion and future work
Intrusion Detection
- Building a Secure Network
Primary assumptions
– System activities are observable
– Normal and intrusive activities have distinct evidence
Main techniques
– Misuse detection: patterns of well-known
attacks
– Anomaly detection: deviation from normal
usage
Data Mining in IDS
Shortfalls with current IDS (mostly misuse
detections)
– Variants:
Intrusions change easily and frequently.
– False positive: Difficult to pick up intrusions.
– False negative: Detecting attacks for which there are
no known signatures
– Data overload: Amount of data grows rapidly.
What is Data Mining
Data Mining:
Take data and pull from it patterns or deviations.
Many different types of algorithms:
Decision Tree, Link analysis, Clustering, Association, Rule
abduction, Deviation Analysis, and Sequence analysis.
Software and Tools:
– MS SQL Server 2000
– Ripper and many others
How can Data Mining help
Variants
– Use anomaly detection, no great concern with variants
in an exploit code.
False positives
– To identify recurring sequences of alarms in order to
help identify valid network activity.
False negatives
– Attacks for which signatures have not been developed
might be detected.
Data overload
– Data mining plays a vital role.
Summary of my work
Identify objective
– Distinguish network attacks from normal traffic
– New area, several research projects, no commercial products
– Focus on the principle and basic implementation of concepts
Data Collection
Data Pre-processing on tcpdump dataset
Apply data mining on processed data
Investigate results
Software packages used: Visual Basic, Microsoft
SQL Server 2000 with Analysis Server, Tcpdump
Data Collection
Tcpdump data (http://iris.cs.uml.edu:8080/)
– Tcpdump was executed on the gateway, to capture the
traffic between LAN and external, and broadcast
packets within LAN
– Only header, no user data
– Filters were used, only TCP and UDP packets
– Baseline and 4 simulated attacks
TCPDUMP data format
TCP packet
–
–
–
–
–
–
–
–
–
–
Time stamp
Source IP address
Source port
Destination IP address
Destination port
Flags (SYN, FIN, PUSH, RST, or .)
Data sequence number of this packet
Data sequence number of the data expected in return
Number of bytes of receive buffer space available
Indication of whether or not the data is urgent
Tcpdump data format
UDP packet
–
–
–
–
–
–
Time stamp
Source IP address
Source port
Destination IP address
Destination port
Length of the packet
Example data
Example tcpdump data
Data Pre-processing
- 80% ~ 90% work
Packet level information to connection
level
–
–
Group by same source/destination IP/Port
Use flags, acks to determine status of the connection
»
–
–
–
SF, REJ, S0, S1, S3, S3, S4, RSTOSn, RSTRSn, SS, SH,
SHR, OOS1, OOS2
Record start time, duration, protocol
Calculate bytes in, bytes out, resent rate
UDP is connectionless, so simply treat each packet as
a connection
First round of processing
Intrinsic Features
Establish more information
Count_per_dest
# of connections to this
destination IP
REJ_count_per_dest
# of connections that get the
flag “REJ”
# of connections that send a
SYN packet but never get the
ACK packet (S0), or receive an
ACK on SYN that they never
have sent (S1).
S01_count_per_dest
Diff_Services_per_dest
# of unique services
Diff_Service_Rate
Diff_Services / Count
Same Destination Temporal and Statistical Attributes (last 2 seconds)
Establish more information
Count_per_service
# of connections to this type of
service
REJ_count_per_service
# of connections that get the
flag “REJ” (SYN met by RST)
# of connections that send a
SYN packet but never get the
ACK packet (S0), or receive an
ACK on SYN that they never
have sent (S1).
S01_count_per_service
Diff_Hosts_per_service
# of unique destination hosts
Diff_Hosts_Rate
Diff_Hosts / Count
Same Service Temporal and Statistical Attributes (last 2 seconds)
Second round of processing
Same Destination Temporal and Statistical Attributes
Final round of processing
Final, but important
– Reduce data amount
– Remove noise or trivial information
– Re-organization data, add new feature if necessary
Challenges
– Hard to tell which data to reduced/remove
– Requires tremendous domain knowledge
– Need experiments and adjustments
Data Mining
Decision Tree Algorithm
Microsoft SQL Server 2000 Analysis Server
Steps:
– 80% of baseline (normal) dataset as training data
– Use 20% left as validation data, compute
misclassification.
– 20% of each of the four intrusion datasets as
predication data, compute misclassification.
Dependency Network
Decision Tree
Apply Data Mining Model to Validate/Predicate
Results
% misclassification (by final state)
Normal
149/1510 = 9.86%
Intrusion1
443/2324 = 19.06%
Intrusion2
376/1968 = 19.10%
Intrusion3
386/2011 = 19.19%
Intrusion4
437/2298 = 19.01%
Conclusion and future improvement
Accuracy
– Preliminary experiments of using DM on the tcpdump
data showed promising results
– depends on sufficient training data and right feature set.
Performance
– 6 hours on one dataset (628775 records)
Size of time window
– 2 seconds or larger?
Automated process
– Call MSSQL DM and DTS procedures within VB
– Real-time monitor and alarm
References
Intrusion Detection, Rebecca Gurley Bace, Macmillan Technical
Publishing, 2000
Data Mining: Concepts and Techniques, Jiawei Han Micheline
kamber, Morgan Kaufmann Publishers 2001
Data Mining with Microcoft SQL Server 2000, Claude Seidman.
Microsoft Press, 2001
http://www.cs.columbia.edu/~sal/hpapers/USENIX/usenix.html
http://iris.cs.uml.edu:8080/network.html
http://www-nrg.ee.lbl.gov/. Network Research Group (NRG) of the
Information and Computing Sciences Division (ICSD) at Lawrence
Berkeley National Laboratory (LBNL) in Berkeley, California.
Thank You!