CS 590M: Security Issues in Data Mining

Download Report

Transcript CS 590M: Security Issues in Data Mining

KDD Cup ’99: Classifier Learning
Predictive Model for Intrusion Detection
Charles Elkan
1999 Conference on Knowledge
Discovery and Data Mining
Presented by Chris Clifton
KDD Cup Overview
• Held Annually in conjunction with Knowledge
Discovery and Data Mining Conference (now
ACM-sponsored)
• Challenge problem(s) released well before
conference
– Goal is to give best solution to problem
– Relatively informal “contest”
– Gives “standard” test for comparing techniques
• Winner announced at KDD conference
– Lots of recognition to winner
Classifier Learning for
Intrusion Detection
• One of two KDD’99 challenge problems
– Other was a knowledge discovery problem
• Goal is to learn a classifier to define TCP/IP
connections as intrusion/okay
– Data: Collection of features describing TCP
connection
– Class: Non-attack or type of attack
• Scoring: Cost per Test Sample
– Wrong answers penalized based on type of “wrong”
Data: TCP “connection”
information
• Dataset developed for 1998 DARPA Intrusion Detection
Evaluation Program
–
–
–
–
Nine weeks of raw TCP dump data from simulated USAF LAN
Simulated attacks to give positive examples
Processed into 5 million training “connections”, 2 million test
Some “attributes” derived from raw data
• Twenty-four attack types in training data, four classes:
– DOS: denial-of-service, e.g. syn flood;
– R2L: unauthorized access from a remote machine, e.g. guessing
password;
– U2R: unauthorized access to local superuser (root) privileges, e.g.,
various ``buffer overflow'' attacks;
– probing: surveillance and other probing, e.g., port scanning.
• Test set includes fourteen attack types not found in training set
Basic features of individual TCP
connections
feature name
description
type
duration
length (number of seconds) of the connection
continuous
protocol_type
type of the protocol, e.g. tcp, udp, etc.
discrete
service
network service on the destination, e.g., http,
telnet, etc.
discrete
src_bytes
number of data bytes from source to destination
continuous
dst_bytes
number of data bytes from destination to source
continuous
flag
normal or error status of the connection
discrete
land
1 if connection is from/to the same host/port; 0
otherwise
discrete
wrong_fragment
number of ``wrong'' fragments
continuous
urgent
number of urgent packets
continuous
Content features within a connection
suggested by domain knowledge
feature name
description
type
hot
number of ``hot'' indicators
continuous
num_failed_logins
number of failed login attempts
continuous
logged_in
1 if successfully logged in; 0 otherwise
discrete
num_compromised
number of ``compromised'' conditions
continuous
root_shell
1 if root shell is obtained; 0 otherwise
discrete
su_attempted
1 if ``su root'' command attempted; 0 otherwise
discrete
num_root
number of ``root'' accesses
continuous
num_file_creations
number of file creation operations
continuous
num_shells
number of shell prompts
continuous
num_access_files
number of operations on access control files
continuous
num_outbound_cmds
number of outbound commands in an ftp session
continuous
is_hot_login
1 if the login belongs to the ``hot'' list; 0 otherwise
discrete
is_guest_login
1 if the login is a ``guest''login; 0 otherwise
discrete
Traffic features computed using a
two-second time window
feature name
description
type
count
number of connections to the same host as the current
connection in the past two seconds
continuous
Note: The following features refer to these same-host
connections.
serror_rate
% of connections that have ``SYN'' errors
continuous
rerror_rate
% of connections that have ``REJ'' errors
continuous
same_srv_rate
% of connections to the same service
continuous
diff_srv_rate
% of connections to different services
continuous
srv_count
number of connections to the same service as the current
connection in the past two seconds
continuous
Note: The following features refer to these same-service
connections.
srv_serror_rate
% of connections that have ``SYN'' errors
continuous
srv_rerror_rate
% of connections that have ``REJ'' errors
continuous
% of connections to different host
continuous
srv_diff_host_rate
Scoring
• Each prediction gets a score:
– Row is correct answer
– Column is prediction made
• Score is average over all predictions
normal
probe
DOS
U2R
R2L
normal
0
1
2
2
2
probe
1
0
2
2
2
DOS
2
1
0
2
2
U2R
3
2
2
0
2
R2L
4
2
2
2
0
Results
• Twenty-four entries, scores:
0.2331 0.2356 0.2367 0.2411 0.2414
0.2443 0.2474 0.2479 0.2523 0.2530
0.2531 0.2545 0.2552 0.2575 0.2588
0.2644 0.2684 0.2952 0.3344 0.3767
0.3854 0.3899 0.5053 0.9414
• 1-Nearest Neighbor scored 0.2523
Winning Method:
Bagged Boosting
• Submitted by Bernhard Pfahringer, ML Group, Austrian Research
Institute for AI
• 50 samples from the original 5 million odd examples set
–
–
–
–
Contrary to standard bagging the sampling was slightly biased:
all of the examples of the two smallest classes U2R and R2L
4000 PROBE, 80000 NORMAL, and 400000 DOS examples
duplicate entries in the original data set removed
• Ten C5 decision trees induced from each sample
– used both C5's error-cost and boosting options.
• Final predictions computed from 50 single predictions of each
training sample by minimizing “conditional risk”
– minimizes sum of error-costs times class-probabilities
• Took approximately 1 day of 200MHz 2 processor Sparc to train
Confusion Matrix
(Breakdown of score)
Winning Entry:
Predicted
actual
0
1
2
3
4
correct
0
1
2
3
4
60262
511
5299
168
14527
74.6%
243
3471
1328
20
294
64.8%
78
184
223226
0
0
99.9%
4
0
0
30
8
71.4%
6
0
0
10
1360
98.8%
Predicted
actual
0
1
2
3
4
0
1
2
3
4
%correct
60322
697
6144
209
15785
72.5%
212
3125
76
5
308
83.9%
57
342
223633
1
1
99.8%
1
0
0
8
0
88.9%
1
2
0
5
95
92.2%
%correct
99.5%
83.3%
97.1%
13.2%
8.4%
For 1-NN:
%correct
99.6%
75.0%
97.3%
3.5%
0.6%
Analysis of winning entry
• Result comparable to 1-NN except on
“rare” classes
– Training sample of winner biased to rare
classes
– Does this give us a general principle?
• Misses badly for some attack categories
– True for 1-NN as well
– Problem with feature set?
Second and Third places
(Probably not statistically significant)
• Itzhak Levin, LLSoft, Inc.: Kernel Miner
– Link broken?
• Vladimir Miheev, Alexei Vopilov, and Ivan Shabalin,
MP13, Moscow, Russia
• Verbal rules constructed by an expert
• First echelon of voting decision trees
• Second echelon of voting decision trees
– Steps sequentially
– Branch to the next step occurs whenever the current one has
failed to recognize the connection
– Trees constructed using their own (previously developed) tree
learning algorithm