Helvetica is a Good Font

Download Report

Transcript Helvetica is a Good Font

Learning Rules and Clusters for
Network Anomaly Detection
Philip Chan, Matt Mahoney, Muhammad Arshad
Florida Institute of Technology
Outline
•
•
•
•
Related work in anomaly detection
Rule Learning algorithm: LERAD
Cluster learning algorithm: CLAD
Summary and ongoing work
Related Work in Anomaly Detection
• Host-based
– STIDE (Forrest et al., 96): system calls, instance-based
– (Lane & Brodley, 99): user commands, instance-based
– ADMIT (Sequeira & Zaki, 02): user commands,
clustering
• Network-based
– NIDES (SRI, 95): addresses and ports, probabilistic
– SPADE (Silicon Defense, 01): addresses and ports,
probabilistic
– ADAM (Barbara et al., 01): hybrid anomaly-misuse
detection
LERAD: Learning Rules for
Anomaly Detection
(ICDM 03)
Probabilistic Models
• Anomaly detection:
– P(x | D NoAttacks)
– Given training data with no attacks, estimate the
probability of seeing event x
– Easier if event x was observed during training
• actually, since x is normal, we aren’t interested in its likelihood
– Harder if event x was not observed (zero frequency
problem)
• we are interested in the likelihood of anomalies
Estimating Probability with Zero
Frequency
• r = number of unique values in an attribute in the
training data
• n = number of instances with the attribute in the
training data
• Likelihood of observing a novel value in an
attribute is estimated by:
p=r/n
(Witten and Bell, 1991)
Anomaly Score
• Likelihood of novel event = p
• During detection, if a novel event (unobserved
during training) actually occurs:
– anomaly score = 1/p [surprise factor]
Example
• Training Sequence1 = a, b, c, d, e, b, f, g, c, h
– P(NovelLetter) = 8/10
– Z is observed during detection, anomaly score = 10/8
• Training Sequence2 = a, a, b, b, b, a, b, b, a, a
– P(NovelLetter) = 2/10
– Z is observed during detection, anomaly score = 10/2
Nonstationary Model
• More likely to see a novel value if novel values
were seen recently (e.g., during an attack)
• During detection, record when the last novel value
was observed
• ti = number of seconds since the last novel value
in attribute Ai
• Anomaly score for Ai: Scorei = ti / pi
• Anomaly score for an instance = Si Scorei
LEarning Rules for Anomaly
Detection (LERAD)
• PHAD uses prior probabilities: P(Z)
• ALAD uses conditional probabilities: P(Z|A)
• More accurate to learn probabilities that are
conditioned on multiple attributes: P(Z|A,B,C…)
• Combinatorial explosion
• Fast algorithm based on sampling
Rules in LERAD
• A  a, B  b, C  c,...  Z {z1 , z2 , z3 ,...}
• If the antecedent is satisfied, the Z attribute has
one of the values z1, z2, z3…
• Unlike association rules, our rules allow a set of
values in the consequent
• Unlike classification rules, our rules don’t require
a fixed attribute as the consequent
Semantics of a Rule
• A  a, B  b, C  c,...  Z {z1 , z2 , z3 ,...}
• If the antecedent is satisfied but none of the values
in the Z attribute is matched, the anomaly score is
n/r (similar to PHAD/ALAD)
• r = size of Z (# of unique values in Z)
• n = # of tuples that satisfy the antecedent and have
the Z attribute (support)
• P( Z {z1 , z2 , z3 ,...} | A  a, B  b, C  c)  r / n
Overview of the Algorithm
• Randomly select pairs of tuples (packets,
connections, …) from a sample of the training
data
• Create candidate rules based on each pair
• Estimate the score of each candidate rule based on
a sample of the training data
• Prune the candidate rules
• Update the consequent and calculate the score for
each rule using the entire training set
Creating Candidate Rules
• Find the matching attributes; for example, given
this randomly selected pair of tuples:
• <A=1,B=2,C=3,D=4> and <A=1,B=2,C=3,D=6>
• Attributes A, B, and, C match
• Create these rules:
• A=1, B=2 => C=?
• B=2, C=3 => A=?
• A=1, C=3 => B=?
Estimating Rule Scores
• Randomly pick a sample from the training set to
estimate the score (n/r) for each rule
• The consequent of each rule is now estimated
• A  1, B  2  C {2,3,4}
n/r=100/3
• B  2, C  3  A {1,5}
n/r=10/2
• A  1, C  3  B  {1,2,3,...,100} n/r=200/100
• The larger the score (n/r), the higher the
confidence that the rule captures normal behavior
Pruning Candidate Rules
• To reduce the amount of time for learning from
the entire training set and during detection
• High scoring rules: more confidence for top rules
• Redundancy check: some rules are not necessary
• Coverage check: minimum set of rules that
describe the data
Redundancy Check
• Rule 1: A  1, B  2  C {3,4}
• Rule 2: A  1  C  {3,4}
• Rule 2 is more general than Rule 1, which is
redundant and can be removed
• Rule 3: B  2  C  {3,4,5,6}
• Rule 2 and Rule 3 don’t overlap
• Rule 4: *  C {3,4,5,6}
• Rule 4 is more general than Rule 3, remove Rule 3
Coverage Check
• A rule can cover multiple tuples, but a tuple can
only be covered by one rule (highest-scoring rule).
• Rules are checked in descending order of scores
• For each rule in the candidate rule set
– mark tuples that are covered by the rule
• Rules that don’t cover any tuples are removed
• Our coverage check includes the redundancy
check
Final Training
• The selected rules are trained on the entire training
set: consequent and score are updated
• A  1, B  2  C {2,3,4,5,6} n/r=100000/5
• B  2, C  3  A {1,5}
n/r=4000/2
• 90% for training the rules
• 10% for validating the rules: rules that cause false
alarms are removed (being conservative--the
remaining rules are highly predictive)
Scoring during Detection
• Score for a matched rule that is violated
S = t * n/r
where t is the duration since the last time the rule
was violated (anomaly occurred wrt the rule)
• Anomaly score for the tuple = Si Si
Attributes Used in LERAD-tcp
• TCP connections are reassembled (similar to
ALAD)
• Last 2 bytes of the destination IP address
• 4 bytes of the source IP address
• Source and destination ports
• Duration (from the first packet to the last)
• Length, TCP flags
• First 8 strings in the payload (delimited by
space/new line)
Attributes used in LERAD-all
• Attributes used in LERAD-tcp
• UDP and ICMP header fields
Experimental Data and
Parameters
• DARPA 99 data set
• Training: Week 3; Testing: Weeks 4 & 5
• Training: 35K tuples (LERAD-tcp); 69K tuples
(LERAD-all)
• Testing: 178K tuples (LERAD-tcp); 941K tuples
(LERAD-all)
• 1,000 pairs of tuples were sampled to form
candidate rules (more didn’t help much)
• 100 tuples were sampled to estimate scores for
candidate rules (more didn’t help much)
Experimental Results
• Average of 5 runs
• 10 false alarms per day
• 201 attacks; 74 “hard-to-detect” attacks
(Lippmann, 2000)
• LERAD-tcp: ~117 detections (58%); ~45 “hardto-detect” (60%)
• LERAD-all: ~112 detections (56%); ~41 “hard-todetect” (55%)
LERAD-all vs. LERAD-tcp
Detections
(10FA/Day)
PROBE
LERAD-all
only
Both
LERAD-tcp
only
7
23
0
DoS
10
30
8
R2L
0
30
5
U2R
0
10
10
Total
17
93
23
Experimental Time Statistics
• Preprocessing: ~7.5 minutes (2.9GB, training set),
~20 minutes (4GB, test set)
• LERAD-tcp: ~6 seconds (4MB, training), ~17
seconds (17MB, testing)
• LERAD-all: ~12 seconds (8MB, training), ~95
seconds (91MB, testing)
• 50-75 learned final rules
Results from Mixed Data
(RAID 03)
• DARPA 99 data set: attacks are real, background
is simulated
• Compared with collected real data
• Artifacts: smaller range of values, little “crud,”
values stop growing
• Modified LERAD: 87 detections, 49 (56%)
legitimate
• Mixed data: 30 detections, 25 (83%) legitimate
CLAD: Clustering for Anomaly
Detection
(In Data Mining against Cyber Threats,
Kumar et al., 03)
Finding Outliers
• Cluster the data points
• Outliers: points in far away and sparse
clusters
• Inter-cluster distance: average distance
from the rest of the clusters
• Density: number of data points in a fixedvolume cluster
CLAD
• Simple efficient clustering algorithm (large
amount of data)
• Clusters with fixed radius
• If a point is within the radius of an existing cluster
– Add the point to the cluster
• Else
– The point becomes the centriod of a new cluster
CLAD Issues
• Distance for discrete attributes
– Values that are more frequent are likely to be more
normal and are consider “closer”
– Difference in frequency of discrete values
• Power-law distributions: logarithm
• Radius of clusters
– Select a small random sample
– Calculate the distance of all pairs
– Average of the smallest 1%
Sparse and Dense Regions
• Outliers are in distant and sparse regions
• However, an attack might generate many
connections and can make its neighborhood
not sparse.
• (distant and sparse) or (distant and dense)
– Distant: distance > avg(distance) + sd(distance)
– Sparse: density < avg(density) – sd(density)
– Dense: density > avg(density) + sd(density)
Experiments
• Weeks 1, 2, 4, and 5
• No explicit training-testing, looking for
outliers
• A model for each port
• Ports with less than 1% traffic are lumped
into the “Others” model
• Anomaly scores are normalized in SD’s, the
“Combined” model simply merges the
scores from different models
HTTP (Port 80)
1.2
CD > 0.8
1
0.8
Ln(Inter-Distance)
0.6
0.4
0.2
0
-0.2
-0.4
0
1
2
3
4 ln(Count) 5
6
7
8
9
HTTP (Port 80)
CD < 0.2
1.2
1
Ln (Inter Distance)
0.8
0.6
0.4
0.2
0
-0.2
-0.4
0
1
2
3
4
5
Ln(Count)
6
7
8
9
SMTP (Port 25)
CD > 0.8
1.2
1
0.8
Ln (Inter Distance )
0.6
0.4
0.2
0
-0.2
-0.4
-0.6
0
1
2
3
4
5
6
7
8
Ln (COUNT)
SMTP (Port 25)
CD <0.2
1.2
1
0.8
Ln (Inter Distance)
0.6
0.4
0.2
0
-0.2
-0.4
-0.6
0
1
2
3
4
Ln(Count)
5
6
7
8
Results
Attack Type
Attacks
Detections
(10 FA/Day)
Probe
28
19 (70%)
DOS
45
25 (55%)
R2L
15 (37%)
41
U2R/Data
37
14 (38%)
Total
151
74 (49%)
LERAD vs. CLAD
LERAD
CLAD
Assume all training data are
normal
Training data can have
unlabeled attacks
Off-line algorithm
On-line algorithm
Concise and comprehensible
models
Harder to explain alerts
Efficient detection
Comparing large # of
centroids
Ongoing Work
• On-line, noise-tolerant LERAD
• Applying LERAD to system calls, including
arguments
• Tokenizing payload to create features
Data Mining for Computer Security
Workshop at ICDM03
Melbourne, FL
Nov 19, 2003
www.cs.fit.edu/~pkc/dmsec03/
http://www.cs.fit.edu/~pkc/id/
Thank you
Questions?