Machine Learning for Network Anomaly Detection

Download Report

Transcript Machine Learning for Network Anomaly Detection

An Analysis of the 1999
DARPA/Lincoln Laboratory
Evaluation Data for Network
Anomaly Detection
Matt Mahoney
[email protected]
Feb. 18, 2003
Is the DARPA/Lincoln Labs IDS
Evaluation Realistic?
• The most widely used intrusion detection evaluation data
set.
• 1998 data used in KDD cup competition with 25
participants.
• 8 participating organizations submitted 18 systems to the
1999 evaluation.
• Tests host or network based IDS.
• Tests signature or anomaly detection.
• 58 types of attacks (more than any other evaluation)
• 4 target operating systems.
• Training and test data released after evaluation to
encourage IDS development.
Problems with the LL Evaluation
• Background network data is synthetic.
• SAD (Simple Anomaly Detector) detects
too many attacks.
• Comparison with real traffic – range of
attribute values is too small and static
(TTL, TCP options, client addresses…).
• Injecting real traffic removes suspect
detections from PHAD, ALAD LERAD,
NETAD, and SPADE.
1. Simple Anomaly Detector (SAD)
•
•
•
•
•
Examines only inbound client TCP SYN packets.
Examines only one byte of the packet.
Trains on attack-free data (week 1 or 3).
A value never seen in training is an anomaly.
If there have been no anomalies for 60 seconds,
then output an alarm with score 1.
Train: 001110111
2
3
Test: 010 03001 23011
60 sec.
60 sec.
DARPA/Lincoln Labs Evaluation
• Weeks 1 and 3: attack free training data.
• Week 2: training data with 43 labeled attacks.
• Weeks 4 and 5: 201 test attacks.
Internet
Router
Attacks
SunOS
Sniffer
Solaris
Linux
NT
SAD Evaluation
• Develop on weeks 1-2 (available in
advance of 1999 evaluation) to find good
bytes.
• Train on week 3 (no attacks).
• Test on weeks 4-5 inside sniffer (177
visible attacks).
• Count detections and false alarms using
1999 evaluation criteria.
SAD Results
• Variants (bytes) that do well: source IP
address (any of 4 bytes), TTL, TCP
options, IP packet size, TCP header size,
TCP window size, source and destination
ports.
• Variants that do well on weeks 1-2
(available in advance) usually do well on
weeks 3-5 (evaluation).
• Very low false alarm rates.
• Most detections are not credible.
SAD vs. 1999 Evaluation
• The top system in the 1999 evaluation,
Expert 1, detects 85 of 169 visible attacks
(50%) at 100 false alarms (10 per day)
using a combination of host and network
based signature and anomaly detection.
• SAD detects 79 of 177 visible attacks
(45%) with 43 false alarms using the third
byte of the source IP address.
1999 IDS Evaluation vs. SAD
Expert 1
Recall %
Precision
Expert 2
Dmine
Forensics
SAD TTL
TCP Hdr
Src IP[3]
0
20
40
60
80
100
SAD Detections by Source Address
(that should have been missed)
• DOS on public services: apache2, back,
crashiis, ls_domain, neptune, warezclient,
warezmaster
• R2L on public services: guessftp, ncftp,
netbus, netcat, phf, ppmacro, sendmail
• U2R: anypw, eject, ffbconfig, perl, sechole,
sqlattack, xterm, yaga
2. Comparison with Real Traffic
• Anomaly detection systems flag rare
events (e.g. previously unseen addresses
or ports).
• “Allowed” values are learned during
training on attack-free traffic.
• Novel values in background traffic would
cause false alarms.
• Are novel values more common in real
traffic?
Measuring the Rate of Novel
Values
• r = Number of values observed in training.
• r1 = Fraction of values seen exactly once (GoodTuring probability estimate that next value will be
novel).
• rh = Fraction of values seen only in second half
of training.
• rt = Fraction of training time to observe half of all
values.
Larger values in real data would suggest a higher
false alarm rate.
Network Data for Comparison
• Simulated data: inside sniffer traffic from
weeks 1 and 3, filtered from 32M packets
to 0.6M packets.
• Real data: collected from www.cs.fit.edu
Oct-Dec. 2002, filtered from 100M to 1.6M.
• Traffic is filtered and rate limited to extract
start of inbound client sessions (NETAD
filter, passes most attacks).
Attributes measured
• Packet header fields (all filtered packets)
for Ethernet, IP, TCP, UDP, ICMP.
• Inbound TCP SYN packet header fields.
• HTTP, SMTP, and SSH requests (other
application protocols are not present in
both sets).
Comparison results
• Synthetic attributes are too predictable:
TTL, TOS, TCP options, TCP window size,
HTTP, SMTP command formatting.
• Too few sources: Client addresses, HTTP
user agents, ssh versions.
• Too “clean”: no checksum errors,
fragmentation, garbage data in reserved
fields, malformed commands.
TCP SYN Source Address
Packets, n
r
r1
rh
rt
Simulated
50650
29
0
3%
0.1%
Real
210297
24924
45%
53%
49%
r1 ≈ rh ≈ rt ≈ 50% is consistent with a Zipf
distribution and a constant growth rate of r.
Real Traffic is Less Predictable
r (Number of
values)
Real
Synthetic
Time
3. Injecting Real Traffic
• Mix equal durations of real traffic into weeks 3-5
(both sets filtered, 344 hours each).
• We expect r ≥ max(rSIM, rREAL) (realistic false
alarm rate).
• Modify PHAD, ALAD, LERAD, NETAD, and
SPADE not to separate data.
• Test at 100 false alarms (10 per day) on 3 mixed
sets.
• Compare fraction of “legitimate” detections on
simulated and mixed traffic for median mixed
result.
PHAD
• Models 34 packet header fields –
Ethernet, IP, TCP, UDP, ICMP
• Global model (no rule antecedents)
• Only novel values are anomalous
• Anomaly score = tn/r where
– t = time since last anomaly
– n = number of training packets
– r = number of allowed values
• No modifications needed
ALAD
• Models inbound TCP client requests –
addresses, ports, flags, application
keywords.
• Score = tn/r
• Conditioned on destination port/address.
• Modified to remove address conditions
and protocols not present in real traffic
(telnet, FTP).
LERAD
• Models inbound client TCP (addresses,
ports, flags, 8 words in payload).
• Learns conditional rules with high n/r.
• Discards rules that generate false alarms
in last 10% of training data.
• Modified to weight rules by fraction of real
traffic.
If port = 80 then word1 = GET,
POST (n/r = 10000/2)
NETAD
• Models inbound client request packet
bytes – IP, TCP, TCP SYN, HTTP, SMTP,
FTP, telnet.
• Score = tn/r + ti/fi allowing previously seen
values.
– ti = time since value i last seen
– fi = frequency of i in training.
• Modified to remove telnet and FTP.
SPADE (Hoagland)
•
•
•
•
•
Models inbound TCP SYN.
Score = 1/P(src IP, dest IP, dest port).
Probability by counting.
Always in training mode.
Modified by randomly replacing real
destination IP with one of 4 simulated
targets.
Criteria for Legitimate Detection
• Source address – target server must
authenticate source.
• Destination address/port – attack must use
or scan that address/port.
• Packet header field – attack must
write/modify the packet header (probe or
DOS).
• No U2R or Data attacks.
Mixed Traffic: Fewer Detections,
but More are Legitimate
Detections out of 177 at 100 false alarms
140
120
Total
Legitimate
100
80
60
40
20
0
PHAD
ALAD
LERAD
NETAD
SPADE
Conclusions
• SAD suggests the presence of simulation
artifacts and artificially low false alarm
rates.
• The simulated traffic is too clean, static
and predictable.
• Injecting real traffic reduces suspect
detections in all 5 systems tested.
Limitations and Future Work
• Only one real data source tested – may
not generalize.
• Tests on real traffic cannot be replicated
due to privacy concerns (root passwords
in the data, etc).
• Each IDS must be analyzed and modified
to prevent data separation.
• Is host data affected (BSM, audit logs)?
Limitations and Future Work
• Real data may contain unlabeled attacks. We
found over 30 suspicious HTTP request in our
data (to a Solaris based host).
IIS exploit with double URL encoding (IDS
evasion?)
GET /scripts/..%255c%255c../winnt/system32/cmd.exe?/c+dir
Probe for Code Red backdoor.
GET /MSADC/root.exe?/c+dir HTTP/1.0
Further Reading
An Analysis of the 1999 DARPA/Lincoln
Laboratories Evaluation Data for
Network Anomaly Detection
By Matthew V. Mahoney and Philip K. Chan
Dept. of Computer Sciences Technical
Report CS-2003-02
http://cs.fit.edu/~mmahoney/paper7.pdf