Data Mining: Introduction

Transcript Data Mining: Introduction

Intrusion Detection
Outline
 Intrusion
detection and computer
security
 Current
 Data
intrusion detection approaches
Mining Approaches for Intrusion
Detection
 Summary
Intrusion Detection and Computer Security

Computer security goals:


Confidentiality, integrity, and availability
Intrusion is a set of actions aimed to
compromise these security goals

Intrusion prevention (authentication,
encryption, etc.) alone is not sufficient

Intrusion detection is needed
Intrusion Examples

Intrusions: Any set of actions that threaten the
integrity, availability, or confidentiality of a
network resource

Examples

Denial of service (DoS): attempts to starve a host of
resources needed to function correctly

Scan: reconnaissance on the network or a particular
host

Worms and viruses: replicating on other hosts

Compromises: obtain privileged access to a host by
known vulnerabilities
Intrusion Detection

Intrusion detection: The process of monitoring
and analyzing the events occurring in a computer
and/or network system in order to detect signs of
security problems

Primary assumption: User and program activities
can be monitored and modeled

Steps

Monitoring and analyzing traffic

Identifying abnormal activities

Assessing severity and raising alarm
Monitoring and Analyzing Traffic

TCPdump and Windump


Provide insight into the traffic activity on a network

ftp://ftp.ee.lbl.gov/tcpdump.tar.Z

http://netgroupserv.polito.it/windump
Ethereal

GUI to interpret all layers of the packet
Goals of Intrusion Detection System (IDS)


Detect wide variety of intrusions

Previously known and unknown attacks

Suggests need to learn/adapt to new attacks or changes in
behavior
Detect intrusions in timely fashion

May need to be real-time, especially when system
responds to intrusion


Problem: analyzing commands may impact response
time of system
May suffice to report intrusion occurred a few minutes or
hours ago
Goals of Intrusion Detect. System (IDS) (2)

Present analysis in simple, easy-to-understand
format

Be accurate

Minimize false positives, false negatives

False positive: An event, incorrectly identified by
the IDS as being an intrusion when none has
occurred

False negative: An event that the IDS fails to
identify as an intrusion when one has in fact
occurred

Minimize time spent verifying attacks, looking for them
IDS Architecture

Sensors (agent)


Analyzers (detector)



to collect data and forward info to the analyzer
 network packets
 log files
 system call traces
To receive input from one or more sensors or from other
analyzers
To determine if an intrusion has occurred
User interface

To enable a user to view output from the system or
control the behavior of the system
IDS Architecture
Sensor 1
Sensor 2
Sensor n
Sensor
events
ANALYSER
Network
Classifier
Clustering
Human
analyst
Signature-Based Intrusion Detection

Human analysts investigate suspicious traffic

Extract signatures

Features of known intrusions

Use pre-defined signatures to discover malicious
packets

Examples

LaBrea Tarpit by Tom Liston

Snort and Snort rules Marty Roesch
Snort by Marty Roesch


An open source free network intrusion detection
system

Signature-based, use a combination of rules and
preprocessors

On many platforms, including UNIX and Windows

www.snort.org
Preprocessors

IP defragmentation, port-scan detection, web traffic
normalization, TCP stream reassembly, …

Can analyze streams, not only a single packet at a time
Problems in Signature-Based Intrusion
Detection Systems

Many false positives: prone to generating alerts
when there is no problem in fact

Signatures are not specific enough

A packet is not examined in context with those that
precede it or those that follow

Cannot detect unknown intrusions

Rely on signatures extracted by human experts
Misuse vs. Anomaly Detection

Misuse detection: use patterns of well-known
attacks to identify intrusions


Classification based on known intrusions

E.g., three consecutive login failures: password guessing.
Anomaly detection: use deviation from normal
usage patterns to identify intrusions

Any significant deviations from the expected behavior
are reported as possible attacks
Misuse vs. Anomaly Detection
Misuse Detection
Anomaly Detection
Definition
matching the
sequence of
“signature
actions” of known
intrusion scenarios
using statistical
measure on system
features
Shortcoming

Example
STAT [HLMS90]
Has to hand Rely upon in selecting
coded known
the system features.
pattern.
 Has to study
 Unable to detect
sequential interrelation
any future intrusion between transactions
IDES [LTG+92]
Host-based vs. Network-based

According to data sources

Host-based detection: the data is collected from
an individual host


Directly monitor the host data files and OS processes

Can determine exactly which host resources are the
targets of a particular attack
Network-based detection: the data is traffic
across the network

A set of traffic sensors within the network

Can easily harder against attacks and hide from the
attackers
OUTLINE

Intrusion detection and computer
security

Current intrusion detection approaches

Data Mining Approaches for Intrusion
Detection

Summary
Current Intrusion Detection
Approaches—Misuse Detection

Misuse detection :

Record the specific patterns of intrusions

Monitor current audit trails (event sequences) and
pattern matching

Report the matched events as intrusions

Representation models: expert rules, Colored Petri Net,
and state transition diagrams, etc.
Misuse Detection Example

Expert systems: use a set of rules to describe
attacks


Signature analysis: capture features of attacks in
audit trail


Haystack, NetRanger, RealSecure, MuSig
State-transition analysis: use state-transition
diagrams


IDES, ComputerWatch, NIDX, P-BEST, ISOA
STAT,USTAT and NetSTAT
Other approaches


Colored petri nets, e.g., IDIOT
Case-based reasoning, e.g., AUTOGUARD
Current Intrusion Detection
Approaches—Anomaly Detection

Anomaly detection:

Establishing the normal behavior profiles

Observing and comparing current activities with the
(normal) profiles

Reporting significant deviations as intrusions

Statistical measures as behavior profiles: ordinal and
categorical (binary and linear)
Anomaly Detection Example

Statistical methods: multivariate, temporal
analysis


IDES, NIDES, EMERALD
Expert systems

ComputerWatch, Wisdom & Sense
Problems of Current Intrusion
Detection Approaches

Main problems: manual and ad-hoc

Misuse detection:



Known intrusion patterns have to be hand-coded
Unable to detect any new intrusions (that have no
matched patterns recorded in the system)
Anomaly detection:


Selecting the right set of system features to be
measured is ad hoc and based on experience
Unable to capture sequential interrelation between
events
OUTLINE
Intrusion detection and computer
security
 Current intrusion detection approaches
 Data Mining Approaches for Intrusion
Detection
 Summary

Why Can Data Mining Help?

Data mining: applying specific
algorithms to extract patterns from data

Normal and intrusive activities leave
evidence in audit data

From the data-centric point view,
intrusion detection is a data analysis
process
Why Can Data Mining Help?
Successful applications in related
domains, e.g., fraud detection,
fault/alarm management
 Learn from traffic data




Supervised learning: learn precise models
from past intrusions
Unsupervised learning: identify suspicious
activities
Maintain or update models on dynamic
data
Frequent Patterns

Patterns that occur frequently in a database

Mining Frequent patterns – finding regularities

Process of Mining Frequent patterns for intrusion
detection

Phase I: mine a repository of normal frequent itemsets
for attack-free data

Phase II: find frequent itemsets in the last n connections
and compare the patterns to the normal profile
Frequent Pattern Mining in MINDS

MINDS: a IDS using data mining
techniques


University of Minnesota
Summarizing attacks using association
rules

{Src IP=206.163.27.95, Dest Port=139,
Bytes[150, 200)}  {ATTACK}
Patterns About Alerts
Ning et al. CCS’02
 Find correlated alerts – the frequent
patterns of alerts




Attack scenarios – the logical connections
between alerts
A hyper-alerts correlation graph approach
Use the correlation of intrusion alerts to
identify high level attacks
Associate rules

Used for link analysis

E.g.:

If the number of failed login attempts
(num_failed_login_attempts) and the network service
on the destination (service) are features, an example
of rule is:

num_failed_login_attempts = 6, service = FTP =>
attack = DoS [1, 0.28 ]
Sequential Pattern Analysis

Models sequence patterns

(Temporal) order is important in many situations


Time-series databases and sequence databases

Frequent patterns  (frequent) sequential patterns
Sequential patterns for intrusion detection

Capture the signatures for attacks in a series of packets
Classification: A Two-Step Process

Model construction: describe a set of
predetermined classes

Training dataset: tuples for model construction



Each tuple/sample belongs to a predefined class
Classification rules, decision trees, or math formulae
Model application: classify unseen objects

Estimate accuracy of the model using an independent
test set

Acceptable accuracy  apply the model to classify data
tuples with unknown class labels
Classification Methods
Basic Algorithm ID3
 Neural networks
 Bayesian classification




Naïve Bayesian classification
Bayesian belief network
Support vector machines
Classification for Intrusion Detection

Misuse detection


Classification based on known intrusions
Example: Sinclair et al. “An application of
machine learning to network intrusion detection”

Use decision trees and ID3 on host session data

Use genetic algorithms to generate rules

If <pattern> then <alert>
HIDE


“A hierarchical network intrusion detection
system using statistical processing and neural
network classification” by Zheng et al.
Five major components





Probes collect traffic data
Event preprocessor preprocesses traffic data and feeds
the statistical model
Statistical processor maintains a model for normal
activities and generates vectors for new events
Neural network classifies the vectors of new events
Post processor generates reports
Intrusion Detection by NN and SVM

S. Mukkamala et al., IEEE IJCNN May 2002

Discover useful patterns or features that
describe user behavior on a system

Use the set of relevant features to build
classifiers

SVMs have great potential to be used in place of
NNs due to its scalability and faster training and
running time

NNs are especially suited for multi-category
classification
Clustering

Group data into clusters

What is a good clustering

High intra-class similarity and low inter-class similarity



Depending on the similarity measure
The ability to discover some or all of the hidden patterns
Clustering Approaches

K-means

Hierarchical Clustering

Density-based methods

Grid-based methods

Model-based
Clustering for Intrusion Detection

Anomaly detection

Any significant deviations from the expected behavior
are reported as possible attacks

Build clusters as models for normal activities

“A scalable clustering for intrusion signature
recognition” by Ye and Li

Use description of clusters as signatures of intrusions
Alert Correlation

F. Cuppens and A. Miege, in IEEE S&P’02

Use clustering and merging functions to
recognize alerts that correspond to the same
occurrence of an attack

Create a new alert that merge data contained in these
various alerts

Generate global and synthetic alerts to reduce
the number of alerts further
Mining Data Streams

Continuous arrival data in multiple, rapid, timevarying, possibly unpredictable and unbounded
streams

Many applications

Financial applications, network monitoring, security,
telecommunications data management, web application,
manufacturing, sensor networks, etc.
Mining Data Streams for Intrusion Detection

Maintaining profiles of normal activities


The profiles of normal activities may drift
Identifying novel attacks

Identifying clusters and outliers in traffic data
streams
A Systematic Framework—J.Stolfo et al.

Build good models:


Build better models:


select appropriate features of audit data to build
intrusion detection models
architect a hierarchical detector system that combines
multiple detection models
Build updated models:

dynamically update and deploy new detection system as
needed
A Systematic Framework

Support for the feature selection and model
construction:

Apply data mining algorithms to find consistent interand intra- audit record (event) patterns

Use the features and time windows in the discovered
patterns to build detection models

A support environment to semi-automate this process
A Systematic Framework

Combining multiple detection models:




Each (base) detector model monitors one aspect of the system
They can employ different techniques and be independent of
each other
The learned (meta) detector combines evidence from a number
of base detectors
An intelligent agent-based architecture:


learning agents: continuously compute (learn) the detection
models
detection agents: use the (updated) models to detect intrusions
A Systematic Framework
Building Classifiers for Intrusion
Detection—J.Stolfo et al.

Experiments in constructing classification models
for anomaly detection

Two experiments:


sendmail system call data

network tcpdump data
Use meta classifier to combine multiple
classification models
Classification Models on sendmail

The data: sequence of system calls made by
sendmail.

Classification models (rules): describe the
“normal” patterns of the system call sequences.

The rule set is the normal profile of sendmail

Detection: calculate the deviation from the profile

large number/high scores of “violations” to the rules in a
new trace suggests an exploit
Classification Models on sendmail

The sendmail data:



Each trace has two columns: the process ids
and the system call numbers
Normal traces: sendmail and sendmail daemon
Abnormal traces: sunsendmailcap, syslogremote, syslog-remote, decode, sm5x and
sm56a attacks
Classification Models on sendmail

Lessons learned:




Normal behavior can be established and used
to detect anomalous usage
Need to collect near “complete” normal data in
order to build the “normal” model
But how do we know when to stop collecting?
Need tools to guide the audit data gathering
process
Classification Models on tcpdump

The tcpdump data (part of a public data
visualization contest):

Packets of incoming, out-going, and internal
broadcast traffic

One trace of normal network traffic

Three traces of network intrusions
Data Preprocessing



Extract the “connection” level features:

Record connection attempts

Watch how connection is terminated
Each record has:

start time and duration

participating hosts and ports (applications)

statistics (e.g., # of bytes)

flag: normal or a connection/termination error

protocol: TCP or UDP
Divide connections into 3 types: incoming, outgoing, and inter-lan
Building Classifier for Each Type of
Connections

Use the destination service (port) as the class
label

Training data: 80% of the normal connections

Testing data: 20% of the normal connections and
connections in the 3 intrusion traces

Apply RIPPER to learn rules
Lessons Learned

Data preprocessing requires extensive domain
knowledge

Adding temporal features improves classification
accuracy

Need tools to guide (temporal) feature selection
Meta Classifier that Combines Evidence
from Multiple Detection Models

Build base classifiers that each model one aspect
of the system

The meta learning task:

each record has a collection of evidence from base
classifiers, and a class label “normal”or “abnormal” on
the state of the system

Apply a learning algorithm to produce the meta
classifier

Data Mining: Introduction

Transcript Data Mining: Introduction

Directory