One Class Support Vector Machines for Detecting Anomalous

Download Report

Transcript One Class Support Vector Machines for Detecting Anomalous

One Class Support Vector Machines
for Detecting Anomalous Windows
Registry Accesses
Krysta Svore
Joint Work with Katherine Heller, Angelos Keromytis, and
Salvatore Stolfo
Anomaly Detection


We would like to be able to detect anomalous
behavior by monitoring Windows registry
queries
We studied a host-based intrusion detection
system that applies a one class support
vector machine algorithm within the Registry
Anomaly Detection (RAD) system
Key Results


We show that the Probabilistic Anomaly
Detection (PAD) algorithm dramatically
outperforms the OCSVM algorithm due to its
hierarchical prior
To improve the performance of the OCSVM
algorithm, we need knowledge on how to
construct well-defined kernels
Host-based Intrusion Detection Systems
(IDS)


Most often attacked operating system is
Microsoft Windows
Combative methods

Virus scanner and security patches



Unable to combat unknown attacks
Requires frequent updates
Host-based IDS


Detects intrusions by monitoring system accesses
Used with data mining techniques for improved
detection ability
Our Approach



Use the Registry Anomaly Detection (RAD)
system to monitor Windows registry queries
Apply a One Class Support Vector Machine
(OCSVM) algorithm within the RAD system
Compare our system with previous work
using the Probabilistic Anomaly Detection
(PAD) algorithm within the RAD system
The Windows Registry



Stores configuration settings for programs,
security information, user profiles, and other
system information
Consists of registry keys and their associated
values
Queried by programs for information by
accessing a specific registry key





Process: EXPLORER.EXE
Query: OpenKey
Key: HKCR\CLSID\B41DB860-8EE4-11D2-9906E49FADC173CA\shellex\MayChangeDefaultMenu
Response: SUCCESS
ResultValue: NOTFOUND
The Windows Registry


Activity varies
from several
accesses per
minute to several
thousand
accesses per
minute
Almost all
programs use the
registry
Registry Anomaly Detection (RAD)
System

Three components:

Audit Sensor


Model Generator


Analyzes the registry access
Develops a model of normal behavior
Anomaly Detector

Classifies a new registry access as normal or attack
according to the generated model
Probabilistic Anomaly Detection (PAD)
Algorithm


Developed by E. Eskin et al., 2002
Relies on consistency checks over normal data and
labels a record anomalous if it fails any one of these
tests

First order consistency check



Verifies a value is consistent with observed values of that
feature in the normal data set
5 total checks
Second order consistency check


Determines the conditional probability of a feature value given
another feature value
20 total checks
PAD

Probability of an observed element
 P(X = i) = C(Ni + α)/(kα+N)


where N - total number of observations
Ni - number of observations of symbol i
α - “pseudo count” for each observed symbol
k - number of observed symbols
Probability of an unobserved element
 P(X = i) = (1-C)/(L-k)
L - number of possible symbols

Scaling factor
 C= N/(N+L-k)
An Example

LAN


We expect the probability of seeing a previously
unobserved IP address to be small
Peering Center

We expect the probability of seeing a previously
unobserved IP address to be almost 1
One Class Support Vector Machine
(OCSVM) Algorithm



Maps input data into a high dimensional
feature space
Iteratively finds the maximal margin in the
hyperplane which best separates the training
data from the origin
Solves optimization problem to find rule f with
maximal margin


f(x)=‹w,x›+b
If f(x)<0 , label x as anomalous
OCSVM
OCSVM: Kernels

Equivalent to solving the dual quadratic programming
problem

minα (1/2) ∑i,j αiαjK(xi,xj) s.t. 0≤αi≤1/(νl) , ∑i αi = 1


where αi - Lagrange multiplier
v - parameter to control trade-off between distance of
hyperplane from the origin and number of points in
training dataset
l - number of points in training dataset
Kernel function projects input vectors into a feature
space allowing for nonlinear decision boundaries


Feature map:
Kernel Function:
Φ: X → RN
K(xi,xj) = ‹Φ(xi), Φ(xj)›
Efficiency Comparison

Complexity of PAD


Time: O(v 2 R 2 )
Space: O(vR 2 )


where v – number of unique record values
R – number of record components
Complexity of OCSVM


Time: O(dL3 )
Space: O(d(L+T))

where d – number of dimensions
L – number of records in training dataset
T – number of records in test dataset
Experimental Data



Collected training data on Windows NT 4.0
Used approximately 500,000 attack-free
records to train system
Tested system using approximately 300,000
records with 2,000 labeled as attack
Experiments

Used three common kernels




Linear: K(x,y) = (x·y)
Polynomial: K(x,y) = (x·y+1)d , where d is the degree of
the polynomial
Gaussian: K(x,y) = e -║x-y║2/(2σ2) , where σ2 is the
variance
Used two types of feature vectors

Binary


One dimension for every unique entry for each of the five
given record values
Frequency-based

Each feature corresponds to the number of occurrences of
the corresponding record component in the training set
Dimensionality of Binary and Frequencybased Feature Vectors

Binary feature vector


Dimension = Number of possible entries for
feature value 1 * number of possible entries for
feature value 2 * … * number of possible entries
for feature value 5
Frequency-based feature vector

Dimension = 5
Evaluation

Used two statistics to compare the system’s
accuracy

Detection rate


False positive rate


Percentage of attack records that have been correctly
identified
Percentage of normal records that have been mislabeled as
anomalous
Plotted results with a Receiver Operator
Characteristic (ROC) curve

Plots percentage of false positives versus percentage of
true positives
Results
Results
Results

PAD algorithm outperforms OCSVM algorithm due
to the use of a hierarchical prior to estimate the
probabilities



Knows the likelihood of encountering a previously
unencountered feature value
Ability of OCSVM to detect anomalies relies on the
choice of the kernel
Need a novel, well-defined kernel which accounts
for highly discriminative information
Conclusions




Registry activity provides a good platform for
anomaly detection
PAD is more reliable than OCSVM
Currently no known way to learn a “most
optimal” kernel
By analyzing different algorithms, we can
identify what needs to be captured in the
kernel definition
Future Work


Improve kernel definition to incorporate more
discriminative information
Update the model as new data is labeled


Requires an efficient way of remodeling the data
over time
Test our system on the UNIX platform