Outlier Detection - Department of Computer Science

Download Report

Transcript Outlier Detection - Department of Computer Science

Data Mining
Anomaly/Outlier Detection
Lecture Notes for Chapter 10
Introduction to Data Mining
by
Tan, Steinbach, Kumar
New slides have been added and the original slides have
been significantly modified by Christoph F. Eick
Lecture Organization
0. Anomaly/Outlier Detection
1. Graphic-based Approaches
2. Model-based Statistical Approaches
3. One-Class SVM Approach
4. Distance-Based Approaches
Anomaly Detection
Coverage in COSC 4355
Please note, 7/8 of an iceberg are under water!
Eick, Tan,Steinbach,Kumar
COSC 4335: Data Mining
4/25/2016
0. Anomaly/Outlier Detection

What are anomalies/outliers?
– The set of data points that are considerably different than the
remainder of the data

Variants of Anomaly/Outlier Detection Problems
– Given a database D, find all the data points x  D with anomaly
scores greater than some threshold t
– Given a database D, find all the data points x  D having the top-n
largest anomaly scores f(x)
– Given a database D, containing mostly normal (but unlabeled) data
points, and a test point x, compute the anomaly score of x with
respect to D

Applications:
– Credit card fraud detection, telecommunication fraud detection,
network intrusion detection, fault detection, data cleaning, sensor
fusion,…
Eick, Tan,Steinbach,Kumar
COSC 4335: Data Mining
4/25/2016
Importance of Anomaly Detection
Ozone Depletion History

In 1985 three researchers (Farman,
Gardinar and Shanklin) were
puzzled by data gathered by the
British Antarctic Survey showing that
ozone levels for Antarctica had
dropped 10% below normal levels

Why did the Nimbus 7 satellite,
which had instruments aboard for
recording ozone levels, not record
similarly low ozone concentrations?

The ozone concentrations recorded
by the satellite were so low they
were being treated as outliers by a
computer program and discarded!
Eick, Tan,Steinbach,Kumar
Sources:
http://exploringdata.cqu.edu.au/ozone.html
http://www.epa.gov/ozone/science/hole/size.html
COSC 4335: Data Mining
4/25/2016
Anomaly Detection

Challenges
– How many outliers are there in the data?
– Method is unsupervised

Validation can be quite challenging (just like for clustering)
– Finding needle in a haystack

Working assumption:
– There are considerably more “normal” observations
than “abnormal” observations (outliers/anomalies) in
the data
Eick, Tan,Steinbach,Kumar
COSC 4335: Data Mining
4/25/2016
Anomaly Detection Schemes

General Steps
– Build a profile of the “normal” behavior

Profile can be patterns or summary statistics for the overall population
– Use the “normal” profile to detect anomalies


Anomalies are observations whose characteristics
differ significantly from the normal profile
Types of anomaly detection
schemes
1. Graphical
2. Model-based, relying on parametric models
3. One-Class SVM Approach
4. Distance-based
Eick, Tan,Steinbach,Kumar
COSC 4335: Data Mining
4/25/2016
1. Graphical Approaches



Idea: user identifies outliers by visual inspection
Scatter plot (2-D), Spin plot (3-D)
Limitations
– Time consuming
– Subjective
Eick, Tan,Steinbach,Kumar
COSC 4335: Data Mining
4/25/2016
Convex Hull Method



Extreme points are assumed to be outliers
Use convex hull method to detect extreme values
http://cgm.cs.mcgill.ca/~godfried/teaching/project
s97/belair/alpha.html
Eick, Tan,Steinbach,Kumar
COSC 4335: Data Mining
4/25/2016
Box-Plot Approach for Outlier Detection
outlier
1.5*IQR
IQR




Mixture of a graphical and a statistical approach
Observations that are more than *IQR (e.g.  =1.5) above
or below the inter-quantile range are outliers.
Decent approach for 1D/single attribute outlier detection!
Sad news: Cannot be used for multi-variate data!
Eick, Tan,Steinbach,Kumar
COSC 4335: Data Mining
4/25/2016
2. Model-based Statistical Approaches



Fit a parametric model M to the data, capturing the distribution
of the data (e.g., normal distribution)
Data
Density Function
Apply a statistical test that depends on
– Data distribution
– Parameter of distribution (e.g., mean, variance)
– Number of expected outliers (confidence limit)
Alternatively, rank points by their likelihood with respect to M
Eick, Tan,Steinbach,Kumar
COSC 4335: Data Mining
4/25/2016
Grubbs’ Test
Detect outliers in univariate data
 Assume data comes from normal distribution
 Detects one outlier at a time, remove the outlier,
and repeat

– H0: There is no outlier in data
– HA: There is at least one outlier


Grubbs’ test statistic:
Reject H0 if:
Eick, Tan,Steinbach,Kumar
( N  1)
G
N
G
max X  X
s
t (2 / N , N 2 )
N  2  t (2 / N , N 2 )
COSC 4335: Data Mining
4/25/2016
Statistical Approaches for Assignment4
1.
2.
3.
4.
Fit a model M to the dataset D; e.g.
– A Bivariate Gaussian Model
GMM-Model
– A Bivariate Gaussian Mixture Model by running
the EM clustering algorithm
Plug each point p into the density function dM of
model M and compute dM(p) or preferably log(dM(p)),
called the log likelihood of p, and add this value as in
a new column ols (“outlier score”) to D obtaining
D’—the smaller this value is the more likely p is an
outlier with respect M.
Sort D’ in ascending order—the first record is the
record with the smallest value for log(dM(p))
Perform the remaining tasks using D’
Eick, Tan,Steinbach,Kumar
COSC 4335: Data Mining
4/25/2016
Density Plot for Assigment4 Dataset
Remark: Using a model-based approach points on the same density contour line should
have the same likelihood to be outliers with respect to the underlying statistical model M.
Eick, Tan,Steinbach,Kumar
COSC 4335: Data Mining
4/25/2016
Another “Better” Density Contour Plot
Eick, Tan,Steinbach,Kumar
COSC 4335: Data Mining
4/25/2016
General Idea EM Algorithm
EM Algorithm
Learns Gaussian Mixture Models: http://pypr.sourceforge.net/mog.html
600.465 - Intro to NLP - J. Eisner
Eick, Tan,Steinbach,Kumar
COSC 4335: Data Mining
http://scikit-learn.org/stable/modules/mixture.html
15
4/25/2016
Likelihood-Based Removal Approach


Assume the data set D contains samples from a mixture of two
probability distributions:
– M (majority distribution)
– A (anomalous distribution)
General Approach:
– Initially, assume all the data points belong to M
– Let Lt(D) be the log likelihood of D at time t
– For each point xt that belongs to M explore the affect of
moving it to A
 Let Lt+1 (D) be the new log likelihood after removing xt
 Compute the difference,  = Lt+1(D) – Lt (D)
 If  > c (some threshold), then xt is declared as an
anomaly and moved permanently from M to A
Eick, Tan,Steinbach,Kumar
COSC 4335: Data Mining
4/25/2016
Limitations of Statistical Approaches
Most of the statistical tests are for a single
attributes
 In many cases, data distribution/model may not be
known
 For high dimensional data, it may be difficult to
estimate the “true” density function. However,
mixtures of Gaussians and conjunction with EM
have been successfully used in practice for some
outlier detection tasks that involve multi-variate
data.

Eick, Tan,Steinbach,Kumar
COSC 4335: Data Mining
4/25/2016
3. One-Class SVM Approach
for Outlier Detection


Consider a sphere with center a and radius R
Minimize R and the error resulting from points outside
the sphere—their error is their distance to the sphere.
Lowercase greek xi letter, pronounced ‘ksi’
min R 2  C  t
t
subject to
xt  a  R 2   t ,  t  0
error
More information: http://rvlasveld.github.io/blog/2013/07/12/introduction-to-one-class-support-vector-machines/
Eick, Tan,Steinbach,Kumar
COSC 4335: Data Mining
4/25/2016
One Class SVM with Kernel Functions
Again kernel functions/mapping to a higher dimensional space can be employed in which
case the class boundary shapes change as depicted.
Eick, Tan,Steinbach,Kumar
COSC 4335: Data Mining
4/25/2016
4. Distance-based Approaches

Data is represented as a vector of features

Three major approaches
1. K-Nearest-neighbor based
2. Density-based approaches, relying on nonparametric density estimation techniques
3. Clustering based
Eick, Tan,Steinbach,Kumar
COSC 4335: Data Mining
4/25/2016
Nearest-Neighbor Based Approach

Approach:
– Compute the distance between every pair of data
points
– There are various ways to define outliers:
 Data
points for which there are fewer than p neighboring
points within a distance r
 The
top n data points whose distance to the kth nearest
neighbor is greatest
 The
top n data points whose average distance to the k
nearest neighbors is greatest
Eick, Tan,Steinbach,Kumar
COSC 4335: Data Mining
4/25/2016
Density-based: LOF approach



For each point, compute the density of its local
neighborhood; e.g. use DBSCAN’s approach
Compute local outlier factor (LOF) of a sample p as the
average of the ratios of the density of sample p and the
density of its nearest neighbors
Outliers are points with largest LOF value
In the NN approach, p2 is
not considered as outlier,
while LOF approach find
both p1 and p2 as outliers

p2

Eick, Tan,Steinbach,Kumar
p1
COSC 4335: Data Mining
4/25/2016
Clustering-Based




Basic Idea: Run a clustering algorithm, objects that do
not belong to any cluster are considered to be outliers!
Problem: Clustering algorithm that has some notion of
outliers!
Problem what parameters should I choose for the
algorithm; e.g. we could just run DBSCAN and report
the outliers!
Rule of Thumb: Less than x% of the data should be
outliers (with x typically chosen between 0.1 and 10); x
might be determined with other methods; e.g. statistical
tests.
Eick, Tan,Steinbach,Kumar
COSC 4335: Data Mining
4/25/2016
Assigment4 Dataset
Eick, Tan,Steinbach,Kumar
COSC 4335: Data Mining
4/25/2016
R-Code Used to Create the Complex9_gn8 Displays
#R-code for Assignment4 COSC 4335 Spring 2016
#Code for Scatter Plots
setwd("C:/Users/8yetula8\\Desktop")
a<-read.csv("complex9_gn8.txt")
d<-data.frame(x=a[,1],y=a[,2],class=factor(a[,3]))
plot(d$x,d$y)
require("lattice")
require("ggplot2")
xyplot(y ~ x | class, d, groups=d$class, pch=20)
ggplot(d, aes(x=x, y=y, colour=class))+ geom_point()
ggplot(d, aes(x = x, y = y)) + geom_point() + facet_grid(~class)
ggplot (d, aes (x = x, y = y, colour = class)) + stat_density2d ()
p <- ggplot(d, aes(x = x,y = y))
p+geom_point()+geom_density2d()
#another approach; seems to get more meaningful contours
require("MASS")
require("KernSmooth")
y <- cbind(d$x, d$y)
#this apporach uses kernel density estimation;
est <- bkde2D(y, bandwidth=c(20, 18))
contour(est$x1, est$x2, est$fhat)
persp(est$fhat)
Eick, Tan,Steinbach,Kumar
COSC 4335: Data Mining
4/25/2016