Deterministic Space Partition for Efficient Anomaly Mining

Download Report

Transcript Deterministic Space Partition for Efficient Anomaly Mining

Deterministic Space Partition
for Efficient Anomaly Mining
Xiao Yu, Lu An Tang, Jiawei Han
05/05/2009
Motivation
• Distance and density based methods achieve
high precision in complex datasets but very
time consuming.
• Sample based and random methods are very
efficient but can’t detect local anomalies in
complex datasets.
• Anomalies may have different meanings in
specific domains, but yet no methods are
proposed to distinct different anomalies.
4/12/2016
2
Related Works
• Statistic Methods.
– Detect anomalies based on distribution and test.
• LOF: Local Outlier Factor.
– Density based method. Detect outliers based on
neighbors.
• Clustering based Methods.
– DBSCAN, OPTICS, etc.
• Isolation Forest.
– Sample based method. Random Space Partition.
4/12/2016
3
Algorithm Overview
• Filter and Refine Framework.
• 1. Filter Step.
– Use Deterministic Space Partition to filter out
normal data and generate anomaly candidates
(more than 80% normal data could be eliminated).
• 2. Refine Step.
– Calculate final results and also attributes of the
anomalies
4/12/2016
4
Deterministic Space Partition
• D=>current dataset, d=>instance.
• Calculate a deterministic test T based on
current partition status.
– A test consists of an attribute q and a split value p
such that it could divide D into two subset D1 and
D2, by examining whether d.q is less than p or not.
• Repeat this process recursively until the
partition reaches a predefined depth or
instances are isolated.
4/12/2016
5
Filter Tree
• DSP could be stored and represented by a full
binary tree, defined as Filter Tree.
4/12/2016
6
Deterministic Test
• A deterministic test contains a dimension and
a split value. Build histogram to summarize
data on different dimensions.
• Select a dimension:
TDim  Ti  Tk
4
(
x


)
f ( x)dx

4
E
[(
C

E
(
C
))
]
1
4
d
d
Ti 
; Tk  4  xDim

2 2
2
2

P
log
P

E
[(
C

E
(
C
))
]
(
(
x


)
f
(
x
)
dx
)
 d 2 d
d
d

xDim
Best _ dim  arg max( TDim )
Dim
4/12/2016
7
Deterministic Test
• After select a dimension, then calculate the
best split point.
• Select a split point:
Ts   Pv log 2 ( Pv ) 
v s
P
v ' s
v'
log 2 ( Pv ' )
count (v)
count (v' )
Pv 
; Pv ' 
 count (v)
 count (v' )
v s
v ' s
Best _ value  arg max (Ts )
4/12/2016
sDim
8
Density based Refine Stage
• Check each anomaly candidate using two
measures.
• K-distance: For at least k objects o’ ∈ D{p} it holds that d(p, o’)
≤d(p, o), and for at most k-1 objects, o’ ∈ D{p} it holds that
d(p, o’) < d(p, o).
1
knn(o)  { p  D | d ( p, o)  k  dist (o)};  kd ( o ) 
d ( p, o)

k pknn( o )
1
Ti 
k
Td 
4/12/2016

 kd ( o )
pknn ( o )
 kd ( p )
size (c)  kd ( o )

pC
Isolation Degree
Distance to normal dataset
kd ( p )
9
Experiment 1
• 2-D Complex synthetic datasets with different
density, distribution and local outliers.
4/12/2016
10
Experiment 1 False Negative
4/12/2016
11
Experiment 1 False Positive
4/12/2016
12
Experiment 1 Time Complexity
4/12/2016
13
Thanks.
4/12/2016
14