Deterministic Space Partition for Efficient Anomaly Mining

Transcript Deterministic Space Partition for Efficient Anomaly Mining

Deterministic Space Partition
for Efficient Anomaly Mining
Xiao Yu, Lu An Tang, Jiawei Han
05/05/2009
Motivation
• Distance and density based methods achieve
high precision in complex datasets but very
time consuming.
• Sample based and random methods are very
efficient but can’t detect local anomalies in
complex datasets.
• Anomalies may have different meanings in
specific domains, but yet no methods are
proposed to distinct different anomalies.
4/12/2016
2
Related Works
• Statistic Methods.
– Detect anomalies based on distribution and test.
• LOF: Local Outlier Factor.
– Density based method. Detect outliers based on
neighbors.
• Clustering based Methods.
– DBSCAN, OPTICS, etc.
• Isolation Forest.
– Sample based method. Random Space Partition.
4/12/2016
3
Algorithm Overview
• Filter and Refine Framework.
• 1. Filter Step.
– Use Deterministic Space Partition to filter out
normal data and generate anomaly candidates
(more than 80% normal data could be eliminated).
• 2. Refine Step.
– Calculate final results and also attributes of the
anomalies
4/12/2016
4
Deterministic Space Partition
• D=>current dataset, d=>instance.
• Calculate a deterministic test T based on
current partition status.
– A test consists of an attribute q and a split value p
such that it could divide D into two subset D1 and
D2, by examining whether d.q is less than p or not.
• Repeat this process recursively until the
partition reaches a predefined depth or
instances are isolated.
4/12/2016
5
Filter Tree
• DSP could be stored and represented by a full
binary tree, defined as Filter Tree.
4/12/2016
6
Deterministic Test
• A deterministic test contains a dimension and
a split value. Build histogram to summarize
data on different dimensions.
• Select a dimension:
TDim  Ti  Tk
4
(
x


)
f ( x)dx

4
E
[(
C

E
(
C
))
]
1
4
d
d
Ti 
; Tk  4  xDim

2 2
2
2

P
log
P

E
[(
C

E
(
C
))
]
(
(
x


)
f
(
x
)
dx
)
 d 2 d
d
d

xDim
Best _ dim  arg max( TDim )
Dim
4/12/2016
7
Deterministic Test
• After select a dimension, then calculate the
best split point.
• Select a split point:
Ts   Pv log 2 ( Pv ) 
v s
P
v ' s
v'
log 2 ( Pv ' )
count (v)
count (v' )
Pv 
; Pv ' 
 count (v)
 count (v' )
v s
v ' s
Best _ value  arg max (Ts )
4/12/2016
sDim
8
Density based Refine Stage
• Check each anomaly candidate using two
measures.
• K-distance: For at least k objects o’ ∈ D{p} it holds that d(p, o’)
≤d(p, o), and for at most k-1 objects, o’ ∈ D{p} it holds that
d(p, o’) < d(p, o).
1
knn(o)  { p  D | d ( p, o)  k  dist (o)};  kd ( o ) 
d ( p, o)

k pknn( o )
1
Ti 
k
Td 
4/12/2016

 kd ( o )
pknn ( o )
 kd ( p )
size (c)  kd ( o )

pC
Isolation Degree
Distance to normal dataset
kd ( p )
9
Experiment 1
• 2-D Complex synthetic datasets with different
density, distribution and local outliers.
4/12/2016
10
Experiment 1 False Negative
4/12/2016
11
Experiment 1 False Positive
4/12/2016
12
Experiment 1 Time Complexity
4/12/2016
13
Thanks.
4/12/2016
14

Deterministic Space Partition for Efficient Anomaly Mining

Transcript Deterministic Space Partition for Efficient Anomaly Mining

Directory