A Unified Approach to Spatial Outliers Detection

Download Report

Transcript A Unified Approach to Spatial Outliers Detection

A Unified Approach to Spatial
Outliers Detection
Chang-Tien Lu
Spatial Database Lab
Department of Computer Science
University of Minnesota
[email protected]
http://www.cs.umn.edu/research/shashi-group
Outline

Introduction






Motivation
General Definition of Spatial Outlier
Related Work
Proposed Approach and Algorithm
Evaluation of Proposed Approach
Conclusions
Spatial Data Mining

Spatial Databases are too large to analyze manually




Spatial Data Mining



NASA Earth Observation System (EOS)
National Institute of Justice – Crime mapping
Census Bureau, Dept. of Commerce - Census Data
Discover frequent and interesting spatial patterns for post
processing (knowledge discovery)
Pattern examples: outliers, crime hot spots, land-use
classification
Historical Example

London, 1854

Cholera & water pump
Spatial Outlier
Definition: A data point that is extreme relative to its neighbors
Application Domain: Traffic Data
An Example of Spatial Outlier
Spatial outlier: S, global outlier: G, L
Spatial Outlier Detection: Z
Function:
1


S (x )  f (x )   y N ( x ) (f (y ))
k


If
Z s( x)
S ( x)   s


s
Declare x as a spatial
outlier
s(x)
approach
Evaluation of Statistical Assumption


Distribution of traffic station attribute f(x) looks normal
Distribution of S (x )  f (x )  1  y N ( x ) (f (y )) looks normal too!

k

Outlier Detection Tests
Outlier Detection Tests

Related work



1-dimension: ignores geographic location
Homogeneous multi-dimension: mixes location with
attributes
Spatial outlier




2 classes of dimensions : location, attribute
Neighborhood : based on location dimension
Difference : compares attribute dimension
Comparison of outlier detection methods
Issues

Numerous tests



Each has custom algorithm
Add complexity to implement Spatial Database
Management System
Desirable



Unified test
A general algorithm to perform different tests
High performance
Our Contribution

A general definition of spatial outlier


S - outlier
Show that existing definitions are special cases of
s -outlier

Design efficient algorithms

Analyze the computation structures

Develop I/O cost models

Evaluate alternate disk page clustering methods
General Definition of Spatial Outlier

Given
A spatial framework SF consisting of locations s1, s2, …, sn
 An attribute function f : si  R
(R : set of real numbers)
 A neighborhood relationship N  SF  SF
N
 A neighborhood aggregation function
: RN  R
f aggr
 A difference function Fdiff : R  R  R
 Statistic test function ST : R  { True, False }



N
Test is based on Fdiff (f, f aggr
(f, N) )
General definition: S -outlier

N
An object O  SF is a S -outlier (f, f aggr
, Fdiff , ST )
if ST = = TRUE
Related Work- Spatial Outlier Tests

Different spatial outlier tests





Spatial statistic approach
Scatter plot approach (Luc Anselin ’94)
Moran Scatter plot approach (Luc Anselin ’95)
Variogram cloud approach (Graphic)
All these are special cases of S -outlier

Show this for one case: scatter plot

Lemma


Scatter plot is a special case
of S -outlier
Given



An attribute function f(x)
A neighborhood relationship N(x)
An aggregation function
N
f aggr
: E (x ) 


E(x)
Scatter Plot Approach
1

y N x f (y )
k
 ( )
A difference function
Fdiff : є = E(x) – (m  f(x) + b)
Detect spatial outlier by

Statistic test function
ST :     


f(x)
Outline


Introduction
Proposed Approach and Algorithm






Problem formulation
Our approach
Efficient algorithm
Cost model
Evaluation of Proposed Approach
Conclusions
Problem Formulation

General definition: S -outlier


Design


An efficient algorithm to detect S -outlier
i,e., O= {si si  SF , si is a spatial outlier}
Objective


N
An object O is a S -outlier (f, f aggr
, Fdiff, ST )
if ST == TRUE
Efficiency: to minimize the computation time
Constraints



Fdiff and ST are algebraic aggregate functions of values
N
of f and f aggr
The size of the data set >> the main memory size
Computation time is determined by I/O time
Aggregate Function

Distributive aggregate function F


Global F value can computed by applying the G function to the
value of F in each partition of the data set, F = G for most cases
Algebraic aggregate function F

Global F value can be computed using a fixed number of subaggregates from each partition of the data set
Our Approach

Separate two phases



Model building
Testing (a node or a set of nodes)
Computation structure of model building

Key insights:



Spatial self join using N(x) relationship
Algebraic aggregate functions can be computed in
one disk scan of spatial join
Computation structure of testing

Single node: spatial range query

Get-All-Neighbors(x) operation
An Example of Our Approach : Scatter plot

Model building



An attribute function f(x)
Neighborhood aggregate function E ( x)  1  y  N ( x ) f ( y )
k
Distributive aggregate functions


 
Testing

(x ),  E 2 (x )
Sy y  (m 2Sxx )
(n  2)
where
f ( x )E 2 ( x )   f ( x )  (f ( x )E ( x ) )

b
N  f 2 ( x ) (  f ( x )) 2
N
Sxx   f (x ) 
2
( f (x )) 2
Difference function


2
Algebraic aggregate functions

N
f ( x )E ( x )  f ( x )  E ( x )
m   N f 2 ( x ) 
(  f ( x )) 2



f (x ), E (x ), f (x )E (x ), f
 E (x )  (m  f (x )  b ) where
Statistic test function

 


E (x ) 
n
1
k
y
,
N ( x )
Sy y 
f (y )
E
2
(x ) 
( E (x )) 2
n
Model Building Algorithm

Steps

For each node x


Retrieve data record of x : (f(x), list of neighbors(x))
Get-All-Neighbors(x): Retrieve data records of neighbor(x)

If neighbor y is not in the memory buffer, request another I/O
operation
N
Compute neighborhood aggregate function f aggr
 Accumulate distributive aggregate function:
D1
D2
Dk
f aggr
, f aggr
,.., f aggr



Compute algebraic aggregate function:
A2
An
A1
f aggr
, f aggr
,.. , f aggr
I/O cost


Dominant operation: Get-All-Neighbors(x)
I/O cost of Get-All-Neighbors(x) is determined by the
clustering efficiency
Attribute Table
Node Attribute: volume
x
List of Neighbors
f(x)
N(x)
V1
125
V3, V5, V10
V2
130
V4, V6
V3
140
V1, V7
….
….
….
V10
120
V1, V7, V12
Computation structure

Lemma:


In model building: algebraic aggregate function can be
computed in one disk scan of the spatial self-join
Proof:



A fixed k number of distributive aggregate functions
G1
Dk
D2
, f aggr
,.., f aggr
are used to store the aggregate values
f aggr
Algebraic aggregate functions are then computed using
these aggregate values
In all cases, one needs a very small number of distributive
aggregate functions (k  10)
Testing Algorithm

Steps

For each node x


Retrieve data record of x : (f(x), list of neighbors(x))
Get-All-Neighbors(x): Retrieve data records of
neighbor(x)




If neighbor y is not in the memory buffer
Request another I/O operation
Compute difference function Fdiff
If test function ST == True

Declare x as a spatial outlier
I/O Cost Model

Definition






CE: Clustering efficiency
N: Total number of nodes
Bfr: Blocking factor (number of nodes in a disk page)
K: Avg. number of neighbors for each node
L: Number of nodes in a route
Cost model of A1

N / Bfr   N  K



 (1  CE )
The cost to retrieve all nodes: N / Bfr 
The cost to retrieve neighbors of all nodes: N  K  (1  CE )
Cost model of A2

L  (1  CE )  L  K  (1  CE )
Clustering Efficiency

CE definition:



Probability [vi and a neighbor of vi are stored in
the same disk page]
(Total number of unsplit edges)/(Total number
of edges)
Computation cost (I/O cost) is determined
by Clustering Efficiency (CE)
Clustering Efficiency

An example


CE 
93 6
  0.66
9
9
CE depends on



Disk page size
Node record size,
edge distribution
over nodes
Clustering method
I/O Cost Model

Definition






CE : Clustering efficiency
N : Total number of nodes
Bfr : Blocking factor (number of nodes in a disk page)
K : Avg. number of neighbors for each node
L : Number of nodes in a route
Cost model of Model Building

N / Bfr   N  K



 (1  CE )
The cost to retrieve all nodes: N / Bfr 
The cost to retrieve neighbors of all nodes: N  K  (1  CE )
Cost model of Testing

L  (1  CE )  L  K  (1  CE )
Outline



Introduction
Proposed Approach and Algorithm
Evaluation of Proposed Approach




Candidates (Clustering Methods)
Experiment Design
Results
conclusions
Experimental Evaluation (Summary)

Hypothesis:


Physical Data Page Clustering Method




Minneapolis- St. Paul traffic data (loop-detector)
Benchmark tasks



CCAM
Cell Tree
Z-order
Benchmark data


I/O cost of the algorithm is determined by the clustering
efficiency
Model building
Testing
Metrics: Clustering efficiency(CE), I/O cost
Clustering Method: CCAM



Connectivity Clustered Access Method
Cluster the nodes via min-cut graph partitioning
Use B+ tree with Z-order as the secondary index
Clustering Method: CCAM
Clustering Methods: Cell Tree



Binary Space Partitioning (BSP)
Decompose universe into disjoint convex subspaces
Cannot exploit edge information, pure geometric
Clustering Method: Cell Tree
Clustering Method: Z-order


Impose a total order on the nodes
Z-order = interleave (bits of X, bits of Y)
Clustering Method: Z-order
Experiment Design

Questions/Hypotheses




What is the ranking of candidate clustering
methods?
Is CE a predictor of relative performance of
clustering methods?
Does cost model predict observed ranking?
What are the effects of parameters:


Disk page size
Memory buffer size
Experiment Design
Model Building: Effect of Page Size
Configuration:


Trends:



Fixed parameters: buffer size
Variable parameters: page size, clustering strategy
CCAM has the best performance, the highest CE value
High CE => Low I/O cost

Cost model: (N/Bfr)+N*K*(1-CE)
Increase page size => reduce number of page accesses
Model Building: Effect of Buffer Size

Fixed parameters




Page size: 2 Kbytes
Clustering Efficiency:
 CCAM=0.81, Cell=0.69,
Z-ord=0.51
Variable parameters

Number of buffers

Clustering strategies
Trends:

Increase buffer size
=> reduce number of page accesses

CCAM has the best performance
Testing: Effect of Page Size
Configuration:
Trends:





Fixed parameters: buffer size
Variable parameters: page size, clustering strategy
CCAM has the best performance, the highest CE value
High CE => Low I/O cost
 Cost model: L*(1-CE)+L*K*(1-CE)
Increase page size
 Reduce number of page access, performance gap reduces
Summary of Experimental Results

Model building and testing



Higher CE leads to lower I/O cost


CCAM achieves higher clustering efficiency
CCAM has lower I/O than Cell-Tree and Z-order
CE is a good predictor of relative I/O performance
Page size improves clustering efficiency of all
methods

Reduces performance gap between methods
Outlier Station Detected
Minneapolis–St. Paul Traffic Data
Outlier Station Detected
Conclusion


A general definition of spatial outlier
S -outlier models existing spatial outlier definitions


Efficient algorithm to detect spatial outlier




Model building & Testing
Recognize the computation structure of algorithms


Scatter plot, Moran Scatterplot, Spatial Statistic, ..
Algebraic aggregate functions on  self join
Get-All-Neighbor(x) dominates I/O cost
Develop algebraic cost models
Evaluate alternate disk page clustering methods

CCAM, Cell-Tree, Z-order
Future Directions

Extend Spatial Outlier Detection Test




Explore other spatial patterns beyond spatial outlier



Multi-attributes
 Combination of traffic volume, speed, occupancy
Location attribute includes time
Temporal and Spatial-Temporal Outliers
Land-use classification
Co-locations
 Fire ignition source feature
 Needle vegetation type feature
 Drought feature
Iterative Spatial Outlier Detection
Application Domain: Traffic Volume Matrix
Spatial Outlier Detection
Iterative Spatial Outlier Detection
Related Publications



Detecting Graph-based Spatial Outliers: Algorithms and
Applications, ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining, September, 2001
Detecting Graph-based Spatial Outliers, the International
Journal of Intelligent Data Analysis (IDA), Vol. 6, No 3. 2002
A Unified Approach to Spatial Outliers Detection, IEEE
Transactions on Knowledge and Data Engineering. (under
review)
http://www.cs.umn.edu/~ctlu
Thank you !!!