A Unified Approach to Spatial Outliers Detection
Download
Report
Transcript A Unified Approach to Spatial Outliers Detection
A Unified Approach to Spatial
Outliers Detection
Chang-Tien Lu
Spatial Database Lab
Department of Computer Science
University of Minnesota
[email protected]
http://www.cs.umn.edu/research/shashi-group
Outline
Introduction
Motivation
General Definition of Spatial Outlier
Related Work
Proposed Approach and Algorithm
Evaluation of Proposed Approach
Conclusions
Spatial Data Mining
Spatial Databases are too large to analyze manually
Spatial Data Mining
NASA Earth Observation System (EOS)
National Institute of Justice – Crime mapping
Census Bureau, Dept. of Commerce - Census Data
Discover frequent and interesting spatial patterns for post
processing (knowledge discovery)
Pattern examples: outliers, crime hot spots, land-use
classification
Historical Example
London, 1854
Cholera & water pump
Spatial Outlier
Definition: A data point that is extreme relative to its neighbors
Application Domain: Traffic Data
An Example of Spatial Outlier
Spatial outlier: S, global outlier: G, L
Spatial Outlier Detection: Z
Function:
1
S (x ) f (x ) y N ( x ) (f (y ))
k
If
Z s( x)
S ( x) s
s
Declare x as a spatial
outlier
s(x)
approach
Evaluation of Statistical Assumption
Distribution of traffic station attribute f(x) looks normal
Distribution of S (x ) f (x ) 1 y N ( x ) (f (y )) looks normal too!
k
Outlier Detection Tests
Outlier Detection Tests
Related work
1-dimension: ignores geographic location
Homogeneous multi-dimension: mixes location with
attributes
Spatial outlier
2 classes of dimensions : location, attribute
Neighborhood : based on location dimension
Difference : compares attribute dimension
Comparison of outlier detection methods
Issues
Numerous tests
Each has custom algorithm
Add complexity to implement Spatial Database
Management System
Desirable
Unified test
A general algorithm to perform different tests
High performance
Our Contribution
A general definition of spatial outlier
S - outlier
Show that existing definitions are special cases of
s -outlier
Design efficient algorithms
Analyze the computation structures
Develop I/O cost models
Evaluate alternate disk page clustering methods
General Definition of Spatial Outlier
Given
A spatial framework SF consisting of locations s1, s2, …, sn
An attribute function f : si R
(R : set of real numbers)
A neighborhood relationship N SF SF
N
A neighborhood aggregation function
: RN R
f aggr
A difference function Fdiff : R R R
Statistic test function ST : R { True, False }
N
Test is based on Fdiff (f, f aggr
(f, N) )
General definition: S -outlier
N
An object O SF is a S -outlier (f, f aggr
, Fdiff , ST )
if ST = = TRUE
Related Work- Spatial Outlier Tests
Different spatial outlier tests
Spatial statistic approach
Scatter plot approach (Luc Anselin ’94)
Moran Scatter plot approach (Luc Anselin ’95)
Variogram cloud approach (Graphic)
All these are special cases of S -outlier
Show this for one case: scatter plot
Lemma
Scatter plot is a special case
of S -outlier
Given
An attribute function f(x)
A neighborhood relationship N(x)
An aggregation function
N
f aggr
: E (x )
E(x)
Scatter Plot Approach
1
y N x f (y )
k
( )
A difference function
Fdiff : є = E(x) – (m f(x) + b)
Detect spatial outlier by
Statistic test function
ST :
f(x)
Outline
Introduction
Proposed Approach and Algorithm
Problem formulation
Our approach
Efficient algorithm
Cost model
Evaluation of Proposed Approach
Conclusions
Problem Formulation
General definition: S -outlier
Design
An efficient algorithm to detect S -outlier
i,e., O= {si si SF , si is a spatial outlier}
Objective
N
An object O is a S -outlier (f, f aggr
, Fdiff, ST )
if ST == TRUE
Efficiency: to minimize the computation time
Constraints
Fdiff and ST are algebraic aggregate functions of values
N
of f and f aggr
The size of the data set >> the main memory size
Computation time is determined by I/O time
Aggregate Function
Distributive aggregate function F
Global F value can computed by applying the G function to the
value of F in each partition of the data set, F = G for most cases
Algebraic aggregate function F
Global F value can be computed using a fixed number of subaggregates from each partition of the data set
Our Approach
Separate two phases
Model building
Testing (a node or a set of nodes)
Computation structure of model building
Key insights:
Spatial self join using N(x) relationship
Algebraic aggregate functions can be computed in
one disk scan of spatial join
Computation structure of testing
Single node: spatial range query
Get-All-Neighbors(x) operation
An Example of Our Approach : Scatter plot
Model building
An attribute function f(x)
Neighborhood aggregate function E ( x) 1 y N ( x ) f ( y )
k
Distributive aggregate functions
Testing
(x ), E 2 (x )
Sy y (m 2Sxx )
(n 2)
where
f ( x )E 2 ( x ) f ( x ) (f ( x )E ( x ) )
b
N f 2 ( x ) ( f ( x )) 2
N
Sxx f (x )
2
( f (x )) 2
Difference function
2
Algebraic aggregate functions
N
f ( x )E ( x ) f ( x ) E ( x )
m N f 2 ( x )
( f ( x )) 2
f (x ), E (x ), f (x )E (x ), f
E (x ) (m f (x ) b ) where
Statistic test function
E (x )
n
1
k
y
,
N ( x )
Sy y
f (y )
E
2
(x )
( E (x )) 2
n
Model Building Algorithm
Steps
For each node x
Retrieve data record of x : (f(x), list of neighbors(x))
Get-All-Neighbors(x): Retrieve data records of neighbor(x)
If neighbor y is not in the memory buffer, request another I/O
operation
N
Compute neighborhood aggregate function f aggr
Accumulate distributive aggregate function:
D1
D2
Dk
f aggr
, f aggr
,.., f aggr
Compute algebraic aggregate function:
A2
An
A1
f aggr
, f aggr
,.. , f aggr
I/O cost
Dominant operation: Get-All-Neighbors(x)
I/O cost of Get-All-Neighbors(x) is determined by the
clustering efficiency
Attribute Table
Node Attribute: volume
x
List of Neighbors
f(x)
N(x)
V1
125
V3, V5, V10
V2
130
V4, V6
V3
140
V1, V7
….
….
….
V10
120
V1, V7, V12
Computation structure
Lemma:
In model building: algebraic aggregate function can be
computed in one disk scan of the spatial self-join
Proof:
A fixed k number of distributive aggregate functions
G1
Dk
D2
, f aggr
,.., f aggr
are used to store the aggregate values
f aggr
Algebraic aggregate functions are then computed using
these aggregate values
In all cases, one needs a very small number of distributive
aggregate functions (k 10)
Testing Algorithm
Steps
For each node x
Retrieve data record of x : (f(x), list of neighbors(x))
Get-All-Neighbors(x): Retrieve data records of
neighbor(x)
If neighbor y is not in the memory buffer
Request another I/O operation
Compute difference function Fdiff
If test function ST == True
Declare x as a spatial outlier
I/O Cost Model
Definition
CE: Clustering efficiency
N: Total number of nodes
Bfr: Blocking factor (number of nodes in a disk page)
K: Avg. number of neighbors for each node
L: Number of nodes in a route
Cost model of A1
N / Bfr N K
(1 CE )
The cost to retrieve all nodes: N / Bfr
The cost to retrieve neighbors of all nodes: N K (1 CE )
Cost model of A2
L (1 CE ) L K (1 CE )
Clustering Efficiency
CE definition:
Probability [vi and a neighbor of vi are stored in
the same disk page]
(Total number of unsplit edges)/(Total number
of edges)
Computation cost (I/O cost) is determined
by Clustering Efficiency (CE)
Clustering Efficiency
An example
CE
93 6
0.66
9
9
CE depends on
Disk page size
Node record size,
edge distribution
over nodes
Clustering method
I/O Cost Model
Definition
CE : Clustering efficiency
N : Total number of nodes
Bfr : Blocking factor (number of nodes in a disk page)
K : Avg. number of neighbors for each node
L : Number of nodes in a route
Cost model of Model Building
N / Bfr N K
(1 CE )
The cost to retrieve all nodes: N / Bfr
The cost to retrieve neighbors of all nodes: N K (1 CE )
Cost model of Testing
L (1 CE ) L K (1 CE )
Outline
Introduction
Proposed Approach and Algorithm
Evaluation of Proposed Approach
Candidates (Clustering Methods)
Experiment Design
Results
conclusions
Experimental Evaluation (Summary)
Hypothesis:
Physical Data Page Clustering Method
Minneapolis- St. Paul traffic data (loop-detector)
Benchmark tasks
CCAM
Cell Tree
Z-order
Benchmark data
I/O cost of the algorithm is determined by the clustering
efficiency
Model building
Testing
Metrics: Clustering efficiency(CE), I/O cost
Clustering Method: CCAM
Connectivity Clustered Access Method
Cluster the nodes via min-cut graph partitioning
Use B+ tree with Z-order as the secondary index
Clustering Method: CCAM
Clustering Methods: Cell Tree
Binary Space Partitioning (BSP)
Decompose universe into disjoint convex subspaces
Cannot exploit edge information, pure geometric
Clustering Method: Cell Tree
Clustering Method: Z-order
Impose a total order on the nodes
Z-order = interleave (bits of X, bits of Y)
Clustering Method: Z-order
Experiment Design
Questions/Hypotheses
What is the ranking of candidate clustering
methods?
Is CE a predictor of relative performance of
clustering methods?
Does cost model predict observed ranking?
What are the effects of parameters:
Disk page size
Memory buffer size
Experiment Design
Model Building: Effect of Page Size
Configuration:
Trends:
Fixed parameters: buffer size
Variable parameters: page size, clustering strategy
CCAM has the best performance, the highest CE value
High CE => Low I/O cost
Cost model: (N/Bfr)+N*K*(1-CE)
Increase page size => reduce number of page accesses
Model Building: Effect of Buffer Size
Fixed parameters
Page size: 2 Kbytes
Clustering Efficiency:
CCAM=0.81, Cell=0.69,
Z-ord=0.51
Variable parameters
Number of buffers
Clustering strategies
Trends:
Increase buffer size
=> reduce number of page accesses
CCAM has the best performance
Testing: Effect of Page Size
Configuration:
Trends:
Fixed parameters: buffer size
Variable parameters: page size, clustering strategy
CCAM has the best performance, the highest CE value
High CE => Low I/O cost
Cost model: L*(1-CE)+L*K*(1-CE)
Increase page size
Reduce number of page access, performance gap reduces
Summary of Experimental Results
Model building and testing
Higher CE leads to lower I/O cost
CCAM achieves higher clustering efficiency
CCAM has lower I/O than Cell-Tree and Z-order
CE is a good predictor of relative I/O performance
Page size improves clustering efficiency of all
methods
Reduces performance gap between methods
Outlier Station Detected
Minneapolis–St. Paul Traffic Data
Outlier Station Detected
Conclusion
A general definition of spatial outlier
S -outlier models existing spatial outlier definitions
Efficient algorithm to detect spatial outlier
Model building & Testing
Recognize the computation structure of algorithms
Scatter plot, Moran Scatterplot, Spatial Statistic, ..
Algebraic aggregate functions on self join
Get-All-Neighbor(x) dominates I/O cost
Develop algebraic cost models
Evaluate alternate disk page clustering methods
CCAM, Cell-Tree, Z-order
Future Directions
Extend Spatial Outlier Detection Test
Explore other spatial patterns beyond spatial outlier
Multi-attributes
Combination of traffic volume, speed, occupancy
Location attribute includes time
Temporal and Spatial-Temporal Outliers
Land-use classification
Co-locations
Fire ignition source feature
Needle vegetation type feature
Drought feature
Iterative Spatial Outlier Detection
Application Domain: Traffic Volume Matrix
Spatial Outlier Detection
Iterative Spatial Outlier Detection
Related Publications
Detecting Graph-based Spatial Outliers: Algorithms and
Applications, ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining, September, 2001
Detecting Graph-based Spatial Outliers, the International
Journal of Intelligent Data Analysis (IDA), Vol. 6, No 3. 2002
A Unified Approach to Spatial Outliers Detection, IEEE
Transactions on Knowledge and Data Engineering. (under
review)
http://www.cs.umn.edu/~ctlu
Thank you !!!