April27`04project_present
Download
Report
Transcript April27`04project_present
Spatial Outlier Detection and
implementation in Weka
Implemented by:
Shan Huang
Jisu Oh
CSCI8715 Class Project, April 27 2004
Presented by Jisu Oh (Group 2)
Slides Available at
http://www.users.cs.umn.edu/~joh/csci8715/HW-list.htm
1
Topics:
Motivation
Problem Statement
Key Concepts
Major Contributions
Validation Methodology
Assumptions
Conclusions
Future work
2
Motivation
-
-
Machine learning /Data mining
Enables a computer program to analyze
large-scale data
Decide important information which can
be used to make predictions or to make
decisions faster and more accurately.
3
Motivation
Weka
A collection of machine learning algorithms
for solving real-world data mining problems
Provides data mining functions
(eg, regressions, association rules, and
clustering algorithm)
Limitation:
operates on traditional non-spatial database
-
-
4
Problem Statement
Input Data set
Minneapolis/St.
Paul traffic data set
Output
: detected outliers as
Plain text (timeslot, time,
Overall traffic volume
Neighbor
station, Zs(x))
relationship graph between stations
5
Problem Statement(cont.)
Constraints
Algorithm
from paper “A unified approach Detecting
Spatial Outliers”
Dataset should be numeric
Objective
To
find sets of spatial outliers and show the results
visually
6
Key Concepts
Spatial outliers
– spatially referenced objects whose nonspatial attribute values are significantly different from
the values of its neighborhood.
Example – a new house in an old neighborhood of a
growing metropolitan area
In this project, outlier is one station which has a high
volume compared to the neighboring stations at
certain time slot.
Definition
7
Key Concepts (contd.)
Algorithm
Proposed
in the paper, “A Unified Approach to
Detecting Spatial Outliers”, by S. Shekhar, C. T. Lu,
and P. Zhang
S(x)
= [f(x)-Ey∈ N(x)(f(y))]
: difference between
f(x) - attribute value of a sensor located at x
Ey - average attribute value of x’s neighbors
Zs(x)
= |s(x) –s/σs| > θ
: spatial statistic, where θ is a z-score for user
specified confidence interval
8
Key Concepts (contd.)
Algorithm (example)
1 2
20 6
2 5
7 8
3 6
3 4 5 S(x) = f(x) –Ey
7 8 9
= 100 – (2+8)/2 = 95
10 11 12
100 2 1
s : 0.22
7 8 9
σs : 23.8
Outlier is replaced by Ey.
100 -> 5
1 2
20 6
2 5
7 8
3 6
3
7
10
5
7
4 5
8 9
11 12
2 1
8 9
Zs(x) = |s(x) –s|/σs = 3.98
Z-score for 95% C.I. = 2
3.98 > 2
Thus, 100 is an outlier
9
Major Contributions
Top k outliers query processing
User interface similar to an UI of Weka
Providing visualization of outliers
-
plain text (time slot, time, station, Zs(x))
overall traffic volume
neighbor relationship graph between stations
Keeping user-specified results
10
Major Contributions (contd.)
Top k outliers query processing
Fig.1. Top 3 outliers from dataset 19970115N.dat
11
Major Contributions (contd.)
User Interface
Fig.2 User interface of the spatial outlier detection application v.s. weka
12
Major Contributions (contd.)
Visualization outliers
Fig.3 Plain text results of detected outliers
13
Major Contributions (contd.)
Visualization outliers
Detected outliers
Fig.4 Overall traffic volume and Neighbor relationship graph between stations
14
Major Contributions (contd.)
Visualization outliers
Fig.4 Overall traffic volume and Neighbor relationship graph between stations
15
Major Contributions (contd.)
Keeping Results
-
Enable to save and print user-specified results
Let’s go to the DEMO!
16
Validation Methodology
Experiments with three different data set
Data set
19970115N.dat
Most outliers
found at station
24
19970116N.dat
24
19970125N.dat
124
17
Assumptions
Data format is set
-
The original data consists of traffic volume and
occupancy.
Detection outlier is based on volume.
Data format :
@relation 19970115N
@station 150
@timeslot 288
1 3 4 7 45 100 ….
-
Users are familiar with statistical concepts
(e.g., confidence interval, C.I.)
18
Conclusion
Adding one more package in Weka to find sets
of spatial outliers
Showing results visually
in the user interface similar to the user interface of
Weka
by top k outliers query processing
providing visualization of outliers
allowing to keep user-specified results
19
Future work
Upgrade to allow various file format and data
type
Experiments to find more efficient algorithm
using different outlier detection algorithms
Add more spatial data mining options
- e.g., SAR(Spatial Auto Regression), co-location
20
Thanks!
21