April27`04project_present

Download Report

Transcript April27`04project_present

Spatial Outlier Detection and
implementation in Weka
Implemented by:
Shan Huang
Jisu Oh
CSCI8715 Class Project, April 27 2004
Presented by Jisu Oh (Group 2)
Slides Available at
http://www.users.cs.umn.edu/~joh/csci8715/HW-list.htm
1
Topics:








Motivation
Problem Statement
Key Concepts
Major Contributions
Validation Methodology
Assumptions
Conclusions
Future work
2
Motivation

-
-
Machine learning /Data mining
Enables a computer program to analyze
large-scale data
Decide important information which can
be used to make predictions or to make
decisions faster and more accurately.
3
Motivation
Weka
A collection of machine learning algorithms
for solving real-world data mining problems
Provides data mining functions
(eg, regressions, association rules, and
clustering algorithm)
Limitation:
operates on traditional non-spatial database

-
-
4
Problem Statement

Input Data set
 Minneapolis/St.

Paul traffic data set
Output
: detected outliers as
 Plain text (timeslot, time,
 Overall traffic volume
 Neighbor
station, Zs(x))
relationship graph between stations
5
Problem Statement(cont.)

Constraints
 Algorithm
from paper “A unified approach Detecting
Spatial Outliers”
 Dataset should be numeric

Objective
 To
find sets of spatial outliers and show the results
visually
6
Key Concepts

Spatial outliers
– spatially referenced objects whose nonspatial attribute values are significantly different from
the values of its neighborhood.
 Example – a new house in an old neighborhood of a
growing metropolitan area
 In this project, outlier is one station which has a high
volume compared to the neighboring stations at
certain time slot.
 Definition
7
Key Concepts (contd.)

Algorithm
 Proposed
in the paper, “A Unified Approach to
Detecting Spatial Outliers”, by S. Shekhar, C. T. Lu,
and P. Zhang
 S(x)
= [f(x)-Ey∈ N(x)(f(y))]
: difference between
f(x) - attribute value of a sensor located at x
Ey - average attribute value of x’s neighbors
 Zs(x)
= |s(x) –s/σs| > θ
: spatial statistic, where θ is a z-score for user
specified confidence interval
8
Key Concepts (contd.)

Algorithm (example)
1 2
20 6
2 5
7 8
3 6
3 4 5 S(x) = f(x) –Ey
7 8 9
= 100 – (2+8)/2 = 95
10 11 12
100 2 1
s : 0.22
7 8 9
σs : 23.8
Outlier is replaced by Ey.
100 -> 5
1 2
20 6
2 5
7 8
3 6
3
7
10
5
7
4 5
8 9
11 12
2 1
8 9
Zs(x) = |s(x) –s|/σs = 3.98
Z-score for 95% C.I. = 2
3.98 > 2
Thus, 100 is an outlier
9
Major Contributions
Top k outliers query processing
 User interface similar to an UI of Weka
 Providing visualization of outliers

-

plain text (time slot, time, station, Zs(x))
overall traffic volume
neighbor relationship graph between stations
Keeping user-specified results
10
Major Contributions (contd.)

Top k outliers query processing
Fig.1. Top 3 outliers from dataset 19970115N.dat
11
Major Contributions (contd.)

User Interface
Fig.2 User interface of the spatial outlier detection application v.s. weka
12
Major Contributions (contd.)

Visualization outliers
Fig.3 Plain text results of detected outliers
13
Major Contributions (contd.)

Visualization outliers
Detected outliers
Fig.4 Overall traffic volume and Neighbor relationship graph between stations
14
Major Contributions (contd.)

Visualization outliers
Fig.4 Overall traffic volume and Neighbor relationship graph between stations
15
Major Contributions (contd.)

Keeping Results
-
Enable to save and print user-specified results
Let’s go to the DEMO!
16
Validation Methodology

Experiments with three different data set
Data set
19970115N.dat
Most outliers
found at station
24
19970116N.dat
24
19970125N.dat
124
17
Assumptions

Data format is set
-
The original data consists of traffic volume and
occupancy.
Detection outlier is based on volume.
Data format :
@relation 19970115N
@station 150
@timeslot 288
1 3 4 7 45 100 ….
-

Users are familiar with statistical concepts
(e.g., confidence interval, C.I.)
18
Conclusion

Adding one more package in Weka to find sets
of spatial outliers

Showing results visually
in the user interface similar to the user interface of
Weka
 by top k outliers query processing
 providing visualization of outliers
 allowing to keep user-specified results

19
Future work



Upgrade to allow various file format and data
type
Experiments to find more efficient algorithm
using different outlier detection algorithms
Add more spatial data mining options
- e.g., SAR(Spatial Auto Regression), co-location
20
Thanks!
21