Mirror Outlier Detection in Foreign Trade Data

Download Report

Transcript Mirror Outlier Detection in Foreign Trade Data

Mirror Outlier Detection in
Foreign Trade Data
Markos Fragkakis
NTTS 2009
Introduction
Foreign Trade data
 Improvement of FT quality is essential
 Quality can be assessed using several
dimensions (e.g. accuracy, timeliness, clarity)
 We focus on accuracy using outlier
detection
 Methods for outlier outlier detection (e.g.
threshold, model based)
 Presentation of the Mirror Outlier
Detection application

2
Methodology
Univariate detection in time series (value,
quantity, supplementary quantity)
 Median Absolute Deviation

xi  M1
xi  M1
Ti 

c
M2
Median(| x j  M1 |)


Robust
◦ median, not mean
◦ non-parametric
3
Mirror Outlier Detection
Characterization of outliers according
mirror flow.
 Possible outlier types:

◦ Green: outlier appears in mirror (same sign)
◦ Red: outlier does not appear in mirror
◦ Violet: outlier appears in mirror (opposite
sign)
◦ Black: mirror series not present
◦ Pink: mirror series not present
(confidentiality)
4
Additional functionalities

Outlier classification (error in dimension,
not observed values)
◦ Swapping of observation between series
◦ Copy of observations
◦ Time delay (hidden green outlier)
Outlier detection in short series (product
code changes)
 Reporting for

◦ Detected outliers per country (e-mailed)
◦ Summary reporting
5
Example of detected outlier
6
Example of error due to swap
7
Error due to time delay
8
Technical Information
MOD-DB has RDBMS repository for
storing outlier data (support for Oracle,
MySQL).
 Implemented in Java (portability,
maintainability)
 Command Line Interface
 Performance issues

◦ Large volume of data cause bottleneck in DB
◦ Storage is in question (several GBs per
month)
9
Architecture
10
Proposal for new platform
Use a multi dimensional viewer
 Enable OLAP functions (slice, dice, rollup
drilldown)
 Create dynamic charts from data
 Estimated variables (indices from raw outlier
data)
 Data mining could be performed for
extracting inferences from data

◦ Log linear models

Pin-point of poor data involving high values
11
Conclusions
Use of mirror flow for outlier
chacterisation
 New features
 Improving quality
 Enable building new platform for data
exploration
 Expansions of MOD to other FT data
outside EU, other domain.

12
Questions
Thank you for your attention
13