Interactive Workshop – ESPON Database 3

Download Report

Transcript Interactive Workshop – ESPON Database 3

The identification of exceptional
values in the ESPON database
Paul Harris
Martin Charlton
National Centre for Geocomputation
NUIM Maynooth Ireland
Madrid seminar - 10/6/10
Outline
1.
ESPON DB data
2.
Identifying exceptional values
3.
Case study 1 (detecting logical input errors)
4.
Case study 2 (detecting statistical outliers)
5.
Next things to do..
1. ESPON DB data

Socio-economic, land cover,…

Continuous, categorical, nominal, ordinal,….
Spatial support:
Area units – NUTS 0/1/2/23/3
(whose boundaries may also change over time)

Temporal support:
Commonly, yearly units (with only a short time series)

2. Identifying exceptional values
Define two types:
Logical input errors
(e.g. a negative unemployment rate)
1.
Statistical outliers
(e.g. an unusually high unemployment rate)
2.
Two-stage identification algorithm:
Stage 1: identify input errors via mechanical techniques
Stage 2: identify outliers via statistical techniques
Stage 1:
Identify logical Input Errors
Logical input errors…

Usually detected using some logical, mathematical approach

Statistical detection may also help…
Typical input errors:

Impossible values (e.g. negatives, fractions…)

Repeated data for different variables

Data displaced between or within columns

Data swapped between or within columns

Wrong NUTS code or name

Wrong NUTS regions used (e.g. for 1999 instead of 2006)

Missing value code (e.g. 9999 treated as a true value)

Etc.
Our approach…
Detect input errors mathematically (& statistically)

Flag observations if they are likely input errors

If possible - correct them

More likely - consult an expert on the data

Once happy - go to stage 2 - assume data is error-free

Stage 2:
Identify statistical outliers
Types of outliers….
Our approach…
There is no single ‘best’ outlier detection technique, so…

Apply a representative selection of outlier detection
techniques (which are simple & robust)

Flag an observation if it is a likely outlier according to each
technique

Build up a weight of evidence for the likelihood of a given
observation being statistically outlying

Suggest what type of outlier it is likely to be
- aspatial, spatial, temporal, relationship, some mixture…

Consult an expert on the data to decide on the appropriate
course of action

Here’s an example using nine techniques & three
observations…
Identification technique
Identification type
Obs. 1
Obs. 2
Yes
Yes
1. Boxplot statistics
Aspatial & univariate
2. Hawkins’ spatial test statistic
Spatial & univariate
Yes
3. Time series statistics
Temporal & univariate
Yes
Yes
4. Large residuals from multiple linear
regression*
Aspatial & multivariate,
Linear relationships
Yes
Yes
5. Large residuals from locally weighted
regression*
Aspatial & multivariate,
Nonlinear relationships
Yes
6. Large residuals from geographically
weighted regression*
Spatial & multivariate,
Nonlinear relationships
Yes
7. Principal component analysis*
Aspatial & multivariate,
Linear relationships
Yes
8. Locally weighted principal component
analysis*
Aspatial & multivariate,
Nonlinear relationships
Yes
9. Geographically weighted principal
component analysis*
Spatial & multivariate,
Nonlinear relationships
Yes
* Can have a spatial, univariate form if the coordinate data are used as variables
Obs. 3
3. Case study 1 (detecting logical input errors)
Data
 Data at NUTS3 level (1351 observations/regions)
 Variables:
 GDP evolution (2000 to 2005) (%age)
 Calculated using 4 other variables:
E0005



GDP2005 POP2005 GDP2005 POP2000



GDP2000 POP2000 GDP2000 POP2005
205 logical input errors deliberately introduced to:
NUTS codes & the 4 variables used to calculate GDP
evolution only
~ 15% of data infected
Performance results
False negatives - 13.2% (e.g. in Italy)
False positives - 2.0% (e.g. in Spain)
Overall misclassification rate - 3.7%
Consequences if we had ignored input
errors….
4. Case study 2
(detecting statistical outliers)
Data
 Data at NUTS23 level for eight years: 2000-2007

For each year - ‘unemployment rate’ calculated
[Unemployment population)/(Active population)]

8 variables at each of 790 regions = 6320 obs.

Data checked for input errors - i.e. stage 1 done
Presentation of results…



For brevity…
Lets say - we only need at least one of 8
time-specific unemployment values in a
region to be outlying…
(But we can identify outliers by year too)
Results: 1 boxplot statistics
(aspatial & univariate)
Results: 2 Hawkins’ test
(spatial & univariate)
Results: 3 time series statistics
(temporal & univariate)
Results: 4 MLR residuals
(aspatial linear relationships)
Results: 5 LWR residuals
(aspatial nonlinear relationships)
Results: 6 GWR residuals
(spatial nonlinear relationships)
Results: 7 PCA residuals
(aspatial linear relationships & model-free)
Results: 8 LWPCA residuals
(aspatial nonlinear relationships & model-free)
Results: 9 GWPCA residuals
(spatial nonlinear relationships & model-free)
Summary of results: weight of evidence
Preliminary performance results







Infected ~ 5% of the data with ‘outliers’ &
repeated the analysis on this ‘infected’ data…
False negatives: 10.3%
False positives: 34.3%
Overall misclassification rate: 26.1%
Problems:
Difficult to guarantee that our infections actually
produce outliers…
The data already contains outliers (as shown)
5. Next things to do…
1. Other ways of performance testing our approach
 Simulated data with known properties?
 Statistical theory (or properties)?
2. Refining each of our nine chosen techniques
 Robust extensions
Thank You!