Interactive Workshop – ESPON Database 3
Download
Report
Transcript Interactive Workshop – ESPON Database 3
The identification of exceptional
values in the ESPON database
Paul Harris
Martin Charlton
National Centre for Geocomputation
NUIM Maynooth Ireland
Madrid seminar - 10/6/10
Outline
1.
ESPON DB data
2.
Identifying exceptional values
3.
Case study 1 (detecting logical input errors)
4.
Case study 2 (detecting statistical outliers)
5.
Next things to do..
1. ESPON DB data
Socio-economic, land cover,…
Continuous, categorical, nominal, ordinal,….
Spatial support:
Area units – NUTS 0/1/2/23/3
(whose boundaries may also change over time)
Temporal support:
Commonly, yearly units (with only a short time series)
2. Identifying exceptional values
Define two types:
Logical input errors
(e.g. a negative unemployment rate)
1.
Statistical outliers
(e.g. an unusually high unemployment rate)
2.
Two-stage identification algorithm:
Stage 1: identify input errors via mechanical techniques
Stage 2: identify outliers via statistical techniques
Stage 1:
Identify logical Input Errors
Logical input errors…
Usually detected using some logical, mathematical approach
Statistical detection may also help…
Typical input errors:
Impossible values (e.g. negatives, fractions…)
Repeated data for different variables
Data displaced between or within columns
Data swapped between or within columns
Wrong NUTS code or name
Wrong NUTS regions used (e.g. for 1999 instead of 2006)
Missing value code (e.g. 9999 treated as a true value)
Etc.
Our approach…
Detect input errors mathematically (& statistically)
Flag observations if they are likely input errors
If possible - correct them
More likely - consult an expert on the data
Once happy - go to stage 2 - assume data is error-free
Stage 2:
Identify statistical outliers
Types of outliers….
Our approach…
There is no single ‘best’ outlier detection technique, so…
Apply a representative selection of outlier detection
techniques (which are simple & robust)
Flag an observation if it is a likely outlier according to each
technique
Build up a weight of evidence for the likelihood of a given
observation being statistically outlying
Suggest what type of outlier it is likely to be
- aspatial, spatial, temporal, relationship, some mixture…
Consult an expert on the data to decide on the appropriate
course of action
Here’s an example using nine techniques & three
observations…
Identification technique
Identification type
Obs. 1
Obs. 2
Yes
Yes
1. Boxplot statistics
Aspatial & univariate
2. Hawkins’ spatial test statistic
Spatial & univariate
Yes
3. Time series statistics
Temporal & univariate
Yes
Yes
4. Large residuals from multiple linear
regression*
Aspatial & multivariate,
Linear relationships
Yes
Yes
5. Large residuals from locally weighted
regression*
Aspatial & multivariate,
Nonlinear relationships
Yes
6. Large residuals from geographically
weighted regression*
Spatial & multivariate,
Nonlinear relationships
Yes
7. Principal component analysis*
Aspatial & multivariate,
Linear relationships
Yes
8. Locally weighted principal component
analysis*
Aspatial & multivariate,
Nonlinear relationships
Yes
9. Geographically weighted principal
component analysis*
Spatial & multivariate,
Nonlinear relationships
Yes
* Can have a spatial, univariate form if the coordinate data are used as variables
Obs. 3
3. Case study 1 (detecting logical input errors)
Data
Data at NUTS3 level (1351 observations/regions)
Variables:
GDP evolution (2000 to 2005) (%age)
Calculated using 4 other variables:
E0005
GDP2005 POP2005 GDP2005 POP2000
GDP2000 POP2000 GDP2000 POP2005
205 logical input errors deliberately introduced to:
NUTS codes & the 4 variables used to calculate GDP
evolution only
~ 15% of data infected
Performance results
False negatives - 13.2% (e.g. in Italy)
False positives - 2.0% (e.g. in Spain)
Overall misclassification rate - 3.7%
Consequences if we had ignored input
errors….
4. Case study 2
(detecting statistical outliers)
Data
Data at NUTS23 level for eight years: 2000-2007
For each year - ‘unemployment rate’ calculated
[Unemployment population)/(Active population)]
8 variables at each of 790 regions = 6320 obs.
Data checked for input errors - i.e. stage 1 done
Presentation of results…
For brevity…
Lets say - we only need at least one of 8
time-specific unemployment values in a
region to be outlying…
(But we can identify outliers by year too)
Results: 1 boxplot statistics
(aspatial & univariate)
Results: 2 Hawkins’ test
(spatial & univariate)
Results: 3 time series statistics
(temporal & univariate)
Results: 4 MLR residuals
(aspatial linear relationships)
Results: 5 LWR residuals
(aspatial nonlinear relationships)
Results: 6 GWR residuals
(spatial nonlinear relationships)
Results: 7 PCA residuals
(aspatial linear relationships & model-free)
Results: 8 LWPCA residuals
(aspatial nonlinear relationships & model-free)
Results: 9 GWPCA residuals
(spatial nonlinear relationships & model-free)
Summary of results: weight of evidence
Preliminary performance results
Infected ~ 5% of the data with ‘outliers’ &
repeated the analysis on this ‘infected’ data…
False negatives: 10.3%
False positives: 34.3%
Overall misclassification rate: 26.1%
Problems:
Difficult to guarantee that our infections actually
produce outliers…
The data already contains outliers (as shown)
5. Next things to do…
1. Other ways of performance testing our approach
Simulated data with known properties?
Statistical theory (or properties)?
2. Refining each of our nine chosen techniques
Robust extensions
Thank You!