Errors and Data Quality Managment

Download Report

Transcript Errors and Data Quality Managment

Geospatial
errors can
cause real-life
problems!
CS 128/ES 228 - Lecture 14a
http://www.brownsmarina.com/fun.html
Data Quality Management
1
One management strategy …
CS 128/ES 228 - Lecture 14a
2
Murphy’s Law
Ignoring data quality
issues usually
doesn’t work very
well
CS 128/ES 228 - Lecture 14a
3
Some geospatial goofs
CS 128/ES 228 - Lecture 14a
4
This one’s worse…
Mars Climate Orbiter (MCO) was
lost on 23 Sep 1999 when it failed
to enter an orbit around Mars,
instead crashing into the planet,
destroying the $125 million craft,
part of a $328 million mission
http://www.boeing.com/companyoffices/gallery/images/
space/d2_mars_climate_orbiter_01.htm
The root cause of the failure was a computer
program that was supposed to provide its
output in newton seconds (N·s) but instead
provided pound-force seconds (lbf·s).
http://lamar.colostate.edu/~hillger/unit-mixups.html#mco
CS 128/ES 228 - Lecture 14a
5
And these are really bad!
Just a 'map error'?
The China Daily website carries a cartoon of the
damaged US plane at Hainan Island's airbase and
asks sarcastically if Sunday's collision "might be due
to another map error“ - a reference to the US
bombing of the Chinese embassy in Belgrade in
1999. "Last time it's due to a map error, and this
time another map error? What about the next?”
It might be due
to another map
error
China Daily
http://news.bbc.co.uk/1/hi/world/monitoring/media_reports/1260185.stm
CS 128/ES 228 - Lecture 14a
6
What is error?

“Error is the physical difference
between the real world and the GIS
facsimile”
-Heywood, Cornelius, & Carver, p. 178

Errors are impossible to avoid, but
can be managed
CS 128/ES 228 - Lecture 14a
7
A Data Management Model
Data acquisition
Data
representation
& analysis
Data outputs
CS 128/ES 228 - Lecture 14a
8
Data acquisition errors
Scientists use the term “error” for
two very different concepts:
 natural variability
 actual mistakes
CS 128/ES 228 - Lecture 14a
9
Take a sidewalk …
What’s its width? 1.77,
1.82, 1.69 … meters
a. “Error” (natural variability):
mean width = 1.76 m, range 1.69 - 1.82
b. “Error” (actual mistake):
mean = 1.67 ft
CS 128/ES 228 - Lecture 14a
10
Accuracy vs.
Precision
Figure 10.1, An Introduction to
Geographic Information Systems by
Heywood, Cornelius, and Carver
CS 128/ES 228 - Lecture 14a
11
Random error vs. Bias
CS 128/ES 228 - Lecture 14a
12
Where does lack of
precision come from?

Natural variability

Poor input assumptions

Imprecise equipment

Sloppy measurement

Accumulated error
CS 128/ES 228 - Lecture 14a
13
Random error is often “normal”
mean
Standard
deviation
CS 128/ES 228 - Lecture 14a
14
95% of observations ±2 s.d.
mean
Mean + 2 s.d.
Mean + 2 s.d.
CS 128/ES 228 - Lecture 14a
15
Means have smaller variability
than single measurements
S. E. (mean) = standard deviation
√n
If n = 4
√n = ?
CS 128/ES 228 - Lecture 14a
16
Where does lack of
accuracy come from?
Dubious source data
 Incompatible source data

Data collected at different times through
different methods, possibly in different
formats

Bias
CS 128/ES 228 - Lecture 14a
17
How can we fix it?

Benchmarks
ex. National Geodetic
Survey maintains
a database of survey
“monuments” at
http://www.ngs.noaa.gov/
cgi-bin/datasheet.prl
http://upload.wikimedia.org/wikipedia/commons
/thumb/6/66/USCGS-E134.jpg/617px-USCGSE134.jpg

Otherwise – just measure variability
CS 128/ES 228 - Lecture 14a
18
Data representation errors

Transference error

Data storage errors

Analysis errors
CS 128/ES 228 - Lecture 14a
19
Where does transference
error come from?

Typos, etc.
Less likely with automated data
collection and transformation
 Can be prevented through diligence
and software “sanity” checks


Format conversion

Many inter-format conversions cause
loss/corruption of data/information
CS 128/ES 228 - Lecture 14a
20
Something got lost in the
translation



“geographic information systems is an
interesting course”
“지리적인 정보 시스템은 재미있는 과정 이다 ”
“The geography information system is the
process which is fun”
Thanks to http://babelfish.altavista.com/babelfish/tr
CS 128/ES 228 - Lecture 14a
21
Raster
Vector conversions
Aliasing is an intrinsic problem of GIS’s
CS 128/ES 228 - Lecture 14a
22
Digitization errors
CS 128/ES 228 - Lecture 14a
23
Topology errors
Figure 10.5, An Introduction to
Geographic Information Systems by
Heywood, Cornelius, and Carver
CS 128/ES 228 - Lecture 14a
24
Data storage/retrieval errors
Hardware failure
Hardware Limitations
CS 128/ES 228 - Lecture 14a
25
What is a
hardware
limitation?

Numbers in a
computer are
stored in a finite
number of bits.

Using too few bits
can cause roundoff error.
Box 9.2, Principles of Geographic Information
Systems by Burrough and McDonnell
CS 128/ES 228 - Lecture 14a
26
Where do errors of
data rot come from?

Link rot
Not Found
The requested URL /cs/dlevine/ was not found on this
server.
Apache/1.3.27 Server at www.xxx.edu Port 80

Poor “style”

E.g. “Employees may appeal to Sr. Carney” as
opposed to “Employees may appeal to the
President of the University”
CS 128/ES 228 - Lecture 14a
27
Where do errors of analysis
come from?
How long do you have? …

Mistaken queries

Analyzing layers with different datums
or coordinate systems

Comparing attributes with incompatible
units
CS 128/ES 228 - Lecture 14a
28
More errors of analysis …

Inappropriate resolution

Combining rasters/vectors with different
resolutions

Using exact/abrupt surface fits when
approx./gradual is appropriate (or vice
versa)
CS 128/ES 228 - Lecture 14a
29
Data output errors

Maps

Reports
CS 128/ES 228 - Lecture 14a
30
Junket at taxpayers’ expense?
Did a
politician
misuse
federal
funds to
visit Alaska
on the way
to official
business in
Japan?
Muekrcke. Map Use, 2nd ed. p. 395
CS 128/ES 228 - Lecture 14a
31
No - Intentional map error*
*More like lying with
maps!
Muekrcke. Map Use, 2nd ed. p. 395
CS 128/ES 228 - Lecture 14a
32
Should maps be as
accurate as possible?

Map simplification
 Features are omitted
 Area features become
lines or points

Exaggeration
 Features’ apparent
size is “increased”
(e.g. hydrants)
 Features’ separation
is increased on the
map for visibility
Must Mapquest
be accurate?
CS 128/ES 228 - Lecture 14a
33
Reporting significance of findings

Hypothesis testing

What does the term “significant”
mean to scientists?
CS 128/ES 228 - Lecture 14a
34
Are two means really different?
http://www.steve.gb.com/science/statistics.html#t
These two normal distributions have a very large overlap. The
means of the two populations are not significantly different,
because the overlap is > 5% of the area under the curves. t
would be very small.
CS 128/ES 228 - Lecture 14a
35
http://www.steve.gb.com/science/statistics.html#t
What about these two means?
CS 128/ES 228 - Lecture 14a
36
http://www.steve.gb.com/science/statistics.html#t
These means are also
significantly different - why?
CS 128/ES 228 - Lecture 14a
37
How do we actually test for
statistical differences?
Student’s t-test
t=
difference in means
measure of variability
CS 128/ES 228 - Lecture 14a
38
Three Commandments of
Data Reporting
Thou Shalt Not …
I. Report insignificant digits
(or omit significant trailing zeros)
II. Report means without also reporting
sample sizes and variability
III. Report results as “significant” (or even
worth talking about) without doing the
appropriate statistical tests.
CS 128/ES 228 - Lecture 14a
39
“CONSTANT
VIGILANCE”
-- “Mad Eye” Moody
Defense Against The Dark Arts Instructor
Hogwarts School of Witchcraft and Wizardry
CS 128/ES 228 - Lecture 14a
http://news.bbc.co.uk/1/shared/spl/hi/pop_up
s/05/entertainment_goblet_of_fire/html/3.stm
How do we minimize
(NOT avoid) error?
40