Metadata challenges: providing stronger
Download
Report
Transcript Metadata challenges: providing stronger
Metadata challenges: providing
stronger assessments of data
quality
Dr Lex Comber
[email protected]
PUBLIC PROTECTION AND ETHICAL GEOSPATIAL DATA DISSEMINATION
AN INITIATIVE OF GEOIDE (PROJECT IV-23)
Acknowledgements
• The ideas in this presentation are the result of
an ongoing collaboration
– Mark Gahegan
• This is a work in progress...
Aims
•
•
•
•
To expand on current notions of metadata for spatial data
To explore metadata objectives, content and roles
To consider possible metadata developments
To propose an agenda for evolving metadata
Statement
“Data quality can only be determined in light of its intended use:
quality is not absolute is relative to its use”
• Data is frequently (mostly?) used purposes other than its original use
• 3rd party data; more users; greater access eg SDIs, INSPIRE, GRID etc
• Users need to understand the uncertainties when they use the data
• A dataset will have different ‘quality’ for different users (and uses)
Outline
• Introduction
– Spatial data variability: semantics, measurement
& abstraction
– Examples
• Context
– Users, prototypes & semiotic triangles
– Standards
• A research agenda for more nuanced
metadata
Introduction: spatial data variability
• Many different ways of conceptualising the world
– Grounded in semantics and meaning
– Different meanings and understandings
– Sometimes called an ‘ontology’
• Geographic representation
– Real world infinitely complex
– Representation involves
• Abstraction, Aggregation, Simplification etc
• Examples
Example: UN FRA
Grainger, A (2007). The influence of end-users on the temporal consistency of an international statistical
process: the case of tropical Forest Statistics. Journal of Official Statistics, 23(4): 553-592
Spatial
characterization
can change
Example: sea level
Differences in sea level (cm)
Fact: A bridge collapsed !
Where: Laufenburg on the river
Rhine
Why: The already completed
bridge on the Swiss side has a
difference in altitude (level) of
0,54 meters compared to the
German counterpart
How: The two neighbouring
countries use varying
(different) measuring methods
Source: http://www.laufenburg.ch
Example: what is a forest?
Example: what is a forest?
From Comber, A.J., Fisher, P.F., Wadsworth, R.A., (2005). What is land cover?
Environment and Planning B: Planning and Design, 32:199-209
Does not include species, area, strip width
16
Zimbabwe
14
Tree Height (m)
12
10
Sudan
8
Turkey
United Nations -FRA 2000
6
PNG
Luxembourg
Malaysia
Belgium
New Zealand
Netherlands
Namibia
Somalia
4
Israel
United States
Gambia
Mexico
2
Tanzania
Mozambique
Morocco
Ethiopia
Denmark
SADC
Cambodia
Australia
Japan
UNESCO
Jamaica
Switzerland
South Africa
Kyrgyzstan
Kenya
Portugal
Estonia
0
0
10
20
30
40
50
Canopy Cover (%)
Data source: http://home.comcast.net/~gyde/DEFpaper.htm
60
70
80
90
Introduction: spatial data variability
• Much variation representation of the world
• Choices about representation vary depending on
– Commissioning, scientific & policy context (who paid
for it?)
– Observer (what did you see?)
– Institution (why you see it that way?)
– Measurement (how did you record it?)
• So… almost everything in Geography is a matter of
interpretation
– The same processes may be recorded (represented) in different
ways
→ Variation in representation & concepts
Context
• Now: many more users of spatial data
• Obtaining data is easy and quick
– Web, INSPIRE, SDIs (click through download)
– No gatekeeper, no negotiation
• Users may assume that data about ‘forest’ or
‘height above sea level’ etc matches their
concept, their understanding
– Prototypes in cognitive science
Context
• Semiotic triangle
– Real world
– GI conceptualisations
– User prototypes
‘real
world’
• GI is interpreted from personal
& group conceptualizations of
the world
User
• Geographical data are mapped
into those conceptualizations
• Then provided to users
GI
Context
• How does the user
– Understand the data-to-real world
link?
– Avoid mis-matches with their
Prototype, Conceptual model,
Analytical objectives, or Existing
data?
– Determine data quality?
– Ensure robust analysis?
‘real
world’
User
metadata
• Users might expect metadata to
support their activity…
• The meta-descriptions of metadata
in standards support that view…
GI
Context
• Geo-spatial data quality and metadata standards:
– Positional Accuracy, Attribute Accuracy, Lineage, Logical consistency,
Completeness
– In many early standards: DCDSTF, 1988; FGDC, 1998; ANZLIC, 2001;
ISO, 2003, OGC, INSPIRE
– Distilled into the Dublin Core
• Dublin Core Metadata Elements Set identifies 15 components
– Contributor, Coverage, Creator, Date, Description, Format, Identifier,
Language, Publisher, Relation, Rights, Source, Subject, Title, Type
• Relate to mainstream information sources
– Books, web pages
– Based on IP, cataloguing, retrieval & discovery
– How to document information
Context
• Metadata objectives:
– “Data about data or a service. Metadata is the
documentation of data. In human-readable form, it has
primarily been used as information to enable the manager
or user to understand, compare and interchange the
content of the described data set” (ISO, 2003a)
• BUT standards reflect the process of data production
– Lineage from methods and data sources
– Accuracy, Consistency and Completeness from assessment
of results
• Little focus on use
• Little focus on assessments of data quality
Context
• Currently metadata does NOT close the
semiotic loop
• In part this is the nature of standards...
... in theory provide a common language
... But their specification (content) is always a
compromise and lags behind research & practice
– E.g. a recent book on spatial data standards took
10 years from inception to being published.
Research Agenda
• Can users make sense of the metadata provided?
– Does it meet their needs?
• Are the various MD fields relevant in this new context?
– Are there important omissions?
• Are there opportunities for further richness provided by
recent innovations in information science?
• Will data producers will be able to keep up with metadata
production at ever-increasing data rates?
• In short: does metadata need to be re-envisioned for these
new technologies and use-cases?
Research Agenda to support user evaluations of
data quality
1. Metadata for what purpose, what roles?
– Currently based on Archive, Discovery, Citations and Browsing.
– Is this complete?
– What about data quality assessments? Semantics ?
2. Metadata for what kinds of resources (not just data)?
– Just datasets? Too shortsighted? What about:
• Methods? Workflows? Research Questions? Researchers?
– There are syntactic and semantic issues for each of the above: e.g. Methods
can be described by syntactic signatures but that does not describe what they
do to the user…
3. Actionable metadata?
– Today’s information systems are poor consumers of metadata…
• Do the tools we use make effective use of metadata?
• Eg the GIS community have spent much time and effort on uncertainty metadata,
even though the systems cannot analyze and propagate uncertainty during analysis
Research Agenda to support user evaluations of
data quality
4. Does the role of Standards need to change?
–
–
–
–
Many metadata standards, and for a variety of purposes.
re-invented by different disciplines/groups
Who gets to make the standards?
Should standards to cover all metadata needs for science communities?
5. Cost and time for creating metadata standards?
–
–
–
–
How long does it take (examples from EU and ISO)?
What is the typical cost?
What does the metadata standards development process look like?
Do communities always accept them?
6. The burden of metadata production?
–
–
–
–
Often an ‘unfunded mandate’
Documenting standards ignored to various degrees
Are metadata standards failing? (e.g. NSDI)
Are we sure we are collecting information that is useful?
Research Agenda to support user evaluations of
data quality
7. Conveying understanding: Capturing and representing domain semantics?
– There are many realities…each user of some resource brings a different
understanding and potentially different metadata needs
– Representing data semantics: (i) for users, (ii) for foreign systems
• Using meta-models, where some domain semantics are first defined, then used to construct
information schemas (e.g. NADM: North American Data Model for Geological Mapping)
• Using ontologies for knowledge domains and tasks (e.g. NASA’s SWEET ontology of Earth
processes and regions)
8. Mining situational metadata from use-cases (provenance)?
– User ranking and feedback:
• What works? What is missing? What is known? What is unknown?
– Use-case logging: monitor use via a web portal / library, warehouse…
• Use counts by web domains: differentiate user communities
– Use-case mining and analysis
• Discover significant usage patterns, use these to infer relevance, e.g. recommender systems,
– Genesis, derivation, workflows
• By exposing, analyzing and documenting the means by which the dataset was produced
Research Agenda to support user evaluations of
data quality
9. Mining semantic metadata from resources and schemas?
– Ontology mining
• inferred from schema (metadata) - mappings built from exposed data schema
• inferred from data in some cases - schema and data to construct ontology
10. Evolving metadata?
– the way we describe the world keeps changing
• …and we learn more about how things are used
– The way we think about metadata now has evolved considerably over
the last 20 years
• we should expect that to continue.
• Metadata schemas need to be designed for expansion and replacement as science
evolves.
– Meta-models help a lot, but are they flexible enough? Will emergent
use patterns lead to new insights?
Final remarks
• Assertion 1: Current attempts to gather and utilize metadata for
data quality assessments are failing...
• Assertion 2: The burden of tagging existing and future data with
user-relevant metadata to do this is overwhelming
– We cannot realistically expect data producers to carry this burden alone
• Many different approaches to metadata creation are open to us
– Some are new, facilitated by ‘grid’ and web service ‘brokered’ access to eresources
– We need to try some of these on a large scale.
• These research ideas are intended to augment the ongoing work of
INSPIRE / ESDIN, etc (not a critique)
• The stakes are high: our success in sharing data – of which data
quality assessments are a key part – will have big repercussions for
research and policy for years to come