- Tetherless World Constellation

Download Report

Transcript - Tetherless World Constellation

Webs of Data and Data on
the Web, the Deep Web, Data
Discovery, Data Integration
Peter Fox
Data Science – CSCI/ERTH/ITWS-6961
Week 12, November 20, 2012
1
Contents
•
•
•
•
•
•
•
•
Review of reading assignment
Webs of data and semantic web
Data on the web, linked data
Deep web
Data discovery
Data integration
Summary
Next week
2
Reading
• Data Quality European Union Presentation
• ISO Technical Standards - General
Reference
3
Webs of data
• Early Web - Web of pages
• http://www.ted.com/index.php/talks/tim_berne
rs_lee_on_the_next_web.html
• Semantic web started as a way to facilitate
“machine accessible content”
– Initially was available only to those with familiarity
with the languages and tools, e.g. your parents
could not use it
• Webs of data grew out of this
– One specific example is W3C’s Linked Open
Data
4
Semantic Web
• http://www.w3.org/2001/sw/
• “The Semantic Web provides a common
framework that allows data to be shared and
reused across application, enterprise, and
community boundaries. It is a collaborative
effort led by W3C with participation from a
large number of researchers and industrial
partners. It is based on the Resource
Description Framework (RDF). See also the
separate FAQ for further information.”
5
Terminology
• Semantic Web
– An extension of the current web in which information is
given well-defined meaning, better enabling computers
and people to work in cooperation, www.semanticweb.org
– Primer: http://www.ics.forth.gr/isl/swprimer/
• Semantic Grid
– Semantic services to use the resources of many
computers connected by a network to solve large scale
computational/ data problems
• Ontology (n.d.). The Free On-line Dictionary of Computing.
http://dictionary.reference.com/browse/ontology
– An explicit formal specification of how to represent the
objects, concepts and other entities that are assumed to
exist in some area of interest and the relationships that
hold among them.
6
Semantic Web Layers
7
http://www.w3.org/2003/Talks/1023-iswc-tbl/slide26-0.html, http://flickr.com/photos/pshab/291147522/
Application Areas for SW
•
•
•
•
•
•
•
•
•
•
•
•
•
•
Smart search
Annotation (even simple forms), smart tagging
Geospatial
Implementing logic (rules), e.g. in workflows
Data integration
Verification …. and the list goes on
Web services
Web content mining with natural language parsing
User interface development (portals)
Semantic desktop
Wikis - OntoWiki, SemanticMediaWiki
Sensor Web
Software engineering
Explanation
8
Semantic Web Basics
• The triple: {subject-predicate-object}
Interferometer is-a optical instrument
Optical instrument has focal length
• W3C is the primary (but not sole) governing org.
– RDF
– OWL 1.0 and 2.0 - Ontology Web Language
• RDF
– programming environment for 14+ languages, including C, C++,
Python, Java, Javascript, Ruby, PHP,...(no Cobol or Ada yet ;-( )
• OWL programming for Java
• Closed World - where complete knowledge is known
(encoded), AI relied on this
• Open World - where knowledge is incomplete/ evolving,
SW promotes this
9
Ontology Spectrum
Catalog/
ID
Thesauri
“narrower
term”
relation
Terms/
glossary
Informal
is-a
Selected
Formal Frames
Logical
is-a (properties) Constraints
(disjointness,
inverse, …)
Formal
instance
Value
Restrs.
General
Logical
constraints
Originally from AAAI 1999- Ontologies Panel by Gruninger, Lehmann, McGuinness, Uschold, Welty;
– updated by McGuinness.
Description in: www.ksl.stanford.edu/people/dlm/papers/ontologies-come-of-age-abstract.html
10
SW != ontologies on the web (!)
• Ontologies are important, but use them only when necessary as
identified by use cases
• The Semantic Web is about integrating data on the Web; ontologies
(and/or rules) are tools to achieve that when necessary
• SW ontologies != some big (central) ontology
– The ethos of the Semantic Web is on sharing, ie, sharing possibly many
small ontologies
– A huge, central ontology could be difficult to manage in terms of
maintenance.
– Semantic web languages such as OWL contain primitives for equivalence
and disjointness of terms and meta primitives for versioning info
• The practice:
•
– SW applications using ontologies mix large number of ontologies and
vocabularies (FOAF, DC, and others)
– the real advantage comes from this mix: that is also how new relationships
may be discovered
One readable background article from the metadata world is available at:
http://www.metamodel.com/article.php?story=20030115211223271
11
Semantic Web Myths
• ‘the Semantic Web is a reincarnation of Artificial Intelligence
on the Web’ (closed world versus open world)
• ‘it relies on giant, centrally controlled ontologies for
"meaning" (as opposed to a democratic, bottom-up control of
terms)’
• ‘one has to add metadata to all Web pages, convert all
relational databases, and XML data to use the Semantic
Web’
• ‘one has to learn formal logic, knowledge representation
techniques, description logic, etc, to use it’
• ‘it is, essentially, an academic project, of no interest for
industry’
12
Integrating Multiple Data Sources
• The Semantic Web lets us merge
statements from different sources
• The RDF Graph Model allows
programs to use data uniformly
regardless of the source
• Figuring out where to find such
data is a motivator for Semantic
Web Services
hasCoordinates
#Ionosphere
#magnetic
name
hasLowerBoundaryValue
“100”
“Terrestrial
Ionosphere”
hasLowerBoundaryUnit
“km”
Different line & text colors
13
represent different data sources
Drill Down /Focused Perusal
• The Semantic Web uses Uniform
Resource Identifiers (URIs) to
…#NeutralTemperature
name things
• These can typically be resolved
to get more information about the
resource
measuredby
• This essentially creates a web of
data analogous to the web of text
created by the World Wide Web
Internet
• Ontologies are represented using
the same structure as content
– We can resolve class and
property URIs to learn about the
ontology
…#Norway
locatedIn
...#ISR
...#FPI
type
operatedby
...#MilllstoneHill …#EISCAT
14
Statements about Statements
• The Semantic Web allows us to
make statements about
statements
– Timestamps
– Provenance / Lineage
– Authoritativeness / Probability /
Uncertainty
– Security classification
– …
#Danny’s
#Aurora
hasSource
hasDateTime
hascolor
• This is an unsung virtue of the
Semantic Web
20031031
Red
Ontologies Workshop, APL May 26, 2006
15
‘Collecting’ the ‘data’
• Part of the (meta)data information is present in tools
... but thrown away at output e.g., a business chart
can be generated by a tool: it ‘knows’ the structure,
the classification, etc. of the chart, but, usually, this
information is lost storing it in web data would be
easy!
• SW-aware tools are around (even if you do not
know it...), though more would be good:
– Photoshop CS stores metadata in RDF in, say, jpg files
(using XMP)
– RSS 1.0 feeds are generated by (almost) all blogging
systems (a huge amount of RDF data!)
16
‘Collecting’ the ‘data’
• Scraping - different tools, services, etc, come
around every day:
– get RDF data associated with images, for
example: service to get RDF from flickr images
– service to get RDF from XMP
– XSLT scripts to retrieve microformat data from
XHTML files
– RSS scraping in use in VO projects in Japan
– scripts to convert spreadsheets to RDF – e.g. see
the tools, tutorials, demos at http://logd.tw.rpi.edu
17
‘Collecting’ the ‘data’
• SQL - A huge amount of data in Relational
Databases
– Although tools exist, it is not feasible to convert that data
into RDF
– Instead: SQL ⇋ RDF ‘bridges’ are being developed: a
query to RDF data is transformed into SQL on-the-fly
– Reading for this week, article by Berners Lee and Sahoo
et al.
– RDB2RDF W3 working group http://www.w3.org/2001/sw/rdb2rdf/
– D2RQ/ D2RServer
– Commercial solutions appearing
• NoSQL
• Other ‘graph’ forms…
18
More Collecting
• RDFa (formerly known as RDF/A) extends XHTML
by:
– extending the link and meta to include child elements
– add metadata to any elements (a bit like the class in
microformats, but via dedicated properties)
• It is very similar to microformats, but with more
rigor:
– it is a general framework (instead of an ‘agreement’ on
the meaning of, say, a class attribute value)
– terminologies can be mixed more easily
• GRDDL - Gleaning Resource Descriptions from
Dialects of Languages
• ATOM (follow on to RSS)
19
Linked open data
• http://linkeddata.org/guides-and-tutorials
• http://tomheath.com/slides/2009-02-austinlinkeddata-tutorial.pdf (we will look at some of
these slides now, #1-25 and 30-37)
• And of course:
– http://logd.tw.rpi.edu/
– http://data-gov.tw.rpi.edu/wiki
20
http://richard.cyganiak.de/2007/10/lod/
•
•
•
•
•
•
•
•
•
•
•
•
•
Latest
2011-09-19
2010-09-22
2009-07-14
2009-03-27
2009-03-05
2008-09-18
2008-03-31
2008-02-28
2007-11-10
2007-11-07
2007-10-08
2007-05-01
295
295
203
95
93
89
45
34
32
28
28
25
12
21
2009-03-05 (Chris Bizer)
22
September 2011
23
“Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/”
(Class 2) Management
•
•
•
•
•
•
Creation of logical collections
Physical data handling
Interoperability support
Security support
Data ownership
Metadata collection, management and
access.
• Persistence
• Knowledge and information discovery
• Data dissemination and publication
24
Data Management and WOD
•
•
•
•
•
•
Is this the grand solution?
How is the data managed?
Found?
Curated?
What about the metadata?
What problems are introduced?
• See: Parsons and Fox (2012): http://mpdatamatters.blogspot.com/
25
Data on the Web, Internet
•
•
•
•
Data behind web services
Data files on web sites
We have covered data as service approaches
Thinking you have found data when you have
really only found information and metadata
• The real difference between this topic and the
next one is:
– Access and dissemination
– Level of curation (and often description)
26
Data on the internet
• http://www.dataspaceweb.org/
• Data files on other protocols
– FTP
– RFTP
– GridFTP
– SABUL
– XMPP/AMQP
– Others…
27
Deep web
• Data behind web services
• Data behind query interfaces (databases or
files)
• Introduces a different curation problem
28
The loose definition
• Something that a crawler cannot find and/or
index
– Creates the other definition of shallow web
• Has many implications for discovery, access
and use
• Curation is more complex to satisfy this
definition, i.e. not a matter of just putting files
‘on the web’
• 50, 100, 1000 times the ‘shallow web’?
29
Managing (in) the deep web
• Sometimes, the deep web aspect of a data
source can be due to extreme obscurity,
language peculiarities, NO metadata, NO
documentation
• There are no known studies of how effective
data management (what you are learning)
could change the percentage of deep/
shallow
• Semantics are often put forward as a solution
http://www.mkbergman.com/458/new-currents-inthe-deep-web/
30
Internet impacts on management
• Management of data that is… on the Internet!
• Web – ‘stateless’
• Curation, Preservation – highly stateful (by
definition)
• You will hear terms such as digital curation
and digital preservation (search on these) but
what about internet curation and internet
preservation (Internet Archive?)
• What others??
31
(Class 2) Management
•
•
•
•
•
•
Creation of logical collections
Physical data handling
Interoperability support
Security support
Data ownership
Metadata collection, management and
access.
• Persistence
• Knowledge and information discovery
• Data dissemination and publication
32
Thus data frameworks are appearing
• Many – meaning they go beyond web sites,
they incorporate many of the data
management functions
• Initially syntactic – e.g. OPeNDAP, ADDE,
ODATA, OODT
• Application oriented – e.g. virtual
observatories
• Semantic – e.g. Virtual Solar-Terrestrial
Observatory
• ALL of these are changing the nature of data
management and role of data ‘providers’ cf. ?
33
34
Some Definitions
DAP = Data Access Protocol
 Model used to describe the data;
 Request syntax and semantics; and
 Response syntax and semantics.
OPeNDAP
 The software;
 Numerous reference implementations;
 Core/libraries and services (servers and clients).
OPeNDAP Inc.
 OPeNDAP is a 501.c(3) non-profit corporation;
 Formed to maintain, evolve and promote the
discipline neutral DAP that was the DODS core
infrastructure.
BOM, Melbourne, VIC
35
Considerations with regard to the
development of DAP and OPeNDAP
 Many data providers
 Many data formats
 Many different client types
 Many different semantic representations of
the data
 Many different security requirements
BOM, Melbourne, VIC
36
Broad Vision
A world in which a single data access protocol
is used for the exchange of data between
network based applications regardless of
discipline.
A layer above TCP/IP providing for syntactic and
semantic consistency not available in existing
protocols such as FTP.
BOM, Melbourne, VIC
37
Practical Considerations
The broad vision:
 Is syntactically achievable, but
 Was not semantically achievable, at least
not fully, but perhaps in the near term.
BOM, Melbourne, VIC
38
OPeNDAP Inc. Mission
Statement
To maintain, evolve and promote a data
access protocol (DAP) and reference
implementation software (OPeNDAP) for the
syntactically consistent exchange of data over
the network.
The DAP should provide syntactic interoperability
across disciplines and allow for semantic
interoperability within disciplines.
BOM, Melbourne, VIC
39
The Data Access Protocol (DAP)
 The DAP has been designed to be as
general as possible without being
constrained to a particular discipline or
world view.
 The DAP is a discipline neutral data access
protocol; it is being used in astronomy,
medicine, earth science,…
 Provides data format and location, and data organization
transparency
 Is metadata neutral
BOM, Melbourne, VIC
40
DAP comparisons
• File-based
– GridFTP/FTP
– HTTP
– SRB
• Service-based
– Open-Geospatial Consortium, WCS, WMS, WFS, …
– Virtual Observatory (Astronomy), SIAP, SSAP, STAP,…
BOM, Melbourne, VIC
41
Who is using DAP/ OPeNDAP?
• Science examples
– PMEL with their Tsunami inundation modeling
– Ocean regional modelers to extract open
boundary conditions
– Visualization of data sets using MATLAB/IDL/…
• Service examples
– Live Access Server
– Mapserver – OGC services and OPeNDAP data
access (future)
– Digital Library Service - metadata and catalogue
info
BOM, Melbourne, VIC
42
Data Access Protocol (DAP2) - Current
 DAP2 currently a NASA/ESE ‘Standard’
 Current servers implement DAP2
DAP3
DAP 2 + XML responses
(implemented)
BOM, Melbourne, VIC
43
DAP4
 DAP4 improvements over DAP3:
Additional datatypes
 Swath
 Blob - GIF, MPEG,…
Additional functionality
 Check sum
 Modulo
 The additional datatypes will enable the DAP to
be used in a wider variety of circumstances and
are a direct response to users’ requests.
BOM, Melbourne, VIC
44
What DAP means to me
• Data access and transport
• Response types: DAP objects versus file type
– A DAP URL is essentially an HTTP URL with
additional restrictions placed on the abs-path
component.
– DAP2-URL = "http://" host [ ":" port ] [ abs-path]
•
•
•
•
abs-path = server-path data-source-id [ "." ext[ "?" query ] ]
server-path = [ "/" token ]
data-source-id = [ "/" token ]
ext = "das" | "dds" | "dods"
– The server-path is the pathname to the server,
whereas data-source-id is the pathname to the data.
BOM, Melbourne, VIC
45
OPeNDAP V3 Architecture
Client
Cgi style access
Data
 CGI-style access
 Uses web server
 HTTP protocol
 Several request and response types
 Reads data files, Databases, et c., returns info
 May return DAP2 objects or other data
 Client can be application, web browser or
specialized server/service
BOM, Melbourne, VIC
46
OPeNDAP V4 (Hyrax)
Architecture
Client
OLFS
BES
 OPeNDAP Lightweight Front end Server (OLFS)
 Receives requests and asks the BES to fill them
 Uses Java Servlets
 Does not directly ‘touch’ data
 Multi-protocol
 Back End Server (BES)
 Reads data files, Databases, et c., returns info
 May return DAP2 objects or other data
 Does not require web server
BOM, Melbourne, VIC
47
Data
Binaries Generated
There are approximately 80 binaries built on a nightly basis.
They are built for the following platforms/operating systems:
 Linux
 FC4
 FC5
 MacOS-X (universal binaries when possible)
 Windows XP, win32
 Java 1.5 (Tomcat 5.5)
 IRIX (in four variants), Solaris, AIX, OSF
BOM, Melbourne, VIC
48
OPeNDAP System Elements
The OPeNDAP data access protocol is
used by a variety of system elements.
 Clients
 Browser Interfaces
 Data System Integrators (ODC)
 Servers
 Processing Servers
 Aggregating Servers - OPeNDAP chains
 Ancillary Information Services
BOM, Melbourne, VIC
49
Clients
 Clients make requests and receive
responses via the DAP.
 Clients convert data from the OPeNDAP
data model to the form required in the client
application.
BOM, Melbourne, VIC
50
OPeNDAP Clients
Internet
netCDF Java
netCDF C
Ferret
GrADS
IDV
Web
Browser
BOM, Melbourne, VIC
VisAD
NCL
Client
IDL
Client
Matlab
Client
ncBrowse
Access
NCL
51
Matlab
IDL
pyDAP
Excel
OPeNDAP
Data
Connector
ArcGIS
OC (2009)
 A pure OPeNDAP C API (OC) for the clientside
 Applications:
 DAP-aware ‘commands’ for commercial
analysis programs (e.g., IDL, matlab)
 Scripting tools (e.g., Perl, python)
BOM, Melbourne, VIC
52
OPeNDAP System Elements
The OPeNDAP data access protocol is
used by a variety of system elements.
 Clients
 Browser Interfaces
 Data System Integrators (ODC)
 Servers
 Processing Servers
 Aggregating Servers - OPeNDAP chains
 Ancillary Information Services
BOM, Melbourne, VIC
53
Browser interfaces
BOM, Melbourne, VIC
54
OPeNDAP System Elements
The OPeNDAP data access protocol is
used by a variety of system elements.
 Clients
 Browser Interfaces
 Data System Integrators (ODC)
 Servers
 Processing Servers
 Aggregating Servers - OPeNDAP chains
 Ancillary Information Services
BOM, Melbourne, VIC
55
Servers
 Servers receive requests and provide
responses via the DAP.
 Servers convert the data from the form in
which they are stored to the DAP.
 Servers provide for subsetting of the data
and more.
BOM, Melbourne, VIC
56
OPeNDAP Servers
CDM
ESML
netCDF HDF4
Data
Data
General
netCDF
Data
HDF5
DSP
Tables
SQL
FITS
CDF
Flat
Binary
CEDAR
Data
Data
Data
Data
Data
Data
Data
Data
HDF5
HDF4
JGOFS
DSP
JDBC
Internet
BOM, Melbourne, VIC
FITS
57
FreeForm
CDF
CEDAR
OPeNDAP Servers
(specialized processing)
pyDAP
ESG
FDS
GDS
DAPPER
CODAR
TDS
Data
Data
Data
Data
Data
Data
Data
General
netCDF
OPeNDAP
netCDF
OPeNDAP
GRIB
BUFR
OPeNDAP
netCDF
OPeNDAP
CODAR
netCDF
OPeNDAP
Internet
BOM, Melbourne, VIC
58
Servers
 Servers may also provide other services
 Directory traversal.
 Browser-based form to build URL.
 Ascii or other representations of data.
 Metadata associated with the data.
 Server side functions.
BOM, Melbourne, VIC
59
OPeNDAP Aggregation Servers
pyDAP
ESG
FDS
GDS
DAPPER
CODAR
TDS
JGOFS
Data
Data
Data
Data
Data
Data
Data
Data
General
netCDF
OPeNDAP
netCDF
OPeNDAP
GRIB
BUFR
OPeNDAP
netCDF
OPeNDAP
CODAR
netCDF
OPeNDAP
General
Internet
BOM, Melbourne, VIC
60
The Aggregation Server: An Example
netCDF Data Set
File
File
DSP Data Set
File
File
File
DSP
Aggregation
Server
Local
OPeNDAP
HTML, GIF
Matlab
Client
Matlab
BOM, Melbourne, VIC
61
File
OPeNDAP’s Hyrax
(‘Server4’)
• Uses a modular architecture to support
different application-level protocols
– Data access using DAP2 (DAP3)
– Catalogs using THREDDS
– Browsing using HTML and ASCII
• Modules for data access
– Different file types
– Potential for database and scripting
• Modules for commands
– Commands provide varying operations for
different protocols
BOM, Melbourne, VIC
62
OPeNDAP V4 (Hyrax)
Architecture
Client
OLFS
BES
 OPeNDAP Lightweight Front end Server (OLFS)
 Receives requests and asks the BES to fill them
 Uses Java Servlets
 Does not directly ‘touch’ data
 Multi-protocol
 Back End Server (BES)
 Reads data files, Databases, et c., returns info
 May return DAP2 objects or other data
 Does not require web server
BOM, Melbourne, VIC
63
Data
GridFTP
DAP2
HTTP
DAP2
Request Formulation**
Response to client
DAP2 (GridFTP, HTTP)
BOM, Melbourne, VIC
BES
Request from client
OPeNDAP Lightweight Front end Server
SOAP-DAP (HTTP)
THREDDS
Info output
HTML form
ASCII output
64
BES
BES Framework
BES Commands/
XML Documents
BOM, Melbourne, VIC
PPT*
Initialization/
Termination
DAP2
Access
Data
Catalogs
Commands**
NetCDF3
HDF4
FreeForm
Data
Data
Data
65
Network Protocol and
Process start/stop
activities
…
Data Store Interfaces
*PPT is built in (other protocols)
**Some commands are built in
Ancillary Information Service
•
•
•
•
Current capability: Attributes only
Client-side only
Local and remote resources
Local resource databases
The AIS enables users to augment the metadata for a data
source in a controlled way without requiring write access to
the original data. By using the DAP, users are also isolated
from data format issues.
BOM, Melbourne, VIC
66
AIS Server
Client linked
w/DAP
Software
0
3
1
AIS
Server
Data
Source
2
AIS
Resource
0. Client requests metadata from the AIS server (which appears no different from
any other DAP server).
1. The AIS server gets metadata from data source
2. The AIS server gets matching the AIS resource using the AIS database and
merges it into the metadata.
3. The AIS server returns resulting the metadata object.
BOM, Melbourne, VIC
67
Lessons (Re)Learned
1. Modularity provides for flexibility
The more modular the underlying
infrastructure the more flexible the
system. This is particularly important for
network based systems for which the
technology, software and hardware, are
changing rapidly.
68
4/11/2016
Bureau of Meteorology, Melbourne Australia
Lessons (Re)Learned
2. Data of interest will be stored in a variety
of formats.
Regardless of how much one might want
to define the format to be used by system
participants, in the end the data will be
stored in a variety of formats.
2a. The same is true of use metadata!
69
4/11/2016
Bureau of Meteorology, Melbourne Australia
Lessons Learned
3. Structural representation of sequence
data sets is a major obstacle to
interoperability
Care must be given to the organizational
structure (as opposed to the format) of the
data.
70
4/11/2016
Bureau of Meteorology, Melbourne Australia
Lessons Learned
7. The lack of a consistent structure for
data inventories is a major obstacle to
the use of distributed systems.
71
4/11/2016
Bureau of Meteorology, Melbourne Australia
Lesser Lessons Learned
9. Some surprises/observations
encountered in the OPeNDAP effort
 Metadata focus in the past has been on data discovery
not on data use, but metadata for use is where it’s at.
 Number of variables increases almost linearly with the
number of data sets.
 Users will take advantage of all of the flexibility offered
by a system sometimes to the disadvantage of all.
 Incredible variability in the structural organization of data.
72
4/11/2016
Bureau of Meteorology, Melbourne Australia
Lessons Learned
10. Time to maturity is order 10 years not 3
Developing new infrastructure takes
time, particularly to iron out all of the
%^*% little details.
73
4/11/2016
Bureau of Meteorology, Melbourne Australia
Summary
Discovery
Discovery
Inventory
Inventory
Detail
Detail
74
Search
Catalog
Tetherless World Constellation
Data
Data discovery
• Free text search on the internet/ web
• Data portals
• What makes discovery work?
– For Deep Web?
– For Linked Data?
75
Data discovery
• What makes discovery work?
– Metadata
– Logical organization
– Attention to the fact that someone would want to
discover it
– It turns out that file types are a key enabler or
inhibitor to discovery
• What does not work?
– Result ranking using *any* conventional
algorithms
76
Smart search
• Semantically aware search, e.g.
http://noesis.itsc.uah.edu
• Faceted search, e.g.
– mspace (http://mspace.fm )
– jSpace (Clark and Parsia)
– Exhibit (MIT)
– S2S – e.g. International Open Government
Dataset Catalog (IOGDC; http://logd.tw.rpi.edu )
77
NOESIS
78
Search Application integration!
Deep web dashboards…
80
http://logd.tw.rpi.edu
Intl. Open Govt. Data Cat.
Federated search
• “is the simultaneous search of multiple online
databases or web resources and is an emerging
feature of automated, web-based library and
information retrieval systems. It is also often
referred to as a portal or a federated search
engine.” wikipedia
• Libraries have been doing this for a long time
(Z39.50, ISO23950)
• Key is consistent search metadata fields (keywords)
• E.g. Geospatial One Stop http://www.geodata.gov
82
Data integration
• “involves combining data residing in different
sources and providing users with a unified
view of these data. This process becomes
significant in a variety of situations both
commercial (when two similar companies
need to merge their databases) and scientific
(combining research results from different
bioinformatics repositories, for example). ”
83
Data integration
• “Data integration appears with increasing
frequency as the volume and the need to
share existing data explodes. It has become
the focus of extensive theoretical work, and
numerous open problems remain unsolved.
In management circles, people frequently
refer to data integration as "Enterprise
Information Integration" (EII)” wikipedia
• Is this a data science/ management challenge
(rhetorical question)?
84
Value Chain –data.gov – Integration Context
Supply Side
Use Side
Community of Suppliers
Community of Users
Acquire Build
Data Dataset
Enable
Publish Discovery
Discover ConnectParticipate
Enable
Dataset
Use
Data.gov
Supply Chain Management
– no geo integration focus
Access and
Interoperability Focused
Simple supply side questions that are very hard to answer?
• Who produces the information I need?
• Are they “the” recognized authority? How can I tell?
• How often will it be re-published?
– Is the supply predictable and reliable? Can I count on it?
• Do the data have a geospatial characteristic?
– What are its geospatial qualities (specs) and provenance?
– Is it consistently defined in its meaning?
– What is the scope of its coverage?
• Will the data be maintained?
– Geometry and models
– Attributes and metadata
• Where do I get it and in what forms?
They should not have to ask if
it has been integrated?
87
What is stopping us from
answering these basic
questions?
88
Barriers to integration
• What is preventing our data from being
integrated?
– Acquisition:
• Uncoordinated data acquisition strategies at national level
• Barrier between business data and geospatial data i.e. schools,
minerals,
• Few means to broker and optimize requirements from consumers
– Production
• Quality of our metadata and when and how we get it
• Unclear operational roles in a national data framework. (NSDI)
• Absence of a granular or meaningful trustworthy data chain of
authority?
• Absence of a schedule to communicate what is going to be
happening?
89
Barriers
• What is preventing our data from being
integrated?
– Data Management
• Cataloging
• Fundamental Semantics (A16)
– Policy, Organization and Culture
• Federated political and government collection and production
environments
• divergent data quality requirements – national, state, local, regional
• Stove-piped national Geodetic policy (A16)
• Shifting market expectations and tolerances for lower quality in
favor of access?
• Legacy institutional barriers and thinking
• They are national assets not just a programs data.
Where are the problems occurring in the Value Chain?
Supply Side
Use Side
Community of Suppliers
Gap in what
gets
integrated
Acquire
Data
Build /
Intra
Dataset
Integration
Ambiguous
Cataloging
and
semantics
Community of Users
Enable
Publish Discovery
Discover ConnectParticipate
Enable
Dataset
Downstream
Use
Gap in
planning view
of Acquisition
Supply Chain Management
Data Integration Focused
Data.gov
Data
Integration
$$$
Access and
Interoperability Focused
We resemble this!
Why we need to think differently!
Aiding data integration
• Standards – formats for sure but also
– Metadata
– Semantics
– Designing for integratability!
• The goal should be to REDUCE the curation
barrier to data integration
• What would you do? What have you done?
94
Summary
• Theme of data management in the chaotic
and enabling environment of the web, internet
• Emergence of frameworks that encompass
some aspects of data management
• Unlocking data in an integratable way is an
immense challenge
• Anything/ everything you can do by following
what you have learned in this course will help
• http://tw.rpi.edu/web/Workshop/Community/G
eoData2011
95
What is next
• Nov. 27 – project presentations
• Final assignment to be handed in
• Reading for this week:
– Semantic Deep Web, James Geller, Soon Ae
Chun, and Yoo Jung An,
– The Deep Web (Internet Tutorials)
– Digital Image Resources on the Deep Web
– Parsons and Fox: Is Data Publication the Right
Metaphor?
96