What are the prerequisites for data integration?

Download Report

Transcript What are the prerequisites for data integration?

Data, Data everywhere:
the Need for Data Integration
Nicolas Spyratos
Professor Emeritus
University of Paris South
France
«Data, data everywhere» : The Economist, February 25, 2010
the relevant questions
What is data integration?
collecting and combining information from multiple sources into a single information source
Why is it needed?
to get more informative answers to important questions and/or analyse it to make decisions
How is it done?
following a well disciplined approach not necessarily using computers
What are the technical problems when using computers to do it?
many, difficult and costly
What are the prerequisites for data integration?
datasets should be open and preferably linked
Example: writing a summary report on rice-production, transportation and commercialization
(using available information from Japan and Vietnam)
datasets
Japan
integration
translation
decision making
japanese
thai
thai
Vietnam
vietnamese
minister of
agriculture
thai
one important difficulty though: the data sources are autonomous, heterogeneous and geographically dispersed
sharing the integrated information
minister of
agriculture
Japan
minister of
transport
minister of
commerce
Vietnam
this is the concept of data integration independently of whether we use computers or not
let’s summarize (before going to computer-assisted integration)
Japan
Vietnam
datasets
translators
integrator
specialists
decision makers
need to know the language
of the dataset and the
language of the integrator
we can now replace all intermediate activities with software modules and either store the knowledge of
the integrator in a database (called a data warehouse) or “simulate” it by a software (called a mediator)
using computers for data integration – the data warehouse approach (data in advance)
production
of “goods”
processing/transport
wholesaler
distribution
consumption
(to retailers)
metatada
dataset-1
Translator-1
.
.
.
dataset-n
datasets
Integrator
database
Translator-n
translators
databases, file systems extract/transform
twit sets, XML docs, etc.
Data Mart
Data Mart
integrator
filters/loads
data warhouse
stores integrated data
and answers queries
specialists
filter/answer
decision makers
and analysts
Real world example: the Walmart data warehouse contains 2,5 petabytes of data
a few remarks about data warehouses
a data warehouse is above all a database but of a specific nature as :
• its users are mainly analysts and decision makers (i.e. non computer specialists)
• it is accessed in read-only mode (usually through data marts)
• updates happen only at the source datasets and propagated to the data warehouse
periodically
• they store mostly historical data (usually records), therefore the data volumes are
orders of magnitude higher than in traditional databases (the Wallmart data warehouse stores
2.5 petabytes of data, i.e. 167 times the information contained in all the books in the US Library of Congress)
using computers for data integration – the mediator approach (data on demand)
Translator-1
dataset-1
.
.
.
.
.
.
software
module
dataset-n
Translator-n
datasets
translators
databases, file systems
twit sets, XML docs, etc.
extract/transform
mediator
query decomposition
synthesis of answers
decision makers
and analysts
Real world example: mediating a car dealers network
a few remarks on mediators
a mediator is not a database but a software modulethat allows querying multiple sources
• its users are mainly analysts and decision makers (i.e. non computer specialists)
• it answers queries of its users
• users can not update through the mediator (as is the case with data warehouses)
• they do not store data, they just answer queries
• the translators are complex pieces of software and writing generic translators is hard
prerequisites for data integration
(whether in data warehousing or mediating)
a minimal requirement for data integration is that the datasets should be collections of
data, published or curated by a single agent, and available for access or download in one or
more formats
Example of such a dataset: The Credit Institutions Register
It is published by the European Banking Authority (EBA) and contains a list of credit institutions to which
authorization has been granted to operate within the European Union and European Economic Area countries
(EEA).
if the datasets to be integrated are also linked and open then integration can release
social and commercial value (ex: through data mining in integrated datasets)
linked data
linked data is about publishing and connecting structured data on the Web, using standard Web technologies
(such as HTTP, RDF and URIs) to make the connections readable by computers, enabling data from different
sources to be connected and queried allowing for better interpretation and analysis
an open dataset is a collection of data that can be freely used, modified, and shared by anyone for any purpose
most datasets of the web are linked (ex: DBPedia)
open data
a dataset is called open if it can be freely used, modified, and shared by anyone for any purpose
• most datasets of the web are not open (or if they are then their quality is low)
• the following site contains a list of open datasets most of which have been closed!
https://blog.bigml.com/list-of-public-data-sources-fit-for-machine-learning/
• however, within controlled user communities, openness is extremely useful
(ex: collaborative working environments, big companies government agencies)
concluding remarks
• data integration is a basic tool in a large number of social and commercial activities
(e.g. hotel or airplane bookings, e-learning, digital libraries, e-Government etc.)
• data warehouses and mediators constitute the common supporting technology for data
integration
• data integration is especially important to governments, where large amounts of data
reside in isolated information silos
linking, integrating and opening government data can help drive the creation
of innovative business and services that deliver social and commercial value
thank you for your attention