Information Mediation

Download Report

Transcript Information Mediation

Information Mediation: Integrating
Information from Multiple Information
Sources
Naveen Ashish
Amit P. Sheth
Department of Computer Science and
Large Scale Distributed Information Systems Lab
University of Georgia, Athens
What is an Information Agent/Mediator ?





A software system that provides integrated and structured
query access to multiple distributed information sources
Sources may be databases of various kinds or Web sources
Sources are autonomously created and heterogeneous
Accessible via a network
Mediator provides the illusion of a single information
source
Information Agents aka Mediators
Map Servers
Example: Restaurant and Theatre Info
on the Web
Geocoders
Ariadne
Mediator
Zagat
Movies
Health Ratings
Why the Interest in Building Such Systems ?
Oracle
MEDIATOR
Sybase
IBM DB2
Legacy System
Object-Oriented DB
Mediators on the Web
Wrapper
MEDIATOR
DB2
DB1
Organization of Remainder of Talk




Introduction
– Information Agents, System Architecture
Research Issues
– Information Modeling
– Query Planning
– Semi-automatic Wrapper Generation
– Performance Optimization by Materialization
– Resolving Inconsistencies
Industry Products for Data Extraction and Integration
Start-up Ventures
Representative Systems (Research Projects)
SIMS/Ariadne
TSIMMIS
Information Manifold
Garlic
Tukwila
InfoSleuth
DISCO
HERMES
InfoMaster
InfoQuilt
University of Southern California/ISI
Stanford
AT&T Research
IBM Almaden
University of Washington
MCC
University of Maryland/INRIA
University of Maryland
Stanford
University of Georgia
Information Modeling





Multiple, heterogeneous, autonomously created
information sources
Users sees an integrated (global) view
– Queries a “mediated schema”
A uniform model for all sources
– Must be (at least) expressive enough to model the most
complex information source
Each source provides a set of relations or classes
– Translation (model) is done by wrapper at each source
Integration
– Global as view, Local as view
Global as View

For each relation (class) in mediated schema we
specify how to obtain its tuples from the sources
Name
Phonenumber
RESTAURANT
Name
DOH Ratings
GEOCODER
Address Lat Lon
Rating
ZAGAT
FODORS
Name
Name
ZAGAT
Address
Telephone
FODORS
Phone #
Reviews
Heterogeneity Resolution



Sources may use different models
– OO, Relational, Legacy, …..
– May be Web sources
– Wrapper “exports” contents in a uniform model
Structural and schematic differences
(name, address)
(name, street, city, state, zip)
Semantic
(name, phonenumber)
(name, telephone)
Global as View: Models


KR based models (SIMS, Ariadne, ….)
– LOOM, CLASSIC
OO, based on ODMG (DISCO, Garlic …)
interface Restaurant {
attribute string name;
attribute string address;
attribute string cuisine;
attribute string review;
}
extent restaurant 0 of Restaurant wrapper w0 repository r0
map ((zagts0=restaurant0) (name=n) (address=a)(cuisine=c))
Local as View

For every information source S describe it in terms of
relations in the mediated schema
v1(name,address,cuisine,rating) :- Restaurant(name,address,
cuisine,rating) ^ city = “Santa Monica”
v2(name, foodrating) :- Restaurant(name,address,cuisine,rating)
….
Query Planning and Optimization




Mediator must generate an information gathering plan
Constraints on execution
– Binding patterns ....
Optimization of query plans
Current areas of work
– Optimization
– Approximate answers (incomplete sources)
– Query planning for other sources such as simulations,
computer programs etc.
– Query execution engines
Query Plans and Plan Quality
Low-Quality Plan
High-Quality Plan
Accessing Sources via Wrappers
SELECT address, tel
FROM Restaurant
WHERE cuisine =
“chinese”
Chinois, 2720 Main St, 310-777-9876
Peking Star, 1 Broad St, 213-999-7676
.....
Semi-Automatic Wrapper Generation



Need wrappers for several sites
– Building wrappers by hand is tedious and time consuming
Approaches to automating the process
– Exploit format information (structure, HTML etc. )
– Template based approaches
– Machine learning techniques
XML
<name> Peking Star </name>
<address> 1 Broad Street, Los Angeles </address>
<phone>31-822-1511 </phone>
Wrappers .... Work in Progress





Database wrappers
Variety of techniques for Web wrappers
“Upmarking”
– To XML
Building “Web-bases”
Other Artificial Intelligence techniques
– Natural Language Processing
– IR
– Classifiers
Performance Issue




Query processing time is typically very high
Despite the mediator generating efficient query plans
Cost of fetching data and pages from remote sources dominates
– Have to typically fetch a large number of Web pages
– The Web sources are not designed for database like query access
– The Web sources can be slow
Further improve performance by materializing data at the mediator
side.
Store and Materialize Data Locally
Wrapped Web Source (SLOW)
MEDIATOR
Materialized Data (FAST)
Selective Materialization



Why not simply materialize all the data in all the Web sources
being integrated and have a really fast mediator ??
– Will not scale, amount of space needed may be too much
– Web sources can get updated
 Cost of keeping data consistent can get prohibitive
– We are building a mediator, not a data warehouse !
Approach then is to selectively materialize data
How do we automatically identify the portion of data most
useful to materialize ?
Selecting Data to Materialize
Distribution of User Queries
(Identify frequently
accessed classes)
Structure of Sources
(Prefetch data to speed up
expensive queries)
Updates
(Have to consider
maintenance cost)
SELECTING
CLASSES
Classes of
Data to
Materialize
Inconsistency Resolution



Same object in different formats
“United States” and “US”
“Red Lobster” and “The Red Lobster”
“John Smith”, “Smith, J.” , “J. Smith”, “Dr. John Smith” ...
Has appeared in other database and IR contexts
Solutions
– Mapping tables
 For finite domains (such as cities, countries, companies …)
 Simply maintain an enumerated list of possible formats for
each object
 (New York, N.Y., NYC, New York City, Big Apple)
Mapping Functions


Mapping functions
– When domain is not finite (person names)
– Domain specific mapping transformations
 Stemming common words (Inc., Corp., The etc.)
 Matching full word and abbreviation
 Match 2 formats with a score
Current work
– Learning mapping functions from example matches
– IR based approaches
– Building “metabases”
Mediator Prototypes and Software









Software and tools from mediator research projects
What may be available.
– Mediator kernels (integration engines)
– Data modeling tools, Description Logic systems
– Wrapper and extractor toolkits and software
– Plenty of papers !
Ariadne, USC/ISI, http://www.isi.edu/ariadne
TSIMMIS, Stanford, http://www-db.stanford.edu/tsimmis/
MIX, UCSD, http://feast.ucsd.edu/Projects/MIX/
InfoSleuth, MCC, http://www.mcc.com/projects/infosleuth/
DISCO, U Maryland, http://www.umiacs.umd.edu/labs/CLIP/im.html
Garlic, IBM Almaden, http://www.almaden.ibm.com/cs/garlic.html
Tukwila, U Washington, http://data.cs.washington.edu/integration/tukwila/
Applications of Mediators






Heterogeneous and Distributed Database Integration
– Legacy systems integration
Web Sources Integration
Data Integration for E-commerce
– Integrating product catalogs, multiple vendors
Data Warehousing
– For populating data warehouses
Bioinformatics
Information Management Environments
 Digital Libraries
 Healthcare Information Systems
Industry Products (IBM DB2 DataJoiner)






IBM DB2 DataJoiner
http://www-4.ibm.com/software/data/datajoiner/
Enterprise data integration middleware
DataJoiner functionality now incorporated in IBM DB2 UDB
http://www-4.ibm.com/software/data/db2/udb/about.html
Native support for popular relational data sources
– DB2, Informix, SQL Server, Sybase, Teradata and others
– Supports non relational data sources
– Support for Web data
– Available on variety of platforms and OS
Start-up ventures: Junglee Corp






Website: www.amazon.com (Acquired)
Researcher Founders: Rajaraman, Gupta, Harinarayanan, Mathur
Products and Services:
– Tools for data extraction and integration
– Building warehouse from multiple Web sources
 Integrating apartment listings from multiple sources
 Integrating job postings from multiple online job sources
Market focus: Online shopping
Current Status: Acquired by Amazon
Similar ventures: Netbots Inc. (www.excite.com) Acquired by Excite
Cohera





Website: www.cohera.com
Researcher Founders: Stonebraker, Hellerstein
Products and Services:
– Cohera E-Catalog System
– Integrates product data from multiple sellers and product catalogs
– Set of software servers and tools for building and running live
“e-catalogs”
Market(s) Targetted: E-Commerce
Customers: E-Commerce communities - ThomasNet, Trapezo,
LiveListings, FoodService.Com

Current Status: Founded October 1997, Privately Held

Similar ventures: Ensosys Markets Inc. (www.enosysmarkets.com)
Mergent Inc. (www.mergent.com)
Nimble Technology





Website: www.nimble.com
Researcher Founders: Levy, Weld
Products and Services:
– Nimble Data Integration Suite
– XML base integration approach
– Current focus on multiple information sources integration
– Tools for data extraction and Data Integration Engine
Market focus: CRM, Business Intelligence, B2B, Portals
Current Status: Founded June 1999, Privately Held
WhizbangLabs !






Website: www.whizbanglabs.com
Researcher Founders: Quass, Geddes, Mitchell
Products and Services:
– Technology for building “Webbases” - databases created by
extracting data from Web pages
– Topic specific
– Topic specific crawler for retrieving pages
– Tools for extracting data from Web pages, cleaning data
and loading into database
Market focus: Content providing portals
Current Status: Founded March 1999, Privately held
Similar ventures: Fetch Technologies (www.fetch.com)
Bioinformatics: A Data Integration Grand
Challenge







Mapping of Human Genetic Code complete
– New, revolutionary, computational approach to drug discovery
Huge amounts of genetic, chemical and biological data being
generated at an exponential rate in biotech/pharma R&D
– Complex structures, maps, sequence data etc.
Drug discovery scientists need integrated access to this data
– Look for patterns across data sources
Need to integrate data from multiple labs
Lab procedures (thus the data) keeps changing
Good amount of genomic data is free text
DiscoveryLink: State of the art Life Sciences data integration
middleware from IBM
http://www-4.ibm.com/software/webservers/lifesciences/discovery.html
Conclusion






Information mediation
Issues in building such systems
Research projects
Industry products
Start-up ventures
Applicable to wide areas such as E-commerce, database and
legacy systems integration, Web source extraction, content
management, portals, digital libraries, bioinformatics.