Opportunities and Challenges along the ñInformation SuperHighwayî

Download Report

Transcript Opportunities and Challenges along the ñInformation SuperHighwayî

MASSACHUSETTS INSTITUTE OF TECHNOLOGY
SLOAN SCHOOL OF MANAGEMENT
INFORMATION TECHNOLOGIES GROUP
SEMANTIC INTEGRATION:
Context Interchange (COIN) Project
Presentation & Demonstration
Stuart Madnick ([email protected])
Michael Siegel ([email protected])
1
2003-07-01
Agenda
• Introduction to the Context Interchange
(COIN) Project – Motivation
"[A]lthough there are many private and public databases that
contain information potentially relevant to counterterrism
programs, they lack the necessary context definitions (i.e.,
metadata) and access tools to enable interoperation with other
databases and the extraction of meaningful and timely
information."
National Research Council (2002), "Making the Nation Safer" (emphasis added)
• Description of the COIN Context Mediation
Technology
• Demonstration of the COIN Technology
2
COntext INterchange (COIN) Project
Applications
Receivers
OUTPUT
PROCESSING
CONTEXT
MEDIATION
INPUT
PROCESSING
ODBC Driver
* Automatic
conflict
detection and
conversion
* Automatic
web
wrapping
- Derived data
Web Publishing
- Source selection
- Source attribution
TRUSTED
AGENTS
Browsers
- Semistructured text
-Multi-source
query plan and
execution
APPLICATIONS: Financial services, electronic
commerce, asset visibility, in-transit visibility.
Web
Pages
Sources
Data
bases
3
Key COIN Technologies

Web Wrapper

Extract selected information from web (HTML+XML)

Allows web to be treated as large relational SQL database

Handles dynamic web sites, cookies, “login”, etc.

Performs SQL Joins & Unions involving DB’s + Web sources
Context Mediator


Resolve semantic (meaning) differences

Enable meaningful aggregation & comparison
4
Background on DARPA Support
for Context Mediation Research
• Initial efforts funded as part of DARPA Intelligent
Integration of Information (I3) Program
• Period: July 1993 - Sept 1998
• Started under: Gio Wiederhold
• then under: Dave Gunning & Bob Neches
• I3 Program discontinued …
Other related activity:
• MIT Total Data Quality Management (TDQM)
• Since 1991 (web.mit.edu/tdqm)
5
Multiple Perspectives . . .
old lady or young lady ?
6
Role Of Context
02-01-03
Context
Context
$
01-02-03
£
?
Context
¥

03-02-01
CONTEXT VARIATIONS:
- GEOGRAPHIC ( US vs. UK )
- FUNCTIONAL (CASH MGMT vs. LOANS )
- ORGANIZATIONAL ( CITIBANK vs. CHASE )
Data:
Databases
Web data
E-mail
7
Types of Context
Representational Ontological
Temporal
Example
Temporal
Representational Currency: $ vs €
Francs before 2000,
Scale factor: 1 vs 1000 € thereafter
Ontological
Revenue: Includes vs
excludes interest
Revenue: Excludes
interest before 1994
but incl. thereafter
8
Example : Context Differences
( from multiple web sources)
Daimler Benz ( DCX ) Financial Data
P/E Ratio
11.6
ABC
Bloomberg 5.57
DBC
19.19
MarketGuide 7.46
9
Complementary Aggregation Example
• Q: How did CO2 emissions
(total, per GDP, per capita)
change over time
(between 1990 and 2000)
in Yugoslavia?
– Context 1: YUG as a
geographic region bounded
before the breakup
– Context 2: YUG as a legal
autonomous state
Related effort:
- Laboratory for Information Globalization and Harmonization Technologies and
10
Studies ( LIGHTS ) Project
World Bank’s World Dev.
Indicator DB; UN Statistic
Division; Statistics Bureaus
OAK Ridge’s CDIAC DB;
WRI; GSSD; EPAs
Olsen (Web)
CO2 Emission
1990
2000
Countr
y
GDP
Pop
GDP
Pop
YUG
698.3
23.
7
1627.
8
10.
6
BIH
13.6
3.9
HRV
266.9
4.5
MKD
608.7
GDP in billions local currency;
SVN
7162
Population in millions
2.0
Fro
m
To
1990
2000
10.5
67.267
Country
1990
2000
YUG
35604
1548
0
USD
YU
G
BIH
1279
USD
BIH
2.086
HRV
5405
USD
8.089
MKD
3378
HR
V
USD
MK
D
64.757
InSVN
1000 tons per year3981
USD
SV
N
Context 1
2.0
225.93
Context 2
Country
Code
Currency
CurCod
e
Country
1990
2000
1990
2000
Yugoslavia
YUG
New
Yug.
Dinar
YUN
CO2
35604
29523
35604
1548
0
GDP
66.5
104.8
66.5
24.2
Marka
BAM
CO2/capita
1.5
1.28
1.5
1.46
CO2/GDP
535
282
535
640
Bosnia and
Herzegovia
BIH
Croatia
HRV
Kuna
HRK
Macedonia
MKD
Denar
MKD
Slovenia
Tolarneeded:
SIT
ManySVN
sources
Meanings in sources & users might differ
GDP/Capit 2800
4560
2800
1100
Total
CO2
in
1000
tons
per
year;
GDP
in
billions
USD;
a
CO2/Capita in tons per person; CO2/GDP in tons per
11
million USD; GDP/Capita in USD per person
The 1999 Overture
Unit-of-measure mixup tied to
loss of $125Million Mars Orbiter
“NASA’s Mars Climate Orbiter was lost because
engineers did not make a simple conversion from English
units to metric, an embarrassing lapse that sent the $125
million craft off course. . . .
. . . The navigators ( JPL ) assumed metric units of
force per second, or newtons. In fact, the numbers were
in pounds of force per second as supplied by Lockheed
Martin ( the contractor ).”
Source: Kathy Sawyer, Boston Globe, October 1, 1999, page 1.
12
The Context Interchange Approach
Concept:
Length
Meters
f()
meters
feet
Feet
Shared
Ontologies
Source
Context
Conversion
Libraries
Context
Mediator
Receiver
Context
part
length
Context
Transformation
17
Source
Context Management
Application
Select partlength
From catalog
Where partno=“12AY”
Receiver
13
COIN Conceptual Model
(Ontology)
14
Another Context Example
(Basis for Demo)
Company Name DAIMLER-BENZ
Net Income
614,995
97,736,992
Sales
Context Mediation
Services
*
Datastream
Company Name DAIMLER-BENZ AG
Net Income
346,577
Sales
*
56,268,168
WorldScope
Company Name DAIMLER BENZ CORP
Net Income
615,000,000
Sales
*
97,737,000,000
Disclosure
O&A DEM-USD Exchange Rate
1.00 German Mark= 0.58 US Dollar as 12/31/93
*
OANDA
Web Server
Users & Appl.
Systems
* Wrapper Services
15
Some Context Differences
Context Definitions
Disclosure
Country of
Incorporation
Money Amount
As_Of_Date
3 Letters
Currency
Used
Currency
Conversion
Currency
Symbols
Scale Factor 1
Disclosure Names
Company
Names
American with ‘/’ as
Date Style
separator
Worldscope
USD
Money Amount
As_Of_Date
3 Letters
DataStream
Country of
Incorporation
Money Amount
As_Of_Date
2 Letters
1000
Worldscope Names
1000
DataStream Names
American with ‘/’ as
separator
European with ‘-’ as
separator
Olsen (OANDA) Web Source uses 3 Letter Currency Symbols and European Date Style
with ‘/’ as a separator
16
Domain Model
exchangeRate
scaleFactor
number
string
countryName
currencyType
date
companyFinancials
Inheritance
Attribute
Modifier
company
companyName
Some currency context possibilities:
• Currency is stated explicitly as part of record
• Currency not stated, but the same for all (e.g., US $)
17
• Currency not stated or constant, but inferred by country
COIN System Architecture
SERVER PROCESSES
MEDIATOR PROCESSES
CLIENT PROCESSES
Web Client
SQL Query
Context
Mediator
Datalog
Query
WWW Gateway
Optimizer
HTTPD-Daemon
Wrapper
Executioner
Mediated
Query
Optimized
Query Plan
(cgi-scripts)
N
HTTPD-Daemon
SQL Compiler
SQL Query
HTTPD-Daemon
COIN
Repository
N
ODBC-compliant Apps
Results
(e.g Microsoft Excel)
ODBC-Driver
HTTPD-Daemon
Web-site
Data Store for
Intermediate
Results
18
System Demonstration
Single Source Queries with Mediation
Q6. Scenario: Using Context Interchange, you can look at
the Disclosure data using Datastream Context.
Query: Find out from Disclosure what Net Income for
DAIMLER-BENZ was. Use Datastream Context.
Capabilities Demonstrated:
Ability to perform Scale Factor Conversion, Date Format
Conversion, Company Name Conversion.
19
Demonstration @ context2.mit.edu
Source
Context
20
Context Metadata (Partial)
21
Conflict Detection and Mediation
Mediated Query in Datalog
Date convert
Scale factor convert
Name convert
22
Mediated SQL Query & Result
Mediated SQL Query
Adjust scale factor
Date format
conversion
Name conversion
Final results – from Disclosure but in Datastream context
23
More Complex Example (4 sources: DB + Web)
Databases
Web source
select WorldcAF.TOTAL_ASSETS, DiscAF.NET_SALES,
DiscAF.NET_INCOME, DStreamAF.TOTAL_EXTRAORD_ITEMS_PRE_TAX,
quotes.Last
from WorldcAF, DiscAF, DStreamAF, quotes
where WorldcAF.COMPANY_NAME = "DAIMLER-BENZ AG"
and DStreamAF.AS_OF_DATE = "01/05/94"
and WorldcAF.COMPANY_NAME = DStreamAF.NAME
24
and WorldcAF.COMPANY_NAME = DiscAF.COMPANY_NAME
and WorldcAF.COMPANY_NAME = quotes.Cname;
Conflict Table (1st part)
25
Conflict Table (2nd part)
26
Generated SQL (1st Part)
select worldcaf.total_assets, discaf.net_sales, ((discaf.net_income*0.001)*olsen.rate),
(dstreamaf2.total_extraord_items_pre_tax*olsen2.rate), quotes.Last
from (select date1, 'European Style -', '01/05/94', 'American Style /'
from datexform
where format1='European Style -'
and date2='01/05/94'
and format2='American Style /') datexform,
(select dt_names, 'DAIMLER-BENZ AG'
from name_map_dt_ws
where ws_names='DAIMLER-BENZ AG') name_map_dt_ws,
(select ds_names, 'DAIMLER-BENZ AG'
from name_map_ds_ws
where ws_names='DAIMLER-BENZ AG') name_map_ds_ws,
(select 'DAIMLER-BENZ AG', ticker, exc
from ticker_lookup2
where comp_name='DAIMLER-BENZ AG') ticker_lookup2,
(select 'DAIMLER-BENZ AG', latest_annual_financial_date, current_outstanding_shares, net_income, sales, total_assets,
country_of_incorp
from worldcaf
where company_name='DAIMLER-BENZ AG') worldcaf,
(select country, currency
from currencytypes
where currency <> 'USD') currencytypes,
(select exchanged, 'USD', rate, date
from olsen
where expressed='USD') olsen,
(select company_name, latest_annual_data, current_shares_outstanding, net_income, net_sales, total_assets, location_of_incorp
from discaf) discaf,
27
Generated SQL (Continued - Partial)
(select as_of_date, name, total_sales, total_extraord_items_pre_tax,
earned_for_ordinary, currency
from dstreamaf) dstreamaf,
(select as_of_date, name, total_sales,
total_extraord_items_pre_tax, earned_for_ordinary, currency
from dstreamaf) dstreamaf2,
(select char3_currency, char2_currency
from currency_map
where char3_currency <> 'USD') currency_map,
(select country, currency
from currencytypes
where currency <> 'USD') currencytypes2,
(select exchanged, 'USD', rate, '01/05/94'
from olsen
where expressed='USD'
and date='01/05/94') olsen2,
(select Cname, Last
from quotes) quotes
where currencytypes.country = discaf.location_of_incorp
and currencytypes.currency = olsen.exchanged
and dstreamaf.currency = dstreamaf2.currency
and dstreamaf2.currency = currency_map.char2_currency
and olsen.date = discaf.latest_annual_data
and currency_map.char3_currency = currencytypes2.currency
and currencytypes2.currency = olsen2.exchanged
and name_map_dt_ws.dt_names = dstreamaf2.name
and name_map_ds_ws.ds_names = discaf.company_name
and ticker_lookup2.ticker = quotes.Cname
and datexform.date1 = dstreamaf2.as_of_date
and currencytypes.currency <> 'USD'
and currency_map.char3_currency <> 'USD'
union
select worldcaf2.total_assets, discaf2.net_sales,
((discaf2.net_income*0.001)*olsen3.rate),
dstreamaf4.total_extraord_items_pre_tax, quotes2.Last
from (select date1, 'European Style -', '01/05/94', 'American Style /'
from datexform
where format1='European Style -'
and date2='01/05/94'
and format2='American Style /') datexform2,
(select dt_names, 'DAIMLER-BENZ AG'
from name_map_dt_ws
where ws_names='DAIMLER-BENZ AG') name_map_dt_ws2,
(select ds_names, 'DAIMLER-BENZ AG'
from name_map_ds_ws
where ws_names='DAIMLER-BENZ AG') name_map_ds_ws2,
(select 'DAIMLER-BENZ AG', ticker, exc
from ticker_lookup2
where comp_name='DAIMLER-BENZ AG') ticker_lookup22,
(select 'DAIMLER-BENZ AG', latest_annual_financial_date,
current_outstanding_shares, net_income, sales, total_assets,
country_of_incorp
from worldcaf
where company_name='DAIMLER-BENZ AG') worldcaf2,
(select country, currency
from currencytypes
where currency <> 'USD') currencytypes3,
(select exchanged, 'USD', rate, date
from olsen
where expressed='USD') olsen3,
(select company_name, latest_annual_data,
current_shares_outstanding, net_income, net_sales, total_assets,
location_of_incorp
from discaf) discaf2,
(select as_of_date, name, total_sales,
total_extraord_items_pre_tax, earned_for_ordinary, currency
from dstreamaf) dstreamaf3,
(select 'USD', char2_currency
from currency_map
where char3_currency='USD') currency_map2,
etc
28
Final Result
29
Execution Trace (1st Part - Partials)
Parallel Execution
...
Retrieving data
From Web source
30
Execution Trace (Continued - Partials)
...
Stock price returned
From Web source
Another Web source used
(for currency conversion)
...
31
The 1805 Overture
In 1805, the Austrian and Russian Emperors agreed
to join forces against Napoleon. The Russians promised
that their forces would be in the field in Bavaria by Oct. 20.
The Austrian staff planned its campaign based on
that date in the Gregorian calendar. Russia, however, still
used the ancient Julian calendar, which lagged 10 days
behind.
The calendar difference allowed Napoleon to
surround Austrian General Mack's army at Ulm and force
its surrender on Oct. 21, well before the Russian forces
could reach him, ultimately setting the stage for Austerlitz.
Source: David Chandler, The Campaigns of Napoleon, New York: MacMillan 1966, pg. 390.
32
Summary
• Tremendous opportunity to gather and integrate
information from many diverse sources
• But … need to overcome many context challenges
• Context-type “metadata” plays a critical role
• COIN technology can be an important aid for
semantically meaningful information integration:
- Scalable
- Extensible
- Application Domain Merging
- Reuse and extension of ontologies and contexts
33
Appendix – Some Useful Reference Material
• Documents
Overview: http://computer.org/conferen/proceed/meta/1999/papers/84/smadnick.html
"Metadata Jones and the Tower of Babel: The Challenge of Large-Scale
Semantic Heterogeneity", (IEEE Meta-Data Conference)
Theory of COIN: http://web.mit.edu/smadnick/www/wp/1997-03.pdf
“Context Interchange: New Features and Formalisms for the Intelligent Integration of
Information” (ACM TOIS)
Contact us for more …
• Web sites
Main COIN web site: http://context2.mit.edu
Miscellaneous demos: http://context2.mit.edu/coin/demos/
Self-explanatory demo: http://interchange.mit.edu:8080/gcms_v4/airCarMergeTop.html
(Airfare and Car rental applications, includes ontology merging.
34
Caution: still under development)
Appendix: Sample Applications
•
•
•
•
•
•
•
•
Airfare, Car Rental and Merged Travel
Weather
Global Price Comparison
Airfare Aggregation
Disaster Relief
TASC Financial Example
Web Services Demo
Corporate Householding
35
Appendix:
User or
Program (via SQL Query)
COIN Web-Wrapper Technology
Select Edgar.Net_income
From Edgar
Where Edgar.Ticker=intc
and Edgar.Form=10-Q
Web page
spec file *
SQL
Side
Ticker
INTC
Web
Wrapper
Generator
HTML
Side
Net Income
1,983
Data record returned
* Spec file contains:
Schema, Navigation rules,
and Extraction rules. 36