2006 COIN-ICEWS

Download Report

Transcript 2006 COIN-ICEWS

MASSACHUSETTS INSTITUTE OF TECHNOLOGY
SLOAN SCHOOL OF MANAGEMENT
INFORMATION TECHNOLOGIES GROUP
Data Transformation & Model Integration
in support of ICEWS :
COntext INterchange (COIN) Project
Presentation & Demonstration
Stuart Madnick ([email protected])
Michael Siegel ([email protected])
August 22, 2006
1
2003-07-01
Data Transformation & Model Integration
Issue: Automation for Scalability & Maintainability
ICESW Challenges (from the ICEWS preliminary proposal, dated 4 May 2006):
• “no inter-agency system exists to integrate broad data sets.”
• need “proper, seamless integration of existing tools, methodologies, and data ...”
• “Social network models, dynamic events data systems and state stability models all
validated – just need to be integrated seamlessly.”
Problem:
• Easy to solve (manually) when only 3-4 data sources and 2-3 models
• But, if grow to many data sources and models – becomes major effort
• Hard to scale to large number of sources and models
• Hard to change (to incrementally add new sources and models)
• Hard to evolve (changing specifications of existing sources and models)
• Hard to automate the integration and maintainance
Solution:
• Use COntext INterchange (COIN) Technology (for Data Extraction and Integration)
• Basic theory funded as part of DARPA Intelligent Integration of Information (I3)
(Uses declarative knowledge about data sources/models and reasoning)
• Since then, further refinement and increasingly complex prototypes
• Ready to extend to model integration application and large-scale prototyping
2
Stuart Madnick ([email protected]), Michael Siegel ([email protected]), Nazli Choucri ([email protected])
Data Transformation & Integration in Support of
Model Operation – Hypothetical Example
Specific Data Needed for Each Model
Semi-Processed
Data Sources
Raw Data Sources
Civilian casualties
(per week)
Civilian casualties
(per day)
Web sites
Filtered
Military casualties
(per week)
Individual
Economic Sector
Performance
(in Euros)
Military casualties
(per day)
Newspapers
Semi-Structured
Data Sources
Some Data Extraction
can be Automated
via COIN Technology
Total casualties
(per day)
Total
Economic Sector
Performance
(in local currency)
Databases
Individual
Economic Sector
Performance
(in local currency1)
Model
M1
Model
M2
Data Transformation
& Integration Automated
via COIN Technology
Interesting challenge, especially with time series data, is that semantics change over time – e.g., currency in France 3
changed from Francs to Euros; Yugoslavia changed several times. COIN handles this easily as part of its temporal
semantics capabilities.
1
Model Input/Output Transformation & Integration
– Hypothetical Example
Recent Individual
Economic
Sector-by-Sector
Performance
(in local currency)
Model
M3
Data Transformation
& Integration Automated
via COIN Technology
Recent
Civilian Casualties1
(per day)
Recent
Military Casualties
(per day)
Model
M4
Projected Individual
Economic
Sector-by-Sector
Performance
(in local currency)
Projected Total
Economic
Sector-by-Sector
Performance
(in Euros)
Projected
Total Casualties
(per week)
Projected
Civilian Casualties
(per day)
Projected
Military Casualties
(per day)
Model
M5
The automation of model
integration – especially to
deal with many autonomous
evolving models, with
differing input/outputs, is an
important technological
advance proposed.
Many more subtleties to data semantics, such as differing definitions of what is a “casualty” in
different data sources and as used in different models - that must be reconciled.
1
Refined
Projected
Total Casualties
(per week)
4
Simplified Overview of COntext INterchange
(COIN) Advanced Technology
Concept:
Time
Modifiers:
Days Weeks
f()
Weeks
Light-weight
Shared
Ontologies with
Ontologies
Context Modifiers
Declarative description of
Source’s actual semantics
Source
Context
Days
Conversion
Creation
Context
Mediator
Mediation & Transformation uses an
integrated framework of abductive and
constraint logic programming
Receiver
Context
2
civ_
casualties
Select civ_casualties / 7
From source1
Where region=“12”
Context
Transformation
140
Source
Specialized symbolic equation solving
techniques used to dynamically create
comprehensive conversion programs
from small conversion components
Declarative description of
Receiver’s desired/expected
semantics
1 Select civ_casualties
From source1
Where region=“12”
3
20
Receiver
5
Agenda
• Introduction to the Context Interchange
(COIN) Project – Motivation
"[A]lthough there are many private and public databases that
contain information potentially relevant to counterterrism
programs, they lack the necessary context definitions (i.e.,
metadata) and access tools to enable interoperation with other
databases and the extraction of meaningful and timely
information."
National Research Council (2002), "Making the Nation Safer" (emphasis added)
• Description of the COIN Context Mediation
Technology
• Demonstration of the COIN Technology
6
COntext INterchange (COIN) Project
Applications
Receivers
OUTPUT
PROCESSING
CONTEXT
MEDIATION
INPUT
PROCESSING
ODBC Driver
* Automatic
conflict
detection and
conversion
* Automatic
web
wrapping
- Derived data
Web Publishing
- Source selection
- Source attribution
TRUSTED
AGENTS
Browsers
- Semistructured text
-Multi-source
query plan and
execution
APPLICATIONS: Financial services, electronic
commerce, asset visibility, in-transit visibility.
Web
Pages
Sources
Data
bases
7
Key COIN Technologies

Web Wrapper

Extract selected information from web (HTML+XML)

Allows web to be treated as large relational SQL database

Handles dynamic web sites, cookies, “login”, etc.

Performs SQL Joins & Unions involving DB’s + Web sources
Context Mediator


Resolve semantic (meaning) differences

Enable meaningful aggregation & comparison
8
Multiple Perspectives . . .
old lady or young lady ?
9
Role Of Context
07-08-09
Context
Context
$
08-07-09
£
?
Context
¥

09-07-08
CONTEXT VARIATIONS:
- GEOGRAPHIC ( US vs. UK )
- FUNCTIONAL (CASH MGMT vs. LOANS )
- ORGANIZATIONAL ( CITIBANK vs. CHASE )
Data:
Databases
Web data
E-mail
10
Types of Context
Representational Ontological
Temporal
Example
Temporal
Representational Currency: $ vs €
Francs before 2000,
Scale factor: 1 vs 1000 € thereafter
Ontological
Revenue: Includes vs
excludes interest
Revenue: Excludes
interest before 1994
but incl. thereafter
11
Example : Context Differences
( from multiple web sources)
Daimler Benz ( DCX ) Financial Data
P/E Ratio
11.6
ABC
Bloomberg 5.57
DBC
19.19
MarketGuide 7.46
12
Complementary Aggregation Example
• Q: How did CO2 emissions
(total, per GDP, per capita)
change over time
(between 1990 and 2000)
in Yugoslavia?
– Context 1: YUG as a
geographic region bounded
before the breakup
– Context 2: YUG as a legal
autonomous state
Related effort:
- Laboratory for Information Globalization and Harmonization Technologies and
13
Studies ( LIGHTS ) Project
World Bank’s World Dev.
Indicator DB; UN Statistic
Division; Statistics Bureaus
OAK Ridge’s CDIAC DB;
WRI; GSSD; EPAs
Olsen (Web)
CO2 Emission
1990
2000
Countr
y
GDP
Pop
GDP
Pop
YUG
698.3
23.
7
1627.
8
10.
6
BIH
13.6
3.9
HRV
266.9
4.5
MKD
608.7
GDP in billions local currency;
SVN
7162
Population in millions
2.0
Fro
m
To
1990
2000
10.5
67.267
Country
1990
2000
YUG
35604
1548
0
USD
YU
G
BIH
1279
USD
BIH
2.086
HRV
5405
USD
8.089
MKD
3378
HR
V
USD
MK
D
64.757
InSVN
1000 tons per year3981
USD
SV
N
Context 1
2.0
225.93
Context 2
Country
Code
Currency
CurCod
e
Country
1990
2000
1990
2000
Yugoslavia
YUG
New
Yug.
Dinar
YUN
CO2
35604
29523
35604
1548
0
GDP
66.5
104.8
66.5
24.2
Marka
BAM
CO2/capita
1.5
1.28
1.5
1.46
CO2/GDP
535
282
535
640
Bosnia and
Herzegovia
BIH
Croatia
HRV
Kuna
HRK
Macedonia
MKD
Denar
MKD
Slovenia
Tolarneeded:
SIT
ManySVN
sources
Meanings in sources & users might differ
GDP/Capit 2800
4560
2800
1100
Total
CO2
in
1000
tons
per
year;
GDP
in
billions
USD;
a
CO2/Capita in tons per person; CO2/GDP in tons per
14
million USD; GDP/Capita in USD per person
The 1999 Overture
Unit-of-measure mixup tied to
loss of $125Million Mars Orbiter
“NASA’s Mars Climate Orbiter was lost because
engineers did not make a simple conversion from English
units to metric, an embarrassing lapse that sent the $125
million craft off course. . . .
. . . The navigators ( JPL ) assumed metric units of
force per second, or newtons. In fact, the numbers were
in pounds of force per second as supplied by Lockheed
Martin ( the contractor ).”
Source: Kathy Sawyer, Boston Globe, October 1, 1999, page 1.
15
The Context Interchange Approach
Concept:
Length
Meters
f()
meters
feet
Feet
Shared
Ontologies
Conversion
Libraries
Context Management
Administrator
Context
Mediator
Source
Context
Receiver
Context
2
part
length
Select partlength x 3.35
From catalog
Where partno=“12AY”
Context
Transformation
17
Source
1 Select partlength
From catalog
Where partno=“12AY”
3
55.25
Receiver
16
COIN Conceptual Model
(Ontology)
17
Another Context Example
(Basis for Demo)
Company Name DAIMLER-BENZ
Net Income
614,995
97,736,992
Sales
Context Mediation
Services
*
Datastream
Company Name DAIMLER-BENZ AG
Net Income
346,577
Sales
*
56,268,168
WorldScope
Company Name DAIMLER BENZ CORP
Net Income
615,000,000
Sales
*
97,737,000,000
Disclosure
O&A DEM-USD Exchange Rate
1.00 German Mark= 0.58 US Dollar as 12/31/93
*
OANDA
Web Server
Users & Appl.
Systems
* Wrapper Services
18
Some Context Differences
Context Definitions
Disclosure
Country of
Incorporation
Money Amount
As_Of_Date
3 Letters
Currency
Used
Currency
Conversion
Currency
Symbols
Scale Factor 1
Disclosure Names
Company
Names
American with ‘/’ as
Date Style
separator
Worldscope
USD
Money Amount
As_Of_Date
3 Letters
DataStream
Country of
Incorporation
Money Amount
As_Of_Date
2 Letters
1000
Worldscope Names
1000
DataStream Names
American with ‘/’ as
separator
European with ‘-’ as
separator
Olsen (OANDA) Web Source uses 3 Letter Currency Symbols and European Date Style
with ‘/’ as a separator
19
Domain Model
exchangeRate
scaleFactor
number
string
countryName
currencyType
date
companyFinancials
Inheritance
Attribute
Modifier
company
companyName
Some currency context possibilities:
• Currency is stated explicitly as part of record
• Currency not stated, but the same for all (e.g., US $)
20
• Currency not stated or constant, but inferred by country
COIN System Architecture
SERVER PROCESSES
MEDIATOR PROCESSES
CLIENT PROCESSES
Web Client
SQL Query
Context
Mediator
Datalog
Query
WWW Gateway
Optimizer
HTTPD-Daemon
Wrapper
Executioner
Mediated
Query
Optimized
Query Plan
(cgi-scripts)
N
HTTPD-Daemon
SQL Compiler
SQL Query
HTTPD-Daemon
COIN
Repository
N
ODBC-compliant Apps
Results
(e.g Microsoft Excel)
ODBC-Driver
HTTPD-Daemon
Web-site
Data Store for
Intermediate
Results
21
System Demonstration
Single Source Queries with Mediation
Q6. Scenario: Using Context Interchange, you can look at
the Disclosure data using Datastream Context.
Query: Find out from Disclosure what Net Income for
DAIMLER-BENZ was. Use Datastream Context.
Capabilities Demonstrated:
Ability to perform Scale Factor Conversion, Date Format
Conversion, Company Name Conversion.
22
Demonstration @ context2.mit.edu
Source
Context
23
Context Metadata (Partial)
24
Conflict Detection and Mediation
Mediated Query in Datalog
Date convert
Scale factor convert
Name convert
25
Mediated SQL Query & Result
Mediated SQL Query
Adjust scale factor
Date format
conversion
Name conversion
Final results – from Disclosure but in Datastream context
26
More Complex Example (4 sources: DB + Web)
Databases
Web source
select WorldcAF.TOTAL_ASSETS, DiscAF.NET_SALES,
DiscAF.NET_INCOME, DStreamAF.TOTAL_EXTRAORD_ITEMS_PRE_TAX,
quotes.Last
from WorldcAF, DiscAF, DStreamAF, quotes
where WorldcAF.COMPANY_NAME = "DAIMLER-BENZ AG"
and DStreamAF.AS_OF_DATE = "01/05/94"
and WorldcAF.COMPANY_NAME = DStreamAF.NAME
27
and WorldcAF.COMPANY_NAME = DiscAF.COMPANY_NAME
and WorldcAF.COMPANY_NAME = quotes.Cname;
Conflict Table (1st part)
28
Conflict Table (2nd part)
29
Generated SQL (1st Part)
select worldcaf.total_assets, discaf.net_sales, ((discaf.net_income*0.001)*olsen.rate),
(dstreamaf2.total_extraord_items_pre_tax*olsen2.rate), quotes.Last
from (select date1, 'European Style -', '01/05/94', 'American Style /'
from datexform
where format1='European Style -'
and date2='01/05/94'
and format2='American Style /') datexform,
(select dt_names, 'DAIMLER-BENZ AG'
from name_map_dt_ws
where ws_names='DAIMLER-BENZ AG') name_map_dt_ws,
(select ds_names, 'DAIMLER-BENZ AG'
from name_map_ds_ws
where ws_names='DAIMLER-BENZ AG') name_map_ds_ws,
(select 'DAIMLER-BENZ AG', ticker, exc
from ticker_lookup2
where comp_name='DAIMLER-BENZ AG') ticker_lookup2,
(select 'DAIMLER-BENZ AG', latest_annual_financial_date, current_outstanding_shares, net_income, sales, total_assets,
country_of_incorp
from worldcaf
where company_name='DAIMLER-BENZ AG') worldcaf,
(select country, currency
from currencytypes
where currency <> 'USD') currencytypes,
(select exchanged, 'USD', rate, date
from olsen
where expressed='USD') olsen,
(select company_name, latest_annual_data, current_shares_outstanding, net_income, net_sales, total_assets, location_of_incorp
from discaf) discaf,
30
Generated SQL (Continued - Partial)
(select as_of_date, name, total_sales, total_extraord_items_pre_tax,
earned_for_ordinary, currency
from dstreamaf) dstreamaf,
(select as_of_date, name, total_sales,
total_extraord_items_pre_tax, earned_for_ordinary, currency
from dstreamaf) dstreamaf2,
(select char3_currency, char2_currency
from currency_map
where char3_currency <> 'USD') currency_map,
(select country, currency
from currencytypes
where currency <> 'USD') currencytypes2,
(select exchanged, 'USD', rate, '01/05/94'
from olsen
where expressed='USD'
and date='01/05/94') olsen2,
(select Cname, Last
from quotes) quotes
where currencytypes.country = discaf.location_of_incorp
and currencytypes.currency = olsen.exchanged
and dstreamaf.currency = dstreamaf2.currency
and dstreamaf2.currency = currency_map.char2_currency
and olsen.date = discaf.latest_annual_data
and currency_map.char3_currency = currencytypes2.currency
and currencytypes2.currency = olsen2.exchanged
and name_map_dt_ws.dt_names = dstreamaf2.name
and name_map_ds_ws.ds_names = discaf.company_name
and ticker_lookup2.ticker = quotes.Cname
and datexform.date1 = dstreamaf2.as_of_date
and currencytypes.currency <> 'USD'
and currency_map.char3_currency <> 'USD'
union
select worldcaf2.total_assets, discaf2.net_sales,
((discaf2.net_income*0.001)*olsen3.rate),
dstreamaf4.total_extraord_items_pre_tax, quotes2.Last
from (select date1, 'European Style -', '01/05/94', 'American Style /'
from datexform
where format1='European Style -'
and date2='01/05/94'
and format2='American Style /') datexform2,
(select dt_names, 'DAIMLER-BENZ AG'
from name_map_dt_ws
where ws_names='DAIMLER-BENZ AG') name_map_dt_ws2,
(select ds_names, 'DAIMLER-BENZ AG'
from name_map_ds_ws
where ws_names='DAIMLER-BENZ AG') name_map_ds_ws2,
(select 'DAIMLER-BENZ AG', ticker, exc
from ticker_lookup2
where comp_name='DAIMLER-BENZ AG') ticker_lookup22,
(select 'DAIMLER-BENZ AG', latest_annual_financial_date,
current_outstanding_shares, net_income, sales, total_assets,
country_of_incorp
from worldcaf
where company_name='DAIMLER-BENZ AG') worldcaf2,
(select country, currency
from currencytypes
where currency <> 'USD') currencytypes3,
(select exchanged, 'USD', rate, date
from olsen
where expressed='USD') olsen3,
(select company_name, latest_annual_data,
current_shares_outstanding, net_income, net_sales, total_assets,
location_of_incorp
from discaf) discaf2,
(select as_of_date, name, total_sales,
total_extraord_items_pre_tax, earned_for_ordinary, currency
from dstreamaf) dstreamaf3,
(select 'USD', char2_currency
from currency_map
where char3_currency='USD') currency_map2,
etc
31
Final Result
32
Execution Trace (1st Part - Partials)
Parallel Execution
...
Retrieving data
From Web source
33
Execution Trace (Continued - Partials)
...
Stock price returned
From Web source
Another Web source used
(for currency conversion)
...
34
The 1805 Overture
In 1805, the Austrian and Russian Emperors agreed
to join forces against Napoleon. The Russians promised
that their forces would be in the field in Bavaria by Oct. 20.
The Austrian staff planned its campaign based on
that date in the Gregorian calendar. Russia, however, still
used the ancient Julian calendar, which lagged 10 days
behind.
The calendar difference allowed Napoleon to
surround Austrian General Mack's army at Ulm and force
its surrender on Oct. 21, well before the Russian forces
could reach him, ultimately setting the stage for Austerlitz.
Source: David Chandler, The Campaigns of Napoleon, New York: MacMillan 1966, pg. 390.
35
Summary
• Tremendous opportunity to gather and integrate
information from many diverse sources
• But … need to overcome many context challenges
• Context-type “metadata” plays a critical role
• COIN technology can be an important aid for
semantically meaningful information integration:
- Scalable
- Extensible
- Application Domain Merging
- Reuse and extension of ontologies and contexts
36
Appendix – Some Useful Reference Material
• Documents
Overview: http://computer.org/conferen/proceed/meta/1999/papers/84/smadnick.html
"Metadata Jones and the Tower of Babel: The Challenge of Large-Scale
Semantic Heterogeneity", (IEEE Meta-Data Conference)
Theory of COIN: http://web.mit.edu/smadnick/www/wp/1997-03.pdf
“Context Interchange: New Features and Formalisms for the Intelligent Integration of
Information” (ACM TOIS)
Contact us for more …
• Web sites
Main COIN web site: http://context2.mit.edu
Miscellaneous demos: http://context2.mit.edu/coin/demos/
Self-explanatory demo: http://interchange.mit.edu:8080/gcms_v4/airCarMergeTop.html
(Airfare and Car rental applications, includes ontology merging.
37
Caution: still under development)
Appendix: Sample Applications
•
•
•
•
•
•
•
•
•
Airfare, Car Rental and Merged Travel
Weather
Global Price Comparison
Airfare Aggregation
Disaster Relief
TASC Financial Example
Web Services Demo
Corporate Householding
Counter-Terrorism Intelligence
38
Appendix:
User or
Program (via SQL Query)
COIN Web-Wrapper Technology
Select Edgar.Net_income
From Edgar
Where Edgar.Ticker=intc
and Edgar.Form=10-Q
Web page
spec file *
SQL
Side
Ticker
INTC
Web
Wrapper
Generator
HTML
Side
Net Income
1,983
Data record returned
* Spec file contains:
Schema, Navigation rules,
and Extraction rules. 39