Mediated-schema - Computer Sciences User Pages

Download Report

Transcript Mediated-schema - Computer Sciences User Pages

Evolving and Self-Managing
Data Integration Systems
AnHai Doan
Dept. of Computer Science
Univ. of Illinois at Urbana-Champaign
Spring 2004
Data Integration at UIUC

Two main players
– Kevin Chang and AnHai Doan

10 students, 30001 cups of coffees, 3 SIGMOD-04 papers

Four supporting players
–
–
–
–

Chengxiang Zhai: IR, bioinformatics, text/data integration
Dan Roth: AI, question answering, text/data integration
Jiawei Han: data mining
Marianne Winslett: security/privacy issues in data sharing
Many supporting departments and local organizations
– NCSA, Information Science, Genome Institute, Fire Service Institute
2
Data Integration Challenge
Find houses with
3 bedrooms
priced under
300K
New faculty
member
realestate.com
homeseekers.com
homes.com
3
Architecture of Data Integration System
Find houses with 3 bedrooms
priced under 300K
price, num-beds,
location, agent-name
mediated schema
list-price, bdrms,
address
source schema 1
source schema 2
source schema 3
wrapper
wrapper
wrapper
realestate.com
homeseekers.com
homes.com
Think “comparison shopping systems on steroid” ...
4
The Need for Data Integration is
Ubiquitous!

In virtually all domains
– data are distributed & stored in heterogeneous formats

WWW
– hundreds/thousands of sources in bioinformatics, real estate, book,etc.

Enterprises
– avg. organization has 49 databases [Ives-01]
– organizations frequently merge, exchange data



Government: e.g., digital government initiatives
Military, cultural & international exchange, Semantic Web,
information agents, etc.
Long-standing challenge in the database community
– recent explosion of distributed data adds urgency
5
Current State of Affairs


Vibrant research & industrial landscape
Research
– dated back to the 70-80s, accelerated in the 90s
– Stanford, UPenn, AT&T Labs, Maryland, UWashington,
Wisconsin, IBM Almaden, ISI, Arizona State U, Ireland, CMU, etc.
– many workshops in AI and DB communities: e.g., SIGMOD/VLDB-04
– focused on
– conceptual & algorithmic aspects
– building systems in specific domains (bio, geo-spatial,
rapid emergency response, virtual organization, etc.)

Industry
– more than 50 startups in 2001, new startups in 2004
Despite much R&D activities, however …
6
Current State of Affairs (cont.)


… Most DI systems are still built & maintained manually
Manual deployment is extremely labor-intensive ...
– construct mediated- & source schemas,
– find semantic mappings between schemas,
– constantly monitor & adjust to changes at
hundreds or thousands of data sources, ...

... and has become a key bottleneck

Emerging technologies
– XML, Web services, Semantic Web, ...
will further fuel DI applications & exacerbate the problem
Slashing the astronomical cost of ownership
is now crucial!
7
The AIDA Project

Recently started at Univ of Illinois
– AIDA = Automatic Integration of Data

Goal: evolving and self-managing data integration systems

Easy to start
– takes hours instead of weeks or months
– perhaps with just a few sources

Learn to continuously improve
– expand to cover new sources
– add novel query capabilities, better query performance

Adjust automatically to changes
– detect and fix broken wrappers, semantic matches, etc.

Require minimal efforts from system admin
– some efforts at the start
– far less as system has been learning more and more
8
The AIDA Project (cont.)

In line with trends in broader computing landscape
–
–
–
–
–

Key differences
–
–
–
–

autonomic systems (IBM initiative)
recovery-oriented computing (Berkeley)
cognitive computer systems (DARPA)
from cycles to RASS (Stanford)
self-tuning databases (MSR, IBM Almaden, Oracle)
applied to distributed data management systems
must attack difficult semantics/meta-data issues
heavy involvement of human
must handle large scale
Need techniques from multiple fields
– databases, machine learning, AI, IR, data mining
9
Project Overview

Thrust 1: automate current labor-intensive tasks
– schema matching
– mediated schema construction
– entity matching

Thrust 2: develop new capabilities
– entity integration


Thrust 3: monitor & adjust to changes
Thrust 4: reduce cost of system admin
– by leveraging the mass of users

Thrust 5: design sources for interoperability
10
Schema Matching
Mediated-schema
price agent-name
1-1 match
homes.com
listed-price
320K
240K
contact-name
Jane Brown
Mike Smith
address
complex match
city
state
Seattle WA
Miami FL
11
Why Schema Matching is Difficult

Schema & data never fully capture semantics!
– not adequately documented

Must rely on clues in schema & data
– using names, structures, types, data values, etc.

Such clues can be unreliable
– same names => different entities: area => location or square-feet
– different names => same entity:
area & address => location

Intended semantics can be subjective
– house-style = house-description?
– military apps require committees!

Cannot be fully automated, needs work from system admin!
12
Current State of Affairs

Largely done by hand
– labor intensive & error prone
– data integration at GTE [Li&Clifton, 2000]
– 40 databases, 27000 elements, estimated time: 12 years

Need semi-automatic approaches to scale up!

Numerous prior & current research projects
– Databases: SemInt (Northwestern), DELTA (MITRE),
IBM Almaden, Microsoft Research, Wisconsin,
Toronto, UC-Irvine, BYU, George Mason, U of Leipzig, ...
– AI: Stanford, Karlsruhe University, NEC Japan, ISI, ...

Many startups
13
Our Prior & Ongoing Work [2000-date]

Joint work with
– Robin Dhamanka, Yoonkyong Lee, Wensheng Wu, Rob McCann, Warren Shen, Alex
Kramnik, Olu Sobulo, Vanitha Varadarajan (Illinois), Pedro Domingos, Alon Halevy (U
Washington)

Learning 1-1 matches for relational & XML schemas
– LSD (Learning Source Description) system
[WebDB-00, SIGMOD-01, Machine Learning Journal-03]

Learning 1-1 & complex matches for ontologies
– GLUE [WWW-02, VLDB Journal-03, Ontology Handbook-03]

Learning 1-1 matches by mass collaboration
– MOBS [WebDB-03, IJCAI-03 Workshop]

Learning complex matches for relational schemas: iMAP [SIGMOD-04]
Large-scale matching via clustering: IceQ [SIGMOD-04]
Corpus-based schema matching [submitted]

Further resources


– brief survey talk at http://anhai.cs.uiuc.edu/home/talks/isi-matching.ppt
– "Learning to Match Structured Representations of Data"
[book by Springer-Velag, to appear]
14
Mediated Schema Construction

Joint work with
– Wensheng Wu (UIUC), Clement Yu (UIC), Weiyi Meng (SUNY Binghamton)

ICeQ project
– given a set of source query interfaces
– construct a mediated schema

Step 1: find matches among source
query interfaces
– use clustering [SIGMOD-04]

Step 2: use the found matches to construct mediated
schema (ongoing work)

Future work
– given lot of text in the domain, construct a mediated schema
15
Project Overview

Thrust 1: automate current labor-intensive tasks
– schema matching
– mediated schema construction
– entity matching

Thrust 2: develop new capabilities
– entity integration


Thrust 3: monitor & adjust to changes
Thrust 4: reduce cost of system admin
– by leveraging the mass of users

Thrust 5: design sources for interoperability
16
Entity Matching
(400K, Queen Ann – Seattle, 206-616-1842, Mike Brown)
...
PRICE
LOCATION
PHONE
NAME
(400K, Queen Ann – Seattle, 206-616-1842, M. Brown)
(320K, S. W. Champaign,
217-727-1999, Jane Smith)
...
PRICE LOCATION
PHONE
NAME
(250K, Decatur, 317-727-2459, P. Robertson)
(400K, Seattle, 616-1842, Mike Brown)
...
17
Prior Work

Very active area of research
– databases: [Hernandez&Stolfo,SIGMOD-95],
[Cohen,SIGMOD-98],
[Elfeky&Verykios&Elmagarmid,ICDE-02], ...
– AI: [Cohen&Richman,KDD-02],[Bilenko&Mooney,02], Dan Roth group,
[Tejada et. al., 01],[Tejada et. al. KDD-02], [Michalowski et. al. 03], ...

Much progress
– very effective techniques for many applications
– covered a broad range of scenarios

Key commonality
– assume entities from disparate sources have same set of attributes
– e.g., (price,location,phone,name) vs. (price,location,phone,name)
– match entities based on similarity of corresponding attributes
18
Our PROM Approach

Key observation 1: Entities often have disjoint attributes
– source V1: (age, name)
– source V2:
(name, salary)
– source S1: (location ,description,phone,name)
– source S2:
(description,phone,name, price,sq-feet)

Key observation 2: Correlations among disjoint attributes
can be exploited to maximize matching accuracy!
– e.g.,
(9, “Mike Brown”) vs. (“M. Brown”, $200K)
a 9-year-old is unlikely to make $200K!
<movie, pyear, actor, rating>
<movie, genre, review, ryear, rrating, reviewer>
19
A Profile-Based Solution

Consider again matching persons
– source V1: (age, name)
– source V2:
(name, salary)
– (9, “Mike Brown”) vs. (“M. Brown”, $200K)

Step 1: build a person profile
– what does a “typical” person “look” like?
– build from data & user input

Step 2: match person names
– “Mike Brown” vs. “M. Brown” => 0.7
– discard if confidence score is low, otherwise ...

Step 3: feed both tuples into profile
– (9, “Mike Brown”, “M. Brown”, $200K) => 0.3
20
Advantages of Profile-Based Solution

Can exploit disjoint attributes to
improve accuracy
Table T1
Tuple t1

Profiles capture taskindependent knowledge
– created from task data, domain
experts, external data
– created once, used anywhere
– an example of “knowledge
construction and reuse”

Yields an extensible, modular
architecture
– plug and play with new profiles
Table T2
Tuple t2
Similarity
Estimator
Hard
Profile
Combiner
Soft
Profile
Combiner
Training
data
Hard
Expert
profilers
knowledge
Domain
User
data
specified
Previous
constraints
matching
tasks
Soft
profilers
Matching pairs
21
Profilers
Completeness
Profiler
• Categorical rules
based on complete data
Eg: Color US movies are
produced only after 1917
• Learn from external
Manual Profiler
• Manually encoded rules
Eg: debut-year  b-year
• Domain Expert
Specified
Association
Rule Profiler
• Encodes interesting
association rules having
high confidence
Eg: (birth-year < 1900) implies
(#ODI-matches = 0)
PROFILERS
• Employs Association
data that is complete in
encode information about Rule Mining Techniques
some aspect
domain concepts and can
Histogram Profiler be constructed in many Instance Profiler
• Characteristics of a
ways
• All possible value
few frequent entities
combinations for a set
Eg: Profilers for 10
of attributes
Classifier
most productive
Eg: (studio,movie-genre)
• Learn from
director
training data
• Learn from external
• External data
Eg:
Decision
tree
data that is complete
•Encodes high confidence
in some aspect
22
rules relating disjoint attributes
Entity Matching: Empirical Evaluation
Improve accuracy significantly across six real-world domains
Precision (Baseline vs PROM)
F1 ( Baseline vs PROM)
1
0.8
0.6
1
0.8
0.6
0.4
0.2
0
0.4
0.2
0
Movie1
Movie2
Cricket
Restaurant Citeseer
Baseline
DBLP
Movie1
Movie2
Cricket
Restaurant Citeseer
Baseline
PROM
DBLP
PROM
More profilers result in better performance
F1
Precision
0.95
0.9
Exclude Profiler 4
0.85
0.8
Exclude Profiler 3
Exclude Profiler 2
0.75
Exclude Profiler 1
0.7
0.65
0.6
Complete System
23
0.55
Movie 1
Cricket
Movie 2
Movie 1
Cricket
Movie 2
Entity Integration

Problem: find all tuples related to a real world entity
.
– given a seed paper
Chris C. Zhai, A. Kramnik, Hui Fang, “Query Processing”, SIGMOD, 1998
find all papers by Chris C. Zhai from DBLP-Lite
DBLP-Lite data source
(1) Christopher Zhai, A. Kramnik, “Data Warehousing”, SIGMOD, 1998
(2) C. C. Zhai, H. Fang, “Data Mining”, VLDB, 1999
(3) C. Zhai, D. Salesin, “Motion Capturing”, SIGGRAPH, 1998
(4) C. Zhai, “Search Optimization”, SIGIR, 1999
(5) Cheng Zhai, Bruce Croft, Jiawei Han “Text Clustering”, SIGIR, 1999
(6) Cheng Zhai, Bruce Croft, “Language Models”, SIGIR, 2001
(7) A. Doan, H. Fang, “Semantic Integration”, SIGMOD, 2000

Desired result: papers (1)-(2)
24
Baseline Solutions: Pairwise Matching
(1) Christopher Zhai, A. Kramnik, “Data Warehousing”, SIGMOD, 1998
(2) C. C. Zhai, H. Fang, “Data Mining”, VLDB, 1999
(3) C. Zhai, D. Salesin, “Motion Capturing”, SIGGRAPH, 1998
(4) C. Zhai, “Search Optimization”, SIGIR, 1999
(5) Cheng Zhai, Bruce Croft, Jiawei Han “Text Clustering”, SIGIR, 1999
(6) Cheng Zhai, Bruce Croft, “Language Models”, SIGIR, 2001
(7) A. Doan, H. Fang, “Semantic Integration”, SIGMOD, 2000

Seed paper:
Chris C. Zhai, A. Kramnik, Hui Fang, “Query Processing”, SIGMOD, 1998


If match papers based only on author names
=> retrieve (1)-(6)
If consider also co-authors and confs => retrieve (1)-(2), (4)-(6)
25
Better Solution: Apply Profilers to Pairwise Matching
(1) Christopher Zhai, A. Kramnik, “Data Warehousing”, SIGMOD, 1998
(2) C. C. Zhai, H. Fang, “Data Mining”, VLDB, 1999
(3) C. Zhai, D. Salesin, “Motion Capturing”, SIGGRAPH, 1998
(4) C. Zhai, “Search Optimization”, SIGIR, 1999
(5) Cheng Zhai, Bruce Croft, Jiawei Han “Text Clustering”, SIGIR, 1999
(6) Cheng Zhai, Bruce Croft, “Language Models”, SIGIR, 2001
(7) A. Doan, H. Fang, “Semantic Integration”, SIGMOD, 2000

Seed paper:
Chris C. Zhai, A. Kramnik, Hui Fang, “Query Processing”, SIGMOD, 1998


If match papers based only on author names
=> retrieve (1)-(6)
If consider also co-authors and confs => retrieve (1)-(2), (4)-(6)
Doesn't fit profile of
a typical researcher!
Aggregate Property:
very active in both
DB and IR, with 3 SIGMOD/VLDB papers
and 3 SIGIR papers in 3 years
26
Even Better Solution: Global Matching
(1) Christopher Zhai, A. Kramnik, “Data Warehousing”, SIGMOD, 1998
(2) C. C. Zhai, H. Fang, “Data Mining”, VLDB, 1999
(3) C. Zhai, D. Salesin, “Motion Capturing”, SIGGRAPH, 1998
(4) C. Zhai, “Search Optimization”, SIGIR, 1999
(5) Cheng Zhai, Bruce Croft, Jiawei Han “Text Clustering”, SIGIR, 1999
(6) Cheng Zhai, Bruce Croft, “Language Models”, SIGIR, 2001
(7) A. Doan, H. Fang, “Semantic Integration”, SIGMOD, 2000
seed
paper
(5), (6)
(4)
Cheng Zhai in IR
Chris C. Zhai, A. Kramnik, Hui Fang,
“Query Processing”, SIGMOD, 1998
C. Zhai,
“Search Optimization”, SIGIR, 1999
27
Empirical Evaluation
Clustering improves
performance over
pair-wise matching
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0.9
DBLP
IMDB
0.8
0.7
0.6
0.5
BL
BL+Prof
SLink
Slink+Prof
ALink
Alink+Prof
0.4
0.3
0.2
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0.1
0
DBLP
Baseline(BL)
DBLP
IMDB
SharedMatch
AllAttrMatch
IMDB
BL+SLink
BL+ALink
Adding profilers
improves performance
over both clustering
and pair-wise
matching.
28
More Information on
Entity Matching and Integration

Context-based entity matching and integration
– Tech. Report UIUC-03-2004

Profile-based object matching for information integration
– A. Doan, Y. Lu, Y. Lee, and J. Han
– IEEE Intelligent Systems,
special issue on information integration, 2003

Object matching for data integration: a profile-based
approach
– A. Doan, Y. Lu, Y. Lee, and J. Han
– Proc. of the IJCAI-03 workshop on
information integration on the Web, 2003
29
Project Overview

Thrust 1: automate current labor-intensive tasks
– schema matching
– mediated schema construction
– entity matching

Thrust 2: develop new capabilities
– entity integration


Thrust 3: monitor & adjust to changes
Thrust 4: reduce cost of system admin
– by leveraging the mass of users

Thrust 5: design sources for interoperability
30
The Problem

Numerous automatic tools have been developed for
– schema matching, wrapper construction, source discovery, etc.

No matter how good these tools are,
system admin still needs to
– verify predictions of tools
– correct wrong ones

These tasks are still extremely labor intensive
– even worse when considering system maintenance

System complexity overwhelms capacity of human admin

Reduce the labor cost of system admin is critical!
– perhaps most important issue, to enable practical systems!
31
Solution: Shift Some Labor to Users

Take some task or part of some task
– e.g., schema matching, wrapper construction, source discovery

Convert it into a series of very simple questions
– such that knowing the answers = solving the task

Ask users to answer questions
– such that each user has to do very little work
 Spread the task labor thinly over a mass of users !
32
Example: Mass Collaboration for
Schema Matching
Author
?
Writer
Amount
Author
Price
?
Price
?
Author
Cost
Writer
Amount
Author
Cost
John
Steinbeck
$14.99
Upton
Sinclair
$6.99
John
Steinbeck
$14.99
Upton
Sinclair
$6.99
Joseph
Heller
$21.99
George
Orwell
$7.99
Joseph
Heller
$21.99
George
Orwell
$7.99
Aldous
Huxley
$12.99
Aldous
Huxley
33
$12.99
Mass Collaboration is not New

Successfully applied to
–
–
–
–
–
open source software construction
knowledge base construction
collaborative software bug detection
collaborative filtering
annotating online pictures [CMU]

Leverage both implicit and explicit feedback from users

But has not been applied to data integration settings
Can use both implicit and explicit feedback

– focus here on explicit one
34
MOBS Project: Mass Collaboration
to Build DI Systems

Joint work with
– Rob McCann, Alex Kramnik, Warren Shen, Vanitha Varadarajan,
Olu Sobulo

If succeeds
– can dramatically reduce cost & time
– launch numerous DI systems on Web & enterprises

Key challenges
– how to break a task into a series of questions
– how to entice users to answer questions
– how to combine user answers
(e.g., what to do with malicious users?)

Illustrate baseline solutions via schema matching
35
Example: Book Domain
Mediated Schema
title
a1 | a 2 | a 3 | a 4 | a 5
Schema S1
|
author
| year |
b1 | b 2 | b 3 | b 4 | b 5
Schema S2
price
|
category
c1 | c2 | c3 | c4 | c5
Schema S3
d1 | d 2 | d 3 | d 4 | d 5
Schema S4
36
Build Partial Correct System
Mediated Schema
title
a1 | a 2 | a 3 | a 4 | a 5
Schema S1
|
author
| year |
b1 | b 2 | b 3 | b 4 | b 5
Schema S2
price
|
category
c1 | c2 | c3 | c4 | c5
Schema S3
d1 | d 2 | d 3 | d 4 | d 5
Schema S4
37
Solicit User Answers
1
3
38
Detect & Remove Bad Users
39
Combine User Answers
Mediated Schema
title
a1 | a 2 | a 3 | a 4 | a 5
Schema S1
|
author
| year |
b1 | b 2 | b 3 | b 4 | b 5
Schema S2
price
|
category
c1 | c2 | c3 | c4 | c5
Schema S3
d1 | d 2 | d 3 | d 4 | d 5
Schema S4
40
Combine User Answers
41
MOBS Challenges Revisited

How to decompose a task into a series of questions?
– task dependent, currently works for source discovery, 1-1 matching
– if can’t solve the whole task, ok for part of the task (e.g., wrapper)

How to entice users to answer questions?
– incentive models: monopoly or better-service applications
use helper applications
use volunteers

How to evaluate users and combine their answers?
–
–
–
–

use machine learning
build a dynamic Bayesian network model
solicit user answers to questions with known answers
use these as training data to learn network parameters
More detail in [McCann et. al. Tech Report 04, WebDB-03]
42
MOBS Applicability

Applied MOBS in many settings ...
– scale: small-community intranet to high-traffic website
– users: unpredictable novice users to cooperative experts

... and to several DI tasks
– Deep Web: form recognition, query interface matching
– Surface Web: hub discovery, data extraction, mini-Citeseer
43
Simulation and Real-World Results
P1 – Uniform [0,1]
P4 – Bell [0,1]
P5 – Bell [0.3,0.7]
P6 – Bell [0.5,0.9]
Interface Matching
Interface Matching
P2 – Uniform [0.3,0.7] P3 – Uniform [0.5,0.9]
10
1.0
8
0.8
0.6
6
P7 – Bimodal {0.2,0.8}
0.4
4
P8 – 90% Uniform [0,0.4], 10% {0.8}
0.2
2
0.0
P9 – 10% {0.1}, 50% Uniform [0.5,0.7], 40% Uniform [0.8,1]
0
ML P1
P1
P10 – 10% {0.3}, 90% Uniform [0.7,1]
Name
P2
P3
P4
P5
Answers per User per Source
Helper Application Duration
& Status
P6
P7
P8
P2
P3
P4
P5
P6
P7
P8
P9 P10
P9 P10
Precision
Answers per Trusted User per Source
Recall
Current Progress
Precision
Recall
Avg User
Workload
Form
Recognition
DB course website,
132 undergrad students
5 days,
Completed
Completed 24/24 interfaces,
Found 17 bookstore interfaces
1.0
(0.7 ML)
0.89
(0.89
ML)
7.4 answers
Interface
Matching
DB course website,
132 undergrad students
7 days,
Stopped
Completed 10/17 interfaces,
Matched 65 total attributes
0.97
(0.63 ML)
0.97
(0.63
ML)
12.5
answers
Hub Discovery
IR course website,
28 undergrad students
21 days,
Stopped
Completed 15/30 sites
Found 15 hubs
0.87
(0.27 ML)
0.87
(0.27
ML)
16.1
answers
Mini-Citeseer
Google search engine,
21 researchers, friends,
family
19 days,
Completed
Completed 17/17 pages (94
lists)
Found 19 pubs
1.0
0.86
8.7 answers
44
Project Overview

Thrust 1: automate current labor-intensive tasks
– schema matching
– mediated schema construction
– entity matching

Thrust 2: develop new capabilities
– entity integration


Thrust 3: monitor & adjust to changes
Thrust 4: reduce cost of system admin
– by leveraging the mass of users

Thrust 5: design sources for interoperability
45
Summary




The need for data integration is pervasive
Manual data integration is a key bottleneck
Our solution: AIDA project on autonomic DI systems
Discussed problems
–
–
–
–

schema matching [SIGMOD-04]
mediated schema construction [SIGMOD-04]
entity matching & integration [Tech report 04]
mass collaboration [Tech report 04]

Machine learning is the underlying technique
Many implications beyond data integration context

More information: “anhai” on Google
46