From Data Integration to Community Information Management

Download Report

Transcript From Data Integration to Community Information Management

From Data Integration to
Community Information Management
AnHai Doan
University of Illinois
Joint work with Pedro DeRose, Robert McCann, Yoonkyong Lee, Mayssam
Sayyadian, Warren Shen, Wensheng Wu, Quoc Le, Hoa Nguyen, Long Vu,
Robin Dhamankar, Alex Kramnik, Luis Gravano, Weiyi Meng, Raghu
Ramakrishnan, Dan Roth, Arnon Rosenthal, Clemen Yu
Data Integration Challenge
Find houses with
4 bedrooms
priced under
300K
New researcher
realestate.com
homeseekers.com
homes.com
2
Actually Bought a House in 2004

Buying period
–
–
–
–

queried 7-8 data sources over 3 weeks
some of the sources are local, not “indexed” by national sources
3 hours / night  60+ hours
huge amount of time on querying, post processing
Buyer-remorse period
– repeated the above for another 3 weeks!
We really need to automate data integration ...
3
Architecture of Data Integration Systems
Find houses with 4 bedrooms
priced under 300K
mediated schema
source schema 1
wrapper
homes.com
source schema 2
wrapper
realestate.com
source schema 3
wrapper
houses.com
4
Current State of Affairs


Vibrant research & industrial landscape
Research since the 70s, accelerated in past decade
– database, AI, Web, KDD, Semantic Web communities
– 14+ workshops in past 3 years: ISWC-03, IJCAI-03, VLDB-04, SIGMOD-04,
DILS-04, IQIS-04, ISWC-04, WebDB-05, ICDE-05, DILS-05, IQIS-05, IIWeb-06, etc.
– main database focuses:
– modeling, architecture, query processing, schema/tuple matching
– building specialized systems: life sciences, Deep Web, etc.

Industry
– 53 startups in 2002 [Wiederhold-02]
– many new ones in 2005
Despite much R&D activities, however …
5
DI Systems are Still Very Difficult
to Build and Maintain

Builder must execute multiple tasks

Most tasks are extremely labor intensive
Total cost often at 35% of IT budget [Knoblock et. al. 02]

select data sources
create wrappers
create mediated schemas
match schemas
eliminate duplicate tuples
monitor changes
etc.
– systems often take months or years to develop
High cost severely limits deployment of DI systems
6
Data Integration @ Illinois

Directions:
– automate tasks to minimize human labor
– leverage users to spread out the cost
– simplify tasks so that they can be done quickly
7
Sample Research on Automating
Integration Tasks: Schema Matching
Mediated-schema
price agent-name
1-1 match
homes.com
listed-price
320K
240K
contact-name
Jane Brown
Mike Smith
address
complex match
city
state
Seattle WA
Miami FL
8
Schema Matching is Ubiquitous!


Fundamental problem in numerous applications
Databases
–
–
–
–
–

data integration,
model management
data translation, collaborative data sharing
keyword querying, schema/view integration
data warehousing, peer data management, …
AI
– knowledge bases, ontology merging, information gathering agents, ...

Web
– e-commerce, Deep Web, Semantic Web, Google Base, next version
of My Web 2.0?

eGovernment, bio-informatics, e-sciences
9
Why Schema Matching is Difficult

Schema & data never fully capture semantics!
– not adequately documented

Must rely on clues in schema & data
– using names, structures, types, data values, etc.

Such clues can be unreliable
– same names  different entities: area  location or square-feet
– different names  same entity:
area & address  location

Intended semantics can be subjective
– house-style = house-description?

Cannot be fully automated, needs user feedback
10
Current State of Affairs

Schema matching is now a key bottleneck!
– largely done by hand, labor intensive & error prone
– data integration at GTE [Li&Clifton, 2000]
– 40 databases, 27000 elements, estimated time: 12 years

Numerous matching techniques have been developed
– Databases: IBM Almaden, Wisconsin, Microsoft Research, Purdue,
BYU, George Mason, Leipzig, NCSU, Illinois, Washington, ...
– AI: Stanford, Toronto, Rutgers, Karlsruhe University, NEC, USC, …
"everyone and his brother is doing ontology mapping"

Techniques are often synergistic, leading to
multi-component matching architectures
– each component employs a particular technique
– final predictions combine those of the components
11
Example: LSD [Doan et al. SIGMOD-01]
agent
Mediated schema
address
name
agent-name
Urbana, IL James Smith
Seattle, WA Mike Doan
homes.com
area
contact-agent
Peoria, IL
Kent, WA
(206) 634 9435
(617) 335 4243
Name
Matcher
agent
0.5
Combiner
Naive Bayes
Matcher
area
=> (address, 0.7), (description, 0.3)
contact-agent => (agent-phone, 0.7), (agent-name, 0.3)
comments
contact
0.1
Constraint
Enforcer
0.3
Match
Selector
=> (address, 0.6), (desc, 0.4)
area = address
Only one attribute
of source schema
matches address
contact-agent = agent-phone
...
comments = desc
12
Multi-Component Matching Solutions

Introduced in [Doan et. al., WebDB-00, SIGMOD-01, Do&Rahm,
VLDB-02, Embley et. al. 02]

Now commonly adopted, with industrial-strength systems
– e.g., Protoplasm [MSR], COMA++ [Univ of Lepzig]
Match
selector
Constraint
enforcer
Match
selector
Constraint
enforcer
Combiner
Matcher 1 … Matcher n
LSD

Match
selector
Match
selector
Combiner
Constraint
enforcer
Matcher 1 … Matcher n
COMA
Matcher
SF
Combiner
Matcher
Combiner
Matcher 1
…
Matcher n
LSD-SF
Such systems are very powerful ...
– maximize accuracy; highly customizable

... but place a serious tuning burden on domain users
13
Tuning Schema Matching Systems

Given a particular matching situation
– how to select the right components?
– how to adjust the multitude of knobs?
Match
selector
Threshold
selector
Bipartite
graph selector
A* search enforcer Relax. labeler
Constraint
enforcer
Combiner
Matcher 1 …
Matcher n
Execution graph

ILP
•
•
•
•
Characteristics of attr.
Split measure
Post-prune?
Size of validation set
Average
Min
Max
Weighted
combiner combiner combiner sum combiner
q-gram name Decision tree
matcher
matcher
TF/IDF
name matcher
Naïve Bays
matcher
SVM
matcher
•
•
•
Knobs of
decision tree matcher
Library of matching components
Untuned versions produce inferior accuracy
14
But Tuning is Extremely Difficult

Large number of knobs
– e.g., 8-29 in our experiments

Wide variety of techniques
– database, machine learning, IR, information theory, etc.



Complex interaction among components
Not clear how to compare quality of knob configs
Long-standing problem since the 80s,
getting much worse with multiple-component systems
 Developing efficient tuning techniques is now crucial
15
The eTuner Solution [VLDB-05a]

Given schema S & matching system M
– tunes M to maximize average accuracy
of matching S with future schemas
– commonly occur in data integration, warehousing, supply chain

Challenge 1: Evaluation
– score each knob config K of matching system M
– return K*, the one with highest score
– but how to score knob config K?
– if we know a representative workload W = {(S,T1), ..., (S,Tn)},
and correct matches between S and T1, …, Tn
 can use W to score K

Challenge 2: Huge or infinite search space
16
Solving Challenge 1:
Generate Synthetic Input/Output

Need workload
W = {(S,T1), (S,T2), …, (S,Tn)}

To generate W
– start with S
– perturb S to generate T1
– perturb S to generate T2
– etc.

Know the perturbation  know matches between S & Ti
17
Generate Synthetic Input/Output
Emps
Employees
id
first
1 Bill
last
salary ($)
Laup
40,000 $
id
first
last
salary
2 Mike Brown 60,000 $
=
=
=
=
id
NONE
emp-last
wage
Perturb # of columns
id salary ($)
1 40,000 $
Brown
2
wage
45200
Brown
59328
2
Perturb
data tuples
Emps
Employees
last
Laup
emp-last id
Laup
1
Perturb table and column names
60,000 $
emp-last id
wage
Laup
1 40,000$
Brown
2
60,000$
Schema S
1
2


3
12
3
Make sure tables do not share tuples
Rules are applied probabilistically
18
The eTuner Architecture
Tuning Procedures
Perturbation Rules
Workload
Generator
Schema S

Synthetic
Workload
S
Ω1
T1
S
Ω2
T2
S
Ωn
Tn
Staged
Searcher
Tuned Matching
Tool M
Matching Tool M
More details / experiments in
– Sayyadian et. al., VLDB-05
19
eTuner: Current Status

Only the first step
– but now we have a line of attack for a long-standing problem

Current directions
– find optimal synthetic workload
– develop faster search methods
– extend for other matching scenarios
– adapt ideas to scenarios beyond schema matching
– wrapper maintenance [VLDB-05b]
– domain-specific search engine?
20
Automate Integration Tasks: Summary

Schema matching
– architecture:
WebDB-00, SIGMOD-01, WWW-02
– long-standing problems: SIGMOD-04a, eTuner [VLDB-05a]
– learning/other techniques: CIDR-03, VLDBJ-03, MLJ-03, WebDB-03,
SIGMOD-04b, ICDE-05a, ICDE-05b
– novel problem: debug schemas for interoperability [ongoing]
– industry transfer: involving 2 startups
– promote research area: workshop at ISWC-03, special issues in
SIGMOD Record-04 & AI Magazine-05, survey

Query reformulation: ICDE-02
 Mediated schema construction: SIGMOD-04b, ICDM-05,
ICDE-06


Duplicate tuple removal: AAAI-05, Tech report 06a, 06b
Wrapper maintenance: VLDB-05b
21
Research Directions

Automate integration tasks
– to minimize human labor

Leverage users
– to spread the cost

Simplify integration tasks
– so that they can be done quickly
22
The MOBS Project

Learn from multitude of users to improve tool accuracy,
thus significantly reducing builder workload
Questions
Answers

MOBS = Mass Collaboration to Build Systems
23
Mass Collaboration

Build software artifacts
– Linux, Apache server, other open-source software

Knowledge bases, encyclopedia
– wikipedia.com

Review & technical support websites
– amazon.com, epinions.com, quiq.com,

Detect software bugs
– [Liblit et al. PLDI 03 & 05]

Label images/pages on the Web
– ESPgame, flickr, del.ici.ous, My Web 2.0

Improve search engines, recommender systems
Why not data integration systems?
24
Example: Duplicate Data Matching

Serious problem in many settings (e.g., e.com)
Dell laptop X200 with mouse ...
Mouse for Dell laptop 200 series ...
Dell X200; mouse at reduced price ...

Hard for machine, but easy for human
25
Key Challenges


How to modify tools to learn from users?
How to combine noisy user answers
Multiple noisy oracles
–build user models, learn them via interaction with users
–novel form of active learning

How to obtain user participation?
– data experts, often willing to help (e.g., Illinois Fire Service)
– may be asked to help (e.g., e.com)
– volunteer (e.g., online communities), "payment" schemes
26
Current Status

Develop first-cut solutions
– built prototype, experimented with 3-132 users,
for source discovery and schema matching
– improve accuracy by 9-60%, reduced workload by 29-88%

Built two simple DI systems on Web
– almost exclusively with users

Building a real-world application
– DBlife (more later)

See [McCann et al., WebDB-03, ICDE-05,
AAAI Spring Symposium-05, Tech Report-06]
27
Research Directions

Automate integration tasks
– to minimize human labor

Leverage users
– to spread the cost

Simplify integration tasks
– so that they can be done quickly
28
Simplify Mediated Schema 
Keyword Search over Multiple Databases

Novel problem
 Very useful for urgent / one-time DI needs
– also when users are SQL-illiterate (e.g., Electronic Medical Records)
– also on the Web (e.g., when data is tagged with some structure)

Solution [Kite, Tech Report 06a]
– combines IR, schema matching, data matching, and AI planning
29
Simplify Wrappers 
Structured Queries over Text/Web Data
SELECT ... FROM ... WHERE ...
E-mails, text, Web data, news, etc.

Novel problem
– attracts attention from database / AI / Web researchers at Columbia,
IBM TJ Watson/Almaden, UCLA, IIT-Bombay

[SQOUT, Tech Report 06b], [SLIC, Tech Report 06c]
30
Research Directions

Automate integration tasks
– to minimize human labor

Leverage users
– to spread the cost

Simplify integration tasks
Integration is difficult
Do best-effort integration
Integrate with text
Should leverage human
– so that they can be done quickly
Build on this to promote
31
Community Information Management
Community Information Management

Numerous communities on the Web
– database researchers, movie fans, legal professionals,
bioinformatics, etc.
– enterprise intranets, tech support groups


Each community = many disparate data sources + people
Members often want to query, monitor, discover info.
–
–
–
–
–
any interesting connection between researchers X and Y?
list all courses that cite this paper
find all citations of this paper in the past one week on the Web
what is new in the past 24 hours in the database community?
which faculty candidates are interviewing this year, where?
Current integration solutions fall short
of addressing such needs
32
Cimple Project @ Illinois/Wisconsin

Software platform that can be rapidly deployed and
customized to manage data-rich online communities
Jim Gray
Researcher
Homepages
**
*
*
Pages
* *
Group Pages
mailing list
Keyword search
SQL querying
Web pages
Conference
DBworld
Jim Gray
*
**
**
*
SIGMOD-04
give-talk
SIGMOD-04
**
*
Text documents
Question
answering
Browse
Mining
Alert/Monitor
News summary
DBLP
Import & personalize data
Share / aggregation
Tag entities/relationship / create new contents
Context-dependent services
33
Prototype System: DBlife

1164 data sources, crawled daily, 11000+ pages / day 
160+ MB, 121400+ people mentions  5600+ persons
34
Structure Related Challenges
Jim Gray
Researcher
Homepages
**
*
*
Pages
* *
Group Pages
mailing list
**
*
SIGMOD-04
*
**
give-talk
SIGMOD-04
**
*
Text documents
Question
answering
Browse
Mining
Alert/Monitor
News summary
DBLP

Keyword search
SQL querying
Web pages
Conference
DBworld
Jim Gray
Extraction
– better blackboxes, compose blackboxes, exploit domain knowledge

Maintenance
– critical, but very little has been done

Exploitation
– keyword search over extracted structure? SQL queries?
– detect interesting events?
35
User Related Challenges

–
–
–
–
–

Jim Gray
Users should be able to
import whatever they want
correct/add to the imported data
extend the ER schema
create new contents for share/exchange
ask for context-dependent services
give-talk
SIGMOD-04
Examples
– user imports a paper, system provides bib item
– user imports a movie, add desc, tags it for exchange

Challenges
– provide incentives, payment
– handle malicious/spam users
– share / aggregate user activities/actions/content
36
Comparison to Current My Web 2.0

Cimple focuses on domain-specific communities
– not the entire Web

Besides page level
– also considers finer granularities of entities / relations / attributes
– leverages automatic “best-effort” data integration techniques

Leverages user feedback to further improve accuracy
– thus combines both automatic techniques and human efforts

Considers the entire range of
search + structured queries
– and how to seamlessly move between them

Allows personalization and sharing
– consider context-dependent services beyond keyword search
(e.g., selling, exchange)
37
Applying Cimple to My Web 2.0:
An Example


Going beyond just sharing Web pages
Leveraging My Web 2.0 for other actions
– e.g., selling, exchanging goods
(turning it to a classified ads platform?)

E.g., want to sell my house
–
–
–
–
create a page describing the house
save it to my account on My Web 2.0
tag it with “sell:house, sell, house, champaign, IL”
took me less than 5 minutes (not including creating the page)
– now if someone searches for any of these keywords …
38
39
40
Here a button can be added to facilitate the “sell” action
41
 provide context-dependent services
The Big Picture [Speculative Mode]
Structured data
Unstructured data
(relational, XML)
(text, Web, email)
Database: SQL
IR/Web/AI/Mining: keyword, QA
Many apps will involve all three
Multitude of users
Semantic Web
Industry/Real World
Exact integration will be difficult
- best-effort is promising
- should leverage human
Apps will want broad range of services
- keyword search, SQL queries
- buy, sell, exchange, etc.
42
Summary

Data integration: crucial problem
– at intersection of database, AI, Web, IR

Integration @ Illinois in my group:
– automate tasks to minimize human labor
– leverage users to spread out the cost
– simplify tasks so that they can be done quickly


Best-effort integration, should leverage human
The Cimple project @ Illinois/Wisconsin
– builds on current work to study Community Information Management

A step toward
managing structured + text + users synergistically!
See “anhai” on Yahoo for more details
43