Transcript MetaQuerier

Making Holistic Schema Matching
Robust: An Ensemble Approach
Bin He
Joint work with: Kevin Chen-Chuan Chang
Univ. Illinois at Urbana-Champaign
Background: MetaQuerier – large-scale
integration of the deep Web
Query
Result
MetaQuerier
The Deep Web
MetaQuerier
2
MetaQuerier: System architecture [CIDR’05]
MetaQuerier
Front-end: Query Execution
Result
Compilation
Query
Translation
Query Web databases
Source
Selection
Find Web databases
Deep Web Repository
Query Interfaces
Query Capabilities
Subject Domains
Unified Interfaces
The Deep Web
Back-end: Semantics Discovery
Database
Crawler
MetaQuerier
Interface
Extraction
Source
Organization
Schema
Matching
3
Matching query interfaces (QIs)
Book Domain
m:n complex matching
1:1 simple matching
Music Domain
MetaQuerier
4
Traditional approaches of schema matching –
Pairwise attribute correspondence
Pairwise Matching
S1:
author
title
subject
ISBN

Typical matching approaches

S2:
writer
title
category
format
S3:
name
title
keyword
binding


Scale is a challenge


Pairwise Attribute
Correspondence

Cupid [VLDB’01]
LSD [SIGMOD’01]
Only small scale
Large-scale is a must for our task
Scale is an opportunity

Context information is not exploited

S1.author  S3.name
S1.subject  S2.category
MetaQuerier

similar attributes across multiple
schemas
co-occurrence patterns among
attributes
5
Emerging paradigm: Holistic schema matching
approach

Match many schemas at the same time and find all
the matchings at once
Input:
a set of schemas
S1:
author
title
subject
ISBN
S2:
writer
title
category
format
S3:
name
title
keyword
binding
MetaQuerier
Output:
a ranked list of matchings
Holistic
Schema
Matching
author = writer = name
subject = category
format = binding
6
Various techniques to realize holistic matching

Matching as hidden model discovery: Model
generative behavior of schemas from attributes and their
semantic relationships


The MGS framework [SIGMOD’03]
Matching as correlation mining: The correlation of
attributes across sources reflect complex relationships


The DCM framework [KDD’04]
Matching as clustering: Attributes in two schemas may
be similar through attributes in other schemas


Interactive clustering based matcher [SIGMOD’04]
WISE-Integrator [VLDB’03]
MetaQuerier
7
Holistic matching is, in essence– Data mining to
discover semantics for information integration
Generation
Observations
(attribute
occurrences)
Hidden
Regularities
 Hypothesis
Semantics
(semantic
correspondences)
Statistical Analysis
 Holistic matching approach
hidden model discovery
correlation mining
clustering
MetaQuerier
8
The baseline holistic matching architecture
with matching as correlation mining
AA.com
United.com
Expedia.com
Delta.com
The DCM matcher
{adult, child, senior} = passenger
departure date = depart
MetaQuerier
9
The challenge in holistic input: Noisy data quality
Database
Crawler
Interface
Extraction
The Deep
Web
Holistic
Schema
Matching
Source
Organization
With the mining nature, holistic matching suffers the
inherent problem of noisy data quality!

Noisy input is inevitable


extraction of QIs may contain errors
organization of QIs may not be fully accurate
MetaQuerier
10
Example of errors in interface extraction
Result of extraction:
AA.com
The correlation between (adult, children) and passenger is affected by a
single extraction error!
MetaQuerier
11
The impact of noises: Error cascade
Accuracy Ai
Si
Accuracy Aj
Error Cascade
(e.g., Interface Extraction)
Accuracy = Aj?
Sj
Accuracy = Ai*Aj?
(e.g., Holistic Schema Matching)
Q: Errors are often minority, why cascade?
A: The technique of a semantics related task, e.g., data integration, is
often context-sensitive: constraints, heuristics, measures,
parameters, procedures
Si
A general solution
Sj
Sampling and voting techniques: The ensemble framework
MetaQuerier
12
The intuition of the ensemble idea

Sampling: a way to reduce noises in the input
Sampling

1) Contain sufficient good
schemas to mine matchings
2) Contain fewer noises to
have more chance to sustain
the holistic matcher
Voting: a single sampling may be biased, so let us repeat it
multiple times and then vote
It is likely that the holistic matcher can be sustained in most samples
MetaQuerier
13
The ensemble framework for holistic schema matching
S1:
author
title
subject
ISBN
S2:
name
title
keyword
binding
S3:
writer
title
category
format
Holistic
Schema
Matching
S1:
author
title
subject
ISBN
1st trial
S2:
name
title
keyword
binding
S3:
writer
title
category
format
Tth trial
Sampling
Sampling
Holistic
Schema
Matching
Holistic
Schema
Matching
Voting
author = name = writer
subject = category
MetaQuerier
Multiple
Sampling
Rank
Aggregation
author = name = writer
subject = category
14
How the ensemble framework works: An example
Holistic
Schema
Matching
1. author = ISBN
2. publisher= category
3. author = name
Holistic
Schema
Matching
1. author = name
2. subject = category
3. author = ISBN
Holistic
Schema
Matching
1. subject = category
2. author = ISBN
3. author = name
Holistic
Schema
Matching
1. author = name
2. publisher = category
3. author = ISBN
1. author = name
2. subject = category
3. author = ISBN
Please refer to our paper for more formal analysis
MetaQuerier
15
The ensemble idea is inspired by bagging
predictors



Bagging is used in machine learning to maintain
the accuracy of a classifier with the presence of
biased distribution of input data
We are essentially applying bagging techniques in
a new scenario of schema matching
However, we are different in



setting: supervised vs. unsupervised
technique: sampling and voting tech
analytic model: our modeling is specific to matching
MetaQuerier
16
Configuration of multiple sampling

The configuration dilemma

Sample size S



Number of trials T



If S is too small, the sampled data may not be sufficiently
representative
If S is too large, the sampled data may contain too many noises
If T is too small, the voting result may not be sufficiently
convincing
If T is too large, more execution time is needed
Two ways to choose S and T



ST: first choose an S, then derive an appropriate T
TS: first choose an T, then derive an appropriate S
TS is better than ST, since the accuracy is very sensitive
to S, not T
MetaQuerier
17
Aggregating matchings from all trials:
Enforcing the majority matching results


Each trial outputs a ranked list of matchings
Voting is thus to aggregate a set of ranked list into
a single ranked list R, which reflects the ranking
results in the majority

Candidate selection


If the majority of trials do not find a matching M, M is
not considered as a correct matching and thus does
not appear in R
Ranking aggregation

If the majority of trials ranks M1 higher than M2, it will
be good if we can also rank M1 higher than M2 in R
MetaQuerier
18
An example of voting
T1: 1. author = name
2. subject = category
3. author = ISBN
T2: 1. subject = category
2. author = ISBN
3. author = name
T3: 1. author = name
2. publisher = category
3. author = ISBN
All Matchings:
M1. author = name, M2. subject = category, M3. author = ISBN, M4. publisher = category
Candidate Selection:
M1. author = name, M2. subject = category, M3. author = ISBN, M4. publisher = category
Rank Aggregation:
Borda’s aggregation: B(Mi) = Σ rank of Mi in Tj
B(M1) = 1 + 3 + 1 = 5, B(M2) = 2 + 1 + 3 = 6, B(M3) = 3 + 2 + 2 = 7
Rank matchings according to B(Mi)
MetaQuerier
M1. author = name
M2. subject = category
M3. author = ISBN
19
Experimental setup

Subsystems integration scenario

Interface Extraction + Holistic Schema Matching



Interface Extractor [SIGMOD’04]
The DCM Matcher [KDD’04]
Datasets

Two representative domains in the TEL-8 dataset in
UIUC Web Integration Repository


Books and Airfares
http://metaquerier.cs.uiuc.edu/repository/
MetaQuerier
20
Experimental result: Baseline vs. Ensemble
Domain
Noisy input
P
R
Books
0.73
0.75
Airfares
0.67
0.68
Baseline approach
Domain
(a) Precision of Books
(c) Recall of Books
MetaQuerier
(b) Precision of Airfares
(d) Recall of Airfares
Average accuracy
P
R
Books
0.83
0.89
Airfares
0.79
0.79
Domain
Most frequent
accuracy
P
R
Books
0.9
1.0
Airfares
0.71
0.82
Ensemble approach
21
Experimental result: Outliers vs. Missing Data


Upper bound exists
Two types of data
quality problems



Outliers

(a) Precision of Books
(b) Precision of Airfares



MetaQuerier
data ideally should
not be observed, but
observed
can be solved by the
ensemble approach
Missing data

(c) Recall of Books
Outliers (noises)
Missing data
data ideally should
be observed, but not
cannot be solved by
the ensemble
approach
(d) Recall of Airfares
22
Contributions

Problem



noisy data quality is an inherent challenge for large
scale schema matching
critical for sustaining holistic schema matching as a
practical and viable technique
Solution


an ensemble framework with sampling and voting
techniques, inspired by bagging predictors
we are essentially applying bagging techniques in a
new scenario of schema matching
MetaQuerier
23
Thank You!
MetaQuerier
24