Probabilistic schema mappings
Download
Report
Transcript Probabilistic schema mappings
Uncertainty in Data
Integration
Ai Jing
2007-11-10
Outline
Data Integration with Uncertainty
Overview of Workshop on
Management of Uncertain Data
Uncertainty in Deep Web
Outline
Data Integration with Uncertainty
Overview of Workshop on
Management of Uncertain Data
Uncertainty in Deep Web
Data Integration with Uncertainty
Motivation and overview
Definition of probabilistic mappings
Query answering w.r.t. p-mappings
Complexity of query answering
Contributions
Data Integration with Uncertainty
Motivation and overview
Definition of probabilistic mappings
Query answering w.r.t. p-mappings
Complexity of query answering
Contributions
Traditional Data Integration Systems
SELECT P.title AS title, P.year AS year,
A.name AS author
FROM Author, Paper, AuthoredBy
WHERE Author.aid = AuthoredBy.aid
AND Paper.pid = AUthoredBy.pid
Q
Q5
Q1
Q2
Q4
Q3
Uncertainty Can Occur at Three
Levels in Data Integration Applications
Focus of the paper:
Probabilistic schema mappings
II. Mapping Level
I. Data Level
III. Query Level
Example Probabilistic Mappings
m1:
T(name, email, mailing-addr, home-addr, office-addr)
0.5
S(pname, email-addr, current-addr, permanent-addr)
m2:
T(name, email, mailing-addr, home-addr, office-addr)
0.4
S(pname, email-addr, current-addr, permanent-addr)
m3: T(name, email, mailing-addr, home-addr, office-addr)
0.1
S(pname, email-addr, current-addr, permanent-addr)
Top-k Query Answering
w.r.t. Probabilistic Mappings
Q: SELECT mailingaddr FROM T
Mediated
Schema
Q1: SELECT current-addr FROM S
Q2: SELECT permanent-addr FROM S
Q3: SELECT email-addr FROM S
0.1
0.5
0.4
Data Integration with Uncertainty
Motivation and overview
Definition of probabilistic mappings
Query answering w.r.t. p-mappings
Complexity of query answering
Contributions
Definition of probabilistic mappings
Schema Mapping
S=(pname, email-addr, home-addr, office-addr)
T=(name, mailing-addr)
one-to-one schema matching
have exact knowledge of mapping
Probabilistic Mapping
S=(pname, email-addr, home-addr, office-addr)
1.0
0.1
0.5
T=(name, mailing-addr)
0.4
By-Table Semantics
m
DT=
0.5
By-Tuple Semantics
…
DT=
Pr(<m1,m3>)=0.05
Data Integration with Uncertainty
Motivation and overview
Definition of probabilistic mappings
Query answering w.r.t. p-mappings
Complexity of query answering
Contributions
By-Table Query Answering
By-Tuple Query Answering
Data Integration with Uncertainty
Motivation and overview
Definition of probabilistic mappings
Query answering w.r.t. p-mappings
Complexity of query answering
Contributions
Complexity of query answering
More on By-Tuple Query Answering
The high complexity comes from computing
One of Dt
probabilities
the number of mapping sequences is exponential in the size
of the input data
n tuples, m mappings
m^n mapping sequences
There are two subsets of queries that can be
answered in PTIME by query rewriting
SELECT mailing-addr FROM T
SELECT mailing-addr FROM T,V
WHERE T.mailing-addr = V.hightech
In general query answering cannot be done by
query rewriting
Extensions to More Expressive Mappings
The complexity results for query answering carry over
to three extensions to more expressive mappings
Complex mappings
GLAV mappings
Conditional mappings:
Data Integration with Uncertainty
Motivation and overview
Definition of probabilistic mappings
Query answering w.r.t. p-mappings
Complexity of query answering
Contributions
Contributions
Definition of probabilistic mappings
Semantics: by-table v.s. by-tuple
Complexity of query answering
Outline
Data Integration with Uncertainty
Overview of Workshop on
Management of Uncertain Data
Uncertainty in Deep Web
Overview of MUD 2007
Theory
A New Language and Architecture to Obtain Fuzzy Global
Dependencies
About the Processing of Division Queries Addressed to
Possibilistic Databases
Making Aggregation Work in Uncertain and Probabilistic
Databases Application
Materialized Views in Probabilistic Databases
Application
Flexible matching of Ear Biometrics
Consistent Joins Under Primary Key Constraints
A New Language and Architecture to Obtain Fuzzy
Global Dependencies
SQL does not satisfy the minimum requirements
to be true DM language
A New Language: dmFSQL (data mining Fuzzy
Structured Query Language)
Fuzzy Database
Data mining
About the Processing of Division Queries Addressed
to Possibilistic Databases
They devised a data model which is a strong
representation system for operations in
possibilistic databases
A possibilistic databases D can be interpreted
as a weighted disjunctive set of regular
databases
Division Queries
Making Aggregation Work in
Uncertain and Probabilistic Databases
Trio is a prototype database management
system for storing and querying data with
uncertainty and lineage
Trio’s query language——TriQL
Trio data model and query semantics
Aggregation function in the Trio system for
uncertain and probabilistic data
Materialized Views in Probabilistic
Databases
Materialized Views for probabilistic
may not define a unique probability
distribution
view representation
Answer queries on large probabilistic
data set more efficiently with
materialized views
Flexible matching of Ear Biometrics
Research area
Image Recognition (or Identification)
Scenario
identifying found bodies in a large-scale disaster
Challenge
fast and cheap identification
no DNA-databases or fingerprint
databases are at hand
Consistent Joins Under Primary Key
Constraints
Inconsistent database
primary key
will the natural join of the repaired relations
always be nonempty, no matter which
tuples are selected?
game theory, winning strategy
Outline
Data Integration with Uncertainty
Overview of Workshop on
Management of Uncertain Data
Uncertainty in Deep Web
Uncertainty in Deep Web
No “perfect” data
Noise
Dirty
Redundancy
……
No “perfect” solution
Web data extraction
Interface integration
……
Uncertainty in Deep Web Data Integration(1)
Result Process Module
Web DB
RDB
Results
Annotation
Data
Merging
Results
Extraction
Web DB
Web DB
•Robust
Deep
Web
•Evaluable
Integrated
Interface
Web DB
WDB
Selection
Query
Translation
Query
Submission
Query Process Module
Interface
Integration
WDB
Clustering
Interface Schema
Extraction
Interface Integration Module
WDB
Discovery
Web DB
Uncertainty in Deep Web Data Integration(2)
Result Process Module
Web DB
RDB
Results
Annotation
Data
Merging
Results
Extraction
Web DB
•Tuning
•Feedback
•Evaluable
Integrated
Interface
WDB
Selection
Query
Translation
Deep Web
Web DB
Query
Submission
Query Process Module
Interface
Integration
WDB
Clustering
Web DB
Interface Schema
Extraction
Interface Integration Module
WDB
Discovery
Web DB
Uncertainty in Jobtong(1)
Data level
Uncertainty in Jobtong(2)
Query level
How can we give every result a probability to show it’s
importance?
Uncertainty in Jobtong(3)
The automatic maintenance of
configuration files
<record>
<xpath>/html/body//table/tr[@class='nob']
</xpath>
<combination>2</combination>
<items>
<item>
<name>title</name>
<xpath>td[2]/a/span</xpath>
</item>
<item>
<name>company</name>
<xpath>td[3]/a/span</xpath>
</item>
</items>
</record>
<record>
<xpath>/html/body//table/tr[@class='list2'
or @class='list3']</xpath>
<combination>2</combination>
<items>
<item>
<name>title</name>
<xpath>td[2]/a</xpath>
</item>
<item>
<name>company</name>
<xpath>td[3]/a</xpath>
</item>
</items>
</record>
Q&A
Thank you!