Probabilistic schema mappings

Download Report

Transcript Probabilistic schema mappings

Uncertainty in Data
Integration
Ai Jing
2007-11-10
Outline
 Data Integration with Uncertainty
 Overview of Workshop on
Management of Uncertain Data
 Uncertainty in Deep Web
Outline
 Data Integration with Uncertainty
 Overview of Workshop on
Management of Uncertain Data
 Uncertainty in Deep Web
Data Integration with Uncertainty





Motivation and overview
Definition of probabilistic mappings
Query answering w.r.t. p-mappings
Complexity of query answering
Contributions
Data Integration with Uncertainty





Motivation and overview
Definition of probabilistic mappings
Query answering w.r.t. p-mappings
Complexity of query answering
Contributions
Traditional Data Integration Systems
SELECT P.title AS title, P.year AS year,
A.name AS author
FROM Author, Paper, AuthoredBy
WHERE Author.aid = AuthoredBy.aid
AND Paper.pid = AUthoredBy.pid
Q
Q5
Q1
Q2
Q4
Q3
Uncertainty Can Occur at Three
Levels in Data Integration Applications
Focus of the paper:
Probabilistic schema mappings
II. Mapping Level
I. Data Level
III. Query Level
Example Probabilistic Mappings
m1:
T(name, email, mailing-addr, home-addr, office-addr)
0.5
S(pname, email-addr, current-addr, permanent-addr)
m2:
T(name, email, mailing-addr, home-addr, office-addr)
0.4
S(pname, email-addr, current-addr, permanent-addr)
m3: T(name, email, mailing-addr, home-addr, office-addr)
0.1
S(pname, email-addr, current-addr, permanent-addr)
Top-k Query Answering
w.r.t. Probabilistic Mappings
Q: SELECT mailingaddr FROM T
Mediated
Schema
Q1: SELECT current-addr FROM S
Q2: SELECT permanent-addr FROM S
Q3: SELECT email-addr FROM S
0.1
0.5
0.4
Data Integration with Uncertainty





Motivation and overview
Definition of probabilistic mappings
Query answering w.r.t. p-mappings
Complexity of query answering
Contributions
Definition of probabilistic mappings
 Schema Mapping
S=(pname, email-addr, home-addr, office-addr)
T=(name, mailing-addr)
one-to-one schema matching
have exact knowledge of mapping
 Probabilistic Mapping
S=(pname, email-addr, home-addr, office-addr)
1.0
0.1
0.5
T=(name, mailing-addr)
0.4
By-Table Semantics
m
DT=
0.5
By-Tuple Semantics
…
DT=
Pr(<m1,m3>)=0.05
Data Integration with Uncertainty





Motivation and overview
Definition of probabilistic mappings
Query answering w.r.t. p-mappings
Complexity of query answering
Contributions
By-Table Query Answering
By-Tuple Query Answering
Data Integration with Uncertainty





Motivation and overview
Definition of probabilistic mappings
Query answering w.r.t. p-mappings
Complexity of query answering
Contributions
Complexity of query answering
More on By-Tuple Query Answering
 The high complexity comes from computing
One of Dt
probabilities


the number of mapping sequences is exponential in the size
of the input data
n tuples, m mappings
m^n mapping sequences
 There are two subsets of queries that can be
answered in PTIME by query rewriting


SELECT mailing-addr FROM T
SELECT mailing-addr FROM T,V
WHERE T.mailing-addr = V.hightech
 In general query answering cannot be done by
query rewriting
Extensions to More Expressive Mappings
 The complexity results for query answering carry over
to three extensions to more expressive mappings
 Complex mappings
 GLAV mappings
 Conditional mappings:
Data Integration with Uncertainty





Motivation and overview
Definition of probabilistic mappings
Query answering w.r.t. p-mappings
Complexity of query answering
Contributions
Contributions
 Definition of probabilistic mappings
Semantics: by-table v.s. by-tuple
 Complexity of query answering
Outline
 Data Integration with Uncertainty
 Overview of Workshop on
Management of Uncertain Data
 Uncertainty in Deep Web
Overview of MUD 2007
 Theory




A New Language and Architecture to Obtain Fuzzy Global
Dependencies
About the Processing of Division Queries Addressed to
Possibilistic Databases
Making Aggregation Work in Uncertain and Probabilistic
Databases Application
Materialized Views in Probabilistic Databases
 Application


Flexible matching of Ear Biometrics
Consistent Joins Under Primary Key Constraints
A New Language and Architecture to Obtain Fuzzy
Global Dependencies
 SQL does not satisfy the minimum requirements
to be true DM language
 A New Language: dmFSQL (data mining Fuzzy
Structured Query Language)
 Fuzzy Database
 Data mining
About the Processing of Division Queries Addressed
to Possibilistic Databases
 They devised a data model which is a strong
representation system for operations in
possibilistic databases
 A possibilistic databases D can be interpreted
as a weighted disjunctive set of regular
databases
 Division Queries
Making Aggregation Work in
Uncertain and Probabilistic Databases
 Trio is a prototype database management
system for storing and querying data with
uncertainty and lineage
 Trio’s query language——TriQL
 Trio data model and query semantics
 Aggregation function in the Trio system for
uncertain and probabilistic data
Materialized Views in Probabilistic
Databases
 Materialized Views for probabilistic
may not define a unique probability
distribution
 view representation
 Answer queries on large probabilistic
data set more efficiently with
materialized views
Flexible matching of Ear Biometrics
 Research area
 Image Recognition (or Identification)
 Scenario
 identifying found bodies in a large-scale disaster
 Challenge
 fast and cheap identification
 no DNA-databases or fingerprint
databases are at hand
Consistent Joins Under Primary Key
Constraints
 Inconsistent database
 primary key
 will the natural join of the repaired relations
always be nonempty, no matter which
tuples are selected?
 game theory, winning strategy
Outline
 Data Integration with Uncertainty
 Overview of Workshop on
Management of Uncertain Data
 Uncertainty in Deep Web
Uncertainty in Deep Web
 No “perfect” data




Noise
Dirty
Redundancy
……
 No “perfect” solution
 Web data extraction
 Interface integration
 ……
Uncertainty in Deep Web Data Integration(1)
Result Process Module
Web DB
RDB
Results
Annotation
Data
Merging
Results
Extraction
Web DB
Web DB
•Robust
Deep
Web
•Evaluable
Integrated
Interface
Web DB
WDB
Selection
Query
Translation
Query
Submission
Query Process Module
Interface
Integration
WDB
Clustering
Interface Schema
Extraction
Interface Integration Module
WDB
Discovery
Web DB
Uncertainty in Deep Web Data Integration(2)
Result Process Module
Web DB
RDB
Results
Annotation
Data
Merging
Results
Extraction
Web DB
•Tuning
•Feedback
•Evaluable
Integrated
Interface
WDB
Selection
Query
Translation
Deep Web
Web DB
Query
Submission
Query Process Module
Interface
Integration
WDB
Clustering
Web DB
Interface Schema
Extraction
Interface Integration Module
WDB
Discovery
Web DB
Uncertainty in Jobtong(1)
 Data level
Uncertainty in Jobtong(2)
 Query level
How can we give every result a probability to show it’s
importance?
Uncertainty in Jobtong(3)
 The automatic maintenance of
configuration files
<record>
<xpath>/html/body//table/tr[@class='nob']
</xpath>
<combination>2</combination>
<items>
<item>
<name>title</name>
<xpath>td[2]/a/span</xpath>
</item>
<item>
<name>company</name>
<xpath>td[3]/a/span</xpath>
</item>
</items>
</record>
<record>
<xpath>/html/body//table/tr[@class='list2'
or @class='list3']</xpath>
<combination>2</combination>
<items>
<item>
<name>title</name>
<xpath>td[2]/a</xpath>
</item>
<item>
<name>company</name>
<xpath>td[3]/a</xpath>
</item>
</items>
</record>
Q&A
Thank you!