Transcript Document

iTrails: Pay-as-you-go Information Integration in
Dataspaces
Presented By Marcos Vaz Salles, Jens Dittrich, Shant Karakashian, Olivier Girard, Lukas Blunschi
ETH Zurich
2008-02-22
Summerized By Sungchan Park
Problem: Querying Several Sources
Center for E-Business Technology
Copyright  2008 by CEBT
Solution #1: Use a Search Engine
Center for E-Business Technology
Copyright  2008 by CEBT
Solution #2: Use an Information Integration System
Center for E-Business Technology
Copyright  2008 by CEBT
iTrail Core Idea
 Is there an integration solution in-between
these two extremes?
Center for E-Business Technology
Copyright  2008 by CEBT
iTrail Core Idea
 Is there an integration solution in-between
these two extremes?
 Declaratively add lightweight ‘hints’ to a
search engine thus allowing gradual
enrichment of loosely integrated data sources
Center for E-Business Technology
Copyright  2008 by CEBT
Example Scenario

Query


Center for E-Business Technology
“pdf yesterday”
Hints(Trails)
1.
The date attribute is mapped to
modified attribute
2.
The date attribute is mapped to
received attribute
3.
The yesterday keyword is mapped to a
query for values of the date attribute
equal to the date of yesterday
4.
The pdf keyword is mapped to a query
for elements whose names end in pdf
Copyright  2008 by CEBT
Where hints come from?
 Given by the user

Explicitly

Via Relevance Feedback
 (Semi-)Automatically

Information extraction techniques

Automatic schema matching

Ontologies and thesauri (e.g., wordnet)

User communities (e.g., trails on gene data, bookmarks)
 All these aspects are beyond the scope of this paper
Center for E-Business Technology
Copyright  2008 by CEBT
Data and Query Model
 Data Model

Assume that all data is represented by a logical graph G

Query also represented by graph
Center for E-Business Technology
Copyright  2008 by CEBT
Query Syntax
Center for E-Business Technology
Copyright  2008 by CEBT
Query Example
 “//Home/projects//*[“Mike”]”
Center for E-Business Technology
Copyright  2008 by CEBT
Basic Form of a Trail
 An unidirectional trail
 An bidirectional trail
Center for E-Business Technology
Copyright  2008 by CEBT
Trail Example
 Trails in an example scenario

Trails

Given query
–

“pdf yesterday”
Transformed query
–
“//*.pdf[modified=yesterday() OR received=yesterday() ].”
Center for E-Business Technology
Copyright  2008 by CEBT
iTrail Query Processing
1. Matching
2. Transforming
3. Merging
Center for E-Business Technology
Copyright  2008 by CEBT
iTrail Query Processing Example

Given Query
Q1 = //home/projects//* [“Mike”]

Trail
Ψ8 := //home/*.name ->
//calendar//*.tuple.category

Resulting Query
Q1{Ψ8} = //home/projects/*[“Mike”] U
//calendar//*[category=“project”]//*
.[“Mike”]
 Utilizing G. Miklau and D. Suciu. Containment and Equivalence for an Xpath Fragment. In PODS, 2002.
Center for E-Business Technology
Copyright  2008 by CEBT
Applying Multiple Trail
 MMCA(Multiple Match Colouring Algorithm) algorithm

Trail can be applied infinitely

To prevent infinite recursion, a trail should not be rematched to
nodes in a logical plan generated by itself
Center for E-Business Technology
Copyright  2008 by CEBT
Other Issues
 Trail Pruning

Problem: MMCA is exponential in number of levels

Solution: Trail Pruning
–
Prune by number of levels
–
Prune by top-K trails matched in each level

–
Give weight and prob. to trails
Prune by both top-K trails and number of levels
 Trail Indexing

Precompute trail expressions in order to speed up query processing

Trail materialization
Center for E-Business Technology
Copyright  2008 by CEBT
Experiments
 Setting


Configured iMeMex to act in three modes
–
Baseline: Graph / IR search engine
–
iTrails: Rewrite search queries with trails
–
Perfect Query: Semantics-aware query
Data
Center for E-Business Technology
Copyright  2008 by CEBT
Experiment, Quality
 Compare with baseline
Center for E-Business Technology
Copyright  2008 by CEBT
Experiment, overhead
 Compare with perfect query

Overhead is not negligible

However, this can be fixed by
exploiting trail materializations
Center for E-Business Technology
Copyright  2008 by CEBT
Experiment, Scalability #1
 Rewrite Time

Query-rewrite time can be controlled with pruning
Center for E-Business Technology
Copyright  2008 by CEBT
Experiment, Scalability #2
 Quality

Pruning improves precision
Center for E-Business Technology
Copyright  2008 by CEBT
Conclusion
 Our Contributions

iTrails: generic method to model semantic relationships (e.g.
implicit meaning, bookmarks, dictionaries, thesauri,attribute
matches, ...)

We propose a framework and algorithms for Pay-as-you-go
Information Integration

Smooth transition between search and data integration
 Future Work


Trail Creation
–
Use collections (ontologies, thesauri, wikipedia)
–
Work on automatic mining of trails from the dataspace
Other types of trails
Center for E-Business Technology
Copyright  2008 by CEBT