Transcript Document
iTrails: Pay-as-you-go Information Integration in
Dataspaces
Presented By Marcos Vaz Salles, Jens Dittrich, Shant Karakashian, Olivier Girard, Lukas Blunschi
ETH Zurich
2008-02-22
Summerized By Sungchan Park
Problem: Querying Several Sources
Center for E-Business Technology
Copyright 2008 by CEBT
Solution #1: Use a Search Engine
Center for E-Business Technology
Copyright 2008 by CEBT
Solution #2: Use an Information Integration System
Center for E-Business Technology
Copyright 2008 by CEBT
iTrail Core Idea
Is there an integration solution in-between
these two extremes?
Center for E-Business Technology
Copyright 2008 by CEBT
iTrail Core Idea
Is there an integration solution in-between
these two extremes?
Declaratively add lightweight ‘hints’ to a
search engine thus allowing gradual
enrichment of loosely integrated data sources
Center for E-Business Technology
Copyright 2008 by CEBT
Example Scenario
Query
Center for E-Business Technology
“pdf yesterday”
Hints(Trails)
1.
The date attribute is mapped to
modified attribute
2.
The date attribute is mapped to
received attribute
3.
The yesterday keyword is mapped to a
query for values of the date attribute
equal to the date of yesterday
4.
The pdf keyword is mapped to a query
for elements whose names end in pdf
Copyright 2008 by CEBT
Where hints come from?
Given by the user
Explicitly
Via Relevance Feedback
(Semi-)Automatically
Information extraction techniques
Automatic schema matching
Ontologies and thesauri (e.g., wordnet)
User communities (e.g., trails on gene data, bookmarks)
All these aspects are beyond the scope of this paper
Center for E-Business Technology
Copyright 2008 by CEBT
Data and Query Model
Data Model
Assume that all data is represented by a logical graph G
Query also represented by graph
Center for E-Business Technology
Copyright 2008 by CEBT
Query Syntax
Center for E-Business Technology
Copyright 2008 by CEBT
Query Example
“//Home/projects//*[“Mike”]”
Center for E-Business Technology
Copyright 2008 by CEBT
Basic Form of a Trail
An unidirectional trail
An bidirectional trail
Center for E-Business Technology
Copyright 2008 by CEBT
Trail Example
Trails in an example scenario
Trails
Given query
–
“pdf yesterday”
Transformed query
–
“//*.pdf[modified=yesterday() OR received=yesterday() ].”
Center for E-Business Technology
Copyright 2008 by CEBT
iTrail Query Processing
1. Matching
2. Transforming
3. Merging
Center for E-Business Technology
Copyright 2008 by CEBT
iTrail Query Processing Example
Given Query
Q1 = //home/projects//* [“Mike”]
Trail
Ψ8 := //home/*.name ->
//calendar//*.tuple.category
Resulting Query
Q1{Ψ8} = //home/projects/*[“Mike”] U
//calendar//*[category=“project”]//*
.[“Mike”]
Utilizing G. Miklau and D. Suciu. Containment and Equivalence for an Xpath Fragment. In PODS, 2002.
Center for E-Business Technology
Copyright 2008 by CEBT
Applying Multiple Trail
MMCA(Multiple Match Colouring Algorithm) algorithm
Trail can be applied infinitely
To prevent infinite recursion, a trail should not be rematched to
nodes in a logical plan generated by itself
Center for E-Business Technology
Copyright 2008 by CEBT
Other Issues
Trail Pruning
Problem: MMCA is exponential in number of levels
Solution: Trail Pruning
–
Prune by number of levels
–
Prune by top-K trails matched in each level
–
Give weight and prob. to trails
Prune by both top-K trails and number of levels
Trail Indexing
Precompute trail expressions in order to speed up query processing
Trail materialization
Center for E-Business Technology
Copyright 2008 by CEBT
Experiments
Setting
Configured iMeMex to act in three modes
–
Baseline: Graph / IR search engine
–
iTrails: Rewrite search queries with trails
–
Perfect Query: Semantics-aware query
Data
Center for E-Business Technology
Copyright 2008 by CEBT
Experiment, Quality
Compare with baseline
Center for E-Business Technology
Copyright 2008 by CEBT
Experiment, overhead
Compare with perfect query
Overhead is not negligible
However, this can be fixed by
exploiting trail materializations
Center for E-Business Technology
Copyright 2008 by CEBT
Experiment, Scalability #1
Rewrite Time
Query-rewrite time can be controlled with pruning
Center for E-Business Technology
Copyright 2008 by CEBT
Experiment, Scalability #2
Quality
Pruning improves precision
Center for E-Business Technology
Copyright 2008 by CEBT
Conclusion
Our Contributions
iTrails: generic method to model semantic relationships (e.g.
implicit meaning, bookmarks, dictionaries, thesauri,attribute
matches, ...)
We propose a framework and algorithms for Pay-as-you-go
Information Integration
Smooth transition between search and data integration
Future Work
Trail Creation
–
Use collections (ontologies, thesauri, wikipedia)
–
Work on automatic mining of trails from the dataspace
Other types of trails
Center for E-Business Technology
Copyright 2008 by CEBT