Part 2 - PPT

Download Report

Transcript Part 2 - PPT

RDF languages and storages
part 2 - indexing semi-structure data
Maciej Janik
Conrad Ibanez
CSCI 8350, Fall 2004
Outline


Jena storage
Indexing techniques
Jena






Implemented in Java
One of the most popularly used RDF
storages and query engines
Supports RDF, RDFS and OWL
In memory and persistent storage
(Oracle, MySQL, PostgreSQL)
RDQL
Reasoning/inference engine
Jena - storage schema

Previous version used normalized relational
DB tables





statements
literals
resources
Taken approach to store triples as (Subject,
Predicate, Object) in denormalized tables
Optimization for common statement patterns
- grouping of properties
Jena - storage

Normalized tables

Denormalized
„Efficient RDF Storage and Retrieval in Jena2” - Wilkinson et al.
Jena - storage



Do certain trade-off
for space and search
time
Cluster properties
that are likely to be
accessed together optimize for
common patterns
Special treatment of
reified statements
Jena - graph abstraction



Graph interface is
separated from
(persistent) triple
storage layer
Special support for
different types of
graphs - optimized for
performance
Support operations like
add, delete, find.
Jena - query processing




Converting multiple patterns in query into one
query to DB
Use DB query optimizer instead of executing
multiple queries from Jena level (as it was in
Jena1)
Associate a table with pattern (best) or span
pattern between tables (requires join operation)
Query may span between different graphs, but it
can be optimized only if they are in the same
database
What to index? How to index?
Indexing semistructured data


XML cannot be indexed directly as
relational DB
Indexing may take advantage of tree
structure




depth of node
common path from the root
convert each path to string expression
precalculate the path tree
Indexing semistructured data



Idea is based on
Particia’s trie
Index should scale
with the growth of
data
Path together with leaf
is encoded into string
-> the Index Fabric
„A Fast Index for Semistructured Data” - Brian F. Cooper et al.
A Layered Index
„A Fast Index for Semistructured Data” - Brian F. Cooper et al.
Index Fabric


Index is used to accelerate path expressions - mainly
for queries that ask for root-to-leaf path
Idea of prefix encoding






xml: <A>alpha<B>beta<C>gamma</C></B></A>
paths: <A>alpha ; <A><B>beta ; <A><B><C>gamma
encoded: A alpha ; A B beta ; A B C gamma
infix (not common): A alpha B beta C gamma
Convert path to string for fast searches
Replace tags with ‘non-terminal’ characters (like in
automata)
Index Fabric - raw paths
„A Fast Index for Semistructured Data” - Brian F. Cooper et al.
Graphs - how to index?
Backbone
http://www.aisee.com/
Graphs - how to index?
Tree-type
- prefixes
- tries
http://www.aisee.com/
Graphs - how to index?
T-index
Path templates
2-index
1-index
„Index Structure for Path Expressions” - Tova Milo, Dan Suciu
Graphs - how to index?
Landmarks
http://www.aisee.com/
Indexing - summary

Indexing semistructure data







index fabric - encoding, multilayered
common prefixes - trie structure
backbone - highways between points
landmarks - county division
path templates - precalculated expressions
clustering - grouping by theme access
Indexing such data is NOT easy, solution
depends how you want to search the graph
References



„Efficient RDF Storage and Retrieval in Jena2”
- Kevin Wilkinson, Craig Sayers, Harumi Kuno,
Dave Reynolds
„A Fast Index for Semistructured Data” Brian F. Cooper, Neal Sample, Michael J.
Franklin, Gisli Hjaltason, Moshe Shadmon
„Index Structures for Path Expressions” Tova Milo, Dan Suciu