PPT - UTPA Faculty Web
Download
Report
Transcript PPT - UTPA Faculty Web
Was
Derived
From
Storing, Indexing and Querying Large Provenance
Data Sets as RDF Graphs in Apache HBase
Artem Chebotko
Joint work with
John Abraham and Pearl Brazier
University of Texas – Pan American
Anthony Piazza
Piazza Consulting
Andrey Kashlev and Shiyong Lu
Wayne State University
7th IEEE International Workshop on Scientific Workflows, July 2, 2013
1
Provenance in eScience
Metadata that captures history of an experiment
Problem diagnosis
Result interpretation
Experiment reproducibility
Scientific Workflow Community Provenance Challenges
2006: understanding and sharing information about
provenance representations and capabilities
2006: interoperability of different provenance
2009: evaluating various aspects of OPM
2010: showcase OPM in the context of novel applications
Open Provenance Model (2007 - 2010)
PROV-DM: The PROV Data Model (W3C
Recommendation 30 April 2013)
2
SWFMS and Provenance
Support provenance collection
Use proprietary or third-party systems to manage
provenance
Differ in provenance models, provenance vocabularies,
inference support, and query languages.
May eventually converge to W3C PROV specifications
Taverna
Galaxy
Kepler
Triana
View
OPMProv
VisTrails,
Karma
Pegasus
RDFProv
Swift
etc.
3
Sample OPM Provenance Graph
Nodes:
artifacts
Create Table
SQL Statements
Create Index
SQL Statements
Create Trigger
SQL Statements
processes
agents
Edges:
Create Database Schema
used
wasGeneratedBy
Schema
Dataset
wasControlledBy
wasTriggeredBy
wasDerivedFrom
Load Data
Instance
4
Sample Graph Serialization: OPMV
and Terse RDF Triple Language
Create Table
SQL Statements
Create Index
SQL Statements
Create Trigger
SQL Statements
Create Database Schema
Schema
Dataset
Load Data
Instance
utpb:schema
utpb:instance
utpb:dataset
utpb:loadData
utpb:loadData
rdf:type opmv:Artifact .
rdf:type opmv:Artifact .
rdf:type opmv:Artifact .
rdf:type opmv:Process .
opmv:used utpb:schema,
utpb:dataset .
utpb:instance opmv:wasGeneratedBy utpb:loadData .
utpb:instance opmv:wasDerivedFrom utpb:schema,
utpb:dataset .
5
Provenance Serialization and
Querying
Both OPM and PROV-DM can be serialized in RDF
Queried in SPARQL
Find all artifacts and their values, if any, in a provenance graph
with identifier http://cs.panam.edu/utpb#opmGraph
6
This Work - Motivation
Single provenance graph as an RDF graph
In general, readily manageable in main memory of a single
machine
Hundreds of thousands or even millions of provenance
graphs as a provenance (RDF) dataset
Challenging to manage
Our Focus/Problem: Efficient and scalable storage and
querying of large collections of provenance graphs
serialized as RDF graphs (in an Apache HBase database)
7
This Work - Contributions
Novel storage and indexing schemes for RDF data in
HBase that are suitable for provenance datasets
Novel and efficient querying algorithms to evaluate
SPARQL queries in HBase that are optimized to make use
of bitmap indices and numeric values instead of triples
Empirical evaluation of our approach using provenance
graphs and test queries of the University of Texas
Provenance Benchmark (UTPB)
8
Talk Outline
RDF Data and Queries
Indexing Scheme
Storage Scheme
Query Processing
Performance Study
Related Work
Summary and Future work
9
RDF Data and Queries
10
RDF Data and Queries
11
Indexing Scheme
Selection Indices: Is, Ip, Io
Find a triple with known s, p and o:
12
Indexing Scheme
Join Indices: Iss, Iso, Ios, Ioo
Find triples with the same object as subject in triple at
position i:
Iso(i)
13
Storage Scheme
One table with two column families for data and indices
Each row stores one complete provenance graph
14
Query Processing
Four efficient algorithms/functions:
application of selection indices
application of join indices
handling of special cases not supported by the indices
basic graph pattern evaluation
15
Query Processing
16
Query Processing
17
Query Processing
18
Query Processing
19
Query Processing
20
Query Processing
21
Performance Study
Implementation
Java, Hadoop 1.0.0, HBase 0.94
Cluster setup
One HBase Master
Eight HBase Region Servers
All commodity machines
Benchmark – UTPB (5 datasets, 11 queries)
22
Performance Study
Q1 – simplest, yet most expensive query due to a large
result set
Q1. Find all provenance graph identifiers.
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX owl: <http://www.w3.org/2002/07/owl#>
SELECT * WHERE { ?graph rdf:type owl:Thing . }
23
Performance Study
Q2 – Q11 – different complexity, yet similar performance
Example: Q8. Find all artifacts and their values, if any, in a
particular provenance graph.
PREFIX opmv: <http://purl.org/net/opmv/ns#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX opmo: <http://openprovenance.org/model/opmo#>
PREFIX utpb: <http://cs.panam.edu/utpb#>
SELECT ?artifact ?value F
ROM NAMED <http://cs.panam.edu/utpb#opmGraph>
WHERE {
GRAPH utpb:opmGraph {
?artifact rdf:type opmv:Artifact .
OPTIONAL { ?artifact opmo:annotation ?annotation .
?annotation opmo:property ?property .
?property opmo:value ?value . } .
OPTIONAL { ?artifact opmo:avalue ?artifactValue .
?artifactValue opmo:content ?value . } .
}
}
24
Performance Study
Please see other queries in the paper – very efficient and
scalable (nearly constant scalability due to minimal data
transfers and fast index-based join processing)
25
Related Work
HBase, BigTable, Cassandra
Hadoop, Hive, Pig, CouchDB, MongoDB, etc.
NoSQL solutions to RDF data management
Provenance management systems
RDF data indexing
26
Summary and Future Work
Designed novel storage and indexing schemes for RDF
data in HBase that are suitable for provenance datasets
Empirical evaluation results are promising
Future work
Compare, compare, compare
More experiments with multi-user workloads
More optimizations
PROV-DM benchmark anyone?
27
THANK YOU! Questions?
My contact information:
Artem Chebotko, Department of Computer Science,
University of Texas – Pan American
[email protected]
http://www.cs.panam.edu/~artem
28