Managing Large RDF Graphs - The University of Texas at Dallas
Download
Report
Transcript Managing Large RDF Graphs - The University of Texas at Dallas
Managing Large RDF Graphs
Vaibhav Khadilkar
Dr. Bhavani Thuraisingham
Department of Computer Science,
The University of Texas at Dallas
December 2008
Introduction
The Provost for the University of Texas at Dallas, Dr. B.
Hobson Wildenthal, in conjunction with the Vice President
for Research and Development, Dr. Bruce Gnade made a
commitment on becoming a leader in emerging technologies
recognizing that the university did not want to compete in
legacy technologies.
After a detailed analysis and examination of unsolved
problems the university committed to the Semantic Web and
Cloud Computing as research areas. This was vetted through
a large number of government and industrial clients. This
resulted in the creation of the Semantic Web Lab.
Our Projects on Semantic Web
Confidentiality, Privacy and Trust for the Semantic Web
Texas Enterprise Funds, 2005; NSF 2007
Building Geospatial Semantic Web
Raytheon, 2006; NGA, 2007
Blackbook Experimentation
Texas Enterprise Funds, 2007
Ontology Mining – part of Text mining project
NASA 2007
Assured Information Sharing
AFOSR MURI, 2008
Managing Large RDF Graphs and Ontology
Homogenization
IARPA, 2008
Managing Large RDF Graphs
Managing Large RDF Graphs
Current problems
Semantic web does not scale
Hinders ability to do reasoning and large graph
processing
Current work focuses on load balancing and fault
tolerance, but the big bottleneck is memory
Current systems can be broken with even 100,000
triples
We work on load balancing and polynomial reasoning
but memory management breaks the systems even
before any of the other problems can be addressed
Managing Large RDF Graphs
Solution History
To solve this problem we only look at history
In the 1960’s Dijkstra invented the multiprocess
operating system
This gave us general purpose resource management
for files and memory
In the 1970’s efforts were directed to taking the
general purpose OS and placing database applications
on top of them
The drawback was that these systems did not scale
In the 1980’s Robert Epstein and
Michael Stonebreaker from
UC Berkley defined specific
algorithms for database processing
like LRU/MRU
These principles are accepted as a
solved solution space resulting in ORACLE,
MySQL and others
Managing Large RDF Graphs
Solution History
Mem Mgt
In 2001 we started with the Semantic
LRU/MRU
Web
Oracle, HP and others tried to apply
database algorithms to graph processing
C
We worked to expand resource
A
B
management to use specific graph
algorithms
The solution is constructed so that
memory is boundless (infinite
graph) with deterministic reads
that are an order of magnitude
slower than pure memory
solutions
Managing Large RDF Graphs
Relevance of problem
This was an unsolved problem
Critical in handling terabytes of data relevant in
today’s times
Virtualize from memory space to disk space
Managing Large RDF Graphs
Tools Used
Jena
An open source Semantic Web framework used to
build and manipulate large RDF graphs
Also gives the capability to handle RDFS and OWL
Provides a query language SPARQL and a rule
based inference engine
Developed by HP Labs
Can represent RDF graphs as a model
Managing Large RDF Graphs
Tools Used
Lucene
Lucene is a Java based text search engine library
Is suitable for any application and is platform independent
Does indexing and retrieval in a few milliseconds across
terabytes of data
MySQL
An open source RDBMS used with the various database
representations in Jena (RDB, SDB, and, TDB)
An easy to use alternative compared to other RDBMS’s
Managing Large RDF Graphs
In-memory Jena Model
This solution formed the basis of the solution that we
will use for the RDB problem
As nodes are added to the in-memory graph, memory
fills up
Therefore we can handle medium sized graphs
After a certain point when memory is full we get an out
of memory exception stopping program execution
We want to solve this out of memory problem
Managing Large RDF Graphs
Memory Management Algorithm
Graph representation
http://www.johnSmith.com author
http://www.johnSmith.com/paper1
Age
35
Phone
123-456-7890
Journal Society
Journal Name
ACM Society Semantic Web Journal
Time
Managing Large RDF Graphs
Memory Management Algorithm
Graph Representation
The graph is constructed in Jena by specifying nodes and their
properties.
Triples are added in a monotonically increasing fashion.
Nodes may be accessed at any time (this is a key point in the
algorithm)
Data structure used in the algorithm
Create an in-memory LRU based cache
For each node in the graph store an index number, a timestamp
value for when it was last accessed, and, the number of
connections for that node
Each time the node is accessed or a triple added, update the
associated cache entry
This structure will be used to determine the candidate node that
will be written to disk
Managing Large RDF Graphs
Memory Management Algorithm
Algorithm
We use the LIMIT clause in MySQL to get back
only a part of the results at a time
The triples retrieved are added to the revised inmemory Jena model
This leverages the memory management algorithm
for the in-memory model
Since the revised in-memory model never runs out
of memory this RDB solution does not run out of
memory
Managing Large RDF Graphs
Conclusions from In-Memory Jena Model
As threshold increases the time required for the
calculations reduces
As the memory size increases the time needed for the
calculations increases since more triples can be stored
in memory
A node in memory takes about 35 ms whereas one
cached to lucene takes about 300ms
The goal is for usage patterns to pull from memory.
Managing Large RDF Graphs
Conclusions from the RDF Jena Model
Database creation times are almost the same as with the original
Jena implementation
Database querying times vary depending upon the threshold value
set in the algorithm
General Conclusions
Implemented an in-memory based LRU/Connectivity memory
management algorithm
Solves the in-memory and RDB based models in Jena by creating
an infinite memory impression for the user
Managing Large RDF Graphs
Future Work
Implement the memory management algorithms
for cloud computing
Generalize the algorithm for all models
Try various other memory management
algorithms which effect usage