Managing Large RDF Graphs - The University of Texas at Dallas

Download Report

Transcript Managing Large RDF Graphs - The University of Texas at Dallas

Managing Large RDF Graphs
Vaibhav Khadilkar
Dr. Bhavani Thuraisingham
Department of Computer Science,
The University of Texas at Dallas
December 2008
Introduction
The Provost for the University of Texas at Dallas, Dr. B.
Hobson Wildenthal, in conjunction with the Vice President
for Research and Development, Dr. Bruce Gnade made a
commitment on becoming a leader in emerging technologies
recognizing that the university did not want to compete in
legacy technologies.
After a detailed analysis and examination of unsolved
problems the university committed to the Semantic Web and
Cloud Computing as research areas. This was vetted through
a large number of government and industrial clients. This
resulted in the creation of the Semantic Web Lab.
Our Projects on Semantic Web
 Confidentiality, Privacy and Trust for the Semantic Web
 Texas Enterprise Funds, 2005; NSF 2007
 Building Geospatial Semantic Web
 Raytheon, 2006; NGA, 2007
 Blackbook Experimentation
 Texas Enterprise Funds, 2007
 Ontology Mining – part of Text mining project
 NASA 2007
 Assured Information Sharing
 AFOSR MURI, 2008
 Managing Large RDF Graphs and Ontology
Homogenization
 IARPA, 2008
Managing Large RDF Graphs
Managing Large RDF Graphs
 Current problems
 Semantic web does not scale
 Hinders ability to do reasoning and large graph
processing
 Current work focuses on load balancing and fault
tolerance, but the big bottleneck is memory
 Current systems can be broken with even 100,000
triples
 We work on load balancing and polynomial reasoning
but memory management breaks the systems even
before any of the other problems can be addressed
Managing Large RDF Graphs
 Solution History
 To solve this problem we only look at history
 In the 1960’s Dijkstra invented the multiprocess
operating system
 This gave us general purpose resource management
for files and memory
 In the 1970’s efforts were directed to taking the
general purpose OS and placing database applications
on top of them
 The drawback was that these systems did not scale
 In the 1980’s Robert Epstein and
Michael Stonebreaker from
UC Berkley defined specific
algorithms for database processing
like LRU/MRU
 These principles are accepted as a
solved solution space resulting in ORACLE,
MySQL and others
Managing Large RDF Graphs
 Solution History
Mem Mgt
 In 2001 we started with the Semantic
LRU/MRU
Web
 Oracle, HP and others tried to apply
database algorithms to graph processing
C
 We worked to expand resource
A
B
management to use specific graph
algorithms
 The solution is constructed so that
memory is boundless (infinite
graph) with deterministic reads
that are an order of magnitude
slower than pure memory
solutions
Managing Large RDF Graphs
 Relevance of problem
 This was an unsolved problem
 Critical in handling terabytes of data relevant in
today’s times
 Virtualize from memory space to disk space
Managing Large RDF Graphs
 Tools Used
 Jena
 An open source Semantic Web framework used to
build and manipulate large RDF graphs
 Also gives the capability to handle RDFS and OWL
 Provides a query language SPARQL and a rule
based inference engine
 Developed by HP Labs
 Can represent RDF graphs as a model
Managing Large RDF Graphs
 Tools Used
 Lucene
 Lucene is a Java based text search engine library
 Is suitable for any application and is platform independent
 Does indexing and retrieval in a few milliseconds across
terabytes of data
 MySQL
 An open source RDBMS used with the various database
representations in Jena (RDB, SDB, and, TDB)
 An easy to use alternative compared to other RDBMS’s
Managing Large RDF Graphs
 In-memory Jena Model
 This solution formed the basis of the solution that we
will use for the RDB problem
 As nodes are added to the in-memory graph, memory
fills up
 Therefore we can handle medium sized graphs
 After a certain point when memory is full we get an out
of memory exception stopping program execution
 We want to solve this out of memory problem
Managing Large RDF Graphs
 Memory Management Algorithm
 Graph representation
http://www.johnSmith.com author
http://www.johnSmith.com/paper1
Age
35
Phone
123-456-7890
Journal Society
Journal Name
ACM Society Semantic Web Journal
Time
Managing Large RDF Graphs
 Memory Management Algorithm
 Graph Representation
 The graph is constructed in Jena by specifying nodes and their
properties.
 Triples are added in a monotonically increasing fashion.
 Nodes may be accessed at any time (this is a key point in the
algorithm)
 Data structure used in the algorithm
 Create an in-memory LRU based cache
 For each node in the graph store an index number, a timestamp
value for when it was last accessed, and, the number of
connections for that node
 Each time the node is accessed or a triple added, update the
associated cache entry
 This structure will be used to determine the candidate node that
will be written to disk
Managing Large RDF Graphs
 Memory Management Algorithm
 Algorithm
 We use the LIMIT clause in MySQL to get back
only a part of the results at a time
 The triples retrieved are added to the revised inmemory Jena model
 This leverages the memory management algorithm
for the in-memory model
 Since the revised in-memory model never runs out
of memory this RDB solution does not run out of
memory
Managing Large RDF Graphs
 Conclusions from In-Memory Jena Model
 As threshold increases the time required for the
calculations reduces
 As the memory size increases the time needed for the
calculations increases since more triples can be stored
in memory
 A node in memory takes about 35 ms whereas one
cached to lucene takes about 300ms
 The goal is for usage patterns to pull from memory.
Managing Large RDF Graphs
 Conclusions from the RDF Jena Model
 Database creation times are almost the same as with the original
Jena implementation
 Database querying times vary depending upon the threshold value
set in the algorithm
 General Conclusions
 Implemented an in-memory based LRU/Connectivity memory
management algorithm
 Solves the in-memory and RDB based models in Jena by creating
an infinite memory impression for the user
Managing Large RDF Graphs
 Future Work
 Implement the memory management algorithms
for cloud computing
 Generalize the algorithm for all models
 Try various other memory management
algorithms which effect usage