PPT - OpenLink Virtuoso
Download
Report
Transcript PPT - OpenLink Virtuoso
Linked Data Driven Data Virtualization
for Web-scale Integration
Orri Erling
Program Manager, Virtuoso
© 2009 OpenLink Software, All rights reserved
Situation Analysis
Agility via ad hoc data access has prevailed throughout the
history of IT.
Data, heterogeneity are growing exponentially, across
Intranets, Extranets, and the Internet
Processing windows remain static (we still only have 24
hrs. in a day for personal and professional activities)
Individual and Enterprise Agility remains totally dependent
on data access, manipulation, and dissemination
Data remains dirty and its context remains necessary for
extracting meaning.
Data Virtualization (in the form of heterogeneous Linked
Data Spaces) remains the only viable way forward.
© 2009 OpenLink Software, All rights reserved
What is Linked Data?
RDF (Resource Description Framework) Data Model - a
graph model where records take the form of
3-tuples i.e., subject-predicate-object or entity-attributevalue
RDF Data Serialization Formats - (X)HTML+RDFa, Turtle,
N3, TriX, RDF/XML, and others
RDF Data Item Identity - is HTTP URI based
RDF is inherently schema-last and self-describing
Linked Data - application of RDF model where records
identifiers, fields, and optionally field values, are endowed
with HTTP scheme URIs whether instance data (ABox) or
data dictionary data (TBox)
Linked Data enables follow-your-nose traversal of RDF
data records where every record identifier, field, or field
value is a data pathway
© 2009 OpenLink Software, All rights reserved
The Linked Data Landscape
Core vocabularies - common terms facilitate integration:
FOAF for Personal Profile
SIOC for Social Networking
Dublin Core for Bibliography
GoodRelations for eCommerce
Geonames
Domain specific vocabularies for all verticals:
OBO Foundry for biology
Dbpedia, OpenCYC, Yago, SUMO, Geonames etc. define
URIs for talking about almost any well known real world
entity or class of entities.
© 2009 OpenLink Software, All rights reserved
The Linked Open Data Cloud
© 2009 OpenLink Software, All rights reserved
What Linked Data Offers for Data Integration
In RDF, all things have a single-part global HTTP based
Identifier: Anything can join with anything else through its
URI.Many people will use a different identifier for the same
thing.
Whether two things can be considered the same depends
on context. OWL sameAs is a generic way of stating identity
co-reference.Literal values can be tagged by type or
language, allowing explicit representation of units of
measure etc.RDF Triples are contained in Named Graphs.
The graph usually denotes provenance, and it has a URI,
about which further statements can be made
© 2009 OpenLink Software, All rights reserved
RDF vs. Relational
When the data is ragged and highly heterogenous, with
schema last needs, use RDF and Linked Data
The more different sources of data, the more you will need
RDF and Linked Data
If data is highly regular and uniform, relational offers higher
performance: Application specific indices, &c are faster
than putting everything in a generic index scheme
© 2009 OpenLink Software, All rights reserved
Incentives for Publishing
If one is on the web, one is there in order to be found
Publishing data in standard vocabularies allows
applications to mesh data from many Web-addressable
Data Spaces (eg. Pages)
In the end, Linked Data will enhance the end user
experience by added serendipitous discovery and
increased relevance
© 2009 OpenLink Software, All rights reserved
Models for Publishing
Linked data is usually published in large dumps which have
a release cycle
Any relational database's contents can be published as
linked data through generating RDF on demand via a
relational to RDF schema mapping
Whether one generates RDF on demand or ETLs RDBs as
RDF depends on use case
If one publishes data – whether as a product, for promotion,
or regulatory compliance – RDF/Linked Data is attractive
because of a critical mass of reusable terms and a ready
base of technology. As more data is published, the link
density increases, leading to more novel ways of deriving
value from the data.
© 2009 OpenLink Software, All rights reserved
Use Case: CRM and MIS
At OpenLink internal IT, all CRM, Support, Blogs, Wikis
available as linked data
Interactive drill down from products to support tickets to
customers to docs, etc.
Currently working on projects about exposing enterprise
CRM as linked data
© 2009 OpenLink Software, All rights reserved
Use Case: The Neurocommons
Publications
CCDB
SAO
OBO Ontologies
Neuronbank
PDSPki
Reactome
AddGene
Plasmids
Gene ontology
annotations
Antibodies
Neurocommons
text mining
Entrez
Gene
SWAN
AlzGene
NeuronDB
Coriell
catalog
BAMS
BrainPharm
Allen Brain
Atlas
MESH
Mammalian
Phenotype
NeuroMorpho
PubChem
Homologene
© 2009 OpenLink Software, All rights reserved
Bio2RDF - some of the larger datasets
Name
Triple count
PubMed *
797,000,000
NCBI GeneID
172,931,628
Uniprot
797,000,000
UniRef *
242,000,000
UniParc *
490,000,000
IproClass
149,342,977
© 2009 OpenLink Software, All rights reserved
Use Case: BBC Programs and Music Service
Data Harvested via Sitemap and Web Crawling
20M Triples
Integrated to Last.FM, Dbpedia, Musicbrainz: See what
any of these has to say about an artist of work.
http://bbc.openlinksw.com
© 2009 OpenLink Software, All rights reserved
Use Case: Linked Open Data Cloud Service
Dbpedia, Freebase, Geodata, Neurocommons, Bio2RDF,
Govtrack, US Census, RKB Explorer
Pingthesemanticweb, Good Relations and more
Entity Ranks
Full Text, SPARQL, Faceted Browsing
http://lod2.openlinksw.com, http://lod.openlinksw.com
7.59 billion triples
© 2009 OpenLink Software, All rights reserved
The Generations of the Web
Web 1.0 - Publishing for all
Web 2.0 - User generated content, mashups, the citizen
journalist
Linked Data Web - Big data, integration and analytics for
all.
© 2009 OpenLink Software, All rights reserved