PPT - OpenLink Virtuoso

Download Report

Transcript PPT - OpenLink Virtuoso

Linked Data Driven Data Virtualization
for Web-scale Integration
Orri Erling
Program Manager, Virtuoso
© 2009 OpenLink Software, All rights reserved
Situation Analysis
 Agility via ad hoc data access has prevailed throughout the
history of IT.
 Data, heterogeneity are growing exponentially, across
Intranets, Extranets, and the Internet
 Processing windows remain static (we still only have 24
hrs. in a day for personal and professional activities)
 Individual and Enterprise Agility remains totally dependent
on data access, manipulation, and dissemination
 Data remains dirty and its context remains necessary for
extracting meaning.
 Data Virtualization (in the form of heterogeneous Linked
Data Spaces) remains the only viable way forward.
© 2009 OpenLink Software, All rights reserved
What is Linked Data?
 RDF (Resource Description Framework) Data Model - a
graph model where records take the form of
 3-tuples i.e., subject-predicate-object or entity-attributevalue
 RDF Data Serialization Formats - (X)HTML+RDFa, Turtle,
N3, TriX, RDF/XML, and others
 RDF Data Item Identity - is HTTP URI based
 RDF is inherently schema-last and self-describing
 Linked Data - application of RDF model where records
identifiers, fields, and optionally field values, are endowed
with HTTP scheme URIs whether instance data (ABox) or
data dictionary data (TBox)
 Linked Data enables follow-your-nose traversal of RDF
data records where every record identifier, field, or field
value is a data pathway
© 2009 OpenLink Software, All rights reserved
The Linked Data Landscape
 Core vocabularies - common terms facilitate integration:
 FOAF for Personal Profile
 SIOC for Social Networking
 Dublin Core for Bibliography
 GoodRelations for eCommerce
 Geonames
 Domain specific vocabularies for all verticals:
 OBO Foundry for biology
 Dbpedia, OpenCYC, Yago, SUMO, Geonames etc. define
URIs for talking about almost any well known real world
entity or class of entities.
© 2009 OpenLink Software, All rights reserved
The Linked Open Data Cloud
© 2009 OpenLink Software, All rights reserved
What Linked Data Offers for Data Integration
 In RDF, all things have a single-part global HTTP based
Identifier: Anything can join with anything else through its
URI.Many people will use a different identifier for the same
thing.
 Whether two things can be considered the same depends
on context. OWL sameAs is a generic way of stating identity
co-reference.Literal values can be tagged by type or
language, allowing explicit representation of units of
measure etc.RDF Triples are contained in Named Graphs.
The graph usually denotes provenance, and it has a URI,
about which further statements can be made
© 2009 OpenLink Software, All rights reserved
RDF vs. Relational
 When the data is ragged and highly heterogenous, with
schema last needs, use RDF and Linked Data
 The more different sources of data, the more you will need
RDF and Linked Data
 If data is highly regular and uniform, relational offers higher
performance: Application specific indices, &c are faster
than putting everything in a generic index scheme
© 2009 OpenLink Software, All rights reserved
Incentives for Publishing
 If one is on the web, one is there in order to be found
 Publishing data in standard vocabularies allows
applications to mesh data from many Web-addressable
Data Spaces (eg. Pages)
 In the end, Linked Data will enhance the end user
experience by added serendipitous discovery and
increased relevance
© 2009 OpenLink Software, All rights reserved
Models for Publishing
 Linked data is usually published in large dumps which have
a release cycle
 Any relational database's contents can be published as
linked data through generating RDF on demand via a
relational to RDF schema mapping
 Whether one generates RDF on demand or ETLs RDBs as
RDF depends on use case
If one publishes data – whether as a product, for promotion,
or regulatory compliance – RDF/Linked Data is attractive
because of a critical mass of reusable terms and a ready
base of technology. As more data is published, the link
density increases, leading to more novel ways of deriving
value from the data.
© 2009 OpenLink Software, All rights reserved
Use Case: CRM and MIS
 At OpenLink internal IT, all CRM, Support, Blogs, Wikis
available as linked data
 Interactive drill down from products to support tickets to
customers to docs, etc.
 Currently working on projects about exposing enterprise
CRM as linked data
© 2009 OpenLink Software, All rights reserved
Use Case: The Neurocommons
Publications
CCDB
SAO
OBO Ontologies
Neuronbank
PDSPki
Reactome
AddGene
Plasmids
Gene ontology
annotations
Antibodies
Neurocommons
text mining
Entrez
Gene
SWAN
AlzGene
NeuronDB
Coriell
catalog
BAMS
BrainPharm
Allen Brain
Atlas
MESH
Mammalian
Phenotype
NeuroMorpho
PubChem
Homologene
© 2009 OpenLink Software, All rights reserved
Bio2RDF - some of the larger datasets
Name
Triple count
PubMed *
797,000,000
NCBI GeneID
172,931,628
Uniprot
797,000,000
UniRef *
242,000,000
UniParc *
490,000,000
IproClass
149,342,977
© 2009 OpenLink Software, All rights reserved
Use Case: BBC Programs and Music Service
 Data Harvested via Sitemap and Web Crawling
 20M Triples
 Integrated to Last.FM, Dbpedia, Musicbrainz: See what
any of these has to say about an artist of work.
 http://bbc.openlinksw.com
© 2009 OpenLink Software, All rights reserved
Use Case: Linked Open Data Cloud Service
 Dbpedia, Freebase, Geodata, Neurocommons, Bio2RDF,
Govtrack, US Census, RKB Explorer
 Pingthesemanticweb, Good Relations and more
 Entity Ranks
 Full Text, SPARQL, Faceted Browsing
 http://lod2.openlinksw.com, http://lod.openlinksw.com
 7.59 billion triples
© 2009 OpenLink Software, All rights reserved
The Generations of the Web
 Web 1.0 - Publishing for all
 Web 2.0 - User generated content, mashups, the citizen
journalist
 Linked Data Web - Big data, integration and analytics for
all.
© 2009 OpenLink Software, All rights reserved