The Semantic Web: Challenges for KR (and others)

Download Report

Transcript The Semantic Web: Challenges for KR (and others)

The Semantic Web:
New-style data-integration
(and how it works for life-scientists too!)
Frank van Harmelen
AI Department
Vrije Universiteit Amsterdam
What’s the problem?
(data-mess in bio-inf)
Kenneth Griffiths and Richard Resnick
Tut. At Intell. Systems for Molec. Biol., 2003
Life Science Data
Recent focus on genetic data
“genomics: the study of genes and their function. Recent advances in genomics are bringing
about a revolution in our understanding of the molecular mechanisms of disease, including the
complex interplay of genetic and environmental factors. Genomics is also stimulating the discovery
of breakthrough healthcare products by revealing thousands of new biological targets for the
development of drugs, and by giving scientists innovative ways to design new drugs, vaccines and
DNA diagnostics. Genomics-based therapeutics include "traditional" small chemical drugs, protein
drugs, and potentially gene therapy.”
The Pharmaceutical Research and Manufacturers of America - http://www.phrma.org/genomics/lexicon/g.html
Study of genes and their function
Understanding molecular mechanisms of disease
Development of drugs, vaccines, and diagnostics
The Study of Genes...
•
•
•
•
•
•
Chromosomal location
Sequence
Sequence Variation
Splicing
Protein Sequence
Protein Structure
… and Their Function
•
•
•
•
•
•
Homology
Motifs
Publications
Expression
HTS
In Vivo/Vitro Functional Characterization
Understanding Mechanisms of Disease
Metabolic
and
regulatory
pathway
induction
Development of Drugs, Vaccines, Diagnostics
Differing types of Drugs, Vaccines, and Diagnostics
• Small molecules
• Protein therapeutics
• Gene therapy
• In vitro, In vivo diagnostics
Development requires
• Preclinical research
• Clinical trials
• Long-term clinical research
All of which often feeds back into ongoing Genomics
research and discovery.
The Industry’s Problem
Too much unintegrated data:
– from a variety of incompatible sources
– no standard naming convention
– each with a custom browsing and querying
mechanism (no common interface)
– and poor interaction with other data sources
What are the Data Sources?
•
•
•
•
•
•
•
•
Flat Files
URLs
Proprietary Databases
Public Databases
Data Marts
Spreadsheets
Emails
…
Sample Problem: Hyperprolactinemia
Over production of prolactin
– prolactin stimulates mammary gland
development and milk production
Hyperprolactinemia is characterized by:
– inappropriate milk production
– disruption of menstrual cycle
– can lead to conception difficulty
Understanding transcription factors for
prolactin production
“Show me all genes in the public literature that are putatively
related to hyperprolactinemia, have more than 3-fold
expression differential between hyperprolactinemic and normal
pituitary cells, and are homologous to known transcription
factors.”
(Q1Q2Q3)
Q1
Q2
“Show me all genes that
“Show me all genes that
are homologous to known have more than 3-fold
expression differential
transcription factors”
SEQUENCE
Q3
between hyperprolactinemic
and normal pituitary cells”
“Show me all genes in
the public literature that
are putatively related to
hyperprolactinemia”
EXPRESSION
LITERATURE
The Complexity of Biological Data
Pharmaceutical Productivity
Source: PhRMA & FDA 2003
Stitching this all together by hand?
Source: Stephens et al. J Web Semantics 2006
The Medical tower of Babel
 Mesh


Medical Subject Headings, National Library of Medicine
22.000 descriptions
 EMTREE


Commercial Elsevier, Drugs and diseases
45.000 terms, 190.000 synonyms
 UMLS

Integrates 100 different vocabularies
 SNOMED

200.000 concepts, College of American Pathologists
 Gene Ontology

15.000 terms in molecular biology
 NCI Cancer Ontology:

17,000 classes (about 1M definitions),
Problem with the Current WWW
Why would
Semantic Web technology
help?
machine accessible meaning
(What it’s like to be a machine)
alleviates
META-DATA
<treatment>
<name>
<symptoms>
IS-A
<drug>
<drug
administration>
<disease>
What is meta-data?
name
symptoms
disease
drug
administration
it's just data
it's data describing other data
its' meant for machine consumption
Required are:
1. one or more standard vocabularies

so search engines, producers and consumers
all speak the same language
2. a standard syntax,

so meta-data can be recognised as such
3. lots of resources with meta-data attached
 mechanisms for attribution and trust
is this page really about Pamela
Anderson?
What are ontologies &
what are they used for
world
concept
language
no shared understanding
Conceptual and
terminological confusion
Agree on a
conceptualization
Make it explicit
in some language.
Actors: both humans and machines
standard vocabularies
(“Ontologies”)
Identify the key concepts in a domain
Identify a vocabulary for these concepts
Identify relations between these concepts
Make these precise enough
so that they can be shared between



humans and humans
humans and machines
machines and machines
Shared content-vocabularies:
Ontologies
Formal,
explicit specification
of a
shared
conceptualisation
machine
processable
concepts, properties,
relations, functions
Consensual
knowledge
Abstract model of
some domain
Real life examples
 handcrafted


music: CDnow (2410/5), MusicMoz (1073/7)
biomedical:
SNOMED (200k), GO (15k),
Emtree(45k+190k
Systems biology
 ranging from lightweight

Yahoo, UNSPC, Open directory (400k)
to heavyweight (Cyc (300k))
 ranging from small (METAR)
to large (UNSPC)
Biomedical ontologies (a few..)
 Mesh


Medical Subject Headings, National Library of Medicine
22.000 descriptions
 EMTREE


Commercial Elsevier, Drugs and diseases
45.000 terms, 190.000 synonyms
 UMLS

Integrates 100 different vocabularies
 SNOMED

200.000 concepts, College of American Pathologists
 Gene Ontology

15.000 terms in molecular biology
 NCBI Cancer Ontology:

17,000 classes (about 1M definitions),
What’s inside an ontology?
 terms + specialisation hierarchy
 classes + class-hierarchy
 instances
 slots/values
 inheritance (multiple? defaults?)
 restrictions on slots (type, cardinality)
 properties of slots (symm., trans., …)
 relations between classes (disjoint, covers)
 reasoning tasks: classification, subsumption
Increasing semantic “weight”
NB: we’re not doing philosophy
Ontologies are not
definitive descriptions of
what exists in the world (= philosphy)
Ontologies are
models of the world
constructed
to facilitate communication
Yes, ontologies exist
(because we build them)
Remember “required are”:
one or more standard vocabularies


so search engines, producers and consumers
all speak the same language
2. a standard syntax,

so meta-data can be recognised as such
3. lots of resources with meta-data attached
Stack of languages
Stack of languages
XML:

Surface syntax, no semantics
XML Schema:

Describes structure of XML documents
RDF:

Datamodel for “relations” between “things”
RDF Schema:

RDF Vocabular Definition Language
OWL:

A more expressive
Vocabular Definition Language
RDF Triples in Life
Sciences
Bluffer’s guide to RDF (1)
Object --Attribute-> Value triples
pers05
Author-of
ISBN...
objects are web-resources
Value is again an Object:


triples can be linked
data-model = graph
pers05
Author-of
ISBN...
ISBN...
Publby
MIT
Bluffer’s guide to RDF (2)
 Every identifier is a URL
= world-wide unique naming!
 Has XML syntax
<rdf:Description rdf:about=“#pers05”>
<authorOf>ISBN...</authorOf>
</rdf:Description>
 Any statement can be an object
• graphs can be nested
NYT
claims
pers05
Author-of
ISBN...
What does RDF Schema add?
• Defines vocabulary for RDF
• Organizes this vocabulary in a
typed hierarchy
• Class, subClassOf, type
• Property, subPropertyOf
• domain, range
Person
subClassOf
Teacher
domain
supervises
type
Frank
subClassOf
range
Student
type
supervises
Marta
Stack of languages
XML:

Surface syntax, no semantics
XML Schema:

Describes structure of XML documents
RDF:

Datamodel for “relations” between “things”
RDF Schema:

RDF Vocabular Definition Language
OWL:

A more expressive
Vocabular Definition Language
OWL:
things RDF Schema can’t do
equality
enumeration
number restrictions


Single-valued/multi-valued
Optional/required values
inverse, symmetric, transitive
boolean algebra

…
Union, complement
OWL: more expressivity
 OWL Light
(sub)classes, individuals
(sub)properties, domain, range
conjunction
(in)equality
cardinality 0/1
datatypes
inverse, transitive, symmetric
hasValue
someValuesFrom
allValuesFrom
OWL DL
Negation
Disjunction
Full Cardinality
Enumerated types
RDF Schema
Full
DL
Lite
 OWL Full
 Allow meta-classes etc
Remember “required are”:
one or more standard vocabularies


so search engines, producers and consumers
all speak the same language
a standard syntax,


so meta-data can be recognised as such
3. lots of resources with meta-data attached
Question:
who writes the ontologies?
Professional bodies, scientific communities,
companies, publishers, ….
 See previous slide on Biomedical ontologies

Same developments in many other fields
Good old fashioned Knowledge Engineering
Convert from DB-schema, UML, etc.
Question:
Who writes the meta-data ?
- Automated learning
- shallow natural language analysis
- Concept extraction
Example: Encyclopedia Britannica on “Amsterdam”
trade
antwerp
europe
amsterdam
merchant
netherlands
center
city
town
Question:
Who writes the meta-data ?
exploit existing legacy-data


Amazon
Lab equipment?
 side-effect from user interaction

MIT Lab photo-annotator
 NOT from manual effort
Web 2.0 community/social interaction
Remember “required are”
one or more standard vocabularies


a standard syntax,



so search engines, producers and consumers
all speak the same language
so meta-data can be recognised as such
lots of resources with meta-data attached
Some working examples?
• DOPE
• HCLS (http://www.w3.org/2001/sw/hcls/)
DOPE: Background
Vertical Information Provision


Buy a topic instead of a Journal !
Web provides new opportunities
Business driver: drug development


Rich, information-hungry market
Good thesaurus (EMTREE)
The Data
Document repositories:


ScienceDirect: approx. 500.000 fulltext articles
MEDLINE: approx. 10.000.000 abstracts
Extracted Metadata

The Collexis Metadata Server: conceptextraction ("semantic fingerprinting")
Thesauri and Ontologies

EMTREE:
60.000 preferred terms 200.000 synonyms
Query
interface
Architecture:
RDF Schema
EMTREE
RDF
Datasource 1
RDF
….
Datasource n
Architecture:
GUI: Spectacle (Aduna)
http requests
Mediator: Sesame (Aduna)
Source
Model
(RDF)
SeRQL
Gene
Thesaurus
(RDFS)
EMTREE
Thesaurus
(RDFS)
Document
Model
(RDFS)
Additional
Source of Data
SeRQL
Source
Model
(RDF)
SOAP
Java Client
Metadata Server
(Collexis)
Summarising…
 Data integration on the Web:

machine processable data besides
human processable data
 Syntax for meta-data




XML (not much meaning)
RDF (some meaning)
RDF Schema (some meaning)
OWL (more meaning
 Vocabularies for meta-data

Lot’s of them in bio-inf.
 Actual meta-data:

Lot’s in bio-inf.
 Will enable:



Better search engines (recall, precision, concepts)
Combining information across pages (inference)
…
Things to do for you
 Practical:
Use existing software
to construct new use-scenario’s
 Conceptual:
Create on ontology
for some area of bio-medical expertise


from scratch
as a refinement of an existing ontology
 Technical:
Transform an existing data-set
in meta-data format,
and provide a query interface
(for humans and machines)