Semantic Web for life scientists

Download Report

Transcript Semantic Web for life scientists

The Semantic Web:
New-style data-integration
(and how it works for life-scientists too!)
Frank van Harmelen
AI Department
Vrije Universiteit Amsterdam
What’s the problem?
(data-mess in bio-inf)
Pharmaceutical Productivity
Source: PhRMA & FDA 2003
Kenneth Griffiths and Richard Resnick
Tut. At Intell. Systems for Molec. Biol., 2003
The Industry’s Problem
Too much unintegrated data:
– from a variety of incompatible sources
– no standard naming convention
– each with a custom browsing and querying
mechanism (no common interface)
– and poor interaction with other data sources
What are the Data Sources?
•
•
•
•
•
•
•
•
Flat Files
URLs
Proprietary Databases
Public Databases
Data Marts
Spreadsheets
Emails
…
Sample Problem: Hyperprolactinemia
Over production of prolactin
– prolactin stimulates mammary gland
development and milk production
Hyperprolactinemia is characterized by:
– inappropriate milk production
– disruption of menstrual cycle
– can lead to conception difficulty
Understanding transcription factors for
prolactin production
“Show me all genes in the public literature that are putatively
related to hyperprolactinemia, have more than 3-fold
expression differential between hyperprolactinemic and normal
pituitary cells, and are homologous to known transcription
factors.”
(Q1Q2Q3)
Q1
Q2
“Show me all genes that
“Show me all genes that
are homologous to known have more than 3-fold
expression differential
transcription factors”
SEQUENCE
Q3
between hyperprolactinemic
and normal pituitary cells”
“Show me all genes in
the public literature that
are putatively related to
hyperprolactinemia”
EXPRESSION
LITERATURE
The Medical tower of Babel
 Mesh


Medical Subject Headings, National Library of Medicine
22.000 descriptions
 EMTREE


Commercial Elsevier, Drugs and diseases
45.000 terms, 190.000 synonyms
 UMLS

Integrates 100 different vocabularies
 SNOMED

200.000 concepts, College of American Pathologists
 Gene Ontology

15.000 terms in molecular biology
 NCI Cancer Ontology:

17,000 classes (about 1M definitions),
Stitching this all together by hand?
Source: Stephens et al. J Web Semantics 2006
Why would
Semantic technology
help?
machine accessible meaning
(What it’s like to be a machine)
alleviates
META-DATA
<treatment>
<name>
<symptoms>
IS-A
<drug>
<drug
administration>
<disease>
What is meta-data?
name
symptoms
disease
drug
administration
it's just data
it's data describing other data
its' meant for machine consumption
Required are:
1. one or more standard vocabularies

so search engines, producers and consumers
all speak the same language
2. a standard syntax,

so meta-data can be recognised as such
3. lots of resources with meta-data attached
 mechanisms for attribution and trust
is this page really about Pamela
Anderson?
What are ontologies &
what are they used for
world
concept
language
no shared understanding
Conceptual and
terminological confusion
Agree on a
conceptualization
Make it explicit
in some language.
Actors: both humans and machines
standard vocabularies
(“Ontologies”)
Identify the key concepts in a domain
Identify a vocabulary for these concepts
Identify relations between these concepts
Make these precise enough
so that they can be shared between



humans and humans
humans and machines
machines and machines
Biomedical ontologies (a few..)
 Mesh


Medical Subject Headings, National Library of Medicine
22.000 descriptions
 EMTREE


Commercial Elsevier, Drugs and diseases
45.000 terms, 190.000 synonyms
 UMLS

Integrates 100 different vocabularies
 SNOMED

200.000 concepts, College of American Pathologists
 Gene Ontology

15.000 terms in molecular biology
 NCBI Cancer Ontology:

17,000 classes (about 1M definitions),
Remember “required are”:
one or more standard vocabularies


so search engines, producers and consumers
all speak the same language
2. a standard syntax,

so meta-data can be recognised as such
3. lots of resources with meta-data attached
Stack of languages
Stack of languages
XML:

Surface syntax, no semantics
XML Schema:

Describes structure of XML documents
RDF:

Datamodel for “relations” between “things”
RDF Schema:

RDF Vocabular Definition Language
OWL:

A more expressive
Vocabular Definition Language
Remember “required are”:
one or more standard vocabularies


so search engines, producers and consumers
all speak the same language
a standard syntax,


so meta-data can be recognised as such
3. lots of resources with meta-data attached
Question:
who writes the ontologies?
Professional bodies, scientific communities,
companies, publishers, ….
 See previous slide on Biomedical ontologies

Same developments in many other fields
Good old fashioned Knowledge Engineering
Convert from DB-schema, UML, etc.
Question:
Who writes the meta-data ?
- Automated learning
- shallow natural language analysis
- Concept extraction
Example: Encyclopedia Britannica on “Amsterdam”
trade
antwerp
europe
amsterdam
merchant
netherlands
center
city
town
Question:
Who writes the meta-data ?
exploit existing legacy-data



Databases
Lab equipment
(Amazon)
 side-effect from user interaction

email keyword extraction
 NOT from manual effort
Remember “required are”
one or more standard vocabularies


a standard syntax,



so search engines, producers and consumers
all speak the same language
so meta-data can be recognised as such
lots of resources with meta-data attached
Some working examples?
• DOPE
DOPE: Background
Vertical Information Provision


Buy a topic instead of a Journal !
Web provides new opportunities
Business driver: drug development


Rich, information-hungry market
Good thesaurus (EMTREE)
The Data
Document repositories:


ScienceDirect: approx. 500.000 fulltext articles
MEDLINE: approx. 10.000.000 abstracts
Extracted Metadata

The Collexis Metadata Server: conceptextraction ("semantic fingerprinting")
Thesauri and Ontologies

EMTREE:
60.000 preferred terms 200.000 synonyms
Query
interface
Architecture:
RDF Schema
EMTREE
RDF
Datasource 1
RDF
….
Datasource n
Ontology
disambiguates
query
Ontology
groups
results
Ontology
clusters
results
Ontology
refines
query
Some working examples?
• DOPE
• HCLS (http://www.w3.org/2001/sw/hcls/)
Query
interface
Architecture:
RDF Schema
Gene Ontology
RDF Schema
….
RDF
Datasource 1
EMTREE
RDF
….
Datasource n
Summarising…
 Data integration on the Web:

machine processable data besides
human processable data
 Syntax for meta-data


Representation
Inference
 Vocabularies for meta-data

Lot’s of them in bio-inf.
 Actual meta-data:

Lot’s in bio-inf.
 Will enable:



Better search engines (recall, precision, concepts)
Combining information across pages (inference)
…
Things to do for you
 Practical:
Use existing software
to construct new use-scenario’s
 Conceptual:
Create on ontology
for some area of bio-medical expertise


from scratch
as a refinement of an existing ontology
 Technical:
Transform an existing data-set
in meta-data format,
and provide a query interface
(for humans and machines)