Semantic Web for life scientists
Download
Report
Transcript Semantic Web for life scientists
The Semantic Web:
New-style data-integration
(and how it works for life-scientists too!)
Frank van Harmelen
AI Department
Vrije Universiteit Amsterdam
What’s the problem?
(data-mess in bio-inf)
Pharmaceutical Productivity
Source: PhRMA & FDA 2003
Kenneth Griffiths and Richard Resnick
Tut. At Intell. Systems for Molec. Biol., 2003
The Industry’s Problem
Too much unintegrated data:
– from a variety of incompatible sources
– no standard naming convention
– each with a custom browsing and querying
mechanism (no common interface)
– and poor interaction with other data sources
What are the Data Sources?
•
•
•
•
•
•
•
•
Flat Files
URLs
Proprietary Databases
Public Databases
Data Marts
Spreadsheets
Emails
…
Sample Problem: Hyperprolactinemia
Over production of prolactin
– prolactin stimulates mammary gland
development and milk production
Hyperprolactinemia is characterized by:
– inappropriate milk production
– disruption of menstrual cycle
– can lead to conception difficulty
Understanding transcription factors for
prolactin production
“Show me all genes in the public literature that are putatively
related to hyperprolactinemia, have more than 3-fold
expression differential between hyperprolactinemic and normal
pituitary cells, and are homologous to known transcription
factors.”
(Q1Q2Q3)
Q1
Q2
“Show me all genes that
“Show me all genes that
are homologous to known have more than 3-fold
expression differential
transcription factors”
SEQUENCE
Q3
between hyperprolactinemic
and normal pituitary cells”
“Show me all genes in
the public literature that
are putatively related to
hyperprolactinemia”
EXPRESSION
LITERATURE
The Medical tower of Babel
Mesh
Medical Subject Headings, National Library of Medicine
22.000 descriptions
EMTREE
Commercial Elsevier, Drugs and diseases
45.000 terms, 190.000 synonyms
UMLS
Integrates 100 different vocabularies
SNOMED
200.000 concepts, College of American Pathologists
Gene Ontology
15.000 terms in molecular biology
NCI Cancer Ontology:
17,000 classes (about 1M definitions),
Stitching this all together by hand?
Source: Stephens et al. J Web Semantics 2006
Why would
Semantic technology
help?
machine accessible meaning
(What it’s like to be a machine)
alleviates
META-DATA
<treatment>
<name>
<symptoms>
IS-A
<drug>
<drug
administration>
<disease>
What is meta-data?
name
symptoms
disease
drug
administration
it's just data
it's data describing other data
its' meant for machine consumption
Required are:
1. one or more standard vocabularies
so search engines, producers and consumers
all speak the same language
2. a standard syntax,
so meta-data can be recognised as such
3. lots of resources with meta-data attached
mechanisms for attribution and trust
is this page really about Pamela
Anderson?
What are ontologies &
what are they used for
world
concept
language
no shared understanding
Conceptual and
terminological confusion
Agree on a
conceptualization
Make it explicit
in some language.
Actors: both humans and machines
standard vocabularies
(“Ontologies”)
Identify the key concepts in a domain
Identify a vocabulary for these concepts
Identify relations between these concepts
Make these precise enough
so that they can be shared between
humans and humans
humans and machines
machines and machines
Biomedical ontologies (a few..)
Mesh
Medical Subject Headings, National Library of Medicine
22.000 descriptions
EMTREE
Commercial Elsevier, Drugs and diseases
45.000 terms, 190.000 synonyms
UMLS
Integrates 100 different vocabularies
SNOMED
200.000 concepts, College of American Pathologists
Gene Ontology
15.000 terms in molecular biology
NCBI Cancer Ontology:
17,000 classes (about 1M definitions),
Remember “required are”:
one or more standard vocabularies
so search engines, producers and consumers
all speak the same language
2. a standard syntax,
so meta-data can be recognised as such
3. lots of resources with meta-data attached
Stack of languages
Stack of languages
XML:
Surface syntax, no semantics
XML Schema:
Describes structure of XML documents
RDF:
Datamodel for “relations” between “things”
RDF Schema:
RDF Vocabular Definition Language
OWL:
A more expressive
Vocabular Definition Language
Remember “required are”:
one or more standard vocabularies
so search engines, producers and consumers
all speak the same language
a standard syntax,
so meta-data can be recognised as such
3. lots of resources with meta-data attached
Question:
who writes the ontologies?
Professional bodies, scientific communities,
companies, publishers, ….
See previous slide on Biomedical ontologies
Same developments in many other fields
Good old fashioned Knowledge Engineering
Convert from DB-schema, UML, etc.
Question:
Who writes the meta-data ?
- Automated learning
- shallow natural language analysis
- Concept extraction
Example: Encyclopedia Britannica on “Amsterdam”
trade
antwerp
europe
amsterdam
merchant
netherlands
center
city
town
Question:
Who writes the meta-data ?
exploit existing legacy-data
Databases
Lab equipment
(Amazon)
side-effect from user interaction
email keyword extraction
NOT from manual effort
Remember “required are”
one or more standard vocabularies
a standard syntax,
so search engines, producers and consumers
all speak the same language
so meta-data can be recognised as such
lots of resources with meta-data attached
Some working examples?
• DOPE
DOPE: Background
Vertical Information Provision
Buy a topic instead of a Journal !
Web provides new opportunities
Business driver: drug development
Rich, information-hungry market
Good thesaurus (EMTREE)
The Data
Document repositories:
ScienceDirect: approx. 500.000 fulltext articles
MEDLINE: approx. 10.000.000 abstracts
Extracted Metadata
The Collexis Metadata Server: conceptextraction ("semantic fingerprinting")
Thesauri and Ontologies
EMTREE:
60.000 preferred terms 200.000 synonyms
Query
interface
Architecture:
RDF Schema
EMTREE
RDF
Datasource 1
RDF
….
Datasource n
Ontology
disambiguates
query
Ontology
groups
results
Ontology
clusters
results
Ontology
refines
query
Some working examples?
• DOPE
• HCLS (http://www.w3.org/2001/sw/hcls/)
Query
interface
Architecture:
RDF Schema
Gene Ontology
RDF Schema
….
RDF
Datasource 1
EMTREE
RDF
….
Datasource n
Summarising…
Data integration on the Web:
machine processable data besides
human processable data
Syntax for meta-data
Representation
Inference
Vocabularies for meta-data
Lot’s of them in bio-inf.
Actual meta-data:
Lot’s in bio-inf.
Will enable:
Better search engines (recall, precision, concepts)
Combining information across pages (inference)
…
Things to do for you
Practical:
Use existing software
to construct new use-scenario’s
Conceptual:
Create on ontology
for some area of bio-medical expertise
from scratch
as a refinement of an existing ontology
Technical:
Transform an existing data-set
in meta-data format,
and provide a query interface
(for humans and machines)