Transcript source

Life Sciences:
a case study for the
Semantic Web
Professor Carole Goble
Information Management Group
University of Manchester
UK
Pioneers and incubators
• The Web -> Physics
– well-organised microcosm of the general
community.
– definite and clearly articulated
information dissemination needs.
– smart motivated people prepared to cooperate, and with the means and desires
to do so.
• The Semantic Web -> Life Sciences
Why Life Sciences?
• Knowledge-based discipline
– Collaborative history
– Publication shift: articles -> data -> knowledge
– Content with extensive metadata -> annotation &
controlled vocabularies
– Highly contextual, unstable and fuzzy
• In silico experiments
–
–
–
–
–
Information harvesting & PSE
Orchestrating resources -> workflow
Services that exploit enriched content
Support for scientific/research method = SW issues
Transparent collection of annotation
Why Life Sciences?
• Strong enthusiastic cohesive community
–
–
–
–
–
–
–
I3C use cases
Grass roots ontologies and annotation
Distributed annotation services
NEED for provenance, audit, security …
A chance of concrete articulation
Sanger, EBI & NCBI
ISCB
Disease Genetics & Pharmacogenomics
Data Capture
Hypotheses
Design
Model &
Analysis
Libraries
Clinical
Resources
Individualised
Medicine
Clinical
Image/Signal
Genomic/Proteomic
Knowledge
Repositories
Data Mining
Case-Base
Reasoning
Analysis
Information
Sources
Information
Fusion
Integration
Annotation /
Knowledge
Representation
Cows to Proteins
• Jim Hendler-> how many cows in Texas?
Q: What ATPase superfamily proteins are
found in mouse?
A:
1. P21958 (from Swiss-Prot)
2. InterPro is a pattern database and could
tell you
3. Attwood’s lab expertise is in nucleotide
binding proteins ….
Which compounds interact with (alphaadrenergic receptors) ((over expressed in
(bladder epithelial cells)) but not (smooth
muscle tissue)) of ((patients with urinary
flow dysfunction) and a sensitivity to the
(quinazoline family of compounds))?
Drug
formulary
High
thro’put
screening
Expressn.
database
Tissue
database
Chemical
database
Enzyme
database
Clinical
trials
database
SNPs
database
Receptor
database
Webs of Knowledge
Interoperating e-Services
Service
provider
Service
provider
Service
provider
Service
provider
Service
provider
Interoperation is by hand or Perl scripts
But surely this is just all about querying and
linking (lots of) databases?
Isn’t the information all computationally
accessible already?
The document publishing
navigation interface
legacy
Navigation-based interaction
Identity
“Inaccessible” Descriptions
• Evolving
• Nonpredictive
• The structured
part of the
schema is open
to change
• Hence flat file
mark up’s
prevalence
• XML is king.
ID
AC
DE
OS
OC
OC
OX
RN
RP
RX
RA
RT
RL
RN
RP
RX
RA
RT
RL
CC
CC
CC
CC
CC
CC
CC
CC
CC
CC
CC
CC
CC
CC
CC
DR
DR
DR
DR
DR
KW
Swiss-Prot
Flat file
PRIO_HUMAN
STANDARD;
PRT;
253 AA.
P04156;
MAJOR PRION PROTEIN PRECURSOR (PRP) (PRP27-30) (PRP33-35C) (ASCR).
Homo sapiens (Human).
Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo.
NCBI_TaxID=9606;
[1]
SEQUENCE FROM N.A.
MEDLINE=86300093 [NCBI, ExPASy, Israel, Japan]; PubMed=3755672;
Kretzschmar H.A., Stowring L.E., Westaway D., Stubblebine W.H., Prusiner S.B., Dearmond S.J.
"Molecular cloning of a human prion protein cDNA.";
DNA 5:315-324(1986).
[6]
STRUCTURE BY NMR OF 23-231.
MEDLINE=97424376 [NCBI, ExPASy, Israel, Japan]; PubMed=9280298;
Riek R., Hornemann S., Wider G., Glockshuber R., Wuethrich K.;
"NMR characterization of the full-length recombinant murine prion protein, mPrP(23-231).";
FEBS Lett. 413:282-288(1997).
-!- FUNCTION: THE FUNCTION OF PRP IS NOT KNOWN. PRP IS ENCODED IN THE HOST GENOME AND IS
EXPRESSED BOTH IN NORMAL AND INFECTED CELLS.
-!- SUBUNIT: PRP HAS A TENDENCY TO AGGREGATE YIELDING POLYMERS CALLED "RODS".
-!- SUBCELLULAR LOCATION: ATTACHED TO THE MEMBRANE BY A GPI-ANCHOR.
-!- DISEASE: PRP IS FOUND IN HIGH QUANTITY IN THE BRAIN OF HUMANS AND ANIMALS INFECTED WITH
NEURODEGENERATIVE DISEASES KNOWN AS TRANSMISSIBLE SPONGIFORM ENCEPHALOPATHIES OR PRION
DISEASES, LIKE: CREUTZFELDT-JAKOB DISEASE (CJD), GERSTMANN-STRAUSSLER SYNDROME (GSS),
FATAL FAMILIAL INSOMNIA (FFI) AND KURU IN HUMANS; SCRAPIE IN SHEEP AND GOAT; BOVINE
SPONGIFORM ENCEPHALOPATHY (BSE) IN CATTLE; TRANSMISSIBLE MINK ENCEPHALOPATHY (TME);
CHRONIC WASTING DISEASE (CWD) OF MULE DEER AND ELK; FELINE SPONGIFORM ENCEPHALOPATHY
(FSE) IN CATS AND EXOTIC UNGULATE ENCEPHALOPATHY(EUE) IN NYALA AND GREATER KUDU. THE
PRION DISEASES ILLUSTRATE THREE MANIFESTATIONS OF CNS DEGENERATION: (1) INFECTIOUS (2)
SPORADIC AND (3) DOMINANTLY INHERITED FORMS. TME, CWD, BSE, FSE, EUE ARE ALL THOUGHT TO
OCCUR AFTER CONSUMPTION OF PRION-INFECTED FOODSTUFFS.
-!- SIMILARITY: BELONGS TO THE PRION FAMILY.
HSSP; P04925; 1AG2. [HSSP ENTRY / SWISS-3DIMAGE / PDB]
MIM; 176640; -. [NCBI / EBI]
InterPro; IPR000817; -.
Pfam; PF00377; prion; 1.
PRINTS; PR00341; PRION.
Prion; Brain; Glycoprotein; GPI-anchor; Repeat; Signal; Polymorphism; Disease mutation.
Literature holds knowledge
Consequence ->
information
extraction
big business
&
metadata is
required.
Community-wide markup
Annotation and Curation
Expressed Sequence Tagsmillions
nrdb
503,479
TrEMBL
234,059
Swiss-Prot
85,661
InterPro 2990
PRINTS
1310
“the elucidation and description of
biologically relevant features”
 Computationally formed – e.g.
cross references to other
database entries, date collected;
 Intellectually formed – the
accumulated knowledge of an
expert distilling the aggregated
information drawn from multiple
data sources and analyses, and
the annotators knowledge.
ID
AC
DE
OS
OC
OC
OX
RN
RP
RX
RA
RT
RL
RN
RP
RX
RA
RT
RL
CC
CC
CC
CC
CC
CC
CC
CC
CC
CC
CC
CC
CC
CC
CC
DR
DR
DR
DR
DR
KW
Swiss-Prot
Annotation
PRIO_HUMAN
STANDARD;
PRT;
253 AA.
P04156;
MAJOR PRION PROTEIN PRECURSOR (PRP) (PRP27-30) (PRP33-35C) (ASCR).
Homo sapiens (Human).
Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo.
NCBI_TaxID=9606;
[1]
SEQUENCE FROM N.A.
MEDLINE=86300093 [NCBI, ExPASy, Israel, Japan]; PubMed=3755672;
Kretzschmar H.A., Stowring L.E., Westaway D., Stubblebine W.H., Prusiner S.B., Dearmond S.J.
"Molecular cloning of a human prion protein cDNA.";
DNA 5:315-324(1986).
[6]
STRUCTURE BY NMR OF 23-231.
MEDLINE=97424376 [NCBI, ExPASy, Israel, Japan]; PubMed=9280298;
Riek R., Hornemann S., Wider G., Glockshuber R., Wuethrich K.;
"NMR characterization of the full-length recombinant murine prion protein, mPrP(23-231).";
FEBS Lett. 413:282-288(1997).
-!- FUNCTION: THE FUNCTION OF PRP IS NOT KNOWN. PRP IS ENCODED IN THE HOST GENOME AND IS
EXPRESSED BOTH IN NORMAL AND INFECTED CELLS.
-!- SUBUNIT: PRP HAS A TENDENCY TO AGGREGATE YIELDING POLYMERS CALLED "RODS".
-!- SUBCELLULAR LOCATION: ATTACHED TO THE MEMBRANE BY A GPI-ANCHOR.
-!- DISEASE: PRP IS FOUND IN HIGH QUANTITY IN THE BRAIN OF HUMANS AND ANIMALS INFECTED WITH
NEURODEGENERATIVE DISEASES KNOWN AS TRANSMISSIBLE SPONGIFORM ENCEPHALOPATHIES OR PRION
DISEASES, LIKE: CREUTZFELDT-JAKOB DISEASE (CJD), GERSTMANN-STRAUSSLER SYNDROME (GSS),
FATAL FAMILIAL INSOMNIA (FFI) AND KURU IN HUMANS; SCRAPIE IN SHEEP AND GOAT; BOVINE
SPONGIFORM ENCEPHALOPATHY (BSE) IN CATTLE; TRANSMISSIBLE MINK ENCEPHALOPATHY (TME);
CHRONIC WASTING DISEASE (CWD) OF MULE DEER AND ELK; FELINE SPONGIFORM ENCEPHALOPATHY
(FSE) IN CATS AND EXOTIC UNGULATE ENCEPHALOPATHY(EUE) IN NYALA AND GREATER KUDU. THE
PRION DISEASES ILLUSTRATE THREE MANIFESTATIONS OF CNS DEGENERATION: (1) INFECTIOUS (2)
SPORADIC AND (3) DOMINANTLY INHERITED FORMS. TME, CWD, BSE, FSE, EUE ARE ALL THOUGHT TO
OCCUR AFTER CONSUMPTION OF PRION-INFECTED FOODSTUFFS.
-!- SIMILARITY: BELONGS TO THE PRION FAMILY.
HSSP; P04925; 1AG2. [HSSP ENTRY / SWISS-3DIMAGE / PDB]
MIM; 176640; -. [NCBI / EBI]
InterPro; IPR000817; -.
Pfam; PF00377; prion; 1.
PRINTS; PR00341; PRION.
Prion; Brain; Glycoprotein; GPI-anchor; Repeat; Signal; Polymorphism; Disease mutation.
gc;
gx;
gt;
gp;
gp;
gp;
gp;
bb;
gr;
gr;
gr;
gr;
gr;
gr;
gr;
gr;
gr;
gr;
gr;
bb;
gd;
gd;
gd;
gd;
gd;
gd;
gd;
gd;
gd;
gd;
gd;
gd;
gd;
gd;
gd;
gd;
gd;
gd;
gd;
gd;
gd;
gd;
gd;
gd;
gd;
PRION
PR00341
Prion protein signature
INTERPRO; IPR000817
PROSITE; PS00291 PRION_1; PS00706 PRION_2
BLOCKS; BL00291
PFAM; PF00377 prion
1. STAHL, N. AND PRUSINER, S.B.
Prions and prion proteins.
FASEB J. 5 2799-2807 (1991).
PRINTS
Annotation
2. BRUNORI, M., CHIARA SILVESTRINI, M. AND POCCHIARI, M.
The scrapie agent and the prion hypothesis.
TRENDS BIOCHEM.SCI. 13 309-313 (1988).
3. PRUSINER, S.B.
Scrapie prions.
ANNU.REV.MICROBIOL. 43 345-374 (1989).
Prion protein (PrP) is a small glycoprotein found in high quantity in the brain of animals infected with
certain degenerative neurological diseases, such as sheep scrapie and bovine spongiform encephalopathy (BSE),
and the human dementias Creutzfeldt-Jacob disease (CJD) and Gerstmann-Straussler syndrome (GSS). PrP is
encoded in the host genome and is expressed both in normal and infected cells. During infection, however, the
PrP molecules become altered and polymerise, yielding fibrils of modified PrP protein.
PrP molecules have been found on the outer surface of plasma membranes of nerve cells, to which they are
anchored through a covalent-linked glycolipid, suggesting a role as a membrane receptor. PrP is also
expressed in other tissues, indicating that it may have different functions depending on its location.
The primary sequences of PrP's from different sources are highly similar: all bear an N-terminal domain
containing multiple tandem repeats of a Pro/Gly rich octapeptide; sites of Asn-linked glycosylation; an
essential disulphide bond; and 3 hydrophobic segments. These sequences show some similarity to a chicken
glycoprotein, thought to be an acetylcholine receptor-inducing activity (ARIA) molecule. It has been
suggested that changes in the octapeptide repeat region may indicate a predisposition to disease, but it is
not known for certain whether the repeat can meaningfully be used as a fingerprint to indicate susceptibility.
PRION is an 8-element fingerprint that provides a signature for the prion proteins. The fingerprint was
derived from an initial alignment of 5 sequences: the motifs were drawn from conserved regions spanning
virtually the full alignment length, including the 3 hydrophobic domains and the octapeptide repeats
(WGQPHGGG). Two iterations on OWL18.0 were required to reach convergence, at which point a true set comprising
9 sequences was identified. Several partial matches were also found: these include a fragment (PRIO_RAT)
lacking part of the sequence bearing the first motif,and the PrP homologue found in chicken - this matches
well with only 2 of the 3 hydrophobic motifs (1 and 5) and one of the other conserved regions (6), but has an
N-terminal signature based on a sextapeptide repeat (YPHNPG) rather than the characteristic PrP octapeptide.
The “Annotation Workflow”
Analysis
EMBL
Analysis
SwissProt
Analysis
PRINTS
GPCRDB
TrEMBL
Analysis
In silico experiments
Nicola:
Domain; Task; Events ontologies
Simon:
Support of research itself
In silico experiments
• Resource discovery, interoperation,
fusion, sharing, finding, filtering
• Work flows
• Science is dynamic – change
propagation
• Problem Solving Environments
• Collaborative and dynamic virtual
organisations
Annotating the annotations
•
•
•
•
•
•
•
•
•
Transparent annotation by side effect
Provenance, Trust, Authentication
Audit
Versioning, roll-backs and snap shots
Confidentiality
Credit – digital signatures
Authorisation & security …
Automated side effects of as part of the PSE
All potentials for Semantic Web Markup
Not just data and tools…
Teams
Laboratories
Repositories
People
Problem Space
• Ability to store and retrieve huge
volumes of information
• Ability to capture, enrich, classify,
publish and structure knowledge about
•Domains
Organisations
•Individuals
Research
Collaborations
•Experiments
Results
•Services
Share info -> share meaning
Service
provider
Service
provider
Service
provider
Service
provider
Service
provider
Ontologies are big news
• Gene Ontology
– Marking up annotation of major databases
– Identity, Linking databases together
– Classification/index framework for instances &
results
– It is sloppy but it is used by everybody!
– Gene Ontology -> DAML+OIL -> inference!
• http://www.geneontology.org
BioOntology Consortium
• 150 people attended the last BOC meeting
• GSK and BOC mandated DAML+OIL
• Plethora of other ontologies
– Bioinformatics
• Many ontologies but under control
– Medical informatics
• Tons of ontologies, out of control
• Representing the natural world is tough!!
– Sufficiency conditions …
Functional
genomics Tissue
Structural
Genomics
Disease
Population
Genetics
Genome Clinical Data
Clinical trial
sequence
• Data resources have been
built introspectively for
human researchers
• Information is machine
readable not machine
understandable
• Sharing vocabulary is a
step towards unification
“The technical advantages of knowledge modeling
are obvious. Knowledge bases can be
automatically checked for consistency;
they support inference mechanisms which
derive data which have not been explicitly
stored; they also offer extensive request and
navigation facilities. However, the most
immediate benefit of knowledge base design
lies in the modeling process itself, through the
effort of explication, organization and
structuration [sic] of the knowledge it
requires.”
Editorial: Bioinformatics, July 2000
Quality & Stability
• Open Knowledge &
transparency
• Data quality
• Inconsistency,
incompleteness
• Provenance
• Contamination, noise,
experimental rigour
• Data irregularity
• Evolution, Audit, Versioning
“ … the problem in the field is
not a lack of good integrating
software, Smith says. The
packages usually end up leading
back to public databases. "The
problem is: the databases are
God-awful," he told
BioMedNet.
If the data is still
fundamentally flawed, then
better algorithms add little”
Temple Smith, director of the
Molecular Engineering
Research Center at Boston
University, BioMedNet 2000
Supporting Science
• All the great stuff Simon talked about
• Information is contextual
• Personalisation
– My view of a metabolic pathway
– My experimental process flows
• Science is not linear
– What did we know then
– What do we know now
• Longevity of data
– It has to be available in 50 years time.
The Grid
• Large scale distributed
data management
• Large scale distributed
computation
• High speed
communications
• Dynamic collaborative
virtual organisations
• UK Govt £120 million
• http://www.gridform.org
Eating our own dog food
myGrid
•
•
•
•
•
UK research council funded e-Science Project
Start 1st October for 36-42 months
£3.4 million
6 academic partners, 8 commercial
19 FTEs
• Web Services + Semantic Web + Grid
• http://www.mygrid.org.uk
myGrid Objectives
• Straightforward discovery, interoperation,
sharing
– information AND processes AND best practice
• Improving quality of both experiments and
data
– provenance through information <-> process linkage
– propagating change
• Individual creativity & collaborative working
• Enabling genomic level bioinformatics
myGrid Technologies




Database access from the Grid
Process enactment on the Grid
Personalisation services
Metadata services & Ontologies




•
DAML+OIL !!
Laying the foundations for Agent Services
Collaboration Environments
Service composition
Ontologies, Protocols & APIs
Grid + Services + Semantic Web
“Bioinformatics is a knowledge-based
discipline. Many predictions, and
interpretations, of data in biology are made
by comparing the data in hand against
existing knowledge”
Dr. Andy Brass, ad nauseum
• Analogy/knowledge-based rather than
axiom-based
Remarks
• Semantic Web literacy in biology weak
• Grid literacy in biology strong
• Biology loves XML and ignores RDF
– Annotations sit in other (non RDF) databases.
• Role of (legacy) databases and semantic
web markup
–
–
–
–
Lots of metadata already in databases
Will we really mark up every database instance?
Exporting results as RDF
Using inference over results of queries
Remarks
• Change management
– What did we know then?
•
•
•
•
Custodianship, guardianship, longevity…
Performance, robustness, scale.
Tools & easy to use environments
Demonstrators
How does this bit fit?
?