sem_web_applicn_biom..

Download Report

Transcript sem_web_applicn_biom..

Migrating to the Semantic Web: Bioinformatics as a case
study.
Phillip Lord,
Dept of Computer Science,
University of Manchester
What is the Semantic Web
We are here!
OWL
RDF
XML
The talk
• Three (and a half) example case studies
• Two different technologies.
• Why we choose the different technologies.
RDF in a nutshell;
1989
Tim Berners-Lee’s original vision…
OWL in a nutshell
The Motivation
“At the doctor’s office, Lucy instructed her semantic
web agent. It promptly retrieved information about
her Mom’s prescribed treatment, looked up a list of
several providers within 20 miles of home, with a
good trust rating.”
Scientific American, May 2001:
The Motivating Example
Lucy
Doctor
myGrid
•
•
•
UK e-Science Pilot Project.
Oct 2001 – April 2005.
£3.4 million.
•
£0.4 million studentships.
Newcastle
Sheffield
Manchester
Nottingham
Hinxton
Southampton
Data(type)-intensive bioinformatics
ID
DE
DE
DE
GN
OS
OC
OC
KW
FT
FT
SQ
MURA_BACSU
STANDARD;
PRT;
429 AA.
PROBABLE UDP-N-ACETYLGLUCOSAMINE 1-CARBOXYVINYLTRANSFERASE
(EC 2.5.1.7) (ENOYLPYRUVATE TRANSFERASE) (UDP-N-ACETYLGLUCOSAMINE
ENOLPYRUVYL TRANSFERASE) (EPT).
MURA OR MURZ.
BACILLUS SUBTILIS.
BACTERIA; FIRMICUTES; BACILLUS/CLOSTRIDIUM GROUP; BACILLACEAE;
BACILLUS.
PEPTIDOGLYCAN SYNTHESIS; CELL WALL; TRANSFERASE.
ACT_SITE
116
116
BINDS PEP (BY SIMILARITY).
CONFLICT
374
374
S -> A (IN REF. 3).
SEQUENCE
429 AA; 46016 MW; 02018C5C CRC32;
MEKLNIAGGD SLNGTVHISG AKNSAVALIP ATILANSEVT IEGLPEISDI ETLRDLLKEI
GGNVHFENGE MVVDPTSMIS MPLPNGKVKK LRASYYLMGA MLGRFKQAVI GLPGGCHLGP
RPIDQHIKGF EALGAEVTNE QGAIYLRAER LRGARIYLDV VSVGATINIM LAAVLAEGKT
IIENAAKEPE IIDVATLLTS MGAKIKGAGT NVIRIDGVKE LHGCKHTIIP DRIEAGTFMI
Tool Providers
Taverna
Talisman
Web Portal
Gateway
Registries
Service and Workflow
Discovery
Ontologies
Ontology Mgt
Views
Metadata Mgt
FreeFluo Workflow
Enactment Engine
Personalisation
Provenance
Event Notification
myGrid Information
Repository
OGSA-DQP
Distributed Query Processor
SoapLab
Legacy apps
GowLab
Legacy apps
Native Web
Services
AMBIT
Text Extraction
Service
External services
Web Service (Grid Service) communication fabric
Core services
Service Providers
Work bench
Applications
Bioinformaticians
Service Stack
WBS Workflows:
Query nucleotide
sequence
RepeatMasker
Pink: Outputs/inputs of a service
Purple: Tailor-made services
Green: Emboss soaplab services
Yellow: Manchester soaplab services
Grey: Unknowns
ncbiBlastWrapper
GenBank Accession No
URL inc GB identifier
Translation/sequence file.
Good for records and
publications
prettyseq
GenBank Entry
Amino Acid translation
Identifies PEST seq
Identifies FingerPRINTS
MW, length, charge, pI,
etc
Predicts Coiled-coil
regions
tblastn Vs nr, est,
est_mouse, est_human
databases.
Blastp Vs nr
Predicts cellular
location
Sort for appropriate Sequences only
epestfind
6 ORFs
Seqret
pscan
Nucleotide seq (Fasta)
pepstats
sixpack
ORFs
pepcoil
transeq
RepeatMasker
Coding sequence
SignalP
TargetP
PSORTII
restrict
cpgreport
Identifies functional and
structural
domains/motifs
Hydrophobic
regions
GenScan
ncbiBlastWrapper
InterPro
PFAM
Prosite
Smart
Pepwindow?
Octanol?
RepeatMasker
ncbiBlastWrapper
Restriction enzyme map
CpG Island locations
and %
Repetative elements
Blastn Vs nr, est databases.
Semantic discovery
•
•
•
•
Query-ontology – discovering
workflows and services described in
the registry by building a query in
Taverna.
A common ontology is used to
annotate and query.
Look for all workflows that accept an
input of semantic type nucleotide
sequence.
Aim to have semantic discovery over
public view on the Web.
Service annotation
•
Adding structured metadata to a workflow registration to enable others to discover and
reuse it more effectively. E.g. what semantic type of input does it accept.
Semantic Discovery
Pedro data capture tool
View annotations on
workflow
Drag a workflow entry into the
explorer pane and the workflow
loads.
Drag a service/ workflow to the
scavenger window for inclusion
into the workflow
Biologist
Ontologist
Service Providers
Problems when doing In Silico Experiments
Experiments being performed repeatedly,
at different site, different time, by
different users or groups;
A large repository of
records about
experiments!!
Scientists
In silico experiments:
•verification of data;
• “recipes” for
experiment designs;
• explanation for the
impact of changes;
• ownership;
• performance of
services;
• data quality;
The Current State of the Art
Tim Berners-Lee’s original vision…
1989
XML
A Semantic Web of Provenance
what
how/which/
when/where
Literature relevant
to provenance
study or data in
this workflow
XML
DAML+OiL Ontologies
linking provenance
how
documents
Provenance
record of a
workflow run
who
HTML
Web page of
people who has
related interests
as the owner of
the workflow
PDF
XML
why
Experiment
Notes
Interlinking
graph of the
workflow that
generates the
provenance
logs
Population Semantic Data
Data
Repository
Web Services
FreeFluo
Taverna
Metadata
Repository
LaunchPad
Haystack
Haystack from IBM
Biologist
Database
Biologist
Gene Ontology Next Generation Project
(GONG)
• Demonstrate the utility of finer grained concept
descriptions in DAML+OIL (OWL-DL)
• Develop methodologies and tools to support the process
Translating theory into practice
•
•
•
Gene Ontology provides a service to the model organism
database community
Description logic (DL) is a technology born out of computer
science research
OWL is a standard ontology interchange language underpinned by
DL
GONG - proof of concept
• Maintaining an exhaustive is-a structure
Parent
Is-a relationship
GO concept
Example: heparin biosynthesis
[chemical] biosynthesis (GO:0009058)
Axis 1:
Chemicals
[i] carbohydrate biosynthesis (GO:0016051)
[i] aminoglycan biosynthesis (GO:0006023)
[i] heparin biosynthesis (GO:0030210)
Example: heparin biosynthesis
[chemical] biosynthesis (GO:0009058)
Axis 1:
Chemicals
[i] carbohydrate biosynthesis (GO:0016051)
[i] aminoglycan biosynthesis (GO:0006023)
[i] heparin biosynthesis (GO:0030210)
Axis 2:
Process
[i] heparin metabolism (GO:0030202)
[i] heparin biosynthesis (GO:0030210)
Example: heparin biosynthesis
[chemical] biosynthesis (GO:0009058)
Axis 1:
Chemicals
[i] carbohydrate biosynthesis (GO:0016051)
[i] aminoglycan biosynthesis (GO:0006023)
[i] glycosaminoglycan biosynthesis (GO:0006024)
[i] heparin biosynthesis (GO:0030210)
Axis 2:
Process
[i] heparin metabolism (GO:0030202)
[i] heparin biosynthesis (GO:0030210)
Is this important?
•
Missing is-a not noticed by users
•
BUT… improves fidelity of DB record retrieval.
–
Asking for gene products involved in ‘glycosaminoglycan biosynthesis’ will lead
to an additional result:
O94923 SPTr ISS - D-glucuronyl C5-epimerase (Fragment)
Paraphrased reasoning process
• heparin biosynthesis
–
class heparin biosynthesis defined
subClassOf biosynthesis
restriction onProperty acts_on hasClass heparin
• glycosaminoglycan biosynthesis
–
class glycosaminoglycan biosynthesis defined
subClassOf biosynthesis
restriction onProperty acts_on hasClass glycosaminoglycan
Is-a
Inferring a new is-a link
• heparin biosynthesis
–
class heparin biosynthesis defined
subClassOf biosynthesis
restriction onProperty acts_on hasClass heparin
Is-a
• glycosaminoglycan biosynthesis
–
class glycosaminoglycan biosynthesis defined
subClassOf biosynthesis
restriction onProperty acts_on hasClass glycosaminoglycan
Is-a
Results
•
Carbohydrate metabolism ~250 concepts
– 22 additional is-a links 17 of which now in GO
•
Amino acid metabolism ~ 250 concepts
– Further 17 additional is-a links now in GO
•
•
GO team will be reviewing results for metabolism as a whole once
we have the tools to support the process
Useful results come from even a partial coverage
Build a practical environment
• Tools needed for:
–
–
–
–
Creating OWL definitions
Tracking changes
Reporting reasoning results
Viewing definitions
Reporting tools
OWL for GONG
Biologist
Ontologist
Conclusions
• Three problems, three different solutions, all making use of
semantic web technologies.
• A little semantics can go a long way.
• The expressivity of the language has to be chosen at least
in part based on the tasks to be performed, and the user
base.
• Tools, tools, tools.
Acknowledgments
Chris Wroe, Robert Stevens,
Carole Goble
University of Manchester, UK
Michael Ashburner
EBI, Hinxton, UK
•
•
Jane Lomax and Midori Harris of the GO editorial team for help and advice
and responding to the suggested changes
UMLS and MeSH which provided valuable resources for chemical
information
Sean Bechhofer for development on OilEd
•
Project funded as a subcontract of the DARPA DAML programme
•
Acknowledgements
myGrid is an EPSRC funded UK eScience Program Pilot Project
Particular thanks to the other members of the
Taverna project, http://taverna.sf.net
myGrid
People
Core
•
Matthew Addis, Nedim Alpdemir, Tim Carver, Rich Cawley, Neil Davis, Alvaro Fernandes, Justin Ferris,
Robert Gaizaukaus, Kevin Glover, Carole Goble, Chris Greenhalgh, Mark Greenwood, Yikun Guo, Ananth
Krishna, Peter Li, Phillip Lord, Darren Marvin, Simon Miles, Luc Moreau, Arijit Mukherjee, Tom Oinn, Juri
Papay, Savas Parastatidis, Norman Paton, Terry Payne, Matthew Pocock, Milena Radenkovic, Stefan
Rennick-Egglestone, Peter Rice, Martin Senger, Nick Sharman, Robert Stevens, Victor Tan, Anil Wipat,
Paul Watson and Chris Wroe.
Users
•
Simon Pearce and Claire Jennings, Institute of Human Genetics School of Clinical Medical Sciences,
University of Newcastle, UK
•
Hannah Tipney, May Tassabehji, Andy Brass, St Mary’s Hospital, Manchester, UK
Postgraduates
•
Martin Szomszor, Duncan Hull, Jun Zhao, Pinar Alper, John Dickman, Keith Flanagan, Antoon Goderis,
Tracy Craddock, Alastair Hampshire
Industrial
•
Dennis Quan, Sean Martin, Michael Niemi, Syd Chapman (IBM)
•
Robin McEntire (GSK)
Collaborators
•
Keith Decker