Transcript Web Service

Middleware for Bioinformaticians:
Lessons from the myGrid Project
Carole Goble and the myGrid consortium
University of Manchester, UK
http://www.mygrid.org.uk
SIAM Conference on Computational Science and Engineering
EPSRC funded UK eScience Program Pilot Project
Particular thanks to the other members of the
Taverna project, http://taverna.sf.net
SIAM Conference on Computational Science and Engineering
“e-Science is about global
collaboration in key areas of
science and the next generation
of [computing] infrastructure that
will enable it.”
Sir John Taylor,
Director Office of Science and
Technology, UK
SIAM Conference on Computational Science and Engineering
Science’ = Science + e-Science
integration
mining
analysis
• Discovery increasingly
done in silico on results
obtained from
hypothesis
experiments using
computational analysis
& data repositories.
integration
• A new era of collection
analysis
based and simulation
mining
based science, in
addition to hypothesis
results
driven and experimental
science
SIAM Conference on Computational Science and Engineering
prediction
experiment
Bioinformatics
“The application of
computer technology to
the management of
biological information.
Specifically, it is the
science of developing
computer databases and
algorithms to facilitate and
expedite biological
research, particularly in
genomics.”
http://www.informatics.jax.org/mgihome/other/glossary.shtml
SIAM Conference on Computational Science and Engineering
What does a bioinformatician do
all day?
SIAM Conference on Computational Science and Engineering
Williams-Beuren Syndrome (WBS)
• Contiguous sporadic gene deletion
disorder
• 1/20,000 live births, caused by
unequal crossover (homologous
recombination) during meiosis
• Haploinsufficiency of the region
results in the phenotype
• Multisystem phenotype – muscular,
nervous, circulatory systems
• Characteristic facial features
• Unique cognitive profile
• Mental retardation (IQ 40-100,
mean~60, ‘normal’ mean ~ 100 )
• Outgoing personality, friendly nature,
‘charming’
SIAM Conference on Computational Science and Engineering
*
*
7q11.23
CTA-315H11
Chr 7 ~155 Mb
SIAM Conference on Computational Science and Engineering
Physical Map
Gap
CTB-51J22
C-tel
Block B
~1.5 Mb
Patient deletions
WBS
SVAS
GTF2IRD2
NCF1
A-tel
B-tel
Block A
GTF2I
A-mid
B-mid
C-mid
B-cen
A-cen
C-cen
POM121
NOLR1
FKBP6T
GTF2IRD2P
GTF2IP
NCF1P
STAG3
PMS2L
Chromosome 7q11.23
GTF2IRD1
CYLN2
RFC2
WBSCR5/LAB
WBSCR1/E1f4H
LIMK1
ELN
CLDN4
CLDN3
WBSCR21
STX1A
WBSCR18
WBSCR22
WBSCR14
TBL2
BCL7B
BAZ1B
FZD9
FKBP6
NOLR1
POM121
Williams-Beuren Syndrome Microdeletion on
Block C
Physical Map
‘Gap’
CTA-315H11
CTB-51J22
Picture of
SIAM Conference on Computational Science and Engineering
Lab Scientist
NCF1
GTF2IRD2
GTF2I
GTF2IRD1
CYLN2
WBSCR1/E1f4H
WBSCR5/LAB
RFC2
LIMK1
Verify
STX1A
WBSCR21
CLDN3
CLDN4
ELN
WBSCR14
WBSCR18
WBSCR22
BCL7B
TBL2
BAZ1B
POM121
NOLR1
FKBP6
FZD9
Predict
Picture of
Workflow
Candidate genes in the
WBS Critical Region
Physical Map
‘Gap’
CTA-315H11
CTB-51J22
Identification of
overlapping
sequence
Picture of
Characterisation of
nucleotide
sequence
Predict
Verify
Lab Scientist
SIAM Conference on Computational Science and Engineering
NCF1
GTF2IRD2
GTF2I
GTF2IRD1
CYLN2
WBSCR1/E1f4H
WBSCR5/LAB
RFC2
LIMK1
STX1A
WBSCR21
CLDN3
CLDN4
ELN
WBSCR14
WBSCR18
WBSCR22
BCL7B
TBL2
BAZ1B
FZD9
POM121
NOLR1
FKBP6
Characterisation of
protein sequence
Candidate genes in the
WBS Critical Region
Filling a genomic gap in Silico
Services published on the web, many without programmatic interface
12181
12241
12301
12361
12421
12481
12541
12601
12661
12721
12781
acatttctac
cagtctttta
gaccatccta
gactaattat
taggtgactt
aggagctatt
ttcttataag
tggttaagta
tggcattaag
atccaatacc
taacccattt
caacagtgga
aattttaacc
atagatacac
gttgagcttg
gcctgttttt
tatatattct
tctgtggttt
tacatgacat
tacatccaca
cattaagctg
tctgtctcta
tgaggttgtt
tttagagaag
agtggtgtct
ttaccattta
ttttaattgg
ggatacaagt
ttatattaat
aaaacggatt
atattgtgca
tcactcccca
tggatttgcc
ggtctatgtt
agtcatacag
cactgtgatt
gacaacttca
gatcttaatt
tctttatcag
gtttttattg
atcttaacca
actatcacca
atctcccatt
tgttctggat
SIAM Conference on Computational Science and Engineering
ctcaccaaat
tcaatagcct
ttaatttgca
ttagagaagt
tttttaaatt
atacacagtt
atgactgttt
ttttaaaatg
ctatcatact
ttcccacccc
attcatatta
ttggtgttgt
tttttagctt
ttttcctgct
gtctaatatt
attgatttgt
tgtgactatt
tttacaattg
taaaattcga
ccaaaagggc
tgacaatcaa
atagaatcaa
Filling a genomic gap in Silico
Services published on the web, many without programmatic interfaces
Public and local databases
and data sets
Protein-protein interaction
algorithms
Sequence alignment algorithms
12181
12241
12301
12361
12421
12481
12541
12601
12661
12721
12781
acatttctac
cagtctttta
gaccatccta
gactaattat
taggtgactt
aggagctatt
ttcttataag
tggttaagta
tggcattaag
atccaatacc
taacccattt
caacagtgga
aattttaacc
atagatacac
gttgagcttg
gcctgttttt
tatatattct
tctgtggttt
tacatgacat
tacatccaca
cattaagctg
tctgtctcta
tgaggttgtt
tttagagaag
agtggtgtct
ttaccattta
ttttaattgg
ggatacaagt
ttatattaat
aaaacggatt
atattgtgca
tcactcccca
tggatttgcc
Visualisation tools
ggtctatgtt
agtcatacag
cactgtgatt
gacaacttca
gatcttaatt
tctttatcag
gtttttattg
atcttaacca
actatcacca
atctcccatt
tgttctggat
Stochastic models for clustering
gene expression data
ctcaccaaat
tcaatagcct
ttaatttgca
ttagagaagt
tttttaaatt
atacacagtt
atgactgttt
ttttaaaatg
ctatcatact
ttcccacccc
attcatatta
ttggtgttgt
tttttagctt
ttttcctgct
gtctaatatt
attgatttgt
tgtgactatt
tttacaattg
taaaattcga
ccaaaagggc
tgacaatcaa
atagaatcaa
Ontology services
Protein folding simulations
Gene prediction algorithms
Literature searches
SIAM Conference on Computational Science and Engineering
Filling a genomic gap in Silico
12181
12241
12301
12361
12421
12481
12541
12601
12661
12721
12781
acatttctac
cagtctttta
gaccatccta
gactaattat
taggtgactt
aggagctatt
ttcttataag
tggttaagta
tggcattaag
atccaatacc
taacccattt
caacagtgga
aattttaacc
atagatacac
gttgagcttg
gcctgttttt
tatatattct
tctgtggttt
tacatgacat
tacatccaca
cattaagctg
tctgtctcta
tgaggttgtt
tttagagaag
agtggtgtct
ttaccattta
ttttaattgg
ggatacaagt
ttatattaat
aaaacggatt
atattgtgca
tcactcccca
tggatttgcc
ggtctatgtt
agtcatacag
cactgtgatt
gacaacttca
gatcttaatt
tctttatcag
gtttttattg
atcttaacca
actatcacca
atctcccatt
tgttctggat
SIAM Conference on Computational Science and Engineering
ctcaccaaat
tcaatagcct
ttaatttgca
ttagagaagt
tttttaaatt
atacacagtt
atgactgttt
ttttaaaatg
ctatcatact
ttcccacccc
attcatatta
ttggtgttgt
tttttagctt
ttttcctgct
gtctaatatt
attgatttgt
tgtgactatt
tttacaattg
taaaattcga
ccaaaagggc
tgacaatcaa
atagaatcaa
Reuse
Grave’s Disease
Simon Pearce and
Claire Jennings,
Institute of Human
Genetics School of
Clinical Medical
Sciences,
University of
Newcastle, UK
adapting and sharing best practice and
know-how across a community
Williams-Beuren Syndrome
Hannah Tipney,
May Tassabehji,
Andy Brass, St
Mary’s Hospital,
Manchester, UK
Trypanosomiasis in cattle
Steve Kemp,
University of
Liverpool, Andy
Brass University of
Manchester, UK
SIAM Conference on Computational Science and Engineering
Chicken genome
Roslin
Institute, UK
No single application
Small molecules
Proteomics
Clinical records
Computational steerage of
heart simulation codes
SIAM Conference on Computational Science and Engineering
Cardiac Vulnerability to Acute
Ischemia
http://www.bioeng.auckland.ac.nz
SIAM Conference on Computational Science and Engineering
Cardiac Vulnerability to Acute
Ischemia, Simulation Step
Blanca Rodriguez, Oxford
Mechanical model
Blood perfusion bath model
Electrophysio models
Simulation protocol
“pace at 250 ms…”
Initial conditions
“K+ 5.4 mmol/l…
•Monitor, Stop, Checkpoint, Discard
•Restart with different parameters
•Perturb initial conditions: Stage 1 and
stage 2 hypoxia
1 week to run per
simulation
Finite Element
Bidomain Solver
Parameters
“Shock strength 50 A..
Data
Analysis
SIAM Conference on Computational Science and Engineering
Result file produced
for every 1ms, 7.3MB
200ms simulation
Query nucleotide
sequence
Pink: Outputs/inputs of a service
Purple: Tailor-made services
Green: Emboss soaplab services
Yellow: Manchester soaplab services
RepeatMasker
BLASTwrapper
GenBank Accession No
Promotor Prediction
URL inc GB identifier
Translation/sequence
file. Good for records
and publications
Identifies PEST seq
prettyseq
Sort for appropriate Sequences only
MW, length,
charge, pI, etc
pepstats
Predicts
cellular location
Identifies functional
and structural
domains/motifs
Hydrophobic
regions
GenBank Entry
epestfind
pscan
tblastn Vs nr, est,
est_mouse, est_human
databases.
Blastp Vs nr
Regulation Element Prediction
Amino Acid translation
Identifies
FingerPRINTS
Predicts Coiled-coil
regions
TF binding Prediction
pepcoil
Identify regulatory
elements in
genomic sequence
Seqret
Nucleotide seq (Fasta)
6 ORFs
RepeatMasker
Coding sequence
GenScan
BlastWrapper
SignalP
TargetP
PSORTII
restrict
sixpack
transeq
cpgreport
Restriction enzyme
map
CpG Island
locations and %
InterPro
ORFs
Pepwindow?
Octanol?
SIAM Conference on Computational Science and Engineering
RepeatMasker
ncbiBlastWrapper
Repetitive elements
Blastn Vs nr, est
databases.
Williams-Beuren
Workflows
Identification of
overlapping sequence
Characterisation of
protein sequence
Characterisation of
nucleotide sequence
SIAM Conference on Computational Science and Engineering
Experiment life cycle
Forming
experiments
Personalisation
Discovering and
reusing
experiments and
resources
Executing and
monitoring
experiments
Sharing
services &
experiments
SIAM Conference on Computational Science and Engineering
Managing lifecycle,
provenance and
results of experiments
Middleware for bioinformaticians
•
•
•
•
Construct, manage and publish
in silico experiments, chiefly as
workflows, to link up your own
and others resources
Data intensive, up stream
analysis
Workflow Reuse - foundations
for sharing and adapting
workflows and resources, and
their outcomes, based on
semantic descriptions
Whole experiment lifecycle,
including provenance
Forming
experiments
Discovering
and reusing
experiments
and resources
SIAM Conference on Computational Science and Engineering
Personalisatio
n
Executing
and
monitoring
experiments
Sharing
services &
experiments
Managing lifecycle,
provenance and
results of
experiments
Middleware for bioinformaticians
•
•
•
•
•
Open domain services and resources
Open community
Open application
Open model and open data
Open architecture
– Service Oriented Architecture
– Loosely coupled
– Web services based
– Assemble your own components
– Designed to work together
Forming
experiments
Discovering
and reusing
experiments
and resources
Sharing
services &
experiments
SIAM Conference on Computational Science and Engineering
Personalisatio
n
Executing
and
monitoring
experiments
Managing lifecycle,
provenance and
results of
experiments
Metadata Management
Service & workflow
discovery
Java applications
KAVE metadata
store
KAVE
provenance
capture
Pedro semantic
publication
myGrid
ontology
Legacy
applications
mIR myGrid
information
repository
Freefluo
workflow
engine
Notification
service
Web Service (Grid Service)
communication fabric
Soaplab
Executable codes with
an IDL
Data
Management
e-Science coordination
e-Science events
External
Services
Pedro semantic
publication
GRIMOIRES
federated
UDDI+ registry
e-Science
process
patterns
e-Science
mediator
LSID support
information
model
Feta
semantic
discovery
Applications
enactment
Utopia
Web
portals
myGrid
Core
Services
Taverna
e-Science
workbench
Haystack
Workflow
Thirdparty
tools
LSID
Launchpad
CScience Outcomes
Gowlab
Web Sites
SIAM Conference on Computational Science and Engineering
OGSA-DAI DQP
service
Web Services
AMBIT
text extraction
service
OGSA-DAI
databases
Making,wrapping,
publishing and
discovering services
SIAM Conference on Computational Science and Engineering
Workflow Components
Freefluo
Freefluo
Workflow
engine to run
workflows
Scufl Simple Conceptual Unified Flow Language
Taverna Writing, running workflows & examining results
SOAPLAB Makes applications available
SIAM Conference on Computational Science and Engineering
Web Service
e.g. DDBJ BLAST
SOAPLAB
Web Service
Any Application
SeqHound
Service
<<Resource>>
Study
Data and Metadata
Management
Life Science Identifiers
0..*
1
+name:String
+description:String
+startTime:DateTime
+endTime:DateTime
+status:String
Subject
Object
Resources.Resource
contains
0..*
researchFocus
0..*
ProgrammeResource
uses
1
0..*
+name:String
0..*
uses
Programme
LabBookView
0..*
1
selected studies
+name:String
+rule:String
scmInvestiga
0..*
labBooks
Information
Repository and
Common Information
model for e-Science
+getId:URIString
1
has participants
Annotation.SemanticConcept
Investigation
init
1..*
RDF Knowledge Added Value to Experiment
OWL & RDFS
Ontologies
To annotate and
classify entities
with a common
vocabulary
based on a
common
understanding.
<<Resource>>
Operations.Operation
1
0..*
method
<<Resource>>
ExperimentDesign
1
method
SIAM Conference on Computational Science and Engineering
has instances
0..*
Agent
ExperimentInstance
0..*
Layering models
Operation
name, description
task
method
resource
application
Service
hasInput
hasOutput
Parameter
name, description
semantic type
format
transport type
collection type
collection format
name
description
author
organisation
subclass
subclass
WSDL based
operation
workflow
bioMoby service
SIAM Conference on Computational Science and Engineering
WSDL based
Web service
Soaplab service
Local Java code
Workflow
script
Service
Registry
Failure policy
Service
Discovery
Services
Alternates list
Invocation
+ Data
Metadata
template
Enactor
LSID
Service
Semantic
Annotation
Events
External
Data
Store
LSID
LSID +
Data
LSIDs +
Metadata
Data
Info
Repository
KAVE
SIAM Conference on Computational Science and Engineering
LSID +
Data
External
Data
Store
Event
Notification
Service
Biological Outcomes
ELN
WBSCR28
WBSCR27
CLDN4
CLDN3
WBSCR21
STX1A
WBSCR22
WBSCR18
WBSCR24
WBSCR14
Four workflow cycles totalling
The gap was correctly closed and all known features identified
A Pseudo gene missed when working by hand discovered
CTA-315H11
CTB-51J22
RP11-622P13
RP11-148M21
RP11-731K22
314,004bp extension
All nine known genes identified
(40/45 exons identified)
SIAM Conference on Computational Science and Engineering
Physical Map
CTA-315H11
‘Gap’
CTB-51J22
Verify
Picture of
Lab Scientist
NCF1
GTF2IRD2
GTF2I
GTF2IRD1
CYLN2
WBSCR1/E1f4H
WBSCR5/LAB
RFC2
LIMK1
Verify
STX1A
WBSCR21
CLDN3
CLDN4
ELN
WBSCR14
WBSCR18
WBSCR22
BCL7B
TBL2
BAZ1B
FZD9
POM121
NOLR1
FKBP6
Predict
Candidate genes in the
WBS Critical Region
Robert Stevens, Hannah J Tipney, Chris Wroe, Tom Oinn, Martin Senger, Phillip Lord, Carole A Goble, Andy Brass and May
Tassabehji Exploring Williams-Beuren Syndrome Using myGrid in. Bioinformatics 20:i303-310. Proc of 12th Intelligent
SIAM Conference
Computational
Science
and
Systems
in MolecularonBiology
(ISMB), 31st
Jul-4th
AugEngineering
2004, Glasgow, UK
Bioinformatics e-Science
Outcomes
• Elapse time to perform one pipeline from 2 weeks to 2 hours
• Data collection improved
• Other people have used and want to develop the workflows
– Which means describing them so they can be understood
• Changed work practices
• Analysis all at once
• Service interoperability -> results integration
SIAM Conference on Computational Science and Engineering
Bioinformaticians
Create or wrap services,
especially shim services
Adapt workflow structure
Parameterise services
Edit workflow
Search existing
work
services
workflows
workflow fragments
Maintain reuse/repurpose
history
Try out
workflow
Register and annotate
workflow and new services
for reuse
Deploy workflow
3rd party
annotation
providers
Annotate
with
- free text
- ontology
Fragment workflow
SIAM Conference on Computational Science and Engineering
Workflow
Providers
Biologists
Results Integration
SIAM Conference on Computational Science and Engineering
Keeping track – a Web of
science
..masked_sequence_of
project
.. nucleotide_sequence
>gi|19747251|gb|AC005089.3| Homo
sapiens BAC clone CTA-315H11 from 7,
complete sequence
AAGCTTTTCTGGCACTGTTTCCTTCTT
CCTGATAACCAGAGAAGGAAAAGATC
TCCATTTTACAGATGAG
GAAACAGGCTCAGAGAGGTCAAGGCT
CTGGCTCAAGGTCACACAGCCTGGGA
ACGGCAAAGCTGATATTC
AAACCCAAGCATCTTGGCTCCAAAGC
CCTGGTTTCTGTTCCCACTACTGTCAG
TGACCTTGGCAAGCCCT
GTCCTCCTCCGGGCTTCACTCTGCAC
ACCTGTAACCTGGGGTTAAATGGGCT
CACCTGGACTGTTGAGCG
experiment definition
rdf:type
..part_of
urn:lsid:taverna:datathing:13
..BLAST_Report
..similar_sequences_to
AC005089.3
831
Homo sapiens BAC
clone CTA-315H11 from 7, complete sequence
15145617
clone RP11-622P13 from 7, complete sequence
15384807
from clone RP11-553N16 on chromosome 1, complete sequence
7717376
chromosome 21 segment HS21C082
16304790
cDNA DKFZp686G08119 (from clone DKFZp686G08119)
5629923
BAC RPCI11-256L6 (Roswell Park Cancer Institute Human BAC Library) complete sequence
34533695
FLJ45040 fis, clone BRAWH3020486
20377057
chromosome 17, clone RP11-104J23, complete sequence
4191263
from clone RP4-715N11 on chromosome 20q13.1-13.2 Contains two putative novel genes, ESTs, STSs and GSSs, complete sequence
17977487
clone RP11-731I19 from 2, complete sequence
17048246
chromosome 15, clone RP11-342M21, complete sequence
14485328
from clone RP11-461K13 on chromosome 10, complete sequence
5757554
clone RP3-368G6 from X, complete sequence
4176355
chromosome 4 clone B200N5 map 4q25, complete sequence
2829108
group
..author
rdf:type
..works_for
person
..author
workflow invocation
..run_during
..run_for
service description
AC073846.6
815
Homo sapiens BAC
AL365366.20
46.1
Human DNA sequence
service invocation
AL163282.2
44.1
Homo sapiens
AL133523.5
44.1
Human chromosome
urn:lsid:taverna:datathing:15
14 DNA sequence BAC R-775G15 of library RPCI-11 from chromosome 14 of Homo sapiens (Human), complete sequence
34367431
..part_of
workflow definition
..invocation_of
19747251
organisation
..part_of
BX648272.1
44.1
Homo sapiens mRNA;
AC007298.17
44.1
Homo sapiens 12q22
..described_by
AK126986.1
44.1
Homo sapiens cDNA
AC069363.10
44.1
Homo sapiens
AL031674.1
44.1
Human DNA sequence
AC093690.5
44.1
Homo sapiens BAC
AC012568.7
44.1
Homo sapiens
AL355339.7
44.1
Human DNA sequence
AC007074.2
44.1
Homo sapiens PAC
..created_by
AC005509.1
44.1
Homo sapiens
AF042090.1
44.1
Homo sapiens
chromosome 21q22.3 PAC 171F15, complete sequence
..filtered_version_of
A
Relationship BLAST
report has with
other
B of information related
Other classes
to BLAST report
Jun Zhao, Chris Wroe, Carole Goble, Robert Stevens, Dennis Quan, Mark Greenwood, Using Semantic Web Technologies for
Representing
e-Science
in Proc
3rd International
Semantic Web Conference, Hiroshima, Japan, Nov 2004
SIAM Conference
on Provenance
Computational
Science
and Engineering
Building a data model and
viewing results
Leaky pipes with prior process path dependencies and state
Data
Objects
Provenance
Record
Data
Objects
Data
Objects
Data
Objects
Data
Objects
Data
Objects
Data
Objects
Provenance
Record
Data
Objects
Provenance
Record
SIAM Conference on Computational Science and Engineering
Integrative Biology Project
http://www.integrativebiology.ac.uk
Scientist designs, initiates and
steers simulation from Taverna
Workbench
Steering of
simulations by
manipulation of
service state
Steering Control
Process 3
Process 2
Process 1
Enactor
myGrid
Workflow definition
sent to enactor
Metadata
Stores
Process and data provenance
captured and stored by
metadata services
SIAM Conference on Computational Science and Engineering
Scientists
Workflow
Workbench
12181 acatttctac caacagtgga
tgaggttgtt ggtctatgtt
ctcaccaaat ttggtgttgt
12241
cagtctttta aattttaacc
tttagagaag agtcatacag
tcaatagcct tttttagctt
12301
gaccatccta atagatacac
agtggtgtct cactgtgatt
SIAM Conference on Computational Science and Engineering
12181 acatttctac caacagtgga
tgaggttgtt ggtctatgtt
ctcaccaaat ttggtgttgt
12241
cagtctttta aattttaacc
tttagagaag agtcatacag
tcaatagcct tttttagctt
12301
gaccatccta atagatacac
agtggtgtct cactgtgatt
SIAM Conference on Computational Science and Engineering
12181 acatttctac caacagtgga
tgaggttgtt ggtctatgtt
ctcaccaaat ttggtgttgt
12241
cagtctttta aattttaacc
tttagagaag agtcatacag
tcaatagcct tttttagctt
12301
gaccatccta atagatacac
agtggtgtct cactgtgatt
SIAM Conference on Computational Science and Engineering
Activation Energy
• Important for take up and
community building.
• And take up leads to much
better understanding.
• 1 hour to learn how to use the
workflow environment
• Service scavenge and go
• Deal with legacy
SIAM Conference on Computational Science and Engineering
Services suck
• The workflow are only as good as the services they link
together. myGrid ships with access to > 1000
• Bootstrapping services.
• Reliability. Stability. Alternates.
• Service provider partners.
SIAM Conference on Computational Science and Engineering
Sharing takes effort.
• Unanticipated reuse by people you don’t know in
automated workflows.
• The metadata needed pays off but its challenging and
costly to obtain..
• Automated, service providers, network effects
• Quality control. Misuse. Inappropriate use.
• Competitive advantage, Intellectual property.
• Workflow design - local or licensed services
SIAM Conference on Computational Science and Engineering
A NCBI-BLAST
Description
Service Name: Blast
Operation: execute
task: pairwise_local_aligning
resource: EMBL
application: blastn
Parameter:
Input:
Name: accession
semantic type: EMBL Nucleotide sequence id
transport data type: string
Output:
Name: Result
semantic type: sequence alignment report
transport data type: string
SIAM Conference on Computational Science and Engineering
Tiered specifications
Task
Service class
Specific services
IBM Life Sciences service
Classes of services
Domain “semantic”
“Unexecutable”
“Potentials”
SOAPLAB service
setProgram()
createJob()
Sequence
similarity
search
setDatabase()
BLAST
BLAST
service
BLAST
run()
or
setE_value()
Instances of services
Business “operational”
“Executable”
“Actuals”
getResults()
blastQuery()
Wroe C, Goble CA, Greenwood M, Lord P, Miles S, Papay J, Payne T, Moreau L Automating Experiments Using Semantic
SIAMonConference
on Computational
Science and
Engineering
Data
a Bioinformatics
Grid in IEEE Intelligent
Systems
Jan/Feb 2004
Disposable SW
Lash up
Technology driven
Prototype 1
internal
• Plan to throw away
• Separate e-Science research
from e-Science development
• Support your e-science pioneers
User driven
pioneers
Development track
Research track
Prototype 2
external
Migration track
SIAM Conference on Computational Science and Engineering
User driven
Early adopters
Reusable SW
• Design for extensibility and reuse – open systems
• Design for the generic but build from the specific
• Separate CS research and development tracks
• When you are
interoperating,
standards aren’t
boring, they are
necessary.
• Standards
mean you can
use everyone
else’s stuff.
SIAM Conference on Computational Science and Engineering
Science – Computer
complexity mismatch
• Interoperability and
execution complexity
• Layers of detail
SIAM Conference on Computational Science and Engineering
Shim Services
‘I want to identify new sequences which overlap
with my query sequence and determine if they
are useful’
Sequence database entry
Fasta format sequence
Genbank format sequence
Sequence
i.e. last
known 3000bp
Mask
BLAST
Simplify and
Compare
Retrieve
Identify new sequences
and determine their degree
of identity
Lister
Old BLAST result
BLAST2
SIAM Conference on Computational Science and Engineering
Alignment of full query
sequence V full ‘new’
sequence
The devil is in the detail
Experiment provenance
Simple
classifications of
services
Descriptions in
biological language
Simple workflow
Descriptions for
automatic service
execution and fault
management
Workflows for
automagical execution –
implicit iteration,
generous typing …
Debugging and
rerunning provenance
logs
Expressive ontologies
to match up services
automatically
SIAM Conference on Computational Science and Engineering
Taverna
Workbench
Scufl language
parser
Freefluo Workflow Enactor Core
Processor
Processor
Processor
Processor
Processor
Bio
MART
Seq
Hound
Plain
Web
Service
Soap
lab
Bio
MOBY
SIAM Conference on Computational Science and Engineering
Processor
Local
App
Processor
Enactor
• Yellow – Soaplab
• Green – WSDL Web Service
SIAM Conference on Computational Science and Engineering
Scientists are from Venus and
Computer Scientists are from Mars
They have different needs and motivations.
SIAM Conference on Computational Science and Engineering
Mars vs Venus
• Not my problem: Lets solve this other problem which isn’t
your problem but is fun and leads to interesting software.
• Over-complication: Lets solve this harder problem than take
the easier route that solves your problem. Hendler Principle
– a little semantics goes a long way.
• Size matters: Well it works for my toy test set that I
synthesised
• Mother knows best: Tell us what you want and then go away
and we will build it for you.
• Fin: You can’t use it until its finished
• Suits me: I can understand it -- just need train you to be just
like me
SIAM Conference on Computational Science and Engineering
Venus vs Mars
• The parent principle: Repeating the same old mistakes
despite our experiences. Simplifications, hackery and
monolithes now stores up trouble down the road.
• It works ‘cos I say so: it works in my application/hack, thus
it is good
• Short termism: It just about holds together to get the
results for my paper. Lets hope the PhD student doesn‘t
leave...You have to invest now for the future.
• Isolationism: It doesn’t matter if only I can understand what
I am doing, no one else will want to know. Oh yeah?
SIAM Conference on Computational Science and Engineering
Science
Biology
Conversation
Respect
Understanding
Compromise
Collaboration
e
Computer
Science
e-Science
Bioinformatics
Middleware
SIAM Conference on Computational Science and Engineering
Bioinformaticians
“You have been working with us too long - I understood
you perfectly”
Mike Sternberg
Head of Structural Bioinformatics Group &
Director of Imperial College Centre for Bioinformatics
SIAM Conference on Computational Science and Engineering
Thanks to
•
myGrid:
Chris Wroe, Katy Wolsencroft, Tom Oinn, Antoon
Goderis, Peter Li, Anil Wipat
• WBS: Hannah Tipney, May Tassabehji
• Graves Disease: Clare Jennings
• Integrative Biology: David Gavaghan
SIAM Conference on Computational Science and Engineering