Taverna workflow editor overview

Download Report

Transcript Taverna workflow editor overview

Taverna
the story from up-above
Antoon Goderis
The University of Manchester, UK
http://www.mygrid.org.uk/taverna
http://www.omii.ac.uk
DART workshop, Brisbane, Australia, 14 December 2006
Overview



The situation in –omics
Creating new biology using Taverna
Taverna


Key traits
Features on the OMII roadmap

Including today’s release
2
Bioinformaticians & co.
3
Open environment
Data, Data, Data
National Center for
EBI
Biotechnology Information (USA)
Tokyo, Japan
Cambridge, UK
SRS
SeqHound
4
12181
12241
12301
12361
12421
12481
12541
12601
12661
12721
12781
acatttctac
cagtctttta
gaccatccta
gactaattat
taggtgactt
aggagctatt
ttcttataag
tggttaagta
tggcattaag
atccaatacc
taacccattt
caacagtgga
aattttaacc
atagatacac
gttgagcttg
gcctgttttt
tatatattct
tctgtggttt
tacatgacat
tacatccaca
cattaagctg
tctgtctcta
tgaggttgtt
tttagagaag
agtggtgtct
ttaccattta
ttttaattgg
ggatacaagt
ttatattaat
aaaacggatt
atattgtgca
tcactcccca
tggatttgcc
ggtctatgtt
agtcatacag
cactgtgatt
gacaacttca
gatcttaatt
tctttatcag
gtttttattg
atcttaacca
actatcacca
atctcccatt
tgttctggat
ctcaccaaat
tcaatagcct
ttaatttgca
ttagagaagt
tttttaaatt
atacacagtt
atgactgttt
ttttaaaatg
ctatcatact
ttcccacccc
attcatatta
ttggtgttgt
tttttagctt
ttttcctgct
gtctaatatt
attgatttgt
tgtgactatt
tttacaattg
taaaattcga
ccaaaagggc
tgacaatcaa
atagaatcaa
5






The situation in {genomics,
transcriptomics, proteomics,
Lots of data
metabolomics ..}
Lots of parameters to choose
An analysis takes a long time
The analysis services are unreliable
Lots of analysis steps
Need to record and explain your steps
6
Enter workflows






Lots of data
[high throughput]
Lots of parameters to choose
[best practice]
An analysis takes a long time
[long running]
The analysis services are unreliable
[fault tolerance]
Lots of analysis steps
[data and control flow]
Need to record and explain your steps
[provenance]
7
Workflow-based
middleware
12181 acatttctac caacagtgga tgaggttgtt
ggtctatgtt ctcaccaaat ttggtgttgt
12241 cagtctttta aattttaacc tttagagaag
agtcatacag tcaatagcct tttttagctt
12301 gaccatccta atagatacac agtggtgtct
cactgtgatt ttaatttgca ttttcctgct
12361 gactaattat gttgagcttg ttaccattta
gacaacttca ttagagaagt gtctaatatt
12421 taggtgactt gcctgttttt ttttaattgg
8
myGrid
 myGrid







http://www.mygrid.org.uk
UK e-Science pilot project since 2001
Part of the Open Middleware Infrastructure Institute UK
Build middleware for Life Scientists that enables them
to undertake in silico experiments and share those
experiments and their results.
Individual scientists, in under-resourced labs, who use
other people’s applications.
Open source.
Workflows & Semantic Techologies for metadata
management.
Data flows. Ad hoc & exploratory
9
Overview



The situation in -omics
Creating new biology using Taverna
Taverna


Key traits
Features on the OMII roadmap

Including today’s release
10
Phenotype
Genotype
200
Genes captured in
microarray
experiment and
present in QTL
region
?
Phenotypic response
investigated using microarray
in form of expressed genes
or evidence provided through
11
QTL mapping
Microarray + QTL
[Andy Brass, Steve Kemp, Paul Fisher, 2006]
Key:
A – Retrieve genes in QTL
region
B – Annotate genes with
external database Ids
C – Cross-reference Ids with
KEGG gene ids
D – Retrieve microarray data
from MaxD database
E – For each KEGG gene get
the pathways it’s involved in
F – For each pathway get a
description of what it does
G – For each KEGG gene get
a description of what it does
[Andy Brass, Steve Kemp,
Paul Fisher, 2006]
12
Result



Captured the pathways returned by QTL and
Microarray workflows over the MaxD
microarray database
Identified a pathway for which its correlating
gene (Daxx) is believed to play a role in
trypanosomiasis resistance.
Manually analysis on the microarray and QTL
data had failed to identify this gene as a
candidate.
13
[Andy Brass, Steve Kemp, Paul Fisher, 2006]
Trichuris muris
(mouse whipworm) infection




Identified the biological pathways involved
in sex dependence in the mouse model,
previously believed to be involved in the
ability of mice to expel the parasite.
Manual experimentation: Two year study of
candidate genes, processes unidentified
Workflows: trypanosomiasis cattle
experiment, was reused without change.
Analysis of the data by a biologist found the
processes in a couple of days.
14
[Joanne Pennock, Paul Fisher, 2006]
Changing scientific practice

Systematic and comprehensive automation.


Dry people hypothesise, wet people validate.


“make sense of this data” -> “does this make sense?”
Workflow factories.


Eliminated user bias and premature filtering of
datasets and results leading to single sided, expertdriven hypotheses
Different dataset, different result
Accurate provenance.
15
Overview



The situation in -omics
Creating new biology using Taverna
Taverna


Key traits
Features on the OMII roadmap

Including today’s release
16
User Uptake

~25000 downloads

Systems biology
Proteomics
Gene/protein annotation
Microarray data analysis
Medical image analysis
Heart simulations
High throughput
screening
Phenotypical studies
Plants, Mouse, Human
Astronomy
Dilbert Cartoons
17










Finding and
Sharing Tools
3rd Party
Applications and
Portals
DAS
Taverna Workbench
myExperiment
Utopia
Feta
Workflow
Enactor
Workflow enactor
Clients
Service
Management
LSIDs
Provenance
log
Metadata
KAVE
Default
Data
Store
BAKLAVA
Custom
Store
18
Results
Management
19
Taverna workbench

3000+ services

Open domain services and
resources, Third party.
Enforce NO common data model.
No common typing, Missing
metadata.




Soaplab
InstantSoap
20
Services Landscape
21
User Interaction


Allows a workflow to call
out to an expert human
user
E.g. Used to embed the
Artemis annotation editor
within an otherwise
automated genome
annotation pipeline
22
[University of Bergen]
Tools, Tools, Tools
Pedro Annotation tool
Feta Search tool
23
Capture and Curation Effort
Ontology and Annotation Curation Team
Franck Tanoh and Katy Wolstencroft
Community Scientists
Community Service
Providers
24
Shielding &
Extensible
plug-ins
Taverna
Workbench
Application
Scufl Model
Simple Conceptual Unified Flow Language
Nested workflows, Automatic iterations,
Best guess data type handling
Workflow Execution
Workflow enactor
Processor
Processor
Processor
Processor
Processor
Bio
MART
Seq
Hound
Plain
Web
Service
Soap
lab
Bio
MOBY
Processor
Local
Java
App
Processor
Processor
Processor
WF
Enactor
WS
RF
Beanshell
25
Duncan Hull, myGrid
Khalid Belhajjame, ISPIDER
Service incompatibility



Fix up the services to be compatible or….
Shims – libraries of adapters.
Automated data type matching using reasoning over
a mismatch and service ontology
26
Mismatch
detection
Shim
identification
27
Service failure?
Most services are owned by other people
 No control over service failure
 Some are research level
Workflows only as good as the services they connect.
 Notify failures
 Instigate retries
 Set criticality
 Substitute services

28
Provenance Collection



Observes events from
the workflow engine
Populates an RDF triple
store with information
from these events
Browse interface
[instanceOf]
[similar_sequence_to]
[input]
[performsTask]


Simple browser replicates
Taverna’s existing result
and status browser
Graphical browser
urn:hit1
…
urn:hit2
….
urn:BlastNInvocation3
[contains]
Find similar sequence
urn:hit50
…..
urn:data2
urn:data1
2
[instanceOf]
[output]
Sequence_hit
[input]
[hasHits]
[instanceOf]
urn:compareinvocation3
[distantlyDerivedFrom]

SwissProt_seq
urn:data1
[output]
[contains]
urn:data:
f1
[hasName
]
Missed sequence
[instanceOf]
urn:hit5…
urn:data:3
[output]
Blast_report
[directlyDerivedFrom
]
[output]
urn:invocation
5
[type]
DatumCollection
urn:hit8…
.
urn:hit10
…..
[ ]
Data generated
by
services/workfl
ows
Properties
[type]
urn:data:
f2
New
sequence
[hasName
]
LSDatum
Concepts
Services
literals
ProQA Query API
[Zhao et al 07 provenance challenge paper]
29
30
Provenance
Tracking
From which
Ensembl gene
does pathway
mmu004620 come
from?
31
Workflows over Results
Automatically
backtrack through
the data
provenance graph
Entrez
dF
dF
Pathway_id
KEGG_id
dF
Uniprot
dF
Ensembl_gene_id
32
A workflow
marketplace
33
webTaverna GUI
- main
34
Overview



The situation in -omics
Creating new biology using Taverna
Taverna


Key traits
Features on the OMII roadmap

Including today’s release
35
myGrid
Source-forge
community
Alliance
Ingest
myGrid
Evaluation
Pre-release
Prioritise
& Plan
Software
Engineering
XP
Production
OMII-UK
Release
myGrid
Release
Software
Engineering
Quality & Test
OMII Software
Engineering
Quality & Test
Applications & Professional Services
Pioneers
Early adopters
Pioneers
Conservatives
Early adopters
36
Pioneers
Who are the OMII Users?
Different scientific/research domains
End Users
Different activities
Application Developers
Service and Middleware
Developers
Increasing
variation in
requirements
with the
scientific
domain.
Middleware Deployers
Systems Administrators
37
Taverna is now part of OMII-UK



Taverna 1.5 – Today!
Taverna 1.6
myExperiment
38
Taverna 1.5





Integrated provenance
Raven release mechanism to simplify updates
for the user
+/- 300 semantic annotations for core services
Patterns for using proxies for bulk data
transactions
Redeveloped plug in and enactor framework,
improved iteration events, data management
39
Taverna 1.5

Integrated provenance
40
Taverna 1.5


Integrated provenance
Raven release mechanism to simplify updates for the
user
41
Taverna 1.5



Integrated provenance
Raven release mechanism to simplify updates for the
user
+/- 300 semantic annotations for core services
Add_ncbi_to_string : beanshell script, need to ask Paul for more details
Input:
Output:
Kegg_gene_ids_all_species (bconv): converts external IDs to KEGG IDs [mapping]
string: External ID . e.g. NCBI ID [Genebank_GI]
return: KEGG gene ID [KEGG_record_id]
Get_pathways_by_genes: Search all pathways which include all the given genes [Searching]
Input: List of KEGG genes id [KEGG_gene_id]
Output: Return a list of pathway_id of specified KEGG genes_id
Merge_pathways
Stringlist
Concatenated
This workflow takes in Entrez gene ids then adds the string "ncbi-geneid:" to
the start of each gene id. These gene ids are then cross-referenced to
KEGG gene ids. Each KEGG gene id is then sent to the KEGG pathway
database and its relevant pathways returned.
42
Taverna 1.5





Integrated provenance
Raven release mechanism to simplify updates for the
user
+/- 300 semantic annotations for core services
Patterns for using proxies for bulk data transactions
Redeveloped plug in and enactor framework, improved
iteration events, data management
43
Taverna 1.6

Due out Summer 2007
 Revised enactment core
 Native support for long running workflows
 Data proxy to deal with bulk data transactions
 Improved service discovery and provenance
management
44
Obtaining Taverna

Taverna is available under the LGPL from our
project site on Sourceforge.net




http://taverna.sourceforge.net
Win32, Solaris / Linux & OS-X
Includes online and downloadable user
manual, examples etc.
Support via project mailing lists
46
Conclusions
See plans for Taverna 2.0 on myGrid wiki
Taverna development is user-driven



Please keep in touch and tell us what you would
like to see by the myGrid mailing lists: Taverna
Users, Taverna Hackers
Taverna http://taverna.sourceforge.net
myGrid http://www.mygrid.org.uk
OMII-UK http://www.omii.ac.uk
47
Acknowledgements



Phase1 myGrid researchers, Phase2 OMII-UK, myGrid
Research Team
Peter Li, Paul Fisher, Andy Brass, Robert Stevens, Mark
Wilkinson
EPSRC, Wellcome Foundation, EU
48