Taverna workflow management system

Download Report

Transcript Taverna workflow management system

http://taverna.org.uk/
Stian Soiland-Reyes
myGrid, School of Computer Science
University of Manchester, UK
Grid
my
UKOLN DevSci: Workflow Tools
Bath, 2010-11-30
What is myGrid?

An e-Science Collaboration Since 2001


Not a grid!
Numerous partners involved:





University of Manchester
University of Southampton
University of Oxford
EMBL-EBI
Provides sustainable and production quality software
 Supported by OMII-UK, EPSRC and BBSRC

Mixture of developers, bioinformaticians and
researchers
Software | Services | Content | Skills | Community
Grid
my
http://mygrid.org.uk/
http://taverna.org.uk/
Motivation

Challenge:
Bioinformatics
 Large amounts of data
 Many open questions
 Numerous freely
available public
datasets and analysis
tools
Grid
my
http://mygrid.org.uk/
http://taverna.org.uk/
Huge amounts of data
Microarray
1000+ Genes
QTL regions
100+ Genes
How do I look
at all the genes
systematically?
Next Gen
Sequencing
10,000+
Genes
Grid
my
http://mygrid.org.uk/
http://taverna.org.uk/
Manual approach

Search using public web sites and databases
 Pubmed
 Uniprot
 EBI BioMart

Copy and paste to web tools for analysis
 NCBI Blast
 EBI InterPro

Further processing locally
 R
 Perl
 Python
Grid
my
http://mygrid.org.uk/
http://taverna.org.uk/
Manual: disadvantages
•
•
•
•
•
•
Scale of analysis task overwhelms researchers
– lots of data
User bias and premature filtering of datasets –
cherry picking
Hypothesis-Driven approach to data analysis
Constant changes in data - problems with reanalysis of data
Implicit methodologies (hyper-linking through
web pages)
Error proliferation from any of the listed issues
– notably human error
Grid
my
http://mygrid.org.uk/
http://taverna.org.uk/
Web services and workflows

Web services
 Technology and standards for exposing code and
data resources that can be programmatically
consumed by a remote third party
 Description on how to interact with the service,
parameters, documentation

Workflows
 General technique for describing and executing a
process
 Describe what you want to do running which
services
Grid
my
http://mygrid.org.uk/
http://taverna.org.uk/
Taverna workflows
Workflow Inputs
start_position
chromosome_name
end_position
genes_in_qtl
mmusculus_gene_ensembl
remove_entrez_duplicates
remove_uniprot_duplicates
merge_entrez_genes

create_report
merge_uniprot_ids
remove_Nulls
REMOVE_NULLS_2
add_ncbi_to_string
add_uniprot_to_string
Kegg_gene_ids_2
Kegg_gene_ids
concat_kegg_genes
split_gene_ids

regex_2
split_for_duplicates
remove_duplicate_kegg_genes
Get_pathways
Workflow Inputs
regex

gene_ids
split_by_regex
A set of (local and remote)
services to analyze or manage
data
Nested workflows are also
services
Data-links connects services
lister
 i.e. output from service A is input to
get_pathways_by_genes1
service B and C
 Describes the desired dataflow
instead of process coordination
Merge_pathways
concat_ids
concat_gene_pathway_ids
Merge_gene_pathways
Workflow Outputs
pathway_genes
pathway_ids
merge_pathway_list_1


merge_pathway_list_2
split_for_duplicate_pathways
remove_duplicate_ids
pathway_descriptions
gene_descriptions
merge_gene_desc
remove_nulls_3
merge_genes_and_pathways
merge_genes_and_pathways_2
merge_genes_and_pathways_3
flatten_pathway_files
remove_pathway_duplicates
merge_pathway_desc
remove_pathway_nulls
merge_patwhay_ids
remove_pathway_nulls_2
merge_kegg_references
species
kegg_pathway_release
merge_reports
getcurrentdatabase
binfo
report
ensembl_database_release
kegg_pathway_release
Automatic iterations
Can customize list handling and
control links
Workflow Outputs
gene_descriptions
genes_pathways
merged_pathways
Grid
pathway_descriptions
pathway_ids
kegg_external_gene_reference
my
http://mygrid.org.uk/
http://taverna.org.uk/
What types of services?







Public/private/secured WSDL/SOAP web services
RESTful web services
Spreadsheet import
Command line tools (local/ssh)
Inline scripts (Beanshell, R)
Java APIs
Customizations:




BioMart, BioMoby / SADI
Soaplab
Grid services (Globus, EGEE gLite, caGrid)
… your tool (Plugin tutorial on wiki)
Grid
my
http://mygrid.org.uk/
http://taverna.org.uk/
Which services?
Taverna is general, can connect to standard
web services for any domain
 Bioinformatics:

 From professional third-party organisations
providing robust & open data/analysis services
 ..to under-the-desk web services for one particular
purpose, ran by PhD students
  http://biocatalogue.org/ - 1730 services from 130
providers – crowd sourced and quality monitored
Grid
my
http://mygrid.org.uk/
http://taverna.org.uk/
Grid
my
http://mygrid.org.uk/
http://taverna.org.uk/
Taverna
workbench





Graphical desktop tool
No server installation
required
Drag-and-drop services
into diagram
Connect services, run,
reconnect, rerun
Integrates diverse set
of tools
Grid
my
http://mygrid.org.uk/
http://taverna.org.uk/
Grid
my
http://mygrid.org.uk/
http://taverna.org.uk/
Grid
my
http://mygrid.org.uk/
http://taverna.org.uk/
Grid
my
http://mygrid.org.uk/
http://taverna.org.uk/
Sharing workflows
myExperiment.org allows users to share,
find, download and rate workflows
 “Facebook for the scientist”
 3000 members, 1100 workflows

Grid
my
http://mygrid.org.uk/
http://taverna.org.uk/
Extensible UI and engine

Plugins can provide new “perspectives”
 i.e.: BioCatalogue, myExperiment

Provide service-specific customization
 BioMart interface replicates web site

Adding new functionality
 Looping, branching, dynamic service resolution
 New service types
 Design helpers, “Find matching service”
Grid
my
http://mygrid.org.uk/
http://taverna.org.uk/
Taverna 3 “Next-gen”

Under development for 2011
 Interactive, component-centric and data-centric
workflow design
 Pre-packaged workflow components
 Searching for workflow components from
BioCatalogue and myExperiment
 New myGrid workflow components library
Grid
my
http://mygrid.org.uk/
http://taverna.org.uk/
Taverna command line



Executes from a
Windows/Linux/OSX
shells
Takes a predefined
workflow with files as
inputs and outputs
Quick way to
“productionize” a
workflow
Grid
my
http://mygrid.org.uk/
http://taverna.org.uk/
Taverna Server
REST/SOAP interface to
execute workflows
 Client libraries for Ruby and Java
 Two demonstration web interfaces

 Ruby
 Java Portlets

Future
 Detailed execution support and control
 Security delegation
Grid
my
http://mygrid.org.uk/
http://taverna.org.uk/
Taverna portlet


Example portlet
implementation
Executes workflows
using Taverna Server
Grid
my
http://mygrid.org.uk/
http://taverna.org.uk/
Grid
my
http://mygrid.org.uk/
http://taverna.org.uk/
Ruby web interface

Example customized
web interface
Grid

Uses Ruby gem
t2-server
my
http://mygrid.org.uk/
http://taverna.org.uk/
Taverna on the cloud

Use-case:
 SNP analysis and annotation of
genome sequenced from
breeds of cows in Africa – why are
some of them resistent to X?
 Amazon EC2 with Taverna Server and local
services
 Custom (built-in-a-week) Ruby on Rails web
interface
 Runs through 31 chromosomes in 6.5 hours
using 10 instances - $26
Grid
my
http://mygrid.org.uk/
http://taverna.org.uk/
Grid
my
http://mygrid.org.uk/
http://taverna.org.uk/
Open source, open development
Taverna suite of tools are all open source
and free to use
 Large user community, active mailing lists
 Lead developers: myGrid in Manchester
 Contributors from across the world
 PAL programme
 myGrid provides training, tutorials and
documentation

Grid
my
http://mygrid.org.uk/
http://taverna.org.uk/
Acknowledgements
Grid
my
http://mygrid.org.uk/
http://taverna.org.uk/
Grid
my
http://mygrid.org.uk/
http://taverna.org.uk/
More information

http://www.mygrid.org.uk/

http://www.taverna.org.uk/

http://www.myexperiment.org/

http://www.biocatalogue.org/
Grid
my
http://mygrid.org.uk/
http://taverna.org.uk/