Taverna workflow management system
Download
Report
Transcript Taverna workflow management system
http://taverna.org.uk/
Stian Soiland-Reyes
myGrid, School of Computer Science
University of Manchester, UK
Grid
my
UKOLN DevSci: Workflow Tools
Bath, 2010-11-30
What is myGrid?
An e-Science Collaboration Since 2001
Not a grid!
Numerous partners involved:
University of Manchester
University of Southampton
University of Oxford
EMBL-EBI
Provides sustainable and production quality software
Supported by OMII-UK, EPSRC and BBSRC
Mixture of developers, bioinformaticians and
researchers
Software | Services | Content | Skills | Community
Grid
my
http://mygrid.org.uk/
http://taverna.org.uk/
Motivation
Challenge:
Bioinformatics
Large amounts of data
Many open questions
Numerous freely
available public
datasets and analysis
tools
Grid
my
http://mygrid.org.uk/
http://taverna.org.uk/
Huge amounts of data
Microarray
1000+ Genes
QTL regions
100+ Genes
How do I look
at all the genes
systematically?
Next Gen
Sequencing
10,000+
Genes
Grid
my
http://mygrid.org.uk/
http://taverna.org.uk/
Manual approach
Search using public web sites and databases
Pubmed
Uniprot
EBI BioMart
Copy and paste to web tools for analysis
NCBI Blast
EBI InterPro
Further processing locally
R
Perl
Python
Grid
my
http://mygrid.org.uk/
http://taverna.org.uk/
Manual: disadvantages
•
•
•
•
•
•
Scale of analysis task overwhelms researchers
– lots of data
User bias and premature filtering of datasets –
cherry picking
Hypothesis-Driven approach to data analysis
Constant changes in data - problems with reanalysis of data
Implicit methodologies (hyper-linking through
web pages)
Error proliferation from any of the listed issues
– notably human error
Grid
my
http://mygrid.org.uk/
http://taverna.org.uk/
Web services and workflows
Web services
Technology and standards for exposing code and
data resources that can be programmatically
consumed by a remote third party
Description on how to interact with the service,
parameters, documentation
Workflows
General technique for describing and executing a
process
Describe what you want to do running which
services
Grid
my
http://mygrid.org.uk/
http://taverna.org.uk/
Taverna workflows
Workflow Inputs
start_position
chromosome_name
end_position
genes_in_qtl
mmusculus_gene_ensembl
remove_entrez_duplicates
remove_uniprot_duplicates
merge_entrez_genes
create_report
merge_uniprot_ids
remove_Nulls
REMOVE_NULLS_2
add_ncbi_to_string
add_uniprot_to_string
Kegg_gene_ids_2
Kegg_gene_ids
concat_kegg_genes
split_gene_ids
regex_2
split_for_duplicates
remove_duplicate_kegg_genes
Get_pathways
Workflow Inputs
regex
gene_ids
split_by_regex
A set of (local and remote)
services to analyze or manage
data
Nested workflows are also
services
Data-links connects services
lister
i.e. output from service A is input to
get_pathways_by_genes1
service B and C
Describes the desired dataflow
instead of process coordination
Merge_pathways
concat_ids
concat_gene_pathway_ids
Merge_gene_pathways
Workflow Outputs
pathway_genes
pathway_ids
merge_pathway_list_1
merge_pathway_list_2
split_for_duplicate_pathways
remove_duplicate_ids
pathway_descriptions
gene_descriptions
merge_gene_desc
remove_nulls_3
merge_genes_and_pathways
merge_genes_and_pathways_2
merge_genes_and_pathways_3
flatten_pathway_files
remove_pathway_duplicates
merge_pathway_desc
remove_pathway_nulls
merge_patwhay_ids
remove_pathway_nulls_2
merge_kegg_references
species
kegg_pathway_release
merge_reports
getcurrentdatabase
binfo
report
ensembl_database_release
kegg_pathway_release
Automatic iterations
Can customize list handling and
control links
Workflow Outputs
gene_descriptions
genes_pathways
merged_pathways
Grid
pathway_descriptions
pathway_ids
kegg_external_gene_reference
my
http://mygrid.org.uk/
http://taverna.org.uk/
What types of services?
Public/private/secured WSDL/SOAP web services
RESTful web services
Spreadsheet import
Command line tools (local/ssh)
Inline scripts (Beanshell, R)
Java APIs
Customizations:
BioMart, BioMoby / SADI
Soaplab
Grid services (Globus, EGEE gLite, caGrid)
… your tool (Plugin tutorial on wiki)
Grid
my
http://mygrid.org.uk/
http://taverna.org.uk/
Which services?
Taverna is general, can connect to standard
web services for any domain
Bioinformatics:
From professional third-party organisations
providing robust & open data/analysis services
..to under-the-desk web services for one particular
purpose, ran by PhD students
http://biocatalogue.org/ - 1730 services from 130
providers – crowd sourced and quality monitored
Grid
my
http://mygrid.org.uk/
http://taverna.org.uk/
Grid
my
http://mygrid.org.uk/
http://taverna.org.uk/
Taverna
workbench
Graphical desktop tool
No server installation
required
Drag-and-drop services
into diagram
Connect services, run,
reconnect, rerun
Integrates diverse set
of tools
Grid
my
http://mygrid.org.uk/
http://taverna.org.uk/
Grid
my
http://mygrid.org.uk/
http://taverna.org.uk/
Grid
my
http://mygrid.org.uk/
http://taverna.org.uk/
Grid
my
http://mygrid.org.uk/
http://taverna.org.uk/
Sharing workflows
myExperiment.org allows users to share,
find, download and rate workflows
“Facebook for the scientist”
3000 members, 1100 workflows
Grid
my
http://mygrid.org.uk/
http://taverna.org.uk/
Extensible UI and engine
Plugins can provide new “perspectives”
i.e.: BioCatalogue, myExperiment
Provide service-specific customization
BioMart interface replicates web site
Adding new functionality
Looping, branching, dynamic service resolution
New service types
Design helpers, “Find matching service”
Grid
my
http://mygrid.org.uk/
http://taverna.org.uk/
Taverna 3 “Next-gen”
Under development for 2011
Interactive, component-centric and data-centric
workflow design
Pre-packaged workflow components
Searching for workflow components from
BioCatalogue and myExperiment
New myGrid workflow components library
Grid
my
http://mygrid.org.uk/
http://taverna.org.uk/
Taverna command line
Executes from a
Windows/Linux/OSX
shells
Takes a predefined
workflow with files as
inputs and outputs
Quick way to
“productionize” a
workflow
Grid
my
http://mygrid.org.uk/
http://taverna.org.uk/
Taverna Server
REST/SOAP interface to
execute workflows
Client libraries for Ruby and Java
Two demonstration web interfaces
Ruby
Java Portlets
Future
Detailed execution support and control
Security delegation
Grid
my
http://mygrid.org.uk/
http://taverna.org.uk/
Taverna portlet
Example portlet
implementation
Executes workflows
using Taverna Server
Grid
my
http://mygrid.org.uk/
http://taverna.org.uk/
Grid
my
http://mygrid.org.uk/
http://taverna.org.uk/
Ruby web interface
Example customized
web interface
Grid
Uses Ruby gem
t2-server
my
http://mygrid.org.uk/
http://taverna.org.uk/
Taverna on the cloud
Use-case:
SNP analysis and annotation of
genome sequenced from
breeds of cows in Africa – why are
some of them resistent to X?
Amazon EC2 with Taverna Server and local
services
Custom (built-in-a-week) Ruby on Rails web
interface
Runs through 31 chromosomes in 6.5 hours
using 10 instances - $26
Grid
my
http://mygrid.org.uk/
http://taverna.org.uk/
Grid
my
http://mygrid.org.uk/
http://taverna.org.uk/
Open source, open development
Taverna suite of tools are all open source
and free to use
Large user community, active mailing lists
Lead developers: myGrid in Manchester
Contributors from across the world
PAL programme
myGrid provides training, tutorials and
documentation
Grid
my
http://mygrid.org.uk/
http://taverna.org.uk/
Acknowledgements
Grid
my
http://mygrid.org.uk/
http://taverna.org.uk/
Grid
my
http://mygrid.org.uk/
http://taverna.org.uk/
More information
http://www.mygrid.org.uk/
http://www.taverna.org.uk/
http://www.myexperiment.org/
http://www.biocatalogue.org/
Grid
my
http://mygrid.org.uk/
http://taverna.org.uk/