Transcript Workflows

Bioinformatics Workflows
Chris Wroe
(based on material from the myGrid team &
May Tassabehji / Hannah Tipney
Medical Genetics, St Marys)
Bioinformatics
pipelines on the web
RepeatMasker
BLASTn
Twinscan
• Copying and pasting from one web based application to
annotation by hand
• Advantages : quick, easy access to distributed
resources
• Disadvantages: time consuming, error prone, tacit
procedure so difficult to share both protocol and results
Automating pipelines
• Using Perl/ Matlab scripts to implement a
pipeline
• Advantages : automation, quick to write,
significant community resources (e.g. BioPerl)
• Disadvantages: hard to explain, hard to
relocate, hard to tinker with.
Workflows
Sequence in
Repeat
Masker
Web service
BLASTn
Web Service
Predicted
genes out
Twinscan
Web Service
• Simple scripting language aims to specify how steps of a
pipeline link together
• High level picture of the pipeline separated from any low
level fiddling
• Application logic and low level fiddling encapsulated in
remote web services
• Advantages : automation, quick to write, easier to
explain, share, relocate, and record provenance of
results in a standard way
Workflow
components in myGrid
• Scufl – Simple Conceptual Unified Flow Language
– Developed by myGrid members at EBI.
– Designed to be as simple as possible, just enough features to
support bioinformatics workflows
• Taverna – a tool for writing, running
workflows and examining results.
(http://taverna.sourceforge.net)
• FreeFluo – workflow engine to run
workflows
(http://freefluo.sourceforge.net)
Workflow use
• Newcastle University (Anil Wipat, Peter Li)
– Affymetrix Microarray Analysis Workflow
– Gene annotation workflow
• Manchester University
May Tassabehji, PhD student Hannah Tipney, Medical Gentics,
St Marys (Wellcome Trust Funded)
– Gene alerting service workflow (GAS)
– Gene and protein annotation workflow
• And others
Workflow experience +
• Easy to get started with Taverna (1-2 hours
tutorial)
• Sharing does happen
• Cuts down the time taken to perform one
pipeline from 2wks to 2 hours
Workflow experience:
outstanding issues
• Early days: web services rare; significant time
take to wrap applications as web services
(licensing, installation, maintenance)
– Soaplab and Gowlab try to help
(http://industry.ebi.ac.uk/soaplab)
• Fiddly bits don’t go away: Many ‘shim’ services
needed to ensure the output of one step fits the
expected input of another
• Automation produces many results in a short
amount of time. Issues of result management
and display
Other workflow systems
• Commercial bioinformatics – drug
discovery
– Incogen VIBE
– TurboWorx Pipeline Pilot
• eScience
– DiscoveryNet (bioinformatics – proprietary)
– Keppler ( US ecology)
– Triana (UK Physics astronomy, signal
processing)
Workflow standards
• Can’t have enough of them! All currently come
from e-Business rather than science community
•
•
•
•
BPEL – Business Process Execution Language
WS – Orchestration
XML Process Definition Language (XPDL)
Business Process Markup Language
(BPML)