- myExperiment

Download Report

Transcript - myExperiment

Introduction to Workflows with
Taverna and myExperiment
Aleksandra Pawlik
University of Manchester
materials by Dr Katy Wolstencroft
Why are workflows important?

21st century is the century of information

More data will be produced in the next 5 years
than in the entire history of human-kind

NESC e-Science strategy 2008
Data Deluge




eGovernment
World bank data
Climate change data
Large scale physics




Large Hadron collider
Astronomy
‘Omics data
Next Gen Sequencing
Lots of Resources
NAR 2012 – 1500 databases
Next Generation Sequencing

1000 Genome Project


10000 Genome project


A Deep Catalog of Human Genetic Variation
a genomic zoo—DNA sequences of 10,000 vertebrate
species, approximately one for every vertebrate
genus.
Human Microbiome

Characterise the microbial communities found at
several different sites on the human body
Where is the data?





In repositories run by major service providers
(e.g. NCBI, EBI)
In local project stores
On web pages
On ftp servers
No defined formats
Distribution



Data resources
Computational power
Researchers and collaborators
12181 acatttctac caacagtgga tgaggttgtt ggtctatgtt
ctcaccaaat ttggtgttgt
12241 cagtctttta aattttaacc
tttagagaag agtcatacag tcaatagcct tttttagctt
12301
gaccatccta atagatacac agtggtgtct cactgtgatt ttaatttgca
ttttcctgct
12361 gactaattat gttgagcttg ttaccattta
gacaacttca ttagagaagt gtctaatatt
12421 taggtgactt
gcctgttttt ttttaattgg gatcttaatt tttttaaatt attgatttgt
12481 aggagctatt tatatattct ggatacaagt tctttatcag
atacacagtt tgtgactatt
12541 ttcttataag tctgtggttt
ttatattaat gtttttattg atgactgttt tttacaattg
12601
tggttaagta tacatgacat aaaacggatt atcttaacca ttttaaaatg
taaaattcga
12661 tggcattaag tacatccaca atattgtgca
actatcacca ctatcatact ccaaaagggc
12721 atccaatacc
cattaagctg tcactcccca atctcccatt ttcccacccc tgacaatcaa
12781 taacccattt tctgtctcta tggatttgcc tgttctggat
attcatatta atagaatcaa
What that means for
Bioinformatics

Sequential use of distributed tools

Analysing large data sets

Incompatible input and output formats

Difficult to record parameter selections

Its ok for one gene or one protein, but what about
10000!
Workflow as a Solution

Sophisticated analysis
pipelines
A set of services to analyse
or manage data (either local
or remote)

Data flow through services

Control of service
invocation

Iteration

Automation

Workflows as a solution





Flow of data from one tool to the next is
automatic
Incompatibilities overcome in the workflow with
‘helper’ services (known as shims)
Workflow records parameter values and
algorithms
Workflows can include data integration and
visualisation without the loss of information
Iteration over large data sets automatic – ideal
for high throughput analysis (e.g. omics)
Reproducible Research
Preventing non-reproducible research
An array of errors
http://www.economist.com/node/21528593
 Duke University, 2006 -Prediction of the course
of a patient’s lung cancer using expression arrays
and recommendations on different
chemotherapies from cell cultures – reported in
Nature Medicine
 3 different groups could not reproduce the
results and uncovered mistakes in the original
work

If the Analyses were done
using Workflows.....



Reviewers could re-run experiments and see
results for themselves
Methods could be properly examined and
criticised
Mistakes could be pinpointed
Different Workflow Systems
VisTrails
Kepler
Triana
Ptolemy II
Taverna
BPEL
Pipeline Pilot
Galaxy
Taverna Workbench
http://www.taverna.org.uk/
Freely available
open source
Current Version 2.4
80,000+ downloads
across version
Part of the myGrid Toolkit
Windows/Mac OS X/
Linux/unix
Nucleic Acids Res. 2006 Jul 1;34(Web Server issue):W729-32.
Taverna: a tool for building and running workflows of services.
Hull D, Wolstencroft K, Stevens R, Goble C, Pocock MR, Li P, Oinn T.
Taverna Workflows







Part of UK E-Science myGrid
project
Started in 2001,
collaboration across UK
Now: Manchester (Goble),
Oxford/Southampton
(DeRoure)
http://www.taverna.org.uk
Taverna desktop Client
Taverna Server
Taverna on the cloud
Taverna Workbench
Workflow engine
to run workflows
List of services
Construct and
visualise workflows
Web Services
Scripts
Programming
libraries
e.g. KEGG
e.g. beanshell, R
e.g. libSBML
What are Web Services?
NOT the same as services on the web (i.e. web
forms)
Web services support machine-to-machine
interaction over a network
Therefore, you can automatically connect to and
use remote services from your computer in an
automated way
Using Remote Tools and Services
with Taverna

Web Services









WSDL
REST
BioMart
R-processor
Grid Services
Local services
Beanshell (small, local scripts)
Workflows
And more.....
Who Provides the Services?
Open domain services and resources
• Taverna accesses thousands of services
• Third party – we don’t own them – we didn’t build them
• All the major providers
– NCBI, DDBJ, EBI …
•
Enforce NO common data model.
How do you use the services?
Simple WSDL
services
BioMoby ‘Semantic’
Services
Asynchronous services
Monitoring
Provider
Tags
Submitter
Service Description
What do Scientists use Taverna for?
Astronomy
Music
Meteorology
Social Science
Cheminformatics
Workflows are …
... records and protocols (i.e. your in silico
experimental method)
... know-how and intellectual property
... hard work to develop and get right
…..re-usable methods (i.e. you can build on the
work of others)
So why not share and re-use them
Workflow Repository
Just Enough Sharing….


myExperiment can provide a central location for
workflows from one community/group
myExperiment allows you to say





Who can look at your workflow
Who can download your workflow
Who can modify your workflow
Who can run your workflow
Ownership and attribution
Spectrum of Users
Advanced users design and
build workflows (informaticians)
Intermediate users reuse
and modify existing
workflows or components
http://www.myexperiment.org
Load Data:
Others “replay” workflows
through web page
Run Workflow
A Collection of Tools
Workflow Repository
Workflow GUI Workbench
and 3rd party plug-ins
Client User Interfaces
Web Portals
Service Catalogue
E-Laboratories
Provenance
Store
Workflow
Server
Activity and Service
Plug-in Manager
Open
Provenance
Model
Secure Service Access, and
Programming APIs
Programming and
APIs
Summary – Workflow Advantages




Informatics often relies on data integration and
large-scale data analysis
Workflows are a mechanism for linking together
resources and analyses
Promote reproducible research
Easy to find and use successful analysis methods
developed by others with myExperiment
More Information

Taverna


myExperiment


http://www.taverna.org.uk
http://www.myexperiment.org
BioCatalogue

http://www.biocatalogue.org
Tutorial


Using Taverna to design and build workflows
Reusing workflows from myExperiment

Analyse a gene set from a Chip-Seq
experiment by finding and reusing existing
workflows

Tutorials are available in the myExperiment
group: Cranfield Course - January 2014