- myExperiment

Download Report

Transcript - myExperiment

An Introduction to Running, Reusing and
Sharing Workflows with Taverna – part 2
Aleksandra Pawlik
materials by
Katy Wolstencroft
University of Manchester



This tutorial will give you a basic introduction to reusing
workflows in Taverna and my Experiment.
We will also explore nested workflows, the workflow
engine (iteration, looping, parallel invocation)
Like in the previous tutorial workflows in this practical
use small data-sets and are designed to run in a few
minutes. In the real world, you would be using larger
data sets and workflows would typically run for longer
Bigger Workflows: Enrichment Analysis



1.
2.
3.
The previous examples were trivial, small tasks.
Taverna’s real power is in iterating over large data sets
Many experiments result in a list of genes (e.g.
microarray analysis, Chip-Seq, SNP identification etc).
In this exercise, we will use Taverna to analyse a gene
set from a Chip-Seq experiment by finding and reusing
existing workflows
We will enrich our dataset by discovering:
Which pathways our genes are involved in
The functions of the genes
Literature evidence for the phenotype/trait of interest
Exercise 3: Re-using workflows from
myExperiment




Go to http://www.myexperiment.org and click on ‘find
workflows’
You will see a list of the most viewed and downloaded
workflow – see what the most popular workflow does by
reading the description
Change the rank to ‘Latest’ and see what has been
uploaded in the last few weeks
We will now find and download a workflow to identify the
pathways each gene in our gene set is involved in
Exercise 3: Re-using workflows from
myExperiment





Find the workflow called “UnigeneID to KEGG Pathways” and
look at the workflow entry page (uploaded by “Aleksandra
Pawlik”)
Download the workflow by clicking on the link: “Download
Workflow File/Package (T2FLOW)” and find out what it does
by reading the descriptions in myExperiment
Open the workflow in Taverna by going to ‘File ->Open
Workflow’
Run the workflow using the example values supplied (Hint:
when you run the workflow the examples values will be
added by default in the input window)
Look at the workflow output – now you will see pathway
information and pathway diagrams
Exercise 4: Combining workflows from
myExperiment







To analyse all the genes from our ChipSeq study, we
need to extract the gene list from our results file
To make it easier to work through the example, we have
provided a Chip-Seq gene list on myExperiment, you
can find it under “GalaxyGeneList - short : datafile for
training”
Save this file to your local machine
Open the file in Excel
Save the file with a .csv extension
As you can see, the list of genes is in column D
Taverna can process and extract this column
automatically
Exercise 4: Combining workflows from
myExperiment





In myExperiment, find and download the workflow
called “Import and convert gene list”
This workflow will extract the list of genes in column D
using Taverna’s built-in spreadsheet import tool (which
can be found in the services panel, for future
reference)
The next step in the workflow converts RefSeq IDs into
unigene IDs (required for the pathways workflow –
converting between different types of identifiers is a
common problem in bioinformatics!)
Run the workflow. This time, in the input window,
select “set file location” and set the location to the
saved .csv gene list.
Look at the workflow results
Exercise 4: Combining workflows from
myExperiment




We will now combine the two
workflows
While you are still in the “import
and convert” workflow, go to
the top of the workbench and
select
“insert
->
Nested
workflow”
In the pop-up window, select
“import from file” and find the
pathways
workflow
you
downloaded earlier.
Click on “import workflow” and
the pathways workflow will
appear in the main workflow
diagram.
Exercise 4: Combining workflows from
myExperiment

Connect the workflows up by linking the output of the
‘Merge_Gene_List’ with the nested workflow input
Exercise 4: Combining workflows from
myExperiment
Create new output ports for the Nested workflow and connect the
Nested workflow outputs to the new outputs
NOTE: you don’t need to connect them all, just pathway
descriptions, pathway images and gene descriptions



Save the workflow
Run the workflow (it may take a few minutes)
Exercise 5: GO Associations
There are many different tools we could use to find Gene
Ontology associations for your gene list
For example, we could simply modify the BioMart/Ensembl
service in the ‘Import and convert gene list’ workflow we
have already used
Reload the ‘Import and Convert gene list’ workflow
Right-click on the ‘mmusculus_gene_ensembl’ service and
select ‘Copy’
Paste an extra copy of this service into the same workflow
diagram
Exercise 5: GO Associations
This is a BioMart service. It allows you to retrieve omics
data from ENSEMBL and other genomics resources. If you
are familiar with BioMart, you will see the interface in
Taverna is very similar to the web interface
We will modify the BioMart query to find all GO
associations for each gene associated with a Chip-Seq
peak
Right-click on the new copy of the service and select
‘Configure BioMart Query’
Exercise 5: GO Associations
The inputs (or filters) already accept RefSeq Ids from our
input file, but we need to modify the outputs (or attributes)
Select ‘Attributes’ and expand the ‘External’ section.
Select ‘Go Term Accession’, ‘GO Name’ and ‘Go Domain’
Unselect ‘UniGeneID’ and select ‘RefSeq mRNA’
At the top of the page, change the output format from
multiple to single (TSV format)

(See screenshot on the next slide for an example)
Exercise 5: GO Associations
Exercise 5: GO Associations
Click ‘apply’ to save your changes, and ‘close’, to go back
to Taverna
At the top of the workflow diagram, change the workflow
view to show all ports by clicking on the table icon
Exercise 5: GO Associations
Connect your new service
to the workflow by linking
the ‘D’ output port of the
spreadsheet service to
the input of your new
service
Make the new output
ports and connect them
as shown to your new
service
Exercise 5: GO Associations
Save the workflow by going to ‘File -> Save Workflow’
Run the workflow
Download and view the GO report
Exercise 6: Simple Text Mining
So far we have looked at enriching the genomic
information, but we could also use workflows for running
data analyses (e.g. aligning mouse genes with human
homologues) or performing literature searches
Think about the ways you could extend this analysis with
literature searches (e.g. Correlations between pathways,
genes, GO terms, phenotypes etc)
Search myExperiment for workflows involving text mining,
using the search terms “text mining” and “Pubmed”
Exercise 6: Text Mining

Find and open the workflow “Phenotype to pubmed”
One of the services is no longer available in the nested
workflow (the faded-out service). Taverna checks the
availability of each service when you load the workflow and
when you run it
In this case, the workflow will still run without the final
nested workflow (clean text)
Delete the ‘clean text’ nested workflow (by selecting it and
right-clicking), and reconnect the workflow output
Run the workflow with the search term ‘erythropoiesis’ (or
a phenotype term to describe the disease you are
studying)
Advanced Exercises

These exercises have given you a brief introduction to
Taverna, but we have just scratched the surface.

The Taverna engine can also help you control the data flow
through your workflows. It allows you to manage iterations
and loops, add your own scripts and tools, and make your
workflows more robust
The following exercises give you a brief introduction to
some of these features

Exercise 7: Iteration
As you have already seen, Taverna can automatically
iterate over sets of data.
When 2 sets of iterated data are combined, however,
Taverna needs extra information about how they should
be combined. You can have:
A cross product – combining every item from list 1 with
every item from list 2 - all against all
A dot product – only combining item 1 from list 1 with
item 1 from list 2, and so on – line against line
Exercise 7: Iteration



Find and load the workflow ‘Demonstration of
configurable iteration’ from myExperiment
Read the workflow metadata to find out what the
workflow does (by looking at the ‘Details’)
Select the ‘ColourAnimals’ service and select the
‘Details’ in the workflow explorer and ‘configure list
handling’
Click on ‘dot product’ in the pop-up window. This allows
you to switch to cross product
Exercise 7: Iteration


Run the workflow twice – once with ‘dot product’ and once
with ‘cross product’.
Save the first results so you can compare them – what is
the difference? What does it mean to specify dot or cross
product?
NOTE: The iteration strategies are very important. Setting
cross product instead of dot when you have 2000 data
items can cause large and unnecessary increases in
computation!
Exercise 7: Iteration
e.g. red, green,
blue, yellow
How does Taverna
combine them?
Exercise 7:
Red
Green
Blue
Yellow
Red cat, red donkey, red koala
Green cat, green donkey, green koala
Blue cat, blue donkey, blue koala
Yellow cat, yellow donkey, yellow koala
Cat
Donkey
Koala
Exercise 7:
Red
Green
Blue
Cat
Donkey
Koala
Yellow
Red cat
Green donkey
Blue koala
There is no yellow animal because the list lengths don’t match!
Exercise 7:

The default in Taverna is cross product

Be careful! All against all in large iterations give
very big numbers!
Exercise 8: Looping




From myExperiment, find the workflow ‘InterproScan
without Looping’ by Katy Wolstencroft
InterproScan analyses a given protein sequence (or set of
sequences) for functional motifs and domains
This workflow is asynchronous. This means that when
you submit data to the ‘runInterproScan’ service, it will
return a jobID and place your job in a queue (this is very
useful if your job will take a long time!)
The ‘Status’ nested workflow will query your job ID to find
out if it is complete
Exercise 8: Looping


The default behaviour in a workflow is to call each
service only once for each item of data – so what if your
job has not finished when ‘Status’ workflow asks?
Run the workflow, using the default protein sequence
and your own email address (the EBI requires an
academic email address for it to run)
Almost every time, the workflow will fail because the
results have not been returned before the workflow
reaches the ‘get_results’ service
Exercise 8: Looping



This is where looping is useful. Taverna can keep running
the ‘status’ service until it reports that the job is done.
Select the ‘Status’ nested workflow and right-click. Select
‘configure running’ from the drop-down list (you could
also just click on ‘details’ in the workflow explorer).
Select ‘advanced’ and click on ‘add looping’
Use the drop-down boxes in the looping window to set
‘get_status_output_status’ ‘is_not_equal_to’ RUNNING
Exercise 8: Looping
Exercise 8: Looping



Save the workflow and run it again
This time, the workflow will run until the ‘Status’ nested
workflow reports that it is either DONE, or it has an
ERROR.
You will see results for ‘TextResults’, but you will still get
an error for ‘Graphical_results’. This is because there is
one more configuration to change – we also need ‘Control
Links’
Exercise 9: Control Links


A control link specifies that there is a dependency of one
service on another even though there is no data flowing
between them.
A control link is a line with a white circle at the end that
connects two services (see the link between the ‘Status’
nested workflow and ‘get_Result_input’)
Exercise 9: Control Links





We will add control links to the other output type
Right-click on getResult_graphical_input and select ‘Run
after’ from the drop down menu.
Set it to ‘Run after’ -> ‘Status’
Save and run the workflow
Now you will see each result returned
Exercise 9: Control Links
Exercise 10: Retries: Making your Workflow
Robust




Web services can sometimes fail due to network
connectivity
If you are iterating over lots of data items, you can guard
against these temporary interruptions by adding retries to
your workflow
Upload the ‘v2_Retry-Example’ workflow from the
myExperiment Next Generation Sequencing Tutorial
group. This workflow is designed to fail sometimes.
Run the workflow as it is and count the number of failed
iterations
Exercise 10: Retries: Making your Workflow
Robust





Now, select the ‘sometimes_fails’ service and select the
‘details’ tab in the workflow explorer panel
Click on ‘advanced’ and ‘configure’ for retries
In the pop-up box, change it so that it retries each service
iteration 2 times
Run the workflow again – how many failures do you get
this time?
Change the workflow to retry 5 times – does it work every
time now?
Exercise 11: Parallel Service Invocation






If Taverna is iterating over lots of independent input data,
you can improve the efficiency of the workflow by running
those iterated jobs in parallel
Run the Retry workflow again and time how long it takes
Go back to the design window, right-click on the
‘sometimes_fails’ service, and select ‘configure running’
This time select ‘Parallel jobs’ and change the maximum
number to 20
Run the workflow again
Does it run faster?
Exercise 11: Parallel Service Invocation :
Use with Caution


Setting parallel jobs makes your workflows run faster, but
you should be careful if you are using remote services.
Sometimes they have policies for the number of
concurrent jobs individuals should run (e.g. The EBI ask
that you do not submit more than 25 at once).
If you exceed this number, your service invocations may
be blocked by the provider. In extreme cases, the provider
may block your whole institution!