Transcript Tutorial

Taverna Workflows
myExperiment
Paul Fisher
University of Manchester
http://www.cs.man.ac.uk/~fisherp/NewcastleTutorial.ppt
Taverna
This tutorial is designed to introduce you to the Taverna 1.7
workflow workbench
Prerequisites - 1
Java







In order to run Taverna 1.7 on your computer you will need to have the latest Java installed. If you do
not have Java already installed, you can download it from this URL:
http://java.sun.com/javase/downloads/index.jsp
You will have a choice of the download you would like. Download the JDK with Java EE packaged up
too. This will give you the opportunity to develop web services and use the ones deployed by Java
developers at a later date. The Java Runtime Environment (JRE) being downloaded should be 1.5 or
later for Taverna to work.
If you have Java installed, but it is an earlier version, you will need to update it to 1.5 or later
otherwise Taverna will NOT work.
The minimal installation you will need is the standard JDK package.
Download the desired JDK by following the link on the website and choose a location on your
computer to save it to.
Open the saved file and follow the installation instructions to install Java on your computer
Restart your computer to complete the installation.
Prerequisites - 2
A zip package





You will also need a tool to unzip the downloaded workbench. There are various tools available on
the internet, including WinZip, 7-Zip, and a few others. Personally I prefer 7-Zip, which is free to and
easy use, available at the following URL:
http://www.7-zip.org/download.html
You will need to choose the appropriate file to download for your operating system, i.e. Windows,
Linux, Apple MAC.
Choose a location to save the file in and save it.
Locate your saved file and follow the installation instructions to install it on your computer.
Restart your computer to complete the installation.
Prerequisites - 3
Linux users - Graphviz

Those who are installing Taverna on Linux will also have to install Graphviz onto the system. This is
available at the following URL:
http://www.graphviz.org/

At the time of writing – I have no installation instructions for this package, so please refer to the user
documentation provided on the web site
Downloading Taverna

Open your usual web browser and go to the myGrid homepage at the following URL:
http://www.mygrid.org.uk

Find and follow the links to download Taverna 1 link on the web page
http://www.mygrid.org.uk/tools/taverna/taverna-1/taverna-download/



Once on the ‘Download’ page, identify the relevant Taverna distribution you need.
Follow the link to download the workbench. The web page should re-direct you to the
source forge page.
Choose a location to save the file and click OK.
Unzipping the workbench




Choose to “Unzip/Extract the files”, but not into the current directory.
You will need to choose a directory in which to unzip the files. I recommend
somewhere in the root drive of your computer so you can easily access it,
e.g. C:\myGrid\ .
You can change the name of the folder at this stage, e.g. to “Taverna”.
If you are using Taverna on Linux, please be sure that you have the relevant
access permissions to install and run Taverna in the desired directory.
If you need a Zip package – download and install “7-ZIP”
(find it using Google)
Opening Taverna





Locate you Taverna installation and open the Taverna folder.
Start Taverna by double clicking on the “runme.bat” (Windows users )or “runme.sh”
(Linux and Mac users).
If you have successfully installed Java, you should see a dialog box or command window
open, shortly followed by the Taverna application.
Once you have installed Taverna for the first time it will need to update all of its
components. You do not need to do anything for this, as this happens as the workbench is
opening. You should see a graphic in the centre of your screen, with a download
progress. Each component will be shown loading in this progress bar in turn. Once this
has completed (depending on connection speed – about 5 minutes), the Taverna
workbench will open.
The Taverna workbench consists of 3 main panels for constructing workflows:



The Available services pane (Top Left side)
The Advanced Model Explorer pane (Bottom Left side)
The Diagram pane (Right side)
The 3 Panes of Taverna



The Available services pane is used to display the web services to the user. This list contains default
services from when the workbench starts. Once you become more experienced with the workbench,
you will be able to add you own services, including adding default services so they load
automatically when Taverna opens. This list contains WSDL web services, local BioJava widgets,
Soaplab services, and BioMoby objects. Each of these can be added to the workflow model
(workflow being constructed) so that a task can be achieved.
The Advanced Model Explorer (AME) pane contains the services used in the current workflow,
including the inputs, outputs, and data links between each service. Once populated with services, each
service can be expanded using the “+” button. This provides a list of the inputs and outputs that the
service takes in and expels. It is these inputs and outputs that allow you to connect services together.
The Diagram pane shows a graphical representation of the workflow being used/constructed. The
diagram can be adapted to view different aspects of the current workflow, to show all the ports for
all the services, only those ports that have been connected or bound, or to change the layout of the
workflow from portrait to landscape.
3 Panes of Taverna
Available
services
Advanced
Model Explorer
Diagram pane
Advanced Model Explorer

AME – (bottom left panel)
The AME is the primary editing component within Taverna.
Through it you can load, save, and edit any property of a
workflow.
It enables you to:
build a workflow
add nested workflows
edit workflows by connecting services
add metadata to a workflow
Diagram Pane


Shows inputs / outputs, services and control flows
It allows you to change the view of a workflow,
save the visual representation, and explode or
implode nested workflows
Available services
Lists services available by default in Taverna – top left

~ 3500 services







Local java services
Simple web services
Soaplab services – legacy command-line application
R Processor
BioMart database services
BioMoby services
Beanshell processor
Allows the user to add new services or workflows from the web or from
file systems
New services can be gathered from anywhere on the web – the
default list are just a few we already know about – importing
others is very straightforward
Go to the DDBJ list of available web services at:
http://xml.nig.ac.jp/wsdl/index.jsp
These services were not designed for use in Taverna, but Taverna can use
them if you supply the address of the WSDL file

Click on the DDBJ blast service (http://xml.nig.ac.jp/wsdl/Blast.wsdl) and
copy the web page address

Go to the services panel in Taverna, and right-click on
‘Available Processors’ (at the top of the list). For each type of
service, you are given the option to add a new service, or set of services.

Select ‘Add new WSDL scavenger’. A window will pop-up asking for
a web address


Enter the Blast Web service address you just copied
Scroll down to the bottom of the Services list and look at the
new DDBJ service that is now included, clicking on the “+” icon
next to the service
Go to the Services Panel
 Type ‘binfo’ into the search box at the top of the panel (we will
start with simple information retrieval from KEGG)
 You may see several services highlighted in red
 Scroll down to the KEGG services, to ‘binfo’
This service returns information about the KEGG databases,
depending on the information you supply to it, e.g. the word
‘pathway’ gives info on the KEGG pathway database



Right click on the ‘binfo’ service and select ‘Invoke service’
In the pop-up ‘Run workflow’ window add the word “pathway”
by clicking on the input document ‘db’ and selecting to ‘add
new input’ from the dialog menu.
Click ‘Run workflow’ and the service is invoked

Click on the ‘Results’ tab in the Taverna tool bar
 The
database information is displayed on the right
when you select ‘click to view’

Click on the ‘Process Report’ tab
 Look
at processes. This shows the experimental provenance – where
and when processes were run, and times

Click on the ‘Status’ tab
 Look
at options As workflows run, you can monitor their progress
here (Note: this workflow was probably too fast to see this feature
properly, we will come back to it later)
The processes for running and invoking a single service are the
basics for any workflow and the tracking of processes and
generation of results are the same however complicated a
workflow becomes
In the next few exercises, we will look at some example
workflows and build some of our own from scratch
Installing The Whip Plug-in





Your going to use the ‘new’ myExperiment Plug-in
Firstly you need to install WHIP - http://www.whipplugin.org/
 This allows you to interact with the myExperiment server
In Taverna, go to “Tools” and then select “Plug-in Manager”
Click “Find New Plug-ins”, and select the “myExperiment and
WHIP (beta) plug-in” from the list
Then click “Install” to install the plug-in




You should now see the myExperiment plug-in in the
toolbar menu
Browse through the example workflows in the first
tab of the plug-in
To view a workflow, select “Preview” from the
buttons under the workflow diagram
To open a workflow in the workbench, click on the
open button under the workflow diagram

Previewing a workflow allows you to see all the metadata
associated with the workflow on the myExperiment website,
including:






TAGS
AUTHOR
CREDITS
DESCRIPTION
You can also view the latest workflows, search for keywords, and
even browse using a tag cloud
Choose a workflow to load and click on “Open”
Opening from a URL

Select ‘Open Workflow Location’ from the File menu at the top
of the workbench. In the pop-up window, add the following web
address to load a workflow from the web
http://www.myexperiment.org/workflows/16/download?version=3


The ‘Mouse Pathways and Gene annotations for QTL Phenotype’ workflow will
be loaded
View the workflow diagram - you will see services in a couple
of different colours
1
Open from URL option
3
Populated Diagram
2
Paste in the file
location – the URL
Populated AME

In the Advanced Model explorer panel – click on the name of
the workflow at the top of the window (just above Inputs) – in
this case ‘Pathways and Gene annotations for QTL Phenotype’ and then
select the ‘workflow metadata’ tab at the top of the AME.
You will see a text description of the workflow, its author and its unique LSID
(Life Science Identifier). When publishing workflows for others, this
annotation is useful information and allows the acknowledgement of
intellectual property





Now that you have loaded your workflow you can execute it
To execute your workflow open the “File” Menu at the top of the
Workbench
Choose “Run Workflow” from the options given – this will open a
pop-up box to input your data
Each input requires you to enter data – to enter data into each of
the inputs, click on one input and then click on the “New Data” option
in the pop-up menu system
Once you have entered these details, press the “Run Workflow”
button at the bottom of the pop-up box
1
Run Workflow option
option
3
Click on input
2
Input pop-up box
4
Click on “New Input”
5
Run Workflow
Viewing Results







Once you have executed the workflow, the Taverna workbench will change views from “Design” to
“Results”. You should see this change behind you Input pop-up box
You can minimise the Input pop-up box to view the progress of the workflow being executed – the
different colours indicate whether a service has run or not

Green = Completed

Purple = Currently being executed

Grey = Awaiting execution
Once completed, the results will appear as separate tabs at the top of the workflow diagram
(indicated in the following diagram as workflow outputs)
Each tab contains an output file of results – the results can be viewed by clicking on the file in the left
hand pane where it says “click to view”
The file can then be searched through using the right hand pane, allowing you to verify the results – if
they are wrong simply maximise the pop-up window and hit the “Run workflow” button again, making
sure that the inputs are correct
Each file can then be saved to the local machine – to do this simply click on the button marked “Save
to disk” and enter the location to save the files
Then click OK
Results pane
1
Workflow Outputs 2
Workflow
progress
Result file
3
Save results to
disk
4

Import the ‘get_genes_by_pathway’ service into a new
workflow model. First, you will need to either close the current workflow
from the file menu, or select ‘New Workflow’ then find the above service
again in the ‘services’ search panel.


Right-click on ‘get_genes_by_pathway’ and import it into the
workbench by right clicking, and selecting ‘Add to Model’
Go to the AME and expand the [+] next to the newly imported
service. You will see:
1
input (Green arrow pointing up)
 1 output (purple arrow pointing down)



Define a new workflow input by right-clicking on ‘Workflow
Input’ and selecting ‘Create New Input’
Supply a suitable name e.g. ‘pathway_identifier’
Connect this new input to the ‘get_genes_by_pathway’ service
by right-clicking on ‘pathway_identifier’ and selecting
‘get_genes_by_pathway ->pathway_id’
You always build workflows with the flow of data




Define a new workflow output by right-clicking on ‘workflow
output’ and selecting ‘create new output’
Supply a suitable name e.g. ‘gene_outputs’
Connect the ‘get_genes_by_pathway’ service to the new output,
remembering to build with the flow of data
You have now built a simple workflow from scratch!
Run the workflow by selecting ‘run workflow’ from the ‘File’
menu at the very top of the workbench. You will again need to
supply a KEGG pathway identifier – “path:mmu03010”









Select a ‘string constant’ from ‘Available Services’ list (by
searching for ‘constant’ in the text search box
Right-click and select ‘add to model with name…’
Insert ‘pathway_id’ in the pop-up window
In the AME, right-click on ‘pathway_id’ and select ‘edit me’
Edit the text to ‘path:mmu03010’.
Replace the workflow input with this string constant
Run the workflow – it runs in the same way
Add a description and your name as author to the metadata
section
Save the workflow by selecting ‘save’ in the file menu
Exercise 7 Defining Output Formats

So far, most of the outputs we have seen have been text, but in
bioinformatics, we often want to view a graph, a 3D structure,
an alignment etc. Taverna is able to display results using a
specific type of renderer if the workflow output is configured
correctly.
Load the ‘Fetch PDB flatfile from RCSB server’ workflow from
http://www.myexperiment.org/workflows/167/download?version=1

Run the workflow with the ID ‘1crn’, or another PDB id you
know of
Exercise 7 Defining Output Formats

Look at the results. For ‘pdbFlatFile’, you will see the results are
displayed graphically. This is achieved by specifying a particular mime
type in the output – given as ‘chemical/x-pbd’ in the service metadata tab.



Go back to the AME and look at the metadata for ‘pdbFlatFile’.
HINT: when you click on something in the AME, a metadata tab
will appear at the top of the window
Click on the Metadata window and select the MIME Types tab
MIME Types. As you can see, it has a mime type associated with it. If you
wish to render results in anything other than plain text, you MUST specify
the mime-type in the workflow output, e.g. PDF e.t.c.
Exercise 7 Taverna MIME-Types
The following mime-types are currently used by Taverna
text/plain=Plain Text
text/xml=XML Text
text/html=HTML Text
text/rtf=Rich Text Format
text/x-graphviz=Graphviz Dot File
image/png=PNG Image
image/jpeg=JPEG Image
image/gif=GIF Image
application/zip=Zip File
chemical/x-swissprot=SWISSPROT Flat File
chemical/x-embl-dl-nucleotide=EMBL Flat File
chemical/x-ppd=PPD File
chemical/seq-aa-genpept=Genpept Protein
chemical/seq-na-genbank=Genbank Nucleotide
chemical/x-pdb=Protein Data Bank Flat File
chemical/x-mdl-molfile
Exercise 8 Sharing Workflows




Go to http://www.myexperiment.org
myExperiment is a social networking site for sharing
workflows and workflow expertise and experiences
Browse around the site and see what it contains
Create yourself an account and join the group
called “Newcastle MSc.” (this will be necessary for
the next exercise)
Exercise 8 Sharing workflows





Find all the workflows containing BLAST searches. How did you
find them? How many are there? Can they all be downloaded?
Which is the most downloaded workflow?
Which is the most viewed workflow? Is it the same?
How many workflows are tagged with ‘protein_structure’ ?
If you wish to share your workflows with the rest of the class,
upload them and set the permissions so that only those in the
‘Newcastle MSc.’ group can see them – make sure you add a
description and author details to the workflow metadata first!
Exercise 9
Workflow Reuse – Nested Workflows




Reload your KEGG workflow from exercise 6
We will extend this workflow to get descriptions of
each gene identifier, and find the pathways for
each gene.
In the myExperiment plug-in, find all the workflow
that are tagged with KEGG
Select the ‘Get Kegg Gene information’ workflow
http://www.myexperiment.org/workflows/611
Exercise 9
Workflow Reuse – Nested Workflows





Go back to Taverna and look at the original workflow
In the AME, click on ‘add nested workflow’.
Go back to the myExperiment plug-in, and choose to
“import from URL” for the workflow you found in
myExperiment
You can change the name of the nested workflow by
right-clicking on the processor and selecting ‘rename’, on
the nested workflow
You need to connect up the workflow as if it was any
other kind of service
Exercise 9
Workflow Reuse – Nested Workflows



The nested workflow has 1 input and 2 outputs. We
have to connect the input, but we can choose which
outputs to display
In the outer workflow create a new output called
‘gene_descriptions’ - hint: to switch between
workflows, use the “Workflows” option in the file
menu system
Connect gene_descriptions to the nested workflow
output ‘gene_descriptions’
Exercise 9
Workflow Reuse – Nested Workflows


Save the workflow (remembering to embed the
nested workflow, using the supplied check box) and
run the workflow
Look at the results
Exercise 10 Iteration
Taverna has an implicit iteration framework. If you connect a
set of data objects (for example, a set of fasta sequences) to
a process that expects a single data item at a time, the process
will iterate over each sequence

Load the ‘Mouse Pathways and Gene annotations for QTL Phenotype’
workflow from the myExperiment plug-in using any of the
previously used import methods
http://www.myexperiment.org/workflows/16/download?version=3

Watch the progress report. You will see several services with
‘Invoking with Iteration’
Exercise 10 Iteration




The user can also specify more complex iteration strategies
using the service metadata tag
Find and load the workflow ‘Demonstration of configurable
iteration’ from the myExperiment plug-in
Read the workflow metadata to find out what the workflow
does
Select the ‘ColourAnimals’ service and read the metadata for
that service. Under the description is the iteration strategy
Click on ‘dot product’. This allows you to switch to cross product
Exercise 10 Iteration
Run the workflow twice – once with ‘dot
product’ and once with ‘cross product’.
 Save the first results so you can compare them
– what is the difference? What does it mean to
specify dot or cross product?

Exercise 11 Substituting Services
Taverna does not own many of the bioinformatics services it provides. This means
that it cannot control their reliability. Instead, Taverna provides strategies for
dealing with services being unavailable

Load the ‘BiomartAndEMBOSSAnalysis’ from the myExperiment website this time,
using the ‘Launch in Taverna’ button.
http://www.myexperiment.org/workflows/158


Look at the metadata for the ‘emma’ service. It is an implementation of clustalw
Find the DDBJ clustalw service – HINT: go to the DDBJ services homepage, and
import the service from URL into the Available Services palatte
http://xml.nig.ac.jp/index.html
Exercise 11 Substituting Services




Instead of adding the new service normally, right-click and
select ‘add as alternate’
In the resulting menu select ‘emma’
The DDBJ version of the ClustalW service is now added as an
alternative to emma in the AME. It will appear at the bottom
of the input/output list of the Emma service
Select the new service (which should be called ‘analyzeSimple’
and look at the inputs and outputs. These need to be connected
to the correct inputs and outputs in Emma (it is unlikely the
inputs and outputs will have the same names! – see if you can
figure them out)
Exercise 11 Substituting Services



Right-click on the ‘query’ input in analyzeSimple and map it to
‘sequence_direct_data’. In both services, these inputs expect a
set of fasta sequences.
Right-click on the ‘result’ output and map it to ‘outseq’ in emma
in the same way.
Now you have a workflow which will run using emma when it is
available – but will substitute it for DDBJ clustalw if emma
fails!
Exercise 12 Failover
Taverna also allows the user to specify the number of times a
service is retried before it is considered to have failed.
Sometimes network traffic is heavy, so a working service needs to be
retried

Select ‘tmap’ from the same workflow. To the right of the service
name are a series of 0s and 1s. By simply typing the numbers, the user can
specify the number of retries and the time between the retries

Change it to 3 retries for ‘tmap’ and set the status to ‘critical’
using the final tickbox. Now it is critical, it means the whole workflow
will be aborted if ‘tmap’ fails after 3 retries. Failures in non-critical services
will not abort the workflow run.
Spotlight on BioMart
Exercise 13 Spotlight on BioMart



Biomart enables the retrieval of large amounts of
genomic data e.g. from Ensembl and Sanger, as well
as Uniprot and MSD datasets
After saving any workflows you want to keep, reset
the workbench in the AME (by closing open workflows
in the File menu)
Keep open the workflow
‘BiomartAndEMBOSSAnalysis’
Run the Workflow
Exercise 13 Spotlight on BioMart
This Workflow Starts by fetching all gene IDs from Ensembl
corresponding to human genes on chromosome 22 implicated
in known diseases and with homologous genes in rat and
mouse.
For each of these gene IDs it fetches the 200bp after the fiveprime end of the genomic sequence in each organism and
performs a multiple alignment of the sequences using the
EMBOSS tool 'emma' (a wrapper around ClustalW). It then
returns PNG images of the multiple alignment along with three
columns containing the human, rat and mouse gene IDs used in
each case.
Exercise 13 Spotlight on BioMart




Right-click on the ‘hsapiens_gene_ensembl’ service
and select ‘configure BioMart query’
By selecting ‘Filters’ and then ‘Region’ – change the
chromosome from 22 to 21 – now the workflow will
retrieve all disease genes from chromosome 21 with
rat and mouse homologues
Run the workflow and look at the results
See how some of the other options were configured
by finding them in the other pull-down lists (Gene,
Multi-species comparison etc)
Exercise 13 Spotlight on BioMart



Find out which Gene Ontology terms are associated with the
genes in your region by adding a new Biomart query
processor
Select another copy of ‘hsapiens_gene_ensembl’ from the
services panel (under Biomart and Ensembl 50 genes (Sanger))
and select ‘add to model with name….’ (as there is already a
service with that name!) and call the service ‘hsapiens_GO’
Configure ‘hsapiens_GO’ by right-clicking and selecting
‘configure Biomart query’ and selecting ‘filters’. In filters,
select ‘gene’ and the ‘id list limit’ tick-box next to ‘ensembl gene
IDs’.
Configure the output (by selecting attributes) and select ‘GO
ID’ for each GO partition under the ‘External -> GO
Attributes’ tab in the attributes section
Exercise 13 Spotlight on BioMart




Connect the input to the ‘hsapiens_gene_ensembl’
service via the ‘ensembl_gene_id’
Create 3 new workflow outputs, ‘CCGOID’, ‘MFGOID’
and ‘BPGOID’. Connect the outputs of the biomart
processor to them
Re-run the workflow and view which GO terms are
associated with your chromosomal region
NOTE: Having 3 outputs for related terms like this is
inefficient and hard to read – we will come back to a
solution to fix this problem in the next session
This exercise highlights the services that do not perform
biological functions, but are vital for running life science
workflows
Exercise 14


A shim is a service that doesn’t perform an
experimental function, but acts as a connector, or
glue when 2 experimental services have
incompatible outputs and inputs
A shim can be any type of service – WSDL,
Soaplab etc. Many are simple BeanShell scripts
Exercise 14 – Finding Shims


In the ‘BiomartandEmbossAnalysis’, work out which
services are shims
What do the shims do?
Exercise 14 Other Shims

There are many myGrid shim services. These are
currently being described in a shim library, but for
now, a small collection are documented here:
http://www.cs.man.ac.uk/~hulld/shims.html


Find a shim that will return a DNA file in Fasta format
from an id. Load the example workflow and run it in
Taverna
Find a shim that will translate DNA
HINT: these services might be in the feta registry
Exercise 14 Other Shims
The emboss suite of programs have a
subdivision – edit
 All the edit services are shims
 Experiment with the edit services
 Find a service that will remove gaps from
sequences

Exercise 14 Beanshell




Open Taverna and load the workflow
‘BiomartAndEMBOSSAnalysis’
Look at the diagram. Each brown service is a
BeanShell script
In the ‘Advanced Model Explorer’ (AME) select the
BeanShell ‘CreateFasta’
Right-click and select ‘configure beanshell’
Exercise 14 Beanshell



Look at the script and see if you can work out its
function
Look at the ports and their types as well as the
script
Note the names of the ports and where they
appear in the script, you will need to know how to
specify an input/output in the next exercise
Exercise 14
Beanshell – Writing your Own
Beanshell scripts allow users to write small, bespoke java scripts
to allow incompatible services to work together
 Create a new workflow by selecting ‘file’ and ‘New Workflow’


Add a new beanshell processor by right-clicking “Beanshell scripting host” in
the service panel and selecting “Add to model” (you may change the name
of the processor)
Right click the beanshell processor created and select “ Configure
beanshell…”

Create 2 input port named: myName and mySurname

Cretate 1 output port named: myFullname
Note that theses ports are automatically added to AME window
Exercise 14
Beanshell – Writing your Own
Select the script tab and Paste the following script
myFullname = myName +"\t" + mySurname
 Create 2 workflow inputs and 1 workflow output by going to
the port menu, and choosing to add a new port for both input
and output.
 Connect them to the configured beanshell processor.
 Run the workflow
 You should get your full name printed in the output

BioCatalogue
BioCatalogue is a social networking site that allows you to discover Web
Services, to include in your workflows

Go to http://www.biocatalogue.org

Familiarise yourself with the page

Go to ‘Project information’ and look at the roadmap to see what
features are coming

If you want to try BioCatalogue, you can sign up to the friends email
list (found on the front page at the bottom left), and you can try the
Pilot out by signing up for the beta testing:
http://beta.biocatalogue.org/
1.
2.
Username: biocat
Password: biodog
FINISH