myGridIntroduction - School of Computer Science

Download Report

Transcript myGridIntroduction - School of Computer Science

An Introduction to Taverna Workflows
Franck Tanoh
myGrid
University of Manchester
What is myGrid?
•
myGrid
is a suite components to support in silico
experiments in biology
• Taverna workbench = myGrid user interface
• Originally designed to support bioinformatics
Expanded into new areas:
Chemoinformatics
Health Informatics
Medical Imaging
Integrative Biology
• Open source – and always will be
History
EPSRC funded UK eScience Program Pilot Project
OMII-UK
• University of Manchester (myGrid) joined with the
Universities of Edinburgh (OGSA-DAI) and Southampton
(OMII phase 1) in March 2006
• OMII-UK aims to provide software and support to enable
a sustained future for the UK e-Science community and
its international collaborators.
• A guarantee of development and support
The Life Science Community
In silico Biology is an open Community
• Open access to data
• Open access to resources
• Open access to tools
• Open access to applications
Global in silico biological research
The Community Problems
• Everything is Distributed
– Data, Resources and Scientists
• Heterogeneous data
• Very few standards
– I/O formats, data representation, annotation
– Everything is a string!
Integration of data and interoperability of resources is
difficult
Lots of Resources
NAR 2007 – 968 databases
Traditional Bioinformatics
12181
12241
12301
12361
12421
12481
12541
12601
12661
12721
12781
acatttctac
cagtctttta
gaccatccta
gactaattat
taggtgactt
aggagctatt
ttcttataag
tggttaagta
tggcattaag
atccaatacc
taacccattt
caacagtgga
aattttaacc
atagatacac
gttgagcttg
gcctgttttt
tatatattct
tctgtggttt
tacatgacat
tacatccaca
cattaagctg
tctgtctcta
tgaggttgtt
tttagagaag
agtggtgtct
ttaccattta
ttttaattgg
ggatacaagt
ttatattaat
aaaacggatt
atattgtgca
tcactcccca
tggatttgcc
ggtctatgtt
agtcatacag
cactgtgatt
gacaacttca
gatcttaatt
tctttatcag
gtttttattg
atcttaacca
actatcacca
atctcccatt
tgttctggat
ctcaccaaat
tcaatagcct
ttaatttgca
ttagagaagt
tttttaaatt
atacacagtt
atgactgttt
ttttaaaatg
ctatcatact
ttcccacccc
attcatatta
ttggtgttgt
tttttagctt
ttttcctgct
gtctaatatt
attgatttgt
tgtgactatt
tttacaattg
taaaattcga
ccaaaagggc
tgacaatcaa
atagaatcaa
Cutting and Pasting
• Advantages:
– Low Technology on both server and client side
– Very Robust: Hard to break.
– Data Integration happens along the way
• Disadvantages:
– Time Consuming (and painful!)
• Can be repeated rarely
• Limited to small data sets.
– Error Prone:
• Poor repeatability
How do you do this for a genome/proteome/metabolome of
information!
Pipeline Programming
• Advantages
– Repeatable
– Allows automation
– Quick, reliable, efficient
• Disadvantages
– Requires programming skills
– Difficult to modify
– Requires local tool and database installation
– Requires tool and database maintenance!!!
What we want as a solution
A system that is:
• Allows automation
• Allows easy repetition, verification and sharing of
experiments
• Works on distributed resource
• Requires few programming skills
• Runs on a local desktop / laptop
myGrid
as a solution
myGrid
allows the automated orchestration of in silico
experiments over distributed resources from the
scientist’s desktop
Built on computer science technologies of:
• Web services
• Workflows
• Semantic web technologies
Web Services
Web services support machine-to-machine interaction over a
network. Note: NOT the same as services on the web
Web services are a:
– technology and standard for exposing code / databases with an API that
can be consumed by a third party remotely.
– describes how to interact with it.
They are:
• Self-contained
• Self-describing
• Modular
• Platform independent
Workflows
– General technique for describing and enacting a process
– Describes what you want to do, not how you want to do it
– High level description of the experiment
Repeat
Masker
Web service
GenScan
Web Service
Blast
Web Service
Workflows
Workflow language specifies how bioinformatics
processes fit together.
High level workflow diagram separated from any
lower level coding – you don’t have to be a
coder to build workflows.
Workflow is a kind of script or protocol that you
configure when you run it.
Easier to explain, share, relocate, reuse and
repurpose.
Workflow <=> Model
Workflow is the integrator of knowledge
The METHODS section of a scientific publication
Workflow Advantages
• Automation
– Capturing processes in an explicit manner
– Tedium! Computers don’t get bored/distracted/hungry/impatient!
– Saves repeated time and effort
• Modification, maintenance, substitution and personalisation
• Easy to share, explain, relocate, reuse and build
• Releases Scientists/Bioinformaticians to do other work
• Record
– Provenance: what the data is like, where it came from, its quality
– Management of data (LSID - Life Science Identifiers)
Different Workflow Systems
•
•
•
•
•
•
•
Kepler
Triana
DiscoveryNet
Taverna
Geodise
Pegasus
Pipeline Pilot
Each has differences in action, language, access
restrictions, subject areas
Taverna Workflow Components
Scufl Simple Conceptual Unified Flow Language
Taverna Writing, running workflows & examining results
SOAPLAB Makes applications available
Web Service
e.g. DDBJ BLAST
SOAPLAB
Web Service
Any Application
An Open World
•
•
•
•
Open domain services and resources.
Taverna accesses 3000+ services
Third party – we don’t own them – we didn’t build them
All the major providers
– NCBI, DDBJ, EBI …
•
Enforce NO common data model.
•
Quality Web
Services
considered
desirable
Adding your own web services
• SoapLab
• Java API Consumer
import Java API
of libSBML as
workflow
components
http://www.ebi.ac.uk/soaplab/
Shield the Scientist – Bury the Complexity
Taverna
Workbench
Application
Scufl Model
Simple Conceptual Unified Flow Language
Workflow Execution
Workflow enactor
Processor Processor Processor Processor Processor Processor Processor
Bio
Bio WSRF Plain Soap
MART
Web
lab MOBY
Service
Local
Java
App
Styx
Processor
Enactor Styx R
client package
...
...
What can you do with myGrid?
• ~37000 downloads
• Users worldwide
US, Singapore, UK, Europe,
Australia
•
•
•
•
•
•
•
•
•
•
•
•
Systems biology
Proteomics
Gene/protein annotation
Microarray data analysis
Medical image analysis
Heart simulations
High throughput screening
Genotype/Phenotype studies
Health Informatics
Astronomy
Chemoinformatics
Data integration
Steve Kemp
Trypanosomiasis in Africa
Andy Brass
Paul Fisher
http://www.genomics.liv.ac.uk/tryps/trypsindex.html
Trypanosomiasis Study
• A form of Sleeping sickness in cattle – Known as n’gana
• Caused by Trypanosoma brucei
• Can we breed cattle resistant to n’gana infection?
• What are the causes of the differences between resistant
and susceptible strains?
Trypanosomiasis Study
Understanding Phenotype
• Comparing resistant vs susceptible strains – Microarrays
Understanding Genotype
• Mapping quantitative traits – Classical genetics QTL
Need to access microarray data,
genomic sequence information,
pathway databases AND integrate
the results
Phenotype
Genotype
200
?
Phenotypic response investigated
using microarray in form of
expressed genes or evidence
provided through QTL mapping
Genes captured in microarray
experiment and present in QTL
region
Microarray + QTL
Key:
A – Retrieve genes in QTL
region
B – Annotate genes with
external database Ids
C – Cross-reference Ids with
KEGG gene ids
D – Retrieve microarray data
from MaxD database
E – For each KEGG gene
get the pathways it’s
involved in
F – For each pathway get a
description of what it does
G – For each KEGG gene
get a description of what it
does
Results
• Identified a pathway for which its correlating gene (Daxx) is believed
to play a role in trypanosomiasis resistance.
• Manual analysis on the microarray and QTL data had failed to
identify this gene as a candidate.
Why was the Workflow Approach
Successful?
• Workflow analysed each piece of data systematically
– Eliminated user bias and premature filtering of datasets and
results leading to single sided, expert-driven hypotheses
• The size of the QTL and amount of the microarray data
made a manual approach impractical
• Workflows capture exactly where data came from and
how it was analysed
• Workflow output produced a manageable amount of data
for the biologists to interpret and verify
– “make sense of this data” -> “does this make sense?”
Trichuris muris
(mouse whipworm) infection
parasite model of the human parasite Trichuris trichuria)
• Identified the biological pathways involved in sex dependence in
the mouse model, previously believed to be involved in the ability
of mice to expel the parasite.
• Manual experimentation: Two year study of candidate genes,
processes unidentified
• Workflows: trypanosomiasis cattle experiment was reused
without change.
• Analysis of the resulting data by a biologist found the processes
in a couple of days.
Joanne Pennock, Richard Grencis
University of manchester
Workflow Reuse – Workflows are
Scientific Protocols – Share them!
Addisons Disease
SNP design
Protein annotation
Microarray analysis
A workflow marketplace
A Practical Guide to Building and
Managing in silico Experiments
Semantic Web Technologies
•
myGrid
built on Web Services, Workflows AND semantic
web technologies
• Semantic web technologies are used to:
– Find appropriate services during workflow design
– Find similar workflows for reuse and repurposing
– Record the process and outcome of an experiment, in context
->>>> the experimental provenance
Finding Services
There are over 3000 distributed services. How do we
find an appropriate one?
Find services by their function instead of their name
• We need to annotate services by their functions.
• The services might be distributed, but a registry of
service descriptions can be central and queried
Feta Semantic Discovery
• Feta is the myGrid component that can query the service
annotations and find services
Questions we can ask:
Find me all the services that perform a multiple sequence
alignment And accept protein sequences in FASTA
format as input
myGrid
Ontology
Specialises
Upper level
ontology
Task
ontology
Informatics
ontology
Contributes to
Molecular Biology
ontology
Bioinformatics
ontology
sequence
Web Service
ontology
protein_structure_feature
biological_sequence
Similarity Search Service
protein_sequence
BLAST service
nucleotide_sequence
BLASTp service
InterProScan service
DNA_sequence
Annotations
•
•
•
•
Feta has been available for over a year
Only just been included in the release
Need critical mass of service annotations before release
By demonstrating the use of service annotation, we aim
to encourage service providers to provide the
annotations in the future
• Annotation experiments with users and domain experts
• Domain expert annotations much better
– We now have a domain expert for full-time service annotation
Data Management
• Workflows can generate vast amount of data - how can
we manage and track it?
• We need to manage
– data AND
– metadata AND
– experiment provenance
• Workflow experiments may consist of many workflows of
the same, or different experiments.
• Scientists need to check back over past results, compare
workflow runs and share workflow runs with colleagues
Provenance – the myGrid logbook
Smart Tea
• Who, What, Where, When,
Why?, How?
•
•
•
•
•
•
•
•
•
•
Context
Interpretation
Logging & Debugging
Reproducibility and repeatability
Evidence & Audit
Non-repudiation
Credit and Attribution
Credibility
Accurate reuse and interpretation
Just good scientific practice
BioMOBY
Advanced Provenance Features
Smart re-running
Experiment mining
Cross experiment mining
From which Ensembl gene
does pathway mmu004620
come from?
Conclusions
Web services and workflows are powerful technologies
for in silico science
–
–
–
–
automation
high throughput experiments
systematic analysis
Interoperability of distributed resources
Contact Us
• Taverna development is user-driven
• Please tell us what you would like to see via the mailing
lists:
– Taverna-Users and Taverna-Hackers
• Download software and find out more at:
http://www.mygrid.org.uk
http://taverna.sourceforge.net
myGrid
acknowledgements
Carole Goble, Norman Paton, Robert Stevens, Anil Wipat, David De Roure, Steve Pettifer
•
•
•
•
•
•
•
OMII-UK Tom Oinn, Katy Wolstencroft, Daniele Turi, June Finch, Stuart Owen, David
Withers, Stian Soiland, Franck Tanoh, Matthew Gamble, Alan Williams, Ian Dunlop
Research Martin Szomszor, Duncan Hull, Jun Zhao, Pinar Alper, Antoon Goderis,
Alastair Hampshire, Qiuwei Yu, Wang Kaixuan.
Current contributors Matthew Pocock, James Marsh, Khalid Belhajjame, PsyGrid
project, Bergen people, EMBRACE people.
User Advocates and their bosses Simon Pearce, Claire Jennings, Hannah Tipney,
May Tassabehji, Andy Brass, Paul Fisher, Peter Li, Simon Hubbard, Tracy Craddock,
Doug Kell, Marco Roos, Matthew Pocock, Mark Wilkinson
Past Contributors Matthew Addis, Nedim Alpdemir, Tim Carver, Rich Cawley, Neil
Davis, Alvaro Fernandes, Justin Ferris, Robert Gaizaukaus, Kevin Glover, Chris
Greenhalgh, Mark Greenwood, Yikun Guo, Ananth Krishna, Phillip Lord, Darren
Marvin, Simon Miles, Luc Moreau, Arijit Mukherjee, Juri Papay, Savas Parastatidis,
Milena Radenkovic, Stefan Rennick-Egglestone, Peter Rice, Martin Senger, Nick
Sharman, Victor Tan, Paul Watson, and Chris Wroe.
Industrial Dennis Quan, Sean Martin, Michael Niemi (IBM), Chimatica.
Funding EPSRC, Wellcome Trust.
Changes to Scientific Practice
– Systematic and comprehensive automation
• Eliminated user bias and premature filtering of datasets and
results leading to single sided, expert-driven hypotheses
– Dry people hypothesise, wet people validate
• “make sense of this data” -> “does this make sense?”
– Workflow factories
• Different dataset, different result
– Workflow market
– Accurate provenance