sys-bio-pres - University of Edinburgh

Download Report

Transcript sys-bio-pres - University of Edinburgh

Towards Grid-Based System Biology
Dr Richard Sinnott
Technical Director National e-Science Centre
|||
Deputy Director (Technical) Bioinformatics Research Centre
University of Glasgow
24th February 2005
Sys-Bio Talk,
24th Feb 2005
Grids? E-Science? E-Research?
methodologies transforming science, engineering, medicine
and business
driven by exponential growth in data, compute demands

enabling a whole-system approach
computers
software
Grid
sensor nets
instruments
colleagues
Sys-Bio Talk,
24th Feb 2005
Shared data
archives
NeSC in the UK
Transition to WSRF/OGSA under discussion
NeSC
Two
UK OGSA
Grid
projects Grid
started
in January
Previous
workTest
on UK
e-Science
based
on GT2
UCL, Imperial College, Universities of Edinburgh and Newcastle Glasgow
Demonstrated broad set of applications across it

Universities of Portsmouth, Reading, Manchester, Westminster and CCLRC

Monte Carlo simulations of ionic diffusion through
radiation damaged crystal structures
The next Grid

Integrated Earth system modelling
Belfast
software

BLAST on the Grid

Grid Integration Test Script Suite

…

Challenges/
There are still issues to be
resolved
Opportunities
OGSA definition and delivery
?


Standards OGSI,
WSRF, …
…and Technologies GT3,
GT4…
Daresbury Lab
Hosting environments & Platforms
CSA
R
Combinations of services supported
Cardiff
Material and grids to support adopters
Edinburgh
Newcastle
White Rose Grid
ManchesterGrid Service
Core National
Cambridge
Hinxton
Oxford
RAL
London
Southampton
Sys-Bio Talk,
24th Feb 2005
HPC(x
)
Life Sciences
Extensive Research Community
>1000 per research university
Extensive Applications
Many people care about them

Health, Food, Environment
Interacts with virtually every discipline
Physics, Chemistry, Maths/Stats, Nano-engineering, …
450+ databases relevant to bioinformatics (and growing!)
Heterogeneity, Interdependence, Complexity, Change, …
Sys-Bio Talk,
24th Feb 2005
Sys-Bio Talk,
24th Feb 2005
+ links to plant/crops,
environmental, health, …
information sources
Populations
Organisms
Physiology
Tissues
Protein-protein interaction (pathways)
Protein Structures
Gene expressions
Nucleotide structures
Systems Biology?
More genomes …...
Yersinia
pestis
Arabidopsis
thaliana
Buchnerasp.
APS
Caenorhabitis Campylobacter Chlamydia
elegans
jejuni
pneumoniae
Helicobacter Mycobacterium
pylori
leprae
rat
mouse
Aquifex
aeolicus
Man
Archaeoglobus Borrelia
Mycobacterium
fulgidus
burgorferi
tuberculosis
Drosophila
melanogaster
Escherichia Thermoplasma
coli
acidophilum
Neisseria
Plasmodium Pseudomonas Ureaplasma
meningitidis falciparum
aeruginosa urealyticum
Z2491
Rickettsia
Saccharomyces
Salmonella
Sys-Bio
Talk,
th
prowazekii
cerevisiae
enterica
24 Feb 2005
Bacillus
subtilis
Thermotoga
maritima
Xylella
fastidiosa
Distributed and Heterogeneous data
Structure
Sequence
LPSYVDWRSA
ECGGCWAFSA
TSGSLISLSE
NTRGCDGGYI
GGINTEENYP
Function
GAVVDIKSQG
IATVEGINKI
QELIDCGRTQ
TDGFQFIIND
YTAQDGDCDV
Gene expression
Sys-Bio Talk,
24th Feb 2005
Morphology
Database Growth
PDB Content Growth
•DBs growing exponentially!!!
•Biobliographic (MedLine, …)
•Amino Acid Seq (SWISS-PROT, …)
•3D Molecular Structure (PDB, …)
•Nucleotide Seq (GenBank, EMBL, …)
•Biochemical Pathways (KEGG, WIT…)
•Molecular Classifications (SCOP, CATH,…)
•Motif Libraries (PROSITE, Blocks, …)
Sys-Bio Talk,
24th Feb 2005
Is Grid the Answer?
Some key problems to be addressed
Tools that simplify access to and usage of data

Internet hopping is not ideal!
Tools that simplify access to and usage of large
scale HPC facilities

qsub [-a date_time] [-A account_string] [-c interval] [-C
directive_prefix] [-e path] [-h] [-I] [-j join] [-k keep] [-l resource_list]
[-m mail_options] [-M user_list] [-N name] [-o path] [-p priority] [-q
destination] [-r c] [-S path_list] [-u user_list] [-v variable_list] [-V]
[-W additional_attributes] [-z] [script]
Tools designed to aid understanding of complex data
sets and relationships between them

e.g. through visualisation
Sys-Bio Talk,
24th Feb 2005
Access to and Usage of Data
Grid technology should allow to
hide heterogeneity,
deal with location transparency,
address security concerns,
…
Data Access and Integration Specification (DAIS)
being defined by GGF
OGSA-DAI and DAIT projects key role in shaping these
standards
Other commercial solutions

IBM Information Integrator, …
Sys-Bio Talk,
24th Feb 2005
Access to and Usage of HPC facilities
Consider whole genome-genome (2*3*10^9 bp)
comparisons between two species
Current strategy essentially chops up one genome
and fires searches for those fragments in the other
then re-assembles results


messy approximate matching - re-assembly difficult
important correlations can be lost
– to make this tractable so called junk DNA ignored
– chopping may introduce artefacts or hide phenomena
Better to put both full genomes in memory and perform a useful
complete comparison
Only possible with very high-end machines (available via grids)
Should not have to be script writer/Linux sys-admin
to use these facilities
Sys-Bio Talk,
24th Feb 2005
Cognitive aspects of Data
Life science data can be “ugly”
Raw data sets messy
Requires significant effort to understand
Schemas/data models evolving
…
Tools needed to
Simplify understanding
Improve analysis
Navigate through potentially huge data sets

e.g. to find genes of interest in chromosomes of different species
Sys-Bio Talk,
24th Feb 2005
BRIDGES
SBRN
Protein-protein interaction (pathways)
Protein Structures
Gene expressions
Nucleotide structures
Sys-Bio Talk,
24th Feb 2005
JDSS
DyVOSE
Populations
Organisms
Physiology
Tissues
GHI
VOTES
Overview of BRIDGES
Biomedical Research Informatics Delivered by Grid
Enabled Services (BRIDGES)
NeSC (Edinburgh and Glasgow) and IBM
Started October 2003
Supporting project for CFG project
Generating data on hypertension
Rat, Mouse, Human genome databases
Variety of tools used
BLAST, BLAT, Gene Prediction, visualisation, …
Variety of data sources and formats
Microarray data, genome DBs, project partner research data, …
Aim is integrated infrastructure supporting
Data federation
Security
Sys-Bio Talk,
24th Feb 2005
Bridges Project
CFG Virtual
Publically Curated Data
Ensembl
Organisation
OMIM
Glasgow
SWISS-PROT
Private
Edinburgh
MGI
VO Authorisation
Private
data
Oxford
Information
Integrator
Synteny
Service
Magna
Vista
Service
London
HUGO
…
RGD
Leicester
DATA
HUB
OGSA-DAI
Private
data
data
Private
data
Netherlands
Private
data
Private
data
+
Sys-Bio Talk,
24th Feb 2005
+
+
JDSS Project
Public data resources openness
Often cannot query directly
Often not easy/possible to find schemas
Joint Data Standards Study investigating this



Started on 1st June and involves
– Digital Archiving Consultancy
– Bioinformatics Research Centre (Glasgow)
– NeSC (Edinburgh and Glasgow)
Look at technical, political, social, ethical etc issues involved in
accessing and using public life science resources
– Interview relevant scientists, data curators/providers
8 month project with final report due imminently
– Funded by MRC, BBSRC, Wellcome Trust, JISC, NERC, DTI
Sys-Bio Talk,
24th Feb 2005
DyVOSE Project
Dynamic Virtual Organisations for e-Science Education
(DyVOSE) project
Two year project started 1st May 2004 funded by JISC
Exploring advanced authorisation infrastructures for security

… in Grid Computing Module as part of advanced MSc at Glasgow
– Provide insight into rolling Grid out to the masses!
ScotGrid
GU Condor pool
Other (known!)
Grid resources
Education
VO policies
PERMIS based
tio
n
Authorisation authorisa
checks
Authorisation decisions
Sys-Bio Talk,
24th Feb 2005
Scottish Bioinformatics Research Network
Four year proposal expected to start imminently
Funded (£2.4M) by Scottish Enterprise, Scottish Higher Education
Funding Council, Scottish Executive Environment and Rural Affairs
Department

Involves Glasgow, Dundee, Edinburgh, Scottish Bioinformatics Forum
Aim to provide bioinformatics infrastructure for Scottish health,
agriculture and industry



Infrastructure support at Dundee, Edinburgh and Glasgow to support first-rate
research in bioinformatics at each academic institute
Infrastructure support at three institutes, to support inter-institutional sharing of
compute and data resources through application of Grid computing
Outreach and training activities mediated by the Scottish Bioinformatics Forum
Sys-Bio Talk,
24th Feb 2005
VOTES
Virtual Organisations for Trials and Epidemiological Studies
3 year MRC (£2.8M) funded project expected to start imminently
Plans to develop Grid infrastructure to address key components of
clinical trial/observational study



Recruitment of potentially eligible participants
Data collection during the study
Study administration and coordination
– Involves Glasgow, Oxford, Leicester, Nottingham, Manchester
Clinical Virtual Organisation Framework
Used to realise
CVO-1
(e.g. for data
collection)
CVO-2
(e.g. for
recruitment)
LeiNott
GLA
Transfer
Grid
GPs
OX
IMP
Clinical trial
data sets
Sys-Bio Talk,
24th Feb 2005
Disease
registries
Hospital
databases
Genetics and Healthcare Initiative
Five (2+3) year proposal (£4.4M) expected to start imminently
Funded by Health Department and Department for Enterprise and
Lifelong Learning

Involves Glasgow, Dundee, Edinburgh, Aberdeen
– focus of genetics as applied to healthcare
– first two years emphasis on providing a platform for research into the genetic
basis of common complex diseases in Scotland
» Mental health, cardiovascular, …
» Plan to establish 15,000 family-based intensively-phenotyped cohort recruited from
the East and West of Scotland
– basis for neutralising heritable (genetic) risk factors in disease surveillance,
treatment optimisation, avoidance of adverse drug events and prediction of
response to therapy, health care planning and drug discovery, …
Sys-Bio Talk,
24th Feb 2005
Systems Biology?
Once we have (securely) connected all relevant
data sets and simplified access to and usage of
HPC resources, wrapped your favourite
bioinformatics applications as Grid services...
what questions would you like to ask?
– How does a cell work?
– Why do people who eat less tend to live longer?
– How many people across Scotland had a heart attack in the last 5
years took drug X, and of those that did where genes A or B
influenced by this drug?
– Who has performed an experiment similar to mine and where their
results similar?
– …
Sys-Bio Talk,
24th Feb 2005
www.nesc.ac.uk
Sys-Bio Talk,
24th Feb 2005
www.nesc.ac.uk
Sys-Bio Talk,
24th Feb 2005
Bridges Portal
Sys-Bio Talk,
24th Feb 2005
MagnaVista
www.nesc.ac.uk
Sys-Bio Talk,
24th Feb 2005
MagnaVista
Sys-Bio Talk,
24th Feb 2005
QTL upload
Sys-Bio Talk,
24th Feb 2005
QTL upload
Sys-Bio Talk,
24th Feb 2005
QTL browsing
Sys-Bio Talk,
24th Feb 2005
Grid Blast Client
• Allows
blasting
‘genome scale’
• Uses ScotGrid and idle
compute resources of
training lab Condor pool
Sys-Bio Talk,
24th Feb 2005
Sys-Bio Talk,
24th Feb 2005
Sys-Bio Talk,
24th Feb 2005
Sys-Bio Talk,
24th Feb 2005