Transcript my Grid

Bioinformatics and Grids
Professor Carole Goble,
University of Manchester,UK
[email protected]
Director of myGrid e-Science project
Co-director ESNW e-Science Regional Centre
Roadmap


Post Genome biology
Challenges for bioinformatics





Why biology isn’t physics
Information-centric Grids
An example: myGrid
Other projects
Take home
Take home
Complexity & Diversity - Size isn’t everything.
 Computation is important but information and
knowledge services dominate.
 Integration, curation, annotation, fusion
 Automating support for integration and fusion
means moving from…
… human interaction to machine interaction.
… machine readable to machineunderstandable.
 Metadata using ontologies for finding,
managing & controlling services & content

Functional Genomics





An integrated view of how
organisms work and interact in
growth, development and
pathogenesis
From single gene to whole
genome
From single biochemical
reactions to whole physiological
and developmental systems
What do genes do?
How do they interact?
Genotype to Phenotype
DNA ‘chips’
Modelling
Expression
DNA
Folding
protein
sequence
protein
structure
•Synchrotron
•Proteomics
•Domain analysis
•SNP
•Gene prediction
•HTP Sequencing

function organism population
Link the observable behaviour of an
organism with its genotype
Drug Discovery
Pharmacogenomics
Knowledge/Information Flow
Data Capture
Hypotheses
Design
Model &
Analysis
Libraries
Clinical
Resources
Individualised
Medicine
Clinical
Image/Signal
Genomic/Proteomic
Knowledge
Repositories
Data Mining
Case-Base
Reasoning
Analysis
Information
Sources
Information
Fusion
Integration
Annotation /
Knowledge
Representation
Use Cases (I3C)



Show me all the genes in the glucose
metabolism pathway and get their
GenBank accession numbers
Find all the citations for the HOX gene
family for human and mouse
Find all the kinase genes from
Wormbase and retrieve the DNA
sequence
Use Cases
Show me Nucleotide binding proteins in
mouse
Answer:
 P12345 in Swiss-Prot is an ATPase
 Terri Attwood is an expert on this
 Jackson labs have a database but you
need to register
 A paper has just been published in
Proteins by the Stanford lab on this.
Which compounds interact with (alpha-adrenergic
receptors) ((over expressed in (bladder epithelial
cells)) but not (smooth muscle tissue)) of ((patients
with urinary flow dysfunction) and a sensitivity to the
(quinazoline family of compounds))?
Drug
formulary
High
thro’put
screening
Expressn.
database
Tissue
database
Chemical
database
Enzyme
database
Clinical
trials
database
SNPs
database
Receptor
database
http://www3.ebi.ac.uk/Services/DBStats/
Large amounts of data

EMBL July 2001


Microarray


150 Gbytes
1 Petabyte per
annum
Sanger Centre


20 terabytes of
data
Genome
sequences
increase 4x per
annum
High throughput
experimental methods





Micro arrays for gene
expression
Robot-based capture
10K data points per chip
20 x per chip
Cottage industry to
industrial scale
100,000 genes
320 cell types
2000 stimuli
3 time points
2 concentrations
2 replicates
8 x 10
11
= 1 x 10
15
= 1 petabyte
Heterogeneity




Data types & forms
Community
Autonomy
Over 500 different
databases


Different formats,
structure, schemas,
coverage…
Web interfaces, flat
file distribution,…
Heterogeneity


Complexity
Diversity
Phenotyp
e
Gene
Genom
sequen
e
ce
sequen
ce
Disease
Drug
Gene
Gene
express
express
ion
ion
Proteo
me
Protein
Protein
Structur
e
Disease
Clinical
trial
Disease
Disease
Protein
Sequen
ce
P-P
interaction
s
homology
Heterogeneity


Complexity
Diversity
Phenotyp
e
Gene
Genom
sequen
e
ce
sequen
ce
Disease
Drug
Gene
Gene
express
express
ion
ion
Genomic, proteomic,
transcriptomic, metabalomic,
Proteo
protein-protein interactions,
me
regulatory bio-networks,
alignments, disease,
patterns &
Protein
Protein
Structur
motifs, protein structure,
protein
Protein
e
classifications, specialist Sequen
ce
proteins (enzymes, receptors), …
Disease
Clinical
trial
Disease
Disease
P-P
interaction
s
homology
Heterogeneous Data





Multimedia
Images & Video
Text
annotations &
literature
Descriptive as
well as numeric
Knowledgebased
Text
Extraction
SWISSPROT:TET9_ENTFA
ID TET9_ENTFA STANDARD; PRT; 639 AA.
AC P21598;
DT 01-MAY-1991 (REL. 18, CREATED)
DT 01-MAY-1991 (REL. 18, LAST SEQUENCE UPDATE)
DT 01-OCT-1993 (REL. 27, LAST ANNOTATION UPDATE)
DE TETRACYCLINE RESISTANCE PROTEIN TETM (TRANSPOSON TN916).
GN TETM(916).
OS ENTEROCOCCUS FAECALIS (STREPTOCOCCUS FAECALIS).
RA BURDETT V.;
RL NUCLEIC ACIDS RES. 18:6137-6137(1990).
CC -!- FUNCTION: ABOLISH THE INHIBITORY EFFECT OF TETRACYCLIN ON PROTEIN
CC
SYNTHESIS BY A NON-COVALENT MODIFICATION OF THE RIBOSOMES.
CC -!- SIMILARITY: VERY HIGH TO OTHER TETM/TETO PROTEINS.
CC -!- SIMILARITY: TO GTP-BINDING ELONGATION FACTORS.
DR EMBL; X56353; G47062; -.
DR PIR; S13142; S13142.
DR PROSITE; PS00301; EFACTOR_GTP; 1.
KW PROTEIN BIOSYNTHESIS; ANTIBIOTIC RESISTANCE; GTP-BINDING;
KW TRANSPOSABLE ELEMENT.
FT NP_BIND
10 17
GTP (BY SIMILARITY).
FT NP_BIND
74 78
GTP (BY SIMILARITY).
SQ SEQUENCE 639 AA; 72464 MW; 523F1359 CRC32;
>TET9_ENTFA
MKIINIGVLAHVDAGKTTLTESLLYNSGAITELGSVDKGTTRTDNTLLERQRGITIQTGI
TSFQWENTKVNIIDTPGHMDFLAEVYRSLSVLDGAILLISAKDGVQAQTRILFHALRKMG
IPTIFFINKIDQNGIDLSTVYQDIKEKLSAEIVIKQKVELYPNVCVTNFTESEQWDTVIE
GNDDLLEKYMSGKSLEALELEQEESIRFQNCSLFPLYHGSAKSNIGIDNLIEVITNKFYS
STHRGPSELCGNVFKIEYTKKRQRLAYIRLYSGVLHLRDSVRVSEKEKIKVTEMYTSING
ELCKIDRAYSGEIVILQNEFLKLNSVLGDTKLLPQRKKIENPHPLLQTTVEPSKPEQREM
LLDALLEISDSDPLLRYYVDSTTHEIILSFLGKVQMEVISALLQEKYHVEIEITEPTVIY
MERPLKNAEYTIHIEVPPNPFWASIGLSVSPLPLGSGMQYESSVSLGYLNQSFQNAVMEG
IRYGCEQGLYGWNVTDCKICFKYGLYYSPVSTPADFRMLAPIVLEQVLKKAGTELLEPYL
SFKIYAPQEYLSRAYNDAPKYCANIVDTQLKNNEVILSGEIPARCIQEYRSDLTFFTNGR
SVCLTELKGYHVTTGEPVCQPRRPNSRIDKVRYMFNKIT
Swiss-Prot
Heterogeneity

Lymphocyte associated receptor of death









LARD
WSL-LR WSL-S1 WSL-S2 proteins
WSL-1 protein precursor
Apoptosis-mediating receptor DR3
Apoptosis-mediating receptor TRAMP
Death Domain receptor 3
WSL protein
apoptosis inducing receptor AIR
APO-3
Functional
genomics Tissue
Structural
Genomics
Disease
Population
Genetics
Genome Clinical Data
Clinical trial
sequence



Data resources have
been built introspectively
for human researchers
Information is machine
readable not machine
understandable
CONTROLLED
VOCABULARIES &
ONTOLOGIES
Shared data->
shared meaning
Service
provider
Service
provider
Service
provider
Service
provider
Service
provider
Complexity



Multiple views
Interrelated
Intra and inter
cell interactions
and bioprocesses
"Courtesy U.S. Department of Energy Genomes to Life program (proposed) DOEGenomesToLife.org."
Instability & Quality




Exploring the unknown
 At least 5 definitions
of a gene
 The sequence is a
model
 Other models are
“work in progress”
Names unstable
Data unstable
Models unstable

“the problem in the field is
not a lack of good
integrating software, Smith
says. The packages usually
end up leading back to
public databases. "The
problem is: the databases
are God-awful," he told
BioMedNet. … If the data is
still fundamentally flawed,
then better algorithms add
little.”
Temple Smith, director of the
Molecular Engineering
Research Center at Boston
University,
Curation
SWISSPROT
MEDLINE
papers
nrdb
annotation
503,479
TrEMBL
234,059
Swiss-Prot
PRINTS
BLOCKS
millions
Expressed Sequence Tags
InterPro
85,661
2990
PRINTS
1310
Infrastructure & Integration
Structural
Genomics
SNPs
Technologies:
•CORBA and the OMA
•Java and JavaBeans
•Data mining
Sequence
Data
Expression
Gene
Data
Analysis
Mutation/Variation
Differential
Pattern Discovery
Temporal
Gene Prediction
In situ
Functional
Splice Sites
Genomics
Promoters
EST
Gene
Gene
Identification
Networks
•Algorithm development
•Knowledge discovery
•Knowledge
representation
•Visualisation
•Query tools and
services
•Database replication
•OO technology
•OO databases
•Networks and security
Proteomics
Gene Annotation and Function
Regulation of Metabolism
Biochemical Pathways
Signal Transduction
CORBA / Java / BSA / SRS
•Data cleaning &
validation
Metabolomics
Bioinformatics Analysis

Different algorithms


Different
implementations


BLAST, FASTA, pSW
WU-BLAST,
NCBI-BLAST
Different service
providers

NCBI, EBI, DDBJ
In silico experimentation
In silico experimentation
myProteins
BLAST
Swiss-Prot
BLAST
PIR
BLAST
Go-Blast
visualisation
In silico experimentation
myProteins
BLAST
Interpro
Swiss-Prot
BLAST
PIR
BLAST
Go-Blast
visualisation
In silico experimentation
myProteins
BLAST
Interpro
Swiss-Prot
BLAST
PIR
BLAST
Go-Blast
visualisation
medline
In silico experimentation
myProteins
BLAST
Interpro
Swiss-Prot
BLAST
PIR
BLAST
Go-Blast
visualisation
medline
In silico experimentation
myProteins
BLAST
Interpro
Swiss-Prot
BLAST
PIR
BLAST
Go-Blast
visualisation
medline
In silico experimentation


Discovery, interoperation, fusion, sharing of
data, knowledge and workflows
Explicit management of workflows


Improving quality of experiments & data


provenance & propagating change
Scientific discovery is personal & global


information & processes & best practice
personalisation & collaborative working
Security, ownership -> valuable assets
myGrid

Personalised
extensible environments for
data-intensive
in silico experiments in biology

http://www.mygrid.org.uk



myGrid




UK e-Science Grid programme pilot
(EPSRC)
Generic middleware
Bioinformatics & Genomics setting
1st October 2001 -- 31st March 2005


(36 months funded in 42 execution
period)
16 full-time researchers/developers
myGrid
m
Partners
A Desiderata







(cf. Grid)
Software development toolkits
Standard protocols, services &
APIs
A modular “bag of technologies”
Enable incremental development
of grid-enabled tools and
applications
Reference implementations
Learn through deployment and
applications
Open source
Applications
Diverse global services
Core
services
Local OS
myGrid
Stack
Approach
Applications
Toolkits/Portals
Metadata
Personalisation
Agent-based Interoperation layer
Governance mgt Process/workflow mgt
Communication fabric
I.E
Data mgt
myGrid
1.
e-Scientists



2.
Outcomes
Environment built on toolkits for service access,
personalisation & community
Gene function expression analysis using S.
cerevisiae
Annotation workbench for the PRINTS pattern
database
Developers




Protocols and service descriptions
myGrid-in-a-Box developers kit
Re-purposing DAS, AppLab and OpenBSA …
Integrating ISYS & GlaxoSmithKline platforms
myGrid
1.
2.
3.
4.
5.
6.
7.
tech outcomes
Services, service descriptions (ontologies),
message protocols & APIs
Database access from the Grid
Process enactment on the Grid
Personalisation services
Provenance services
Metadata services ~ DAML+OIL, RDF(S)
Laying the foundations for Agent Services
Converging technologies
Grid Computing
Web Service &
Semantic Web
Technologies
SOAP, WSDL, UDDI,
WSIL, DAML+OIL, OWL,
RDF(S), WSFL
Globus, Sun Grid Engine,
Condor, DS (Jini, Corba)
Agents
ACL, methodology
Service
Functionality
Metadata
User
Directory
Service
Discovery
Ontological Definitions
Ontological Reasoning
Workflow
Provenance
Validation
Provenance
Repository
User Agent
User
Repository
Workflow
Personalisation
Databases
Workflow
Enactment
Workflow
Resolution
Distributed
Queries
Serialised
Workflow
Repository
Information
Extraction
Job Scheduling
Resource Mgt
Services
Notification
myGrid
Authentication
Workflow
Definition
Repository
Standards and Activities
Open Source
Open Bio Foundation BioJava, BioPerl …
(DeFacto) Standards
Consortium
Expertise
View propagation,
reasoning, workflow …
OMG LSR, I3C, MGED, Gene Ontology
Semantic Web
RDF, RDFS, DAML+OIL
Bioinformatics integration platforms
DAS, OpenBSA, ISYS, OpenMMS, Kleisli, Ensembl, AppLab,
SRS, BioNavigator, DiscoveryLink, K1
TAMBIS. MOBY …
Web Services
XML, SOAP, WSDL, UDDI
Distributed Computing Environments
CORBA, RMI, JavaOne
GRID
Globus/SRB/Condor/Sun Grid Engine
Other BioGrids










BioOpera
North Carolina BioGrid
Novartis Grid
Scientific Annotation Middleware project
Entropia AIDS modelling Grid ….
DiscoveryNet
Proteomics analysis
Protein structure prediction
Biodiversity
CLEF Clinical records …
myGrid





Summary
myGrid aims to develop infrastructure
middleware for an e-Biologist’s workbench
The setting is bioinformatics but the results
are intended to be generally applicable to e-
Science
A mix of standard, vanguard and bleed edge
technologies, advanced development and
(some) research
Academic & commercial partnership
myGrid project is timely & reflects a
community desire to “collaborate, or die”
Take home reprise
Complexity & Diversity - Size isn’t everything.
 Computation is important but information and
knowledge services dominate.
 Integration, curation, annotation, fusion
 Automating support for integration and fusion
means moving from…
… human interaction to machine interaction.
… machine readable to machineunderstandable.
 Metadata using ontologies for finding,
managing & controlling services & content

Acknowledgements





Colleagues on myGrid
Robert Stevens
Norman Paton
Alan Robinson at EMBL-EBI
I3C Interoperable Informatics Infrastructure
Consortium http://www.i3c.org
URLs

EBI


LSR


http://www.omg.org/homepages/lsr/
Open-Bio


http://www.ebi.ac.uk/
http://www.open-bio.org/
I3C

http://www.i3c.org/
"Molecular biologists appear to have eyes for
data that are bigger than their stomachs. As
genomes near completion, as DNA arrays on
chips begin to reveal patterns of gene
sequences and expressions, as researchers
embark on characterising all known proteins,
the anticipated flood of data vastly exceeds in
scale anything biologists have been used to."
(Editorial Nature, June 10, 1999)




Presented over the AccessGrid to
the CSC Finnish IT Centre for
Science Grid Seminar
Otaniemi, Espoo, Finland
6th March 2002
http://www.csc.fi/suomi/tapahtumat/
GridSeminar/