Transcript Objective
Proteome data integration
characteristics and challenges
K. Belhajjame1, R. Cote4, S.M. Embury1, H. Fan2, C. Goble1, H.
Hermjakob, S.J. Hubbard1, D. Jones3, P. Jones4, N. Martin2, S. Oliver1,
C. Orengo3, N.W. Paton1, M. Pentony3, A. Poulovassilis2, J. Siepen,
R.D. Stevens1, C. Taylor4, L. Zamboulis2, and W. Zhu4
1University
of Manchester
2Birkbeck College
3University College London
4European Bioinformatics Institute
Outline
Experimental proteomics
ISPIDER architecture
Example use cases
Conclusion
All Hands Meetings, 2005
2
Experimental proteomics
An essential
component for
elucidation of the
biological functions of
proteins
The study of the set of
proteins produced by
an organism with the
aim of understanding
their behaviour under
varying conditions
Separation
2D gel
electrophoresis
Protein digestion
Enzymatic digestion
Mass Spectrometry
Maldi TOF
Identification
Protein DB
Protein ID
All Hands Meetings, 2005
3
Experimental proteomics
Development of new technologies for:
– protein separation (2D-SDS-PAGE, HPLC, Capillary
Electrophoresis)
– mass spectrometry (Multi-Dimensional protein identification)
Availability of publicly accessible protein sequence
databases
Proteomics databases (PedroDB, gpmDB, PepSeeker,
Pride, …)
Building experiments involving analysis services
orchestration and data processing and integration
All Hands Meetings, 2005
4
Objectives of ISPIDER
A Grid dedicated to the creation of bioinformatics
experiments for proteomics
Develop, or make, existing Proteome databases and
Grid-enabled services
Develop Middleware support for developing and
executing new proteome analyses, based on distributed
query processing and workflow technologies
Undertake proteomic studies that demonstrate the
effectiveness of the resulting infrastructure
All Hands Meetings, 2005
5
Outline
Experimental proteomics
ISPIDER architecture
Example use cases
Conclusion and future directions
All Hands Meetings, 2005
6
ISPIDER
Vanilla
Query Client
+ Phosph.
Extensions
2D Gel
Visualisation
Client
PPI Validation
+ Analysis
Client
+ Aspergil.
Extensions
Protein ID
Client
Web services
Proteome
Request
Handler
Proteomic
Ontologies/
Vocabularies
myGrid
WS
WS
WS
PEDRo
PID
Phos
ISPIDER Resources
WS
GS
WS
TR
ISPIDER
Proteomics Grid
Infrastructure
Data
Cleaning
Services
Existing
E-Science
Infrastructure
AutoMed
Ontology
Services
DQP
Workflows
PRIDE
Instance
Ident/Mapping
Services
myGrid
myGrid
WS
Source
Selection
Services
ISPIDER
Proteomics
Clients
WS
WS
WS
WS
PS
PF
FA
PPI
Public
Proteomics
Resources
Existing Resources
KEY: WS = Web services, GS = Genome sequence, TR = transcriptomic data,
PS = protein structure, PF = protein family, FA = functional annotation, PPI =
protein-protein interaction data,
Work 2005
Package
All WP
Hands=Meetings,
7
Outline
Experimental proteomics
ISPIDER architecture
Example use cases
Conclusion and future directions
All Hands Meetings, 2005
8
Value-added protein datasets
Motivation
Protein identification experiments are usually used as
input into further analysis processes.
– Gathering evidence for a biological hypothesis
– Suggesting new hypotheses
Objective
Augment the identification results with additional
information on the identified protein
Implementation
Taverna workflow system
All Hands Meetings, 2005
9
Value-added protein datasets
PepMapper
Web Service
Auxiliary
Services
GO Services
All Hands Meetings, 2005
10
Genome-focused protein
identification
Motivation
Currently, protein identification searches performed over large data
sets. This means fewer false negatives, but false positives are also
more likely.
Objective
More focused and thus more efficient protein identification
Implementation
Taverna workflow system
DQP, a service-based query processor
All Hands Meetings, 2005
11
Genome-focused protein
identification
select p.Name, p.Seq
from p in db_proteinSequences
where p.OS='HomoSapiens';
PepMapper
web service
DQP Web
Service
GOA Web
Service
IPI
All Hands Meetings, 2005
12
Integrated access
to proteome databases
Motivation
Ability to analyse existing proteomics results en masse is limited,
because of the heterogeneities between the schemas of the different
databases
Objective
Providing integrated access to proteome databases through a
common schema
Implementation
AutoMed, a framework for mapping heterogeneous schemata
DQP, a service-based query processor
All Hands Meetings, 2005
13
Integrated access
to proteome databases
OGSA Distributed
Query Processor
OQL query
OQL result
User query
Result
OGSA-DAI
Activity
OGSA-DAI
Activity
OGSA-DAI
Activity
gpmDB
PedroDB
PRIDE
Automed
DQP Wrapper
Automed Wrappers
Automed
Query Processor
Automed Repository
All Hands Meetings, 2005
14
Conclusions
+ Available e-science technologies provide rapid prototyping facilities
for bioinformatics analyses
+ Combining such technologies is possible and opens up more
possibilities
Taverna + DQP
Automed + DQP
- Writing custom code is usually required
– Processing service output to extract inputs for following services
– Transforming results between data formats
– Dealing with mismatches between identifiers
Developing a user-guided environment for the detection and
resolution of mismatches
Development of Proteomics client applications (PepMapper,
PepSeeker and PRIDE)
All Hands Meetings, 2005
15