CVA for NMR data - National e

Download Report

Transcript CVA for NMR data - National e

Capture, integration, and sharing of
functional genomic data
Steve Oliver
Professor of Genomics
School of Biological Sciences
University of Manchester
http://www.cogeme.man.ac.uk
http://www.bioinf.man.ac.uk
What are biologists interested in?
Complete organisms are
much too complicated.
Only very well understood
systems have well defined
pathways.
Many biologists focus on
one or a small number of
genes.
GENOME
TRANSCRIPTOME
PROTEOME
METABOLOME
The nature of proteomics experiment data
• Sample generation
– Origin of sample
• hypothesis, organism, environment,
preparation, paper citations
• Sample processing
– Gels (1D/ 2D) and columns
• images, gel type and ranges, band/spot
coordinates
• stationary and mobile phases, flow rate,
temperature, fraction details
• Mass Spectrometry
• machine type, ion source, voltages
• In Silico analysis
• peak lists, database name + version,
partial sequence, search parameters,
search hits, accession numbers
A Systematic Approach to
Modelling, Capturing and
Disseminating Proteomics
Experimental Data
http://pedro.man.ac.uk/
The PEDRo UML schema in reduced form
Organism
TaggingProcess
OntologyEntry
PercentX
MobilePhase Component
AssayDataPoint
SampleOrigin
GradientStep
Column
OtherAnalyte ProcessingStep
ChemicalTreatment
Fraction
AnalyteProcessingStep
OtherAnalyte
Analyte
Sample
TreatedAnalyte
Experiment
MassSpecMachine
RelatedGelItem
mzAnalysis
IonSource
GelItem
Electrospray
BoundaryPoint
DiGEGelItem
Gel1D
Gel
Detection
Spot
Gel2D
DiGEGel
Tandem SequenceData
MSMSFraction
IonTrap
PeakList
MALDI
Band
MassSpecExperiment
DBSearch
ToF
DBSearchParameters
ListProcessing
OtherIonisation
Hexapole
PeptideHit
Peak
OntologyEntry
ProteinHit
OntologyEntry
OthermzAnalysis
Quadrupole
CollisionCell
Chromatogram
Point
Peak-Specific
ChromatogramIntegration
Protein
RelatedGelItem
The Framework Around PEDRo
1. Lab generated data is encoded using the PEDRo data entry
tool, producing an XML (PEML) file for local storage, or
submission
2. Locally stored PEML files may be viewed in a web browser (with
XSLT), allowing web pages to be quickly generated from
datasets
3. Upon receipt of a PEML file at the repository site, a validation
tool checks the file before entering it into the database
4. The repository (a relational database) holds submitted data,
allowing various analyses to be performed, or data to be
extracted as a PEML file or another format
INTEGRATION
Why integrate data?
“These 200 genes are up-regulated in my
experiment. Are any of their protein products
known to interact?”
•Data is stored at a variety of sites and formats.
•Databases designed mainly for browsing
(MIPS, SGD, BIND, SCPD, KEGG).
•Need databases that allow complex queries.
•Need to be easily usable by biologists.
Genome Information
Management System (GIMS)
Paton NW, Khan SA, Hayes A, Moussouni F, Brass A,
Eilbeck K, Goble GA, Hubbard SJ, Oliver SG (2000)
Conceptual modelling of genomic information.
Bioinformatics 16, 548-557.
GIMS
• Integrates genomic and functional
data.
• Consists of two parts:
–GIMS Database
–GIMS User Interface
GIMS data warehouse
Browser
Canned Queries
Analysis Library
GIMS Database
SGD
MIPS
maxD
Database implementation
• Uses the object database FastObjects.
• All database classes and analysis programs
are written in Java.
• Allows close integration of the programming
language with the database.
• Allows fast access to database data from
application programs.
• Allows data to be stored in a way that reflects
the underlying mechanisms in the organism.
• Very flexible and extensible.
GIMS Contents
Data type
Data source
DNA sequences, chromosome locations of
coding regions, e.g. ORFs, tRNAs,
centromeres, telomeres etc.
MIPS
Predicted protein sequences, pI, mol
weight, number of transmembrane regions.
MIPS
Protein attributes (e.g. cellular location,
function, protein class, Prosite motifs,
phenotype).
MIPS
Protein interaction data (affinity purification,
yeast two-hybrid, genetic interactions).
Ho et al.,(2002), Gavin et
al.,(2002), MIPS, Uetz et
al.. (2000), Ito et al., (2001)
GIMS Contents
Data type
Metabolic data (reactions, compounds and
enzymes).
Data source
L-compound, L-enzyme
Transcription factor.
SCPD
Transcriptome data
Stanford Microarray
Database,
University of Manchester
(BBSRC COGEME Project)
Ontology Data
Sequence similarity
GO
SGD
GIMS User Interface
• Java application.
• Can download from
http://img.cs.man.ac.uk/gims
• Communicates with database via RMI.
• On start-up, application is sent information
about database classes and canned queries.
• Very flexible.
• Allows user to browse database, ask canned
queries, and store and combine data sets.
• Can save results as txt, html or xml.
Selecting Canned Queries
Query
categories.
Queries in
selected
category
Initially
empty store.
Parameterising a Query
Previously
selected
query
Parameters
for specific
run – selects
downregulated
genes in the
nucleus
Viewing the Results
Result
collection
Operations
on collections
Selecting a Second Query
Setting Its Parameters
Parameters
for specific
run – selects
downregulated
genes in the
same
experiment
that are
transcription
factors
Obtaining Its Results
Inter-relating Results
Collections
selected for
operating on
Remove one
result from
the other
Result of Difference
GIMS
empowers
the biologist
Resources at the centre
People who
have
registered
an interest in
this data
Workflows
that could
be used to
generate
this data
Literature
relevant
Related Data
Data holdings
Annotations
Provenance
record on how
the data was
produced
Ontologies
describing
data
Services that can use or produce this data
Biologists at the centre
Workflows
they wrote or
used
Literature
Provenance
record of
workflow runs
they have made
Notes
People
People they collaborate with
Data holdings
Ontologies
Preferences for Services
myGrid
•
•
•
•
•
EPSRC UK e-Science pilot project.
Open Source Upper Middleware for Bioinformatics.
(Web) Service-based architecture -> Grid services.
42 months, 24 months in.
Prototype v1 Release Sept 2004; some services
available now.
www.mygrid.org.uk
Workflows are in silico experiments
Annotation Pipeline
What is known about my
candidate gene?
Medline
EMBL
GO
Query
OMIM
BLAST
DQP
Application: Work bench demonstrator
The myGrid service
components are used in a
demonstration application
called the “myGrid
WorkBench”, which
provides a common point
of use for the services.
We can select data from
the myGrid Information
repository (mIR), select a
workflow based on its
semantic description, and
examine the results.
e-Science: Provenance
Like a bench
experiment, myGrid
records the materials
and methods it has
used for an in silico
experiment in a
provenance log.
This is the where,
what, when and how
the experiment was
run.
Derivation paths ~
workflows, queries
Annotations ~ notes
Evolution paths ~
workflow 
workflow
e-Science: Notification
A notification service can
inform the mIR and the
user (proxy) that data,
workflows, services, etc.
have changed and thus
prompt actions over data
in the mIR.
Notifications are
presented to the user with
a client in the workbench
environment.
User registers interest in
notification topics
The myGrid Team
Matthew Addis, Nedim Alpdemir, Rich Cawley, Vijay
Dialani, Alvaro Fernandes, Justin Ferris, Rob
Gaizauskas, Kevin Glover, Carole Goble, Chris
Greenhalgh, Mark Greenwood, Claire Jennings,
Ananth Krishna, Xiaojian Liu, Darren Marvin, Karon
Mee, Simon Miles, Luc Moreau, Juri Papay,
Norman Paton, Simon Pearce, Steve Pettifer,
Milena Radenkovic, Peter Rice, Angus Roberts, Alan
Robinson, Martin Senger, Nick Sharman, Paul
Watson, Anil Wipat and Chris Wroe.
Need
GRID
to empower
the biologist