Transcript Mission
Proteomics databases for
comparative studies:
Transactional and Data Warehouse
approaches
Patricia Rodriguez-Tomé, Nicolas
Pinaud, Thomas Kowall
GeneProt, Switzerland
What is Proteomics ?
P1
P2
P3
P4
P5
E1
E2
E3
G1
Separation
(CEX, RP)
Protein EST Genomic Peptide
MS MS
Sample
BioInformatics
processes
Manual
analysis
DB
About DBs for Proteomics at
GeneProt
Needs
Data
Transactional DB
Data Warehouse
Data Mining
Data Management Challenges
A high-throughput
Need for a convenient
environment requires
near real time
processing
Quick response to
evolving laboratory
procedures and
evolving user needs
Accomodate to
heterogeneous data
types
Manage a constantly
rising flood of data
data access at all levels
of granularity via
analysis software and
web front ends
Adapt to demand for
global queries across all
proteomics studies
Adapt and innovate to
offer new tools:
Statistics,
Data mining.
Data Flow
experimental data
(LIMS)
XML
DB
Identification of
peptides and proteins
annotation
external data sources
Data export
Data details
Experimental data:
Store MS and MS/MS peak
External data sources:
Import information from external
lists
Store all meta data
Identification :
Load peptide matches,
identified proteins, scores
Automatic annotation
and analysis:
Give access to data, store
data sources: taxonomy,
ontologies, bibliography…
Export data:
Export all or a subset of data
Flat file
Database dump
Misc:
Access control, security an
results
Expert annotation:
Give interactive access to
data using a Web interface,
store manual validation and
annotation
confidentiality
data consistency/integrity checks
Error checks and corrections
Run statistics
backup and archive
Data production per project
Raw data (spectra) : 330 000 -> 1 500 000
Identified peptides : 45 000 -> 145 000
Identified sequences: 10 000 ->120 000
Database size : 15G -> 140G
Nbr projects: 16
1Tb of databases files
Implementation: transactional
Intended to capture all relevant information from
proteomics experiment, protein identification
automatic and manual annotation and validation.
Each proteome is isolated in its own ProtDB
(16 at present).
Complex and generic data model for efficient
data storage.
Built in data consistency and error checks.
A layer of « views » provides fast query access.
Web front end: interactive means to visualize,
update and validate data.
Limitations
We have 16 projects on-line:
High cost of maintenance to keep all
database schemas compatible.
Space : could we archive some of the
projects ?
New spectrometers produce more data
Inter databases queries:
Technique « exists » but implementation is
often awkward and there is no efficient
solution in our case.
What about overcoming these
limitations and take advantage of
this wealth of data ?
Decide what data are actually important
in the long term.
Merge the data from all the projects.
Clean and consolidate the data.
Implement an update procedure to keep
this « merged data system » up to date
(archive old projects)
Data Warehouse ?
This looks very much like the definition
of a data warehouse !
Data consolidation and integration
Non instantaneous accuracy, non volatility
Comprehensive data structure
Query throughput
ProtWare: proteomics data warehouse
1. Stores consolidated and final analysis results,
2.
3.
4.
5.
6.
centralises data common to proteins in all
proteome studies.
Is read-only, not real time, asynchronous
updates are run weekly.
Data model is focused on proteome to
proteome comparisons.
Comprehensive data structure which enhance
the performance of analysis queries.
Ideally suited for statistical analysis and data
mining tools.
Provides a decision support system.
ProtDB and ProtWare data flow
10+11 bytes
X
M
L
P2
…
Pn
Extraction Transformation Loading
P1
10+8 bytes
classification,
taxonomy…
ProtWare
analyses &
statistical queries
website
flat
file
flat file
DB
dump
10+5 bytes
ProtDB vs ProtWare
ProtDB: transactional system
Data input, real time
acces to data
Data updates,
annotation, validation
Error and consistency
checks
Stores experimental data
Stores all steps of data
annotation and validation
(keep history)
In depth queries on a
given proteome
ProtWare: data warehouse
Read-only, asynchronous
updates from ProtDB
Consolidated data and final
results of annotation and
validation (no history)
No experimental data
Queries oriented to
proteomes comparisons,
statistics, data mining
Decision support system
The needle in a haystack
Of course we are looking for the Holy
Grail !
Find the interesting proteins in all our data
that:
Can be used for diagnostic,
Can explain a disease,
Can be used to cure a disease.
KDD and Data Mining
Knowledge Discovery in Databases is
« the non-trivial extraction of implicit,
previously unknown and potentially
useful knowledge from data ».
Data Mining is the discovery stage of
the KDD.
Data mining tools provide additional
possibilities to explore a database.
Data Mining tools
ProtWare: the data warehouse model is
protein query oriented.
R package: statistics and clustering
tools
Oracle 10g new data mining functions
Database infrastructure
Data input files use XML.
RDBMS: Oracle 9i moving to Oracle 10g on
Linux
ProtWare uses ANSI SQL, portable to other ANSI
SQL compliant systems (PostgreSQL).
Web interface built using standard
technologies:
PERL, CGI, DBI, HTML, Javascript, SVG.