Transcript Mission

Proteomics databases for
comparative studies:
Transactional and Data Warehouse
approaches
Patricia Rodriguez-Tomé, Nicolas
Pinaud, Thomas Kowall
GeneProt, Switzerland
What is Proteomics ?
P1
P2
P3
P4
P5
E1
E2
E3
G1
Separation
(CEX, RP)
Protein EST Genomic Peptide
MS MS
Sample
BioInformatics
processes
Manual
analysis
DB
About DBs for Proteomics at
GeneProt
Needs
Data
Transactional DB
Data Warehouse
Data Mining
Data Management Challenges
 A high-throughput
 Need for a convenient
environment requires
near real time
processing
 Quick response to
evolving laboratory
procedures and
evolving user needs
 Accomodate to
heterogeneous data
types
 Manage a constantly
rising flood of data
data access at all levels
of granularity via
analysis software and
web front ends
 Adapt to demand for
global queries across all
proteomics studies
 Adapt and innovate to
offer new tools:
 Statistics,
 Data mining.
Data Flow
experimental data
(LIMS)
XML
DB
Identification of
peptides and proteins
annotation
external data sources
Data export
Data details
 Experimental data:
 Store MS and MS/MS peak
 External data sources:
 Import information from external
lists
 Store all meta data
 Identification :
 Load peptide matches,
identified proteins, scores
 Automatic annotation
and analysis:
 Give access to data, store
data sources: taxonomy,
ontologies, bibliography…
 Export data:
 Export all or a subset of data
 Flat file
 Database dump
 Misc:
 Access control, security an
results
 Expert annotation:
 Give interactive access to
data using a Web interface,
store manual validation and
annotation




confidentiality
data consistency/integrity checks
Error checks and corrections
Run statistics
backup and archive
Data production per project
 Raw data (spectra) : 330 000 -> 1 500 000
 Identified peptides : 45 000 -> 145 000
 Identified sequences: 10 000 ->120 000
 Database size : 15G -> 140G
 Nbr projects: 16
 1Tb of databases files
Implementation: transactional
 Intended to capture all relevant information from
proteomics experiment, protein identification
automatic and manual annotation and validation.
 Each proteome is isolated in its own ProtDB
(16 at present).
 Complex and generic data model for efficient
data storage.
 Built in data consistency and error checks.
 A layer of « views » provides fast query access.
 Web front end: interactive means to visualize,
update and validate data.
Limitations
 We have 16 projects on-line:
 High cost of maintenance to keep all
database schemas compatible.
 Space : could we archive some of the
projects ?
 New spectrometers produce more data
 Inter databases queries:
 Technique « exists » but implementation is
often awkward and there is no efficient
solution in our case.
What about overcoming these
limitations and take advantage of
this wealth of data ?
 Decide what data are actually important
in the long term.
 Merge the data from all the projects.
 Clean and consolidate the data.
 Implement an update procedure to keep
this « merged data system » up to date
 (archive old projects)
Data Warehouse ?
 This looks very much like the definition
of a data warehouse !
 Data consolidation and integration
 Non instantaneous accuracy, non volatility
 Comprehensive data structure
 Query throughput
ProtWare: proteomics data warehouse
1. Stores consolidated and final analysis results,
2.
3.
4.
5.
6.
centralises data common to proteins in all
proteome studies.
Is read-only, not real time, asynchronous
updates are run weekly.
Data model is focused on proteome to
proteome comparisons.
Comprehensive data structure which enhance
the performance of analysis queries.
Ideally suited for statistical analysis and data
mining tools.
Provides a decision support system.
ProtDB and ProtWare data flow
10+11 bytes
X
M
L
P2
…
Pn
Extraction Transformation Loading
P1
10+8 bytes
classification,
taxonomy…
ProtWare
analyses &
statistical queries
website
flat
file
flat file
DB
dump
10+5 bytes
ProtDB vs ProtWare
ProtDB: transactional system
 Data input, real time





acces to data
Data updates,
annotation, validation
Error and consistency
checks
Stores experimental data
Stores all steps of data
annotation and validation
(keep history)
In depth queries on a
given proteome
ProtWare: data warehouse
 Read-only, asynchronous
updates from ProtDB
 Consolidated data and final
results of annotation and
validation (no history)
 No experimental data
 Queries oriented to
proteomes comparisons,
statistics, data mining
 Decision support system
The needle in a haystack
 Of course we are looking for the Holy
Grail !
 Find the interesting proteins in all our data
that:
 Can be used for diagnostic,
 Can explain a disease,
 Can be used to cure a disease.
KDD and Data Mining
 Knowledge Discovery in Databases is
« the non-trivial extraction of implicit,
previously unknown and potentially
useful knowledge from data ».
 Data Mining is the discovery stage of
the KDD.
 Data mining tools provide additional
possibilities to explore a database.
Data Mining tools
 ProtWare: the data warehouse model is
protein query oriented.
 R package: statistics and clustering
tools
 Oracle 10g new data mining functions
Database infrastructure
 Data input files use XML.
 RDBMS: Oracle 9i moving to Oracle 10g on
Linux
 ProtWare uses ANSI SQL, portable to other ANSI
SQL compliant systems (PostgreSQL).
 Web interface built using standard
technologies:
 PERL, CGI, DBI, HTML, Javascript, SVG.