Transcript data model

Database Issues in
Nutritional Genomics
Tony Travis &
Peter Gray
Rowett Research Institute &
University of Aberdeen
Jan 2005
Nu GO
Un Oslo
Rowett
Un. Ulster
Un Newcastle
Un Lund
Trinity
Un Cork
EBI
DiFE
IFR
Rivm Rikilt
TNO
Un Reading
Un Wageningen
Un Maastricht
Un Krakow
Un Munich
NuGO
Inserm Marseille
Un Balearic Illes
Un Florence
Utopian view
•
•
•
•
Share data freely
Everyone benefits
Ideas develop
Science prospers
Big pharma disagree!
• Sell data
commercially
• Big pharma benefits
• Ideas are exploited
• Science is a business
Scientists are confused…
• Intellectual freedom?
– Curiosity driven
science
– Poor funding
• Intellectual
property?
– Commercially driven
science
– Good funding
Preserving intellectual
property
• Autonomy
– Scientist or institution
control who their data
is disclosed to
– Control data sharing
by collaborators who
share their IP
– Needs federated
solution
• Security
– Prevent unauthorised
access to data
– Prevent unauthorised
use of data
– Maintain integrity and
provenance of data
Typical NutriGenomics Use
Case
• Example of pragmatic solution
– DNA microarray work at RRI
• Autonomy
– Data held locally on PC spreadsheets
– Completely under control of investigator
• Collaborators
– Each create spreadsheet of local results
– All collaborators exchange spreadsheets
Spreadsheet microarray
data
Distribution of one
spreadsheet
A
B
D
C
Exchange of all
spreadsheets
A
B
D
C
Manual replication of
database
• Advantages
– Simple peer-to-peer
data transfer via
email
– Each collaborator has
entire database locally
– Local analysis tools
are readily available
– Complete control of IP
within collaboration
• Disadvantages
– N(N-1) solution
– Does not scale well
– Each collaborator
must merge data into
local database replica
– No control over data
integrity or
provenance
Spreadsheet Replicated
Data Model
• Distributed
– Data originates at each collaborator’s site
• Replicated
– Copy of the entire database at each site
• Manually updated
– Data and corrections are pushed from each
collaborator to all others via email of Excel
spreadsheets containing expression data
which is merged into a single spreadsheet
Local analysis tools: maxd
• Microarray Bioinformatics Group University of
Manchester (UK)
• Java-based
• maxdView
– Visualise and analyse gene expression data.
• maxdLoad2
– Store and curate gene expression data to MIAME
standards
• Export in MAGE/ML format for submission to
ArrayExpress.
Import spreadsheet data
into maxd
Analyse expression
profiles
• 10,000 genes
• Four experiments by
one collaborator
• Normalised
• Clustered
• Comparison of gene
expression profiles
between
experiments
Upgrade spreadsheet
solution
• MaxdLoad2
– Replace spreadsheets
– Use MIAME standard
– JDBC compliant interface
– SQL92 (MySQL, Postgres)
Candidate Mediator
middleware
• Maxd
– Designed for use with single database
• P/FDM
– Integration of heterogeneous data sources
– Federated union/join of relations
• Biomart
– MartShell scripting language
– Federate database instances
Example
federated
DB
MartShell
• Command line (text mode) user Interface to
BioMart that can be used by programs
• Mart Query Language (MQL)
• Queries can be executed in ‘batch’ mode using
stored procedures in MQL scripts
BioArray Software
Environment
• BASE is a comprehensive database server to
manage massive amounts of data generated
by microarray analysis
• Lund University +Oklahoma University
• Data can be analysed using a web-based GUI
to server-side PHP scripts or data can be
extracted from the BASE database by
applications such as Genespring
Querying a Federated DB
There are two kinds of distributed query that you
can send out to the federation:
• Federated Join - like adding extra columns
with cross-referenced information on the same
object or related objects.
• Federated Union – like adding extra rows with
the same column headings – the same kinds
of experiments but done at different sites.
Comparing expression profiles
(e.g.looking for co-regulation)
Lab 1
Rat 1
Rat 2
Rat 3
Lab 2
Rat 4
Rat 5
Rat 6
1.032028 1.320651 5.806003 1.389428 3.625239 …
1.482212 1.157023 6.857593 1.678806 2.142907 …
1.291634 1.06932 1.061052 2.083518 1.146157 …
Lab 3
Rat 7
Rat 8
Rat...
1.808716 2.388491 1.47649 0.412969 2.225646 …
1.217205 1.114725 1.218257 2.560339 2.202825 …
…
…
…
…
…
Lab...
GeneB
1.152023
1.746107
1.260019
GeneC
6.957593
4.191758
1.079844
GeneD
1.678806
2.666562
1.651717
GeneE
2.172907
3.184208
1.359549
Gene…
…
…
…
GeneA
1.488212
0.374015
0.642141
Conditions for making a
Federated DB work
Needs Common Ontology
for data of same type.
BEWARE measurements made in
different units,
or using a very different exptl.
procedure,
or qualitative measurements such as
"large".."medium"
Conditions for making a
Federated DB work
Need Common Unique Identifiers :
if no property allows you to tell that one
entity instance is the same as another
then integration is UNSAFE!
(Note - it might be OK for say 95
percent of identifiers...)
Conditions for making a
Federated DB work
Mechanisation of Value mapping :
• if data values can only be compared or made
compatible with others using the judgement of
an experienced scientist, then one must use a
Warehouse (as in early PDB), otherwise
• if you can mechanise it using rules or
equations then it can be done by a view,
• or by a mediator accessing the Federation
Conditions for making a
Federated DB work
Need Standard Interchange Formats :
• Formats such as MMCIF helped reduce
human intervention in PDB. The widely
used MIAME format may do the same
for MicroArray Data.
• However such data is much harder to
integrate as it may be measured under
different conditions with different
technology.
Difficulties of Federated
Approach
• Reliability - Sites must be available
continuously, and not crash too
often;
• Support costs - must be proof
against Virus attacks, etc., and
have people able to bring them
back up again promptly
Difficulties of Federated
Approach
• Compatibility - must provide a common
interface - may be able to share
development of some downloadable
server software (like Java WebStart),
responding to SOAP protocol messages
and commands, config-urable through
web forms that keeps logs of errors.
Difficulties of Federated
Approach
• Performance:
Warehouses will provide better
performance for data mining programs
and others programs with a high hit
rate.
• Federated systems compete well on
more focused queries which allow the
use of indexes in remote systems.
Having it Both Ways:
• A Federated Solution can include some sites
that are adopting Warehouse technology to
collect and vet large volumes of data of a
particular kind.
• The NUGO data model and ontologies are
bound to change a lot in ways we cannot
forsee. Thus it makes sense to be flexible to
start, allowing site autonomy, and to delay
committing to large warehouses until we
understand more about the data model and
IPR issues.
Discovering the Model
Birney & Clamp (2004) say –
"the true biological interpretation
of data stored in a database will
change over time, and discovering
new relationships between aspects
of the data is an important part of
the motivation for storing it..”
Conclusion (1) Spreadsheets
• Spreadsheets are easy and popular
• Integrating Spreadsheets manually is
time wasting and can easily lead to
errors and wrong conclusions
• Scientists need the discipline of a
shared Data Model and the automation
of data transfer and conversion, usually
provided by a Mediator
Conclusion (2) – Shared
Data Model
• Agreement on a shared Ontology is mainly a
problem of agreeing Standards for names,
units, and specialised types.
• Agreeing a shared Data Model is more subtle.
It may need experimentation in advance of a
standard.
• The Data Model, based on Entity-Relationship
Model with SubTypes, must be able to evolve not fixed in stone, coping with the unforseen.
Conclusion (2) – Shared
Data Model
• The Data Model must be at Conceptual Level independent of Storage Technique - arrays,
ASN-1, XML, tables etc... Otherwise agreeing
a Shared Model becomes too hard!
• The Data Model must provide External Views
both to restrict access and to provide a
consistent API to External Applications; these
may be Spreadsheets or Statistical Packages
or MaxD or Genespring etc...
Conclusion (3) – Federating
Microarray Data
• Usually, a federation is based on a federated
Join, through common identifiers, because
irrelevant joins can be left out, to speed up
the query.
• Federated Joins suit integrating other types of
data with Microarray data, e.g. physiological,
epidemiological data
• This is easily done, on the fly; it allows us to
evolve the data model and experiment with it
without making changes to a centralised
warehouse. Once the data model is more
stable, parts of it can be stored in warehouse.
Conclusion (3) – Federating
Microarray Data
• Queries that want to compare Gene
Expression Profiles across many Experiments
need a federated Union of data from different
experimenters.
• Comparing one profile against those from
many experimental sites could be done in
parallel. Trusted methods could work with an
encrypted profile to keep it confidential.
Conclusion (4) – IPR and
Federation
• Scientists want to retain their autonomy and
right to recognised authorship of the data,
otherwise they may not share it!
• If Database Right (EU proposal) becomes
established, scientists may wish to keep data
in their own DB in order to take advantage of
it. Thus we may need to make more use of
federated techniques to bring such data
together.
• Revenue-Raising Potential may become
important (iTunes for example).