Transcript Good Title
Data Provenance Workshop
Natalia Maltsev
MCS
Argonne National Laboratory
Why Biotechnological Revolution?
So much
data!!
High-throughput technologies
provide huge amounts of biological
data:
Sequence data
Data describing functional
Networks (Metabolism, Regulation,
Gene Expression)
Dynamic data
Progress of Computer Science
and Computer Technologies and
Bioinformatics allows to analyze
this data
Hmmm…
98 published genomes
652 on-going genomes
Biology in a Nutshell
(for people with little knowledge but infinite intelligence)
Genome
(ROM): assembly
code on how to build proteins
Genomes
3
Gene
Products
C, T, G
variables amino acid
Structure
& Function
Genome consists of genes
Protein: Object
description Object instantiation
Gene
Protein
Instructions: A,
Functions
Enzymes:
proteins that catalyze
biochemical reactions
Pathway: sequence
Network
of reactions
(directed graph): set of
pathways with metabolites as vertices and
enzymes as edges
Pathways &
Physiology
Data: Classes
Sequence data
Data describing Networks
DNA sequences, Protein Sequences – NCBI, GenBank, SwissProt,
TIGR, sequencing projects
Metabolic Networks (EMP database, KEGG, etc)
Regulatory Networks (Sentra, TransFac, etc)
Gene Expression data (Experimental)
Other experimental data
Dynamic Data (experimental and literature)
Organisms data
Phenotypic data
Physiological data
Gene
Functions
Data
sources: Predictions
Un-annotated genomes
Genome Annotations
from public databases
Experimental Results
Sequence Analysis
results
Problems:
Genome
Sequencing and
Assembly Errors
Problems:
Gene Functions
Assignments
Identification of
Components
Stages of analysis:
Determine components of
the system (assign functions to
the genes)
Establish relationships
between components –
reconstruct biological networks
(develop a static model)
Develop a dynamic model of
the system
Data sources:
Metabolic data from
public databases (EMP,
KEGG, EcoCyc, Brenda,
etc)
Regulatory data from
public databases
(RegulonDB, Sentra, etc)
Experimental Results
Networks Analysis results
Data sources:
Enzymatic and enzyme
kinetic data from EMP
Experimental Results
Networks Analysis results
Biological Networks
Reconstructions
Static Models
Problems:
Wrong/incomplete
information about metabolic
or regulatory networks
Wrong info from step1
Problems:
Dynamic
Models
Phenotypes Predictions
General Systems
Biology Project
Architecture
Wrong Assignment of
functions to the genes
Biological
Engineering
Wrong info from step 1
&2
Wrong dynamic data
Wrong procedures
Hmm… This
microbe does
everything
wrong…
Data Sources
Public and private databases (GeneBank, SwissProt, EMP, KEGG,
etc)
Results of data analysis
Updates and versioning? (Data and annotations updates, Developed models)
Prediction of Gene functions
Predicting of gene functions by
comparing of an unknown
sequence with sequences of
genes for which the functions
are established
Seq1 – function alcohol dehydrogenase
Seq2– Function?
Alcohol dehydrogenase?
Seq1_Mus.musculus
Seq2_Homo_sapiens
GSGITKGLGAGANPEVGRNAADEDRDALRAALEGSDMVFIAAGMGGGTGTGAAPVVAE
GSGITKGLGAGANPEVGRNS AEEDRDALRAALDGSDMVFIAAGMGGGTGTGAAPVVAE
Example 1 Gene Function Assignments
Query sequence
Function Unknown!!!
Bioinformatics
tools
Blast
InterPro
Blocks
KNOWLEDGE BASE
F1
result
result
F2
result
F3
VOTING
ALGORITHM
F1 with probability P1
F2 with probability P2
F2
An Example on Pathways Reconstruction
Enzyme 1
5.1.1.5
present
Enzyme2
1.13.12.2
Not found
Enzyme 3
3.5.1.30
Weak
evidence
For
enzyme
Enzyme 4
2.6.1.48
present
Enzyme 5
1.2.1.20
present
How reliably can we
predict this pathway?
What approach will
Increase our confidence
The most?
Another Problem:
Control of Data flow
ftp to NCBI, TREMBL for updates
on annotated databases (i.e. nr,
swissport pdb)
ftp to NCBI, TIGR, JGI for
new and updated genomes
Updated DB?
New genome?
ye s
no
ye s
no
Exit
Check genome
timestamp
ye s
Download genome to
Chiba City directory
Updated
genome?
no
Exit
Download genome to
Chiba City directory
Data Acquisition
Organism Name:
Corynebacterium_glutamicum
Version and GI Number:
NC_003450.1 GI:19551250
Def inition:
Corynebacterium glutamicum,
complete genome.
Create multiple files for each
genome containing
information that will help the
user decide whether to
analyze the genome or not
How reliable?
User interface:
Genome Analyzer
(which genomes to
run through tools)
Create multiple files for each
tool and for each genome
selected to be run through
Chiba City (or TeraGrid)
(RunOnChiba)
Get information from each
file generated to submit to
Chiba in Parallel
Data Analysis
How reliable?
(GetCDSinfo)
Parse information from each gbk file of
each genome. Output to Oracle
Databases
(SubmitToChiba)
Submit each genome and each tool
to Chiba in Parallel (Capable of
doing all genomes at same time)
CHIBA
CHIBA OR
OR
TERAGRID
TERAGRID
processing the jobs
processing the jobs
Create multiple files for
submitted job to check output.
Similar to above files created.
Output generated
Data Storage
How reliable?
Organism Name:
Corynebacterium_glutamicum
Sequence Qty:
3456
Path to f asta f ile:
/nf s/chiba-homes01/……….
Tool: ChibaBlocks
Def inition:
Corynebacterium glutamicum,
complete genome.
ORACLE DB
no
output
correct?
Tables:
GENOME, BLOCKS,
BLAST, PFAM, CDS,
UPDATE, etc.
ye s
(OracleParsers)
Parse output from each tool.
Output to Oracle Databases
Gene
Functions
Data
sources: Predictions
Un-annotated genomes
Genome Annotations
from public databases
Experimental Results
Sequence Analysis
results
Problems:
Genome
Sequencing and
Assembly Errors
Problems:
Gene Functions
Assignments
Identification of
Components
What can provenance
do?
Help plan experiments
by uggesting “weak” facts
to be tested in a wetlab
Find “weak” spots in a
model
Prioritize certain steps of
model building
Evaluate data flows
Data sources:
Metabolic data from
public databases (EMP,
KEGG, EcoCyc, Brenda,
etc)
Regulatory data from
public databases
(RegulonDB, Sentra, etc)
Experimental Results
Networks Analysis results
Data sources:
Enzymatic and enzyme
kinetic data from EMP
Experimental Results
Networks Analysis results
Biological Networks
Reconstructions
Static Models
Problems:
Wrong/incomplete
information about metabolic
or regulatory networks
Wrong info from step1
Problems:
Dynamic
Models
Phenotypes Predictions
General Systems
Biology Project
Architecture
Wrong Assignment of
functions to the genes
Biological
Engineering
Wrong info from step 1
&2
Wrong dynamic data
Wrong procedures
Hmm… This
microbe does
everything
wrong…