Bioinformatics Needs for the post

Download Report

Transcript Bioinformatics Needs for the post

Bioinformatics Needs for the
post-genomic era
Dr. Erik Bongcam-Rudloff
The Linnaeus Centre for
Bioinformatics
From Egg to Adult in 3x109
Bases
• A single cell, the fertilized egg, eventually
differentiates into the ~300 different types of cells
that make up an adult body.
• With a few exceptions all of these cells contain the
complete human genome, but express only a
subset of the genes.
• Gene expression patterns are determined largely
by cell type, and vice versa.
The “body” has:
•
•
•
•
•
The genome
A comprehensive list of genes
Gene expression data
Protein localization in the cells
Information about Protein/protein and
protein/DNA interactions.
• Ways to store, display and query masses of
data so activity can focus on relevant bits.
Primary Flows of Information
and Substance in the Cell
DNA
creation
regulation
mRNA
transcription
factors
splicing
factors
Environment
& other cells
Receptors
Enzymes
structural
proteins
signaling
molecules
structural
sugars
structural
lipids
Why a Grid?
• Growth of Molecular Biological problems is
getting out of sync with Moore's Law
• Growing interest in Bioinformatics from
other disciplines
• New experimental approaches (genomics,
proteomics, etc.) require new and more
demanding solutions
Comparative Genomics
• Comparative genomics: comparison of
whole genomes (e.g. human and mouse)
and new techniques for phylogenetic
footprinting.
Rnomics
• Rnomics: tertiary structure prediction and
novel RNA gene location in whole genomes
• We are conducting genome wide scans for
RNA regulatory elements and RNA genes
using state of the art comparative genomics
tools.
• The analysis involves comparison of the
human and mouse genomes using tools such
as stochastic context-free grammars
Molecular Interactions
• Large scale in silico maps of the molecular
interactions over entire proteomes and
genomes. These maps provide quantitative
functional models that bridge the biological
with the chemical.
• We are developing models of gene participation in
biological processes. Such models are developed
from microarray-based gene expressions and
background knowledge, e.g. as provided by the socalled Gene Ontology. The GRID Test Bed will
be an excellent computational environment for
finding molecular classifiers associated with e.g.
major diseases such as, for instance, cancer,
artherosclerosis and other diseases that kill many
people in Europe.
What is needed?
• Standard, stable interfaces to conceptual
problem solvers / data / objects
• A distributed way to store and analyse
information
• Security for user data
• Avoiding duplication of implementation and
computation
Protein structure prediction
an example
• There are over 1.3 million sequences in the nonredundant protein database managed by the NCBI
and over 19 thousand structures in the protein data
bank (PDB)
• Using this data we have built a library of common
protein substructures linking structure and
sequence on a local level
• Our library consists of over 4000 unique
substructure associated with from seven to two
thousand examples of sequence fragments
• In order to extract properties that recognize
proteins containing particular substructures, we
iteratively test different (combinations of)
properties on proteins containing and proteins not
containing the substructure of interest.
• calculating properties for all groups takes one
week on ten Athlon XP 1700+ (1.46 GHz, 1GB
RAM) processors
• In a more realistic search space, without the
drastic search space reductions, we estimate to
need approximately 700 processor days with 2GB
RAM. And depending on the available resources,
we would like to run several such trails in order to
test different parameter settings. Thus our upper
estimates may be multiplied by a factor 5-10.