Transcript Document
From Bio-Informatics
towards e-BioScience
L.O. (Bob) Hertzberger
Computer Architecture and Parallel Systems Group
Department of Computer Science
Universiteit van Amsterdam
[email protected]
Background information
experimental sciences
• There is a tendency to look ever deeper in:
Matter e.g. Physics
Universe e.g. Astronomy
Life e.g. Life sciences
• Instrumental consequences are increase in detector:
Resolution & sensitivity
Automation & robotization
• Therefore experiments change in nature & become
increasingly more complex
Impact in the life sciences
• Impact of high throughput methods e.g. Omics
experimentation
genome ===> genomics
New technologies in
Life Sciences research
cell
Methodology/
Technology
DNA
Genomics
RNA
Transcriptomics
protein
metabolites
Proteomics
Metabolomics
University of Amsterdam
Omics impact
Impact in the life sciences
• Impact of high throughput methods e.g. Omics
experimentation
genome ===> genomics
• Instrumentation being used in omics
experimentation:
Transcriptomics via among others; micro-arrays
Proteomics via among others; Mass Spectroscopy (MS)
Metabolomics via among others; MS & Nuclear Magnetic
Resonance (NMR)
Results in Paradigm shift in Life
sciences
• Past experiments where hypothesis
driven
Evaluate hypothesis
Complement existing knowledge
• Present experiments are data driven
Discover knowledge from large amounts
of data
Life sciences research: from gene to function
nucleus
cell
Gene
DNA
Whole-genome sequence projects
Gene expression by
RNA synthesis
Genome-wide micro-array analysis
AAAAAAAAA
mRNA
mRNA translation by
protein synthesis
“High-throughput” protein-analysis
Protein function:
-prediction by bioinformatics
-proof by laboratory research
NH2
Protein
COOH
function-1
function-n
function-2
Developments towards Bioinformatics & e-Science
• Experiments become increasingly more complex
• Driven by increase of detector developments
• Results in an increase in amount and complexity
of data
• Something has to be done to harness this
development
Bio-informatics to translate data into useful biological,
medical, pharmaceutical & agricultural knowledge
The what of Bioinformatics
Bioinformatics is redefining rules and
scientific approaches, resulting in the
‘new biology’. Within this new paradigm
the traditional scientific boundaries are
blurred, leaving no clear line between
‘dry or computational’ and ‘wet-based’
approaches
Role of bioinformatics
Genomics
RNA
Transcriptomics
protein
metabolites
Proteomics
Metabolomics
Integrative/System Biology
Data usage/user interfacing
DNA
Bioinformatics
Data integration/fusion
methodology
Data generation/validation
cell
Two sides of Bioinformatics
• The scientific responsibility to develop the underlying
computational concepts and models to convert
complex biological data into useful biological and
chemical knowledge
• Technological responsibility to manage and integrate
huge amounts of heterogeneous data sources from
high throughput experimentation
Need for e-Science support
Developments towards Bioinformatics & e-Science
• Experiments become increasingly more complex
• Driven by increase of detector developments
• Results in an increase in amount and complexity
of data
• Something has to be done to harness this
development
Bio-informatics to translate data into useful biological,
medical, pharmaceutical & agricultural knowledge
Virtualization of experimental resources
enabling sharing & leading to e-BioScience
Life science application areas
Life science/genomics
research consortia and industry
e-Bioscience
and life science
innovation domain
Bioinformatics
e-Bioscience
& research
infrastructure
Generic e-Science ICT
development and support
e-Science
& research
infrastructure
Grid infrastructure
Network infrastructure and
computing capacity
Why e-BioScience
• There is an increasing necessity to use
results from other scientist e.g. share data &
information:
Re-use and sharing of biological data (2)
Information content of omics data extremely high, however,
• Data subject to noise, biological and technical variation
• How to induce biological principles from these genome-wide data sets?
Approach: develop methodology for “reverse engineering” of biological
mechanisms.
• Biggest challenge in bioinformatics today.
Need for external data sources for in-silico experimentation
• Two practices for re-use and sharing of data
Collectively compile huge amounts of relevant data and make these
available to the community. Examples: Bio-banking, compendia (e.g.
NIH’s Affymetrix SNP repository).
Re-use information from different and diverse experiments to discover
phenomena
Re-use and sharing of biological data (2)
Compendium example: re-use and sharing of Huntington data
•
•
•
•
Datasets: 404 Affymetrix Gene chips of measurements on extremely rare
human brain samples (Hodges et al. Hum. Mol. Genetics, 2006)
Available from NCBI GEO database (MIAME)
Goal: find genes involved in Huntington’s Disease
Approach:
Reanalyze gene expression data
Combine genotype data and clinical data (e.g. using SigWin)
Extend experiments with own ChIP on chip data
Resource Identification software
Repository of relevant meta-information from:
• Data warehouses e.g. GEO, ArrayExpress, Protein Interaction database
• Literature (Mining of PubMed using Collexis)
• Information resources specialized on diseases, genes,
proteins, e.g. OMIM, GenBank, Ensembl
Why e-BioScience
• There is an increasing necessity to use results from
other scientist e.g. share data & information:
Data repositories
Cohort studies in
Bio-banking
Biodiversity
Expensive and complex equipment
Mass Spectroscopy
MRI
Other
Problems for the realization
of e-BioScience
• Life Science field is still in an early stage of
development and:
First principles are not understood at all
• As a consequence experimental methods are not
well established and will not for a time to come
• Because of the new forms of omics instrumentation
there is a need for design for experimentation
methods
Lack correct logging of conditions under which experiments
are done
is production of large amounts of data that request among
others statistical techniques for interpretation
• As a consequence results are multi interpretable
Problems for the realization
of e-BioScience
• Problems for bioinformatics & e-Bioscience:
Rationalisation at this early stage is almost impossible
Pre- standardization & standardization almost non existent
Where there are standards they are inadequate because
multi interpretable (like MIAME for micro-array’s)
• In addition there are commercial end-user products
that are difficult to integrate
• Users lack the training necessary to handle these
complex experimental situation
• Only possible solution is to create a flexible
experimentation environment for the end-users
Role of ICT in e-BioScience
• e-Science is a new form of science methodology
complementing theoretical and experimental sciences.
• It is using generic methods and an ICT infrastructure to support
this methodology.
Web services as a paradigm/way of using/accessing
information
Grid is as a method of accessing & sharing computing
resources by virtualization
• What is missing in e-BioScience:
Connection between biological problem & e-Bioscience
User oriented tools that can be re-used and extended
General model of ICT based integration
Semantic support
ontology’s and semantic support for workflows to make user
knowledge explicit
Consequences for bioinformatics & e-BioScience
• Considerable amounts of experimentation is
necessary before a well established methodology
will emerge
• The VL-e approach might be a good model &
produces an environment in which the necessary
experimentation can be realized
Enhancing the scientific process: e-BioLab
Motivation:
• Interacting with the problem domain requires an environment in which the
domain can be opened up and ideas, hunches and notions on the data and
crude models of the biology can be visualized
• A tangible space in which biologists, aided by e-scientists, will have the full
potential of VL-e at their disposal.
An actual laboratory in which:
• Problem domain experts (biologists, medical doctors) and scientists from
enabling disciplines jointly and in a creative manner work on the analyses
and design of –omics experiments.
Basic concept of e-BioLab:
• Problem domain experts can focus on
Basic model of
the biology because they are shielded
Small integration
problem area Readily accessible
experiments
from technical details by e-scientists.
data + models
+ integration methods
data mining
• Viewpoints on the research question
and the data semi-instantaneously can
Easy
be expressed and visualized.
visuaVague results
lization
• Ideas and analyses can be retained
e-BioOperator
and documented.
• Facilities for remote collaboration are
Biologists
Biologists
present*.
e-BioScientist
* Rauwerda et al., 2nd IEEE International Conference on e-Science and Grid Computing (submitted)
Enhancing the scientific process: e-BioLab (2)
Realization:
•
•
•
•
•
Large high resolution display (26.2 Mpixel) with high bandwidth (10 Gbit/s)
connection to render cluster
Full access to computational facilities and GRID middleware of VL-e
e-whiteboards and tablet PCs to share and store ideas
High definition video cameras for remote collaboration
Highly adaptable lab configuration.
Research into:
• Problem Solving Environments for biology under study
• formulation of scientific workflows that allow for sufficient interactivity and
guarantee reproducibility
• Maintaining an electronic lab journal for e-science experimentation
• Methods for:
• Information Management of omics data
• Biological Domain Interaction / Resource Identification
• Modeling of Biological Information and Knowledge
• Remote scientific co-operation
• Man-machine interaction
High resolution displays in e-bioscience
Remote whiteboard
3
2
1
2
1
3
GSEA
SOM
Video remote
collaboration
Literature Mining
Clustering
Gene lists
Interesting Pathways
GO catagories
Example: concurrently display in a discussion with a remote partner
•
Clustering results of microarray experiments
•
Interesting pathways that are predominant in certain clusters
•
Gene Ontology categories
•
Results from literature mining
•
Gene Set Enrichment of categories identified in literature mining
•
Notions depicted on the e-whiteboards
Virtual Lab for e-Science
research Philosophy
• Multidisciplinary research and development of
related ICT infrastructure
• Generic application support
Application cases are drivers for computer & computational
science and engineering research
Problem solving partly generic and partly specific
Re-use of components via generic solutions whenever
possible
Domain generic
e-BioScience services
Microarray pipeline
Mass spectroscopy
pipeline
Pathway visualization
Protein annotation
Generic e-Science
Generic e-Science
Generic e-Science services services
services
Technology push
Grid Services
Harness multi-domain distributed resources
Application pull
Domain
Specific tools
Generic e-Science
Generic e-Science
services
Generic e-Science services services
Technology push
Grid Services
Harness multi-domain distributed resources
Application pull
Micro-array
Transcriptomics pipeline
Domain
Mass spectroscopy
Specific tools
Proteomics pipeline
Domain generic
Domain generic
Domain
Generic
services
e-Science services
e-Science services
Bioinformatics methods in VL-e (1)
Example 1 – An application specific method modified by e-science
into a generic one: SigWin*
• Starting point:
Application specific method for detecting windows of increased gene
expression on chromosomes** (implemented in C and perl for SAGE technology)
• Motivation:
Broad interest from molecular biology in positional behaviour of any
measurement data that can be mapped onto DNA sequences
• SigWin e-Science version:
GRID-based modular workflow for detecting windows of significance in
any sequence of values
Widely applicable from gene expression to meteorology data
Modules reusable for alternative workflows, e.g. protein modification
Scalable to very large datasets
* Inda et al., 2nd IEEE International Conference on e-Science and Grid Computing (submitted)
** Versteeg et al, Genome Research, 2003
Bioinformatics methods: SigWin
Human gene
expression
DNA curvature of
the Escherichia coli
chromosome
Significant window detector
Generalisation of RIDGE method
Temperature in
Amsterdam
Bioinformatics methods in VL-e (2)
Example 2 – An application specific method composed of generic and
specific modules in a workflow: OligoRAP*
•
Purpose: a re-annotation workflow for oligo libraries
•
Motivation: rapidly evolving knowledge in genome analysis requires
frequent re-assessment of the molecules which are used to measure
gene-expression.
• OligoRAP
Uses set of application generic (BIOMOBY) BLAT and BLAST sequence
alignment (web)services.
Uses application specific (BIOMOBY) annotation analysis service
BIOMOBY: de-facto standard for bio-informatics webservices.
Joint work of sequence analysis lab and micro-array lab
Workflow:
• Adjustable filtering criteria make quality level of oligos explicit
• Workflow provenance makes re-annotation reproducible.
* P. Neerincx, H. Rauwerda, F. Verster, A. Kommadath, T.M. Breit, J.A.M. Leunissen, Poster ISMB 2006
Virtual Lab for e-Science
research Philosophy
• Multidisciplinary research and development of related ICT
infrastructure
• Generic application support
Application cases are drivers for computer & computational science and
engineering research
Problem solving partly generic and partly specific
Re-use of components via generic solutions whenever possible
• Rationalization of experimental process
Reproducible & comparable
• Two research experimentation environments
Proof of concept for application experimentation
Rapid prototyping for computer & computational science experimentation
Medical Diagnosis and Imaging
Problem Solving Environment
Partners:
•
Universiteit van Amsterdam (UvA)
•
Academisch Medisch Centrum (AMC)
•
Vrije Universiteit Medisch Centrum (VUMC)
•
Philips Research
•
Philips Medical Systems
•
TU Delft
•
IBM
Applications:
1.
Eddy current reduction
2.
Matched Masked Bone Elimination
3.
Functional brain imaging, DWI and fiber
tracking
4.
MR virtual colonoscopy
5.
Parallel MEG data analyses
6.
Grid-based data storage, retrieval and
sharing
7.
Interactive 3D medical visualization
Objective:
To study the design and implementation of
a PSE for medical diagnosis and imaging to
support and enhance the clinical diagnostic
and therapeutic decision process.
1
4
3
5
7
Brain Imaging and Fiber
Tractography
• Diffusion Weighted Imaging (DWI)
Restricted Brownian motion results in anisotropy that can be
measured
>= 6 measurements, reduced to tensor per voxel
Largest eigenvectors give diffusion vector
• Whole volume fiber tracking can take
many hours
Depends on size of volume and number
of measurements per voxel
Suitable for parallelization
• Visualization techniques
Medical Diagnosis and Imaging
Problem Solving Environment
Application specific services:
•
Access to PACS, DICOM
•
Interfaces to medical scanners (MRI)
•
In-house developed algorithms:
…
Medical
Applications
…
•
Eddy Current Reduction
Matched Masked Bone Elimination
Patient privacy
VL-e generic services:
•
Provides:
Virtual Laboratory
Grid Middleware
Surfnet
VL-e Environment
•
Scientific visualization techniques
Image processing algorithms
Uses:
Experiment editor
Parallel processing techniques
Grid services:
•
Storage facilities (SRB)
•
High Performance Computing
platforms
•
High Performance Visualization
platforms
Eddy current reduction
• Shear, magnification and translation as a result of residual
currents in DWI
2D matching to correct
Computationally expensive
• Parallelization through
domain decomposition
Computing cycles via Grid
Integrated PACS solution
Effects of residual eddy currents on
Philips 3T Intera with DWI.
Figure by Erik-Jan Vlieger, AMC.
Medical Diagnosis and Imaging
Problem Solving Environment
2D/3D visualization
Data retrieval,
acquisition
Filtering, analyses,
simulation
VL experiment topology
Image processing,
Data storage
The situation in the
Netherlands
•
•
•
Netherlands Bio-Informatics Center (NBIC) was set up as part of the Dutch
Genomics Initiative Netherlands Genomics Initiative (NGI)
Its aim was to organize bio-informatics in the Netherlands and to generate
sufficient critical mass also to support as a technology center the other
genomics initiatives
Organizational structure:
Board of directors
Dr van Kampen scientific director
Drs R. Kok executive director
Prof. Dr. Hertzberger adjunct scientific director
Board of overseeing
International Advisory board
Scientific Committee
Program Steering Group
Current NBIC activities
• Currently NBIC runs three programs and took the initiative and
participates in another three joint activities besides collaboration such
as with SURF (networking) and VL-e (e-Science):
• NBIC programs:
BioRange: a bio-informatics research program of 25 M$ & 25 M$
matching
BioAssist: a 10 M$ support program
BioWise: a 3 M$ education program
• Participation in :
Computation life sciences: a 5 M$ program with among others physics,
chemistry and computational science
Pilot grid roll out: a 3M$ Grid rollout & support with Dutch Foundation for
computing (NCF) and others
BIG GRID: a 35M$ GRID and e-Science program in the Netherlands
together with NCF, physics, VL-e and others
Program activities
• Bio Range has four program lines:
Micro array related bio-informatics
Proteomics related bio-informatics
Integrated bio-informatics
Informatics research for Bio-informatics
• All program lines comprise a number of collaborative projects with
participation of groups all over the Netherlands
• Bio Assist runs two program lines
Establishment of e-bioscience support environment
Establishment of generic e-science infrastructure
• In future also addition towards biomedical as was illustrated
The VL-e infrastructure
Application
specific
service
Application
Potential
Generic service
&
Virtual
Lab. services
Grid
&
Network
Services
Telescience
Medical
Application
Bio
Informatics
Applications
Virtual Laboratory
Grid Middleware
Surfnet
VL-e Proof of Concept Environment
Test & Cert.
VL-software
Virtual Lab.
rapid prototyping
(interactive simulation)
Test & Cert.
Grid Middleware
Additional
Grid Services
(OGSA services)
Test & Cert.
Compatibility
Network Service
(lambda networking)
VL-e Certification
Environment
VL-e Experimental
Environment
xxxx
xxxx
BioAssist
Total 25M$ support + 25M$ matching
Telescience
Bio
Medical
Applicatio
Application
ns
Virtual Laboratory
Virtual
Laboratory
Grid Middleware
Grid
Middleware
e-Science
Roll out
VL-E Proof of concept
Environment
& VL-e
component
Big Grid
Surfnet
Stable
Application
Surfnet
Application
feedback
Rapid prototyping
(interactive
simulation)
Additional
Grid Services
(OGSA services)
Network Service
(lambda
networking)
VL-E Experimental
Environment
Unstable
Application
& VL-e
component
Total 35 M$ support
Conclusions
• Omics experiments change the face of life sciences
• Bioinformatics can be considered to be an essential
enabler and is a form of e-Science
• Will help to realize necessary paradigm shift in Life
Science experimentation
• Better support of experimentation & optimal use of
ICT infrastructure requires rationalization
experimentation process
• Information management essential technology
• Bioinformatics can not be decoupled from e-Bioscience applications
• e-Bioscience also has to comprise biomedical
applications