Integration of Data and Resources using Industry

Download Report

Transcript Integration of Data and Resources using Industry

From Biological Data
to Biological Knowledge
Volker Stümpflen
Group for Biological Information Systems
MIPS / Institute for Bioinformatics
GSF – National Research Center for Environment and Health
TMRA 06
Something About Our Problem


For a long time we focused on
individual genes / proteins …

… but e.g. humans don’t have
much more genes than
“simple” organisms …

… because complexity occurs
at the level of biological
networks
We can’t understand anything without understanding the context
TMRA 06
Small Scale „Knowledge Generation“

Accessing some of the several hundred (web)
resources (public available data > 2 Petabyte)
=> Compilation of required knowledge by hand
TMRA 06
Large Scale Assessment of
Information and Knowledge

R. Shamir et. al., Revealing modularity and organization in the yeast
molecular network by integrated analysis of highly heterogeneous
genomewide data, PNAS, Vol. 101, No. 9, 2004, p. 2981-2986


“To gain deeper understanding of the [biological]
systems, it is pertinent to analyze heterogeneous data
sources in a truly integrated fashion and shape the
analysis results into one body of knowledge.”
“By integrating experimental data of heterogeneous
sources and types, we are able to perform analysis
on a much broader scope than previous studies.”
TMRA 06
Technical Problems

Information integration
from heterogeneous
and distributed data
sources (databases
AND applications)

Solvable with n-Tier
architectures



E.g. GenRE at MIPS
J2EE based
middleware
Enterprise Java
Beans (EJBs) and
Web Services (WS)
TMRA 06
Semantic Problems






Sloppy Definitions:
e.g. Gene has Function
Homonym / Synonym problems
e.g. gene identifiers
Ambiguity of terms
Differences in meaning of terms between
different biological communities
Results of in-vitro often differ within the
experimental scope (e.g. Protein Interactions)
…
TMRA 06
Strategies
 Complete
semantic annotation of (all)
resources


Funding ?
Data models ?
 Modeling



of individual domains
Suited for biologists (Topic Maps)
Access of relevant data sources
Merging of individual domains to obtain the
“complete picture”
TMRA 06
Static Generation of Topic Maps
Extract
Extract
TM4J
XTM
File
Extract
+
+
+
+

Highly flexible data model
Straightforward process
Intuitive user interface
Finding the right information easy
Topic maps tend to be very large
Redundant information in DBs and Topic Map files
Update problems
Dynamic generation of topic maps
TMRA 06
Omnigator
Dynamic Topic Map Generation

Dynamical information retrieval via EJBs / Web Services


Each topic type is mapped to a EJBs / Web Service
Each association is also represented by a EJBs / Web Service
Protein – ECNum
Association
Protein
has
EC Number
is associated to
Protein
Web Services

Straightforward extension of the data model


EC Number
Web Services
Protein – ECNum
Association
Web Services
Afterwards user's adjustments are possible
Intuitive navigation of related information
TMRA 06
Interface Definition

Information retrieval via EJBs (Web Service)



Each topic type is mapped to a EJB / WS
Each association type is also represented by a EJB / WS
Straightforward extension of the data model

Afterwards user's adjustments are possible
…
…
…
…
TMRA 06
DTMG Architecture
(Extension of GenRE)
Other Types of
Clients
XSL Transformation
ProteinTopicType
EJB
Protein
Extractor
GISE
EJB
…
…
GISE
EJB
Arabidopsis
thaliana
ProteinPfam
AssType EJB
TopicMap
Manager
ProteinPfam
Extractor
…
GISE
EJB
Semantic
Tier
Syntax
Tier
Resource
Manager
SIMAP Access
EJB
…
PEDANT DBs
Web Presentation
Tier
Integration
Tier
Enterprise
Information System
Tier
SIMAP
FunCat
K. Nenova
TMRA 06
“Worst Case” Example
 Combination
of two large resources at
MIPS


Annotated Proteins:
Calculated properties of genes / proteins from
various organisms
Orthologs:
Calculated similarities of proteins
(all against all)
K. Nenova / R. Gregory
TMRA 06
Large Scale Annotation with PEDANT
(Protein Extraction, Description and Analysis Tool)


Covers currently > 400 genomes
~ 1000 end of this year
TMRA 06
SIMAP: Precalculated Sequence Homologies
SIMAP database
NFS-Server
Grid Master
• 450 proteoms
• 4 sequence
collections
• 7.5 million
protein entries
• 3.5 million
LAN
Grid execution hosts
External users: MIPS +
WWW users
Webserver
Internet
BOINC daemons
SIMAP database
Database-,
Fileserver
Linux
Windows
sequences
8 billion FASTA
hits
SIMAP client
BOINC core
Mac
BOINC:
• 12600 hosts
• 2.3 TeraFLOPS
R. Arnold, T. Rattei, P. Tischler, V. Stümpflen, M-D. Truong and HW. Mewes;
Bioinformatics in press
TMRA 06
Topic Map Schema
is represented by
Description
Length
Molecular Weight
Sequence
Classification
Pedant
URL
Contig Name
Description
Description
Genome
contains
has
Protein
PFAM Domain
EC Number
is associated to
belongs
Pfam
URL
UR
L
belongs
is represented by
KEGG
URL
has orthologs
Domain
Genome
Fun Cat
Description
Description
Status
FunCa
t
URL
TMRA 06
Strain
Taxonomy Id
Some Screenshots
Context
TMRA 06
Improvements

Parallel searches based on Message Driven Beans
1
Search
Queue
Request Message
Response
Queue
Pedant DB 1
Message Driven
Bean
Pedant DB 2
Message Driven
Bean
Pedant DB 3
Message Driven
Bean
Pedant DB n
2
Stateless
Session
Bean
4
Message Driven
Bean
3
Response Message
Database Connection
R. Gregory
TMRA 06
Further Improvements

More Maps




Deseases, Metabolisms
Combination with Text Mining
Inference Engines, Reasoners
…
Computer:
Show me all proteins
in mus musculus
involved in
transmembrane signal transduction
and show me the orthologs
in rattus norvegicus
TMRA 06
Conclusion

Topic Maps suitable for semantic information
integration
 Development of a Dynamic Topic Map
Generation (DTMG) Framework
 Generation of fragments based on component
and service oriented architectures
 Capable to gain deeper understanding of
biological entities and systems in a truly
integrated fashion
TMRA 06
Acknowledgements


Filka Nenova
Richard Gregory
Matthias Oesterheld
Roland Arnold
Octave Noubibou
Marisa Thoma
Konrad Schreiber
…
Thomas Rattei

Ulrich Güldener
Martin Münsterkötter

Funding
Impuls- und
Vernetzungsfonds der
Helmholtz-Gemeinschaft
Deutscher
Forschungszentren e.V.
TMRA 06