Architectural Constraints on Current Bioinformatics Integration

Download Report

Transcript Architectural Constraints on Current Bioinformatics Integration

Architectural Constraints on Current
Bioinformatics Integration Systems
Norman Paton
Department of Computer Science
University of Manchester
Manchester, UK
<norm>@cs.man.ac.uk
Structure of Presentation

Current integration proposals.



Requirements for integration.


What they support.
What they don’t support, and why.
What could be useful, and why.
Grid opportunities.


Relevant Grid technologies.
Absent Grid technologies.
Current Integration Proposals
Classification
Feature
Data Location
Values
In-situ, Replicated,
Reorganised
Integration Model None, Relational, SemiStructured, Object-Oriented
Architecture
Thin Client, Client-Server,
Multi-Tier
Analysis Support Function Call, Query, Workflow
SRS
Sequence Retrieval
System
http://srs.ebi.ac.uk/
SRS In Use
List of
Databases
Search
Interfaces
Selected
Databases
SRS Results
Links to
Result
Records
Classification of SRS
Feature
Values
Data Location
Replicated
Integration Model None
Architecture
Thin Client
Analysis Support
Function Call, Query
BioNavigator


BioNavigator combines data sources
and the tools that act over them.
As tools act on specific kinds of data,
the interface makes available only tools
that are applicable to the data in hand.
Online trial from:
https://www.bionavigator.com/
Initiating Navigation
Select
database
Enter
accession
number
Viewing Selected Data
Relevant
display
options
Navigate
to related
programs
Chaining Analyses in Macros
Chained collections of
navigations can be
saved as macros and
restored for later use.
Classification of BioNavigator
Feature
Values
Data Location
Replicated
Integration Model None
Architecture
Thin Client
Analysis Support
Function Call, Workflow
Current Public Integration
Systems




Location: data is replicated – under control.
Integration model: often minimal.
Architecture: The architecture is often twotier.
Analysis support: Query and analysis access
is carefully contained.
Only very careful instantiation of the classification
yields sufficiently predictable performance.
GIMS
GIMS – recent experience
Feature
Values
Data Location
Reorganised
Integration Model Object-Oriented
Architecture
Multi-tier
Analysis Support
Function Call
Example Analysis

Data:





Yeast genome sequence.
Protein-protein interaction data.
350 transcriptome experiments.
Overall database ~350Mb.
Analysis:

Correlate transcription of interacting
proteins.
Features of Experience




Challenging to conduct single runs of
analyses – must break into bits.
These are modest data sets compared
with what is coming.
Environment has been designed with
analysis in mind.
These analyses will never make it into
the public release!
Requirements for Integration
Requirements for Integration




Location: replication is transparent.
Integration model: standards.
Architecture: Flexible, multiple tier.
Analysis support: Arbitrary analyses
over diverse data sets.
True integration in bioinformatics should not just be
data oriented, but involve integration of analyses.
Three Tier Architecture



Clients handle user
interaction and
presentation.
Application servers
perform
computation and
analysis.
Data servers
manage and query
databases.
Client
Application
Server
Data
Server
Three Tier Architecture

Scaleability:



Replace/Upgrade components as needed.
Replace/Upgrade layers independently.
Flexibility:

Application server layer protects clients
from changes in database layer.
Classical three tier architectures are configured
statically, and are adapted slowly as needs evolve.
Grid Opportunities
Necessary and Missing

Necessary:






Directory services.
Discovery services.
Co-allocation.
Data replication.
Workload
management.
Accounting and
payment.

Missing:






Databases.
Data models.
Heterogeneity
resolution.
Personalisation.
Web services.
Standards.
Dynamic Multi-Tier
Client
Application
Server
Resources need
to be identified,
selected and
scheduled
dynamically.
Application
Server
Application
Server
Data
Server
Data
Server
Grid Classification
Feature
Values
Data Location
In-situ, Replicated
Integration Model None
Architecture
Multi-Tier
Analysis Support
Function Call, Workflow
The current Grid is not the answer, but the answer
subsumes the current facilities of the Grid.
Summary

Current integration facilities in biology:



The Grid is bringing to the table:



Are cunningly restrictive.
Make the most of limited distributed
computational architectures.
Resource description facilities.
Resource scheduling and workflow management
facilities.
The Grid does not directly address current
needs in biology, but its descendents may.