The Informatics in BioInformatics

Download Report

Transcript The Informatics in BioInformatics

Informatics perspectives
in
Bio-Informatics
Atul P Agarwal
Apt Software Avenues Pvt Ltd
Apt Software Avenues Pvt Ltd, Unit G302 Block DC,
City Centre, Salt Lake, Kolkata 700064
Two aspects of Informatics


Computational Biology
All the plumbing needed to put a Bioinformatics application together
Application architecture

Standalone





Local computation
Needs to be installed on individual machines
Can connect to a web service
Updates are difficult to manage
Web based




Runs in a browser
Needs no install
Updates are easy
Can connect to other web services
Web application architecture
Proprietary
, SOAP
Lite
Application
Browser
SOAP XML
HTML, XHTML,
DHTML,
Javascript, AJAX
HTTP,
MIME
Web server
Apache, JBoss, IIS
Application
logic
Perl, Python, PHP,
C/C++, C#
CGI/ASP.N
ET/JSP
Database driver,
SQL
Database
MySQL, Postgress,
SqlServer, Oracle
Platforms - Two camps

Public domain

LAMP
Linux
 Apache, JBoss
 MySQL
 Perl, Python, PHP, Java


Microsoft

.Net
SQLServer
 ASP.NET (C, C++, C#, VB.net)

World Wide Web

The World Wide Web (WWW, or simply Web)
is an information space in which the items of
interest, referred to as resources, are identified
by global identifiers called Uniform Resource
Identifiers (URI).
Browsers – the display




Responsible for user input and result display
No algorithmic computation
Displays HTML
Some programmability through Javascript
Browser Operation





The browser recognizes that what a user has typed is a URI.
The browser performs an information retrieval action in accordance with its
configured behavior for resources identified via the "http" URI scheme.
The authority responsible for handling the URI provides information in a
response to the retrieval request.
The browser interprets the response, identified as HTML by the server, and
performs additional retrieval actions for inline graphics and other content as
necessary.
The browser displays the retrieved information, which includes hypertext
links to other information. The user can follow these hypertext links to
retrieve additional information.
Portability across Browsers

There are many browsers out there
IE
 Firefox
 Safari
 Opera



They have their own idiosyncracies
Application needs lots of testing
Web Server



Handle multiple incoming requests
Process the HTTP requests
Serve the requests

Multiple possibilities







static pages
cgi-bin
jsp
servlets
Form the HTTP responses
Send back the responses
Maintain sessions
HTTP (Hypertext transfer protocol)




RFC 2616 (The official specification )
A request/response protocol.
A client sends a request to the server in the form of a
request method, URI, and protocol version, followed
by a MIME-like message containing request modifiers,
client information, and possible body content over a
connection with a server.
The server responds with a status line, including the
message's protocol version and a success or error code,
followed by a MIME-like message containing server
information, entity meta-information, and possible
entity-body content.
HTTP Message format

The format of the request and response
messages are similar, and English-oriented. Both
kinds of messages consist of:
an initial line,
 zero or more header lines,
 a blank line (i.e. a CRLF by itself), and
 an optional message body (e.g. a file, or query data,
or query output).

Example request

To retrieve the file at the URL
http://www.somehost.com/path/file.html
open a connection to the host www.somehost.com
 send something like the following through the
connection:
GET /path/file.html HTTP/1.0
From: [email protected]
User-Agent: HTTPTool/1.0
[blank line here]

Example response

The server will respond with something like
HTTP/1.0 200 OK
Date: Fri, 31 Dec 1999 23:59:59 GMT
Content-Type: text/html
Content-Length: 1354
<html>
<body>
<h1>Happy New Millennium!</h1>
</body>
</html>

After sending the response, the server closes the network connection.
HTML (Hypertext Markup
Language)



A markup language which consists of tags
embedded in the text of a document.
The browser reading the document interprets
these markup tags to help format the document
for subsequent display to a reader.
However, many of the decisions about layout
are made by the browser.
Basic HTML tags
Tag
Description
<html>
Defines an HTML document
<body>
Defines the document's body
<h1> to <h6>
Defines header 1 to header 6
<p>
Defines a paragraph
<br>
Inserts a single line break
<hr>
Defines a horizontal rule
<!-->
Defines a comment
Evolution of HTML

Emergence of new platforms


Mobiles, TVs, Digital phones
Dynamic HTML
Interactive web pages
 Combines HTML, Javascript, DOM, CSS


XHTML

Stricter and cleaner version of HTML
Evolution of the Web technologies








Static content
Cgi-bin
Servlets
JSP
ASP
Struts
JSF
AJAX
AJAX



Asynchronous JavaScript and XML
Improve the User experience
The browser can continue to communicate with the
web server while the user interacts with the page



The User can do something during long running
computationally intensive jobs
The User can manipulate complex data in a more friendly
manner
Aggregate data from multiple sources into a single view
Enhancing the User experience


iPhone has set a new standard
More demands from the Browser

Rich Internet Applications (RIA)
Silverlight – Microsoft
 Flex – Adobe
 GWT – Google


Web 2.0

Communities and sharing
Building your application

Choice of programming language

Lightweight


Heavyweight



C#, Java, C++
Specialized


Pearl, Ruby, Python
R, Matlab, Mathematica
Choice of architecture/framework
Costs
Perl – The language





An interpreted language
Easy and fast
Very good for prototyping
Powerful text manipulation features
Has been used a lot for “plumbing”
Disadvantages of Perl



Interpreted, hence slow
Poor GUI support, screen based or command
line user interaction only
Novice can be caught on the wrong foot
Variables can be used without initialization
 No type checking of variables

BioPerl




A collection of Perl modules
Specifically for Bio-Informatics
Object oriented
Can be a little difficult to get started with
Objects in BioPerl




Sequences
Databases
Alignments
Features and genes on sequences
Parallel Computing


Advent of cheap multi-core CPUs
Availability of libraries to help parallel processing

STAPL


Standard Template Adaptive Parallel Library
Protein folding problem using STAPL


Intel TBB



Parallelized version of Smith Waterman algorithm
http://cmgm.stanford.edu/~brutlag/Papers/brutlag93.pdf
Specialized hardware


Intel Threading Building Blocks
Google MapReduce


http://www.hicomb.org/papers/HICOMB2004-03.pdf
FPGA implementation of Blast
Very hard to program parallel algorithms
CGI (Common Gateway Interface)




a standard way for a web server to invoke a script,
passing certain environment variables and user input
data to the script, and allow the script to return a
result.
one of the oldest ways of providing dynamic web
content.
supported on innumerable low cost web hosting
services
included out of the box with many Apache installations,
such as that provided on Red Hat Linux.
CGI in operation
XML (eXtensible Markup Language)



XML is a data format that represents data in a
structured form
XML is a simple, standard way for interchange
of structured textual data between multi-vendor
platforms
XML can be used to store data
XML is used to create new languages






XHTML the latest version of HTML
WSDL for describing available web services
WAP and WML as markup languages for
handheld devices
RSS languages for news feeds
RDF and OWL for describing resources and
ontology
SMIL for describing multimedia for the web
Domain Specific XML

WITSML


JDF


Oil drilling
Printing
Gen2Phen

http://www.pageom.org
XML documents

Well formed


Conform to the syntax
Valid

Conform to the semantics
Data Models in BioInformatics



Not much standardization so far
Laboratory specific modeling
New initiative for genome data modeling


http://www.pageom.org
Based on XML
Databases

Public domain databases


Commercial databases




MySQL, Postgress
Oracle, SQLServer
SQL is the language
The heart and soul of BioInformatics
applications
Commercial deployments are expensive !
RDBMS (Relational Database
Management System)




Based on a “Relational” model proposed by
Codd
A “Relational” is a formal mathematical concept
The operations on Relations are based on
“Relational Algebra”
Implemented as tables

Each row defines a relation
Relational Algebra

3 primitive operations

Projection


Selection


Select a subset of rows
Join


Select a subset of columns
Cross product of two tables
Set Operations



Union
Intersection
Difference
SQL (Structured Query Language)

For manipulating an RDBMS

Data Definition Language (DDL) statements


To build and modify the structure of tables
Data Manipulation Language(DML) statements
To work with the data in the tables
 4 basic statements





SELECT
INSERT
UPDATE
DELETE
Transaction




RDBMS are multi-user systems
Different programs may be updating the
database at the same time
A DML operation that changes the database is
“effected” only when a COMMIT is issued
To undo a DML change, you can use the
ROLLBACK command instead
Datatype


An RDBMS has its own type system
The service provider “maps” from the
programming language types to the database
types
MySQL – the database



The ‘M’ in LAMP architecture
Free (GPL License)
Many enterprise features
Distributed databases
 Triggers and stored procedures


Poor XML support
Some MySQL DataTypes







INT
FLOAT
DOUBLE
integer
Small floating-point number
Double-precision floating-point
number
CHAR(N)
Text N characters long (N=1..255)
VARCHAR(N) Variable length text up to N
characters long
TEXT
Text up to 65535 characters long
LONGTEXT Text up to 4294967295 characters
long
DBI (Database Interface) Perl

to access databases from different vendors
transparently






e.g., MySQL, Oracle, Sybase (even Plain text files)
relies on proper DBD (DataBase Ddrive) modules to talk to
the real databases
there is one DBD module for every different type of database
to connect to different databases (of different types) at
the same time and easily move data between them.
single generalized API for all types of databases
program at a "higher level" than the API provided by
the database system
DBD (Database Driver) Perl


convert the general DBI API into the database
system-specific API.
also provide mechanism to access database
specific functionality directly (won’t be used)
Future Databases in Bioinformatics





Parallel database architectures
Data mining
Data warehousing
Improved query techniques
Object oriented databases ?
Web Services

Simulates a remote function invocation
A calling program wants to use function hosted on
another machine
 Inputs are passed to a remote function
 The remote function is executed
 The output is returned to the calling program



WSDL to define services
SOAP/XML to invoke services
SOAP::Lite





a collection of Perl modules
provides a simple and lightweight interface to
the Simple Object Access Protocol (SOAP)
on client and server side
the programmer doesn’t have to worry about the
details of the SOAP protocol
http://www.soaplite.com/
Service Oriented Architecture

Structuring large applications as an ad hoc collection of smaller modules called "services“

encapsulation


loose coupling


Collections of services can be coordinated and assembled to form composite services
autonomy


Logic is divided into services with the intention of promoting reuse
composability


Beyond what is described in the service contract, services hide logic from the outside world
reusability


Services adhere to a communications agreement, as defined collectively by one or more service description
documents
abstraction


Services maintain a relationship that minimizes dependencies and only requires that they maintain an awareness of
each other
contract


Many web-services are consolidated to be used under the SOA.
Services have control over the logic they encapsulate
discoverability

Services are designed to be outwardly descriptive so that they can be found and assessed via available discovery
mechanisms
Cloud Computing


Thin clients
Software as a service


Pay per use ?
Data stored on servers
Web 3.0 (wiki)








transformation of the Web from a network of separately siloed applications
and content repositories to a more seamless and interoperable whole
ubiquitous connectivity, broadband adoption, mobile Internet access and
mobile devices
network computing, software-as-a-service business models, Web services
interoperability, distributed computing, grid computing and cloud computing
open technologies, open APIs and protocols, open data formats, open-source
software platforms and open data (e.g. Creative Commons)
open identity, OpenID, open reputation, roaming portable identity and
personal data
the intelligent web, Semantic Web technologies such as RDF, OWL, semantic
application platforms, and statement-based datastores
distributed databases, the "World Wide Database" (enabled by Semantic Web
technologies)
intelligent applications, natural language processing, machine learning,
machine reasoning, autonomous agents
Example Bio-workflow

Quickly integrate different web service
Pdb
 EBI
 Kegg




AJAX and Microsoft Atlas technologies
All data exchanged as XML
http://203.197.120.150:82/aptbiocom/
The Lab




A simple cgi-bin application
Reads some EBI sequence ids from a local
mysql database
Retrieves the DNA sequence from EBI
corresponding to an id
Transcribes the DNA to RNA