The Informatics in BioInformatics
Download
Report
Transcript The Informatics in BioInformatics
Informatics perspectives
in
Bio-Informatics
Atul P Agarwal
Apt Software Avenues Pvt Ltd
Apt Software Avenues Pvt Ltd, Unit G302 Block DC,
City Centre, Salt Lake, Kolkata 700064
Two aspects of Informatics
Computational Biology
All the plumbing needed to put a Bioinformatics application together
Application architecture
Standalone
Local computation
Needs to be installed on individual machines
Can connect to a web service
Updates are difficult to manage
Web based
Runs in a browser
Needs no install
Updates are easy
Can connect to other web services
Web application architecture
Proprietary
, SOAP
Lite
Application
Browser
SOAP XML
HTML, XHTML,
DHTML,
Javascript, AJAX
HTTP,
MIME
Web server
Apache, JBoss, IIS
Application
logic
Perl, Python, PHP,
C/C++, C#
CGI/ASP.N
ET/JSP
Database driver,
SQL
Database
MySQL, Postgress,
SqlServer, Oracle
Platforms - Two camps
Public domain
LAMP
Linux
Apache, JBoss
MySQL
Perl, Python, PHP, Java
Microsoft
.Net
SQLServer
ASP.NET (C, C++, C#, VB.net)
World Wide Web
The World Wide Web (WWW, or simply Web)
is an information space in which the items of
interest, referred to as resources, are identified
by global identifiers called Uniform Resource
Identifiers (URI).
Browsers – the display
Responsible for user input and result display
No algorithmic computation
Displays HTML
Some programmability through Javascript
Browser Operation
The browser recognizes that what a user has typed is a URI.
The browser performs an information retrieval action in accordance with its
configured behavior for resources identified via the "http" URI scheme.
The authority responsible for handling the URI provides information in a
response to the retrieval request.
The browser interprets the response, identified as HTML by the server, and
performs additional retrieval actions for inline graphics and other content as
necessary.
The browser displays the retrieved information, which includes hypertext
links to other information. The user can follow these hypertext links to
retrieve additional information.
Portability across Browsers
There are many browsers out there
IE
Firefox
Safari
Opera
They have their own idiosyncracies
Application needs lots of testing
Web Server
Handle multiple incoming requests
Process the HTTP requests
Serve the requests
Multiple possibilities
static pages
cgi-bin
jsp
servlets
Form the HTTP responses
Send back the responses
Maintain sessions
HTTP (Hypertext transfer protocol)
RFC 2616 (The official specification )
A request/response protocol.
A client sends a request to the server in the form of a
request method, URI, and protocol version, followed
by a MIME-like message containing request modifiers,
client information, and possible body content over a
connection with a server.
The server responds with a status line, including the
message's protocol version and a success or error code,
followed by a MIME-like message containing server
information, entity meta-information, and possible
entity-body content.
HTTP Message format
The format of the request and response
messages are similar, and English-oriented. Both
kinds of messages consist of:
an initial line,
zero or more header lines,
a blank line (i.e. a CRLF by itself), and
an optional message body (e.g. a file, or query data,
or query output).
Example request
To retrieve the file at the URL
http://www.somehost.com/path/file.html
open a connection to the host www.somehost.com
send something like the following through the
connection:
GET /path/file.html HTTP/1.0
From: [email protected]
User-Agent: HTTPTool/1.0
[blank line here]
Example response
The server will respond with something like
HTTP/1.0 200 OK
Date: Fri, 31 Dec 1999 23:59:59 GMT
Content-Type: text/html
Content-Length: 1354
<html>
<body>
<h1>Happy New Millennium!</h1>
</body>
</html>
After sending the response, the server closes the network connection.
HTML (Hypertext Markup
Language)
A markup language which consists of tags
embedded in the text of a document.
The browser reading the document interprets
these markup tags to help format the document
for subsequent display to a reader.
However, many of the decisions about layout
are made by the browser.
Basic HTML tags
Tag
Description
<html>
Defines an HTML document
<body>
Defines the document's body
<h1> to <h6>
Defines header 1 to header 6
<p>
Defines a paragraph
<br>
Inserts a single line break
<hr>
Defines a horizontal rule
<!-->
Defines a comment
Evolution of HTML
Emergence of new platforms
Mobiles, TVs, Digital phones
Dynamic HTML
Interactive web pages
Combines HTML, Javascript, DOM, CSS
XHTML
Stricter and cleaner version of HTML
Evolution of the Web technologies
Static content
Cgi-bin
Servlets
JSP
ASP
Struts
JSF
AJAX
AJAX
Asynchronous JavaScript and XML
Improve the User experience
The browser can continue to communicate with the
web server while the user interacts with the page
The User can do something during long running
computationally intensive jobs
The User can manipulate complex data in a more friendly
manner
Aggregate data from multiple sources into a single view
Enhancing the User experience
iPhone has set a new standard
More demands from the Browser
Rich Internet Applications (RIA)
Silverlight – Microsoft
Flex – Adobe
GWT – Google
Web 2.0
Communities and sharing
Building your application
Choice of programming language
Lightweight
Heavyweight
C#, Java, C++
Specialized
Pearl, Ruby, Python
R, Matlab, Mathematica
Choice of architecture/framework
Costs
Perl – The language
An interpreted language
Easy and fast
Very good for prototyping
Powerful text manipulation features
Has been used a lot for “plumbing”
Disadvantages of Perl
Interpreted, hence slow
Poor GUI support, screen based or command
line user interaction only
Novice can be caught on the wrong foot
Variables can be used without initialization
No type checking of variables
BioPerl
A collection of Perl modules
Specifically for Bio-Informatics
Object oriented
Can be a little difficult to get started with
Objects in BioPerl
Sequences
Databases
Alignments
Features and genes on sequences
Parallel Computing
Advent of cheap multi-core CPUs
Availability of libraries to help parallel processing
STAPL
Standard Template Adaptive Parallel Library
Protein folding problem using STAPL
Intel TBB
Parallelized version of Smith Waterman algorithm
http://cmgm.stanford.edu/~brutlag/Papers/brutlag93.pdf
Specialized hardware
Intel Threading Building Blocks
Google MapReduce
http://www.hicomb.org/papers/HICOMB2004-03.pdf
FPGA implementation of Blast
Very hard to program parallel algorithms
CGI (Common Gateway Interface)
a standard way for a web server to invoke a script,
passing certain environment variables and user input
data to the script, and allow the script to return a
result.
one of the oldest ways of providing dynamic web
content.
supported on innumerable low cost web hosting
services
included out of the box with many Apache installations,
such as that provided on Red Hat Linux.
CGI in operation
XML (eXtensible Markup Language)
XML is a data format that represents data in a
structured form
XML is a simple, standard way for interchange
of structured textual data between multi-vendor
platforms
XML can be used to store data
XML is used to create new languages
XHTML the latest version of HTML
WSDL for describing available web services
WAP and WML as markup languages for
handheld devices
RSS languages for news feeds
RDF and OWL for describing resources and
ontology
SMIL for describing multimedia for the web
Domain Specific XML
WITSML
JDF
Oil drilling
Printing
Gen2Phen
http://www.pageom.org
XML documents
Well formed
Conform to the syntax
Valid
Conform to the semantics
Data Models in BioInformatics
Not much standardization so far
Laboratory specific modeling
New initiative for genome data modeling
http://www.pageom.org
Based on XML
Databases
Public domain databases
Commercial databases
MySQL, Postgress
Oracle, SQLServer
SQL is the language
The heart and soul of BioInformatics
applications
Commercial deployments are expensive !
RDBMS (Relational Database
Management System)
Based on a “Relational” model proposed by
Codd
A “Relational” is a formal mathematical concept
The operations on Relations are based on
“Relational Algebra”
Implemented as tables
Each row defines a relation
Relational Algebra
3 primitive operations
Projection
Selection
Select a subset of rows
Join
Select a subset of columns
Cross product of two tables
Set Operations
Union
Intersection
Difference
SQL (Structured Query Language)
For manipulating an RDBMS
Data Definition Language (DDL) statements
To build and modify the structure of tables
Data Manipulation Language(DML) statements
To work with the data in the tables
4 basic statements
SELECT
INSERT
UPDATE
DELETE
Transaction
RDBMS are multi-user systems
Different programs may be updating the
database at the same time
A DML operation that changes the database is
“effected” only when a COMMIT is issued
To undo a DML change, you can use the
ROLLBACK command instead
Datatype
An RDBMS has its own type system
The service provider “maps” from the
programming language types to the database
types
MySQL – the database
The ‘M’ in LAMP architecture
Free (GPL License)
Many enterprise features
Distributed databases
Triggers and stored procedures
Poor XML support
Some MySQL DataTypes
INT
FLOAT
DOUBLE
integer
Small floating-point number
Double-precision floating-point
number
CHAR(N)
Text N characters long (N=1..255)
VARCHAR(N) Variable length text up to N
characters long
TEXT
Text up to 65535 characters long
LONGTEXT Text up to 4294967295 characters
long
DBI (Database Interface) Perl
to access databases from different vendors
transparently
e.g., MySQL, Oracle, Sybase (even Plain text files)
relies on proper DBD (DataBase Ddrive) modules to talk to
the real databases
there is one DBD module for every different type of database
to connect to different databases (of different types) at
the same time and easily move data between them.
single generalized API for all types of databases
program at a "higher level" than the API provided by
the database system
DBD (Database Driver) Perl
convert the general DBI API into the database
system-specific API.
also provide mechanism to access database
specific functionality directly (won’t be used)
Future Databases in Bioinformatics
Parallel database architectures
Data mining
Data warehousing
Improved query techniques
Object oriented databases ?
Web Services
Simulates a remote function invocation
A calling program wants to use function hosted on
another machine
Inputs are passed to a remote function
The remote function is executed
The output is returned to the calling program
WSDL to define services
SOAP/XML to invoke services
SOAP::Lite
a collection of Perl modules
provides a simple and lightweight interface to
the Simple Object Access Protocol (SOAP)
on client and server side
the programmer doesn’t have to worry about the
details of the SOAP protocol
http://www.soaplite.com/
Service Oriented Architecture
Structuring large applications as an ad hoc collection of smaller modules called "services“
encapsulation
loose coupling
Collections of services can be coordinated and assembled to form composite services
autonomy
Logic is divided into services with the intention of promoting reuse
composability
Beyond what is described in the service contract, services hide logic from the outside world
reusability
Services adhere to a communications agreement, as defined collectively by one or more service description
documents
abstraction
Services maintain a relationship that minimizes dependencies and only requires that they maintain an awareness of
each other
contract
Many web-services are consolidated to be used under the SOA.
Services have control over the logic they encapsulate
discoverability
Services are designed to be outwardly descriptive so that they can be found and assessed via available discovery
mechanisms
Cloud Computing
Thin clients
Software as a service
Pay per use ?
Data stored on servers
Web 3.0 (wiki)
transformation of the Web from a network of separately siloed applications
and content repositories to a more seamless and interoperable whole
ubiquitous connectivity, broadband adoption, mobile Internet access and
mobile devices
network computing, software-as-a-service business models, Web services
interoperability, distributed computing, grid computing and cloud computing
open technologies, open APIs and protocols, open data formats, open-source
software platforms and open data (e.g. Creative Commons)
open identity, OpenID, open reputation, roaming portable identity and
personal data
the intelligent web, Semantic Web technologies such as RDF, OWL, semantic
application platforms, and statement-based datastores
distributed databases, the "World Wide Database" (enabled by Semantic Web
technologies)
intelligent applications, natural language processing, machine learning,
machine reasoning, autonomous agents
Example Bio-workflow
Quickly integrate different web service
Pdb
EBI
Kegg
AJAX and Microsoft Atlas technologies
All data exchanged as XML
http://203.197.120.150:82/aptbiocom/
The Lab
A simple cgi-bin application
Reads some EBI sequence ids from a local
mysql database
Retrieves the DNA sequence from EBI
corresponding to an id
Transcribes the DNA to RNA