Transcript sek1
K. SEKAR, Ph.D.
BIOINFORMATICS CENTRE
INDIAN INSTITUTE OF SCIENCE
BANGALORE 560 012
INDIA
[email protected]
Voice: (91)-80-3942469 FAX : (91)-80-3600683
(91)-80-3601409
(91)-80-3600551
APPROACHES TO
DEVELOPING
DATA MINING
TOOLS
Abstract
Bioinformatics is one of the fastest
growing interdisciplinary areas in the
biological sciences and has explored in
such a way that we need powerful tools
to organize and analyze the data. An
overview will be presented on the general
features of data mining tools, techniques
and its applications.
Bioinformatics is the fashionable new
name for the field previously called
computational biology.The name is
preferred by many because it puts
the emphasis on the data storage
and analysis, rather than on the
biology, and the field is really data
driven.
The term Bioinformatics is used to encompass almost all
computer applications in biological sciences, but was
originally coined in the mid 1980’s for the analysis of
biological sequence data.
The quantity of known sequences data outweighs protein
structural data and by virtue of the genome projects,
sequence database are doubling in size every year.
A key challenge of bioinformatics is to analyze the wealth of
sequence data in order to understand the amassed information
in term of protein structure function and evolution.
Wherever possible, a range of different methods should be
used, and the results should be married with all available
biological information.
Bioinformatics has provided us with a communication
channel to reach and decode all this information in a
comprehensive manner.
Both the large information repositories and the
specialized tools to query them are held on
distributed internet sites, therefore Bioinformatics
require sound internet navigation skills.
The primary integrating technology that facilitates
access to copious data is the world wide web.
Refers to database-like activities involving
persistent sets of data that are maintained in a
consistent state over essentially indefinite
periods of time.
Encompass the use of algorithmic tools to
facilitate biological database analyses.
Comprises the entire collection of information
management systems, analysis tools and
communication networks supporting biology.
DATA MINING
Datamining
is
defined
as
“exploration and analysis by
automatic and semi-automatic
means, of large quantities of data
in order to discover meaningful
patterns and rules”.
The central challenge is to derive
maximum results from the wealth
of data.This can be achieved by
establishing
and
maintaining
databases and providing search
and analysis tools to interpret the
data.
DATABASE
Database is nothing but a collection of
quantitative
data
resulting
from
experimental
measurements
or
observations in various fields of
science.Recently interest in database
has been kindled through international
efforts to organize and analyze the data
and update the knowledge
A database is essentially just a store
of information.They are usually in the
form of simple files (just a flat file,
say).You can shove information into
this store or retrieve it from the store.
Derived Database
One of the greatest challenges in database
research is analyze the database in depth and
create derived databases to meet the needs
or demands without compromising the
sustainability and quality of the existing
database. Creating desired database is
expected is expected to dramatically reduce
the workload of the user community and will
serve as a highly focused database.
DBREF 1UNE
1 123 SWS P00593 PA2_BOVIN
23 145
SEQADV 1UNE ASN 122 SWS P00593 LYS 144 CONFLICT
SEQRES 1 123 ALA LEU TRP GLN PHE ASN GLY MET ILE LYS CYS LYS ILE
SEQRES 2 123 PRO SER SER GLU PRO LEU LEU ASP PHE ASN ASN TYR GLY
SEQRES 3 123 CYS TYR CYS GLY LEU GLY GLY SER GLY THR PRO VAL ASP
SEQRES 4 123 ASP LEU ASP ARG CYS CYS GLN THR HIS ASP ASN CYS TYR
SEQRES 5 123 LYS GLN ALA LYS LYS LEU ASP SER CYS LYS VAL LEU VAL
SEQRES 6 123 ASP ASN PRO TYR THR ASN ASN TYR SER TYR SER CYS SER
SEQRES 7 123 ASN ASN GLU ILE THR CYS SER SER GLU ASN ASN ALA CYS
SEQRES 8 123 GLU ALA PHE ILE CYS ASN CYS ASP ARG ASN ALA ALA ILE
SEQRES 9 123 CYS PHE SER LYS VAL PRO TYR ASN LYS GLU HIS LYS ASN
SEQRES 10 123 LEU ASP LYS LYS ASN CYS
HET CA 124
1
HETNAM
CA CALCIUM ION
FORMUL 2 CA CA1 2+
FORMUL 3 HOH *134(H2 O1)
HELIX 1 1 LEU
2 LYS 12 1
11
HELIX 2 2 PRO 18 ASP 21 1
4
HELIX 3 3 ASP 40 LYS 57 1
18
HELIX 4 4 ASP 59 VAL 63 1
5
HELIX 5 5 ALA 90 LYS 108 1
19
HELIX 6 6 LYS 113 HIS 115 5
3
SHEET 1 A 2 TYR 75 SER 78 0
SHEET 2 A 2 GLU 81 CYS 84 -1 N THR 83 O SER 76
SSBOND 1 CYS 11 CYS 77
SSBOND 2 CYS 27 CYS 123
SSBOND 3 CYS 29 CYS 45
SSBOND 4 CYS 44 CYS 105
SSBOND 5 CYS 51 CYS 98
SSBOND 6 CYS 61 CYS 91
SSBOND 7 CYS 84 CYS 96
LINK
CA CA 124
O TYR 28
LINK
CA CA 124
O GLY 32
CRYST1 47.120 64.590 38.140 90.00 90.00 90.00 P 21 21 21 4
SUB-DERIVED DATABASE
EXAMPLE-1
RADHASEKAR
SHAMIASEKAR
SARADASEKAR
EXAMPLE-2
XAXAXA
KAMALA
SARADA
YAMAHA
KANAGA
MANASA
VANASA
PANAMA
Adding
information
to the database
Software to
collate the required
Information from
the database
Analyze
the collated
information
WHY A TOOL?
The amount of information in the world is
growing exponentially, and it is becoming
impossible to effectively manage the
data.Machine assistance is clearly necessary,
but the difficulty lies in designing systems and
softwares that are capable of discovering
“useful” information with minimal human
intervention.
PROTEIN DATA BANK
(PDB)
GENOME DATABASE
(GDB)
STRUCTURAL CLASSIFICATION OF PROTEINS
(SCOP)
CAMBRIDGE STRUCTURAL DATABASE
(CSD)
Protein Data Bank
(PDB)
&
Genome Database
(GDB)
PDB (Protein Data Bank)
Anonymous FTP - SERVER
PDB Anonymous FTP – server is up to date and
contains all the available 20,317 atomic coordinates
of macro molecules (Proteins, Nucleic Acids and
Carbohydrates) that have been deposited in the
protein databank so far.
For Weekly update
http://iris.physics.iisc.ernet.in/index.html
For complete entries click on
“COMPLETE LIST OF ALL PDB ENTRIES”
PDB-MIRROR site machine
3.06 GHz PIV machine
1 GB RD RAM
240 GB Hard Disk
32 MB Graphics Card
Powered by Red Hat 7.3 Linux Operating
System
Given PDB-Id : 1une
HEADER HYDROLASE
05-NOV-97 1UNE
TITLE CARBOXYLIC ESTER HYDROLASE, 1.5 ANGSTROM ORTHORHOMBIC
FORM
TITLE 2 OF THE BOVINE RECOMBINANT PLA2
COMPND MOL_ID: 1;
COMPND 2 MOLECULE: PHOSPHOLIPASE A2;
COMPND 3 CHAIN: NULL;
COMPND 4 EC: 3.1.1.4;
COMPND 5 ENGINEERED: YES
SOURCE MOL_ID: 1;
SOURCE 2 ORGANISM_SCIENTIFIC: BOS TAURUS;
SOURCE 3 ORGANISM_COMMON: BOVINE;
SOURCE 4 EXPRESSION_SYSTEM: ESCHERICHIA COLI;
SOURCE 5 EXPRESSION_SYSTEM_STRAIN: BL21 (DE3) PLYSS;
SOURCE 6 EXPRESSION_SYSTEM_PLASMID: PTO-A2MBL21;
SOURCE 7 EXPRESSION_SYSTEM_GENE: MATURE PLA2
KEYWDS HYDROLASE, ENZYME, CARBOXYLIC ESTER HYDROLASE
EXPDTA X-RAY DIFFRACTION
AUTHOR M.SUNDARALINGAM
REVDAT 1 06-MAY-98 1UNE 0
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
1 REFERENCE 1
1 AUTH K.SEKAR,A.KUMAR,X.LIU,M.-D.TSAI,M.H.GELB,
1 AUTH 2 M.SUNDARALINGAM
1 TITL CRYSTAL STRUCTURE OF THE COMPLEX OF BOVINE
1 TITL 2 PANCREATIC PHOSPHOLIPASE A2 WITH A TRANSITION STATE
1 TITL 3 ANALOGUE
1 REF TO BE PUBLISHED
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
1 REFN
0353
1 REFERENCE 2
1 AUTH K.SEKAR,C.SEKARUDU,M.-D.TSAI,M.SUNDARALINGAM
1 TITL 1.72A RESOLUTION REFINEMENT OF THE TRIGONAL FORM OF
1 TITL 2 BOVINE PANCREATIC PHOSPHOLIPASE A2
1 REF TO BE PUBLISHED
1 REFN
0353
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
1 REFERENCE 3
1 AUTH K.SEKAR,S.ESWARAMOORTHY,M.K.JAIN,M.SUNDARALINGAM
1 TITL CRYSTAL STRUCTURE OF THE COMPLEX OF BOVINE
1 TITL 2 PANCREATIC PHOSPHOLIPASE A2 WITH THE INHIBITOR
1 TITL 3 1-HEXADECYL-3-(TRIFLUOROETHYL)-SN-GLYCERO-21 TITL 4 PHOSPHOMETHANOL
1 REF BIOCHEMISTRY
V. 36 14186 1997
REMARK
REMARK
REMARK
REMARK
2 RESOLUTION. 1.5 ANGSTROMS.
3 REFINEMENT.
3 PROGRAM : X-PLOR 3.1
3 AUTHORS : BRUNGER
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
3
3
3
3
3
3
3
3
DATA USED IN REFINEMENT.
RESOLUTION RANGE HIGH (ANGSTROMS) : 1.5
RESOLUTION RANGE LOW (ANGSTROMS) : 10.0
DATA CUTOFF
(SIGMA(F)) : 1.0
DATA CUTOFF HIGH
(ABS(F)) : 0.1
DATA CUTOFF LOW
(ABS(F)) : 1000000.0
COMPLETENESS (WORKING+TEST) (%) : 92.
NUMBER OF REFLECTIONS
: 17572
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
3
3
3
3
3
3
3
3
FIT TO DATA USED IN REFINEMENT.
CROSS-VALIDATION METHOD
: NULL
FREE R VALUE TEST SET SELECTION : X-PLOR
R VALUE
(WORKING SET) : 0.184
FREE R VALUE
: 0.228
FREE R VALUE TEST SET SIZE (%) : 7.
FREE R VALUE TEST SET COUNT
: 1198
ESTIMATED ERROR OF FREE R VALUE : 0.24
REMARK 3 PARAMETER FILE 1 : PARHCSDX.PRO
REMARK 3 PARAMETER FILE 2 : NULL
REMARK 3 TOPOLOGY FILE 1 : TOPHCSDX.PRO
REMARK 3 TOPOLOGY FILE 2 : NULL
REMARK 3 OTHER REFINEMENT REMARKS: NULL
REMARK 4 1UNE COMPLIES WITH FORMAT V. 2.2, 16-DEC-1996
REMARK 200
REMARK 200 EXPERIMENTAL DETAILS
REMARK 200 EXPERIMENT TYPE
: X-RAY DIFFRACTION
REMARK 200 DATE OF DATA COLLECTION
: 26-JAN-1996
REMARK 200 TEMPERATURE
(KELVIN) : 291
REMARK 200 PH
: 7.2
REMARK 200 NUMBER OF CRYSTALS USED
:1
REMARK 200
REMARK 200 SYNCHROTRON
(Y/N) : N
REMARK 200 RADIATION SOURCE
: NULL
REMARK 200 BEAMLINE
: NULL
REMARK 200 X-RAY GENERATOR MODEL
: R-AXIS IIC
REMARK 200 MONOCHROMATIC OR LAUE (M/L) : M
REMARK 200 WAVELENGTH OR RANGE
(A) : 1.5418
REMARK 200 MONOCHROMATOR
: GRAPHITE
REMARK 200 OPTICS
: NULL
REMARK 200
REMARK 200 IN THE HIGHEST RESOLUTION SHELL.
REMARK 200 HIGHEST RESOLUTION SHELL, RANGE HIGH (A) : 1.5
REMARK 200 HIGHEST RESOLUTION SHELL, RANGE LOW (A) : 1.55
REMARK 200 COMPLETENESS FOR SHELL (%) : 63.
REMARK 200 DATA REDUNDANCY IN SHELL
: 3.7
REMARK 200 R MERGE FOR SHELL
(I) : 0.172
REMARK 200 R SYM FOR SHELL
(I) : NULL
REMARK 200 FOR SHELL
: NULL
REMARK 200
REMARK 200 METHOD USED TO DETERMINE THE STRUCTURE: THE HIGH RESOLUTION
REMARK 200 ATOMIC COORDINATES OF THE WILD TYPE (PDB ENTRY 1BP2)
REMARK 200 WERE USED AS THE STARTING MODEL FOR REFINEMENT.
REMARK 200 SOFTWARE USED: X-PLOR
REMARK 200 STARTING MODEL: WILD TYPE (PDB ENTRY 1BP2)
REMARK 200
REMARK 200 REMARK: NULL
REMARK 280
REMARK 290
REMARK 290 CRYSTALLOGRAPHIC SYMMETRY
REMARK 290 SYMMETRY OPERATORS FOR SPACE GROUP: P 21 21 21
REMARK 290
REMARK 290
SYMOP SYMMETRY
REMARK 290 NNNMMM OPERATOR
REMARK 290
1555 X,Y,Z
REMARK 290
2555 1/2-X,-Y,1/2+Z
REMARK 290
3555 -X,1/2+Y,1/2-Z
REMARK 290
4555 1/2+X,1/2-Y,-Z
DBREF 1UNE
1 123 SWS P00593 PA2_BOVIN
23 145
SEQADV 1UNE ASN 122 SWS P00593 LYS 144 CONFLICT
SEQRES 1 123 ALA LEU TRP GLN PHE ASN GLY MET ILE LYS CYS LYS ILE
SEQRES 2 123 PRO SER SER GLU PRO LEU LEU ASP PHE ASN ASN TYR GLY
SEQRES 3 123 CYS TYR CYS GLY LEU GLY GLY SER GLY THR PRO VAL ASP
SEQRES 4 123 ASP LEU ASP ARG CYS CYS GLN THR HIS ASP ASN CYS TYR
SEQRES 5 123 LYS GLN ALA LYS LYS LEU ASP SER CYS LYS VAL LEU VAL
SEQRES 6 123 ASP ASN PRO TYR THR ASN ASN TYR SER TYR SER CYS SER
SEQRES 7 123 ASN ASN GLU ILE THR CYS SER SER GLU ASN ASN ALA CYS
SEQRES 8 123 GLU ALA PHE ILE CYS ASN CYS ASP ARG ASN ALA ALA ILE
SEQRES 9 123 CYS PHE SER LYS VAL PRO TYR ASN LYS GLU HIS LYS ASN
SEQRES 10 123 LEU ASP LYS LYS ASN CYS
HET CA 124
1
HETNAM
CA CALCIUM ION
FORMUL 2 CA CA1 2+
FORMUL 3 HOH *134(H2 O1)
HELIX 1 1 LEU
2 LYS 12 1
11
HELIX 2 2 PRO 18 ASP 21 1
4
HELIX 3 3 ASP 40 LYS 57 1
18
HELIX 4 4 ASP 59 VAL 63 1
5
HELIX 5 5 ALA 90 LYS 108 1
19
HELIX 6 6 LYS 113 HIS 115 5
3
SHEET 1 A 2 TYR 75 SER 78 0
SHEET 2 A 2 GLU 81 CYS 84 -1 N THR 83 O SER 76
SSBOND 1 CYS 11 CYS 77
…
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
FIT IN THE HIGHEST RESOLUTION BIN.
TOTAL NUMBER OF BINS USED
:8
BIN RESOLUTION RANGE HIGH
(A) : 1.5
BIN RESOLUTION RANGE LOW
(A) : 1.55
BIN COMPLETENESS (WORKING+TEST) (%) : 63.
REFLECTIONS IN BIN (WORKING SET) : 1176
BIN R VALUE
(WORKING SET) : 0.340
BIN FREE R VALUE
: 0.352
BIN FREE R VALUE TEST SET SIZE (%) : 7.
BIN FREE R VALUE TEST SET COUNT : 81
ESTIMATED ERROR OF BIN FREE R VALUE : NULL
NUMBER OF NON-HYDROGEN ATOMS USED IN REFINEMENT.
PROTEIN ATOMS
: 957
NUCLEIC ACID ATOMS
:0
HETEROGEN ATOMS
:1
SOLVENT ATOMS
: 134
B VALUES.
FROM WILSON PLOT
(A**2) : NULL
MEAN B VALUE
(OVERALL, A**2) : NULL
LOW RESOLUTION CUTOFF
(A) : NULL
CROSS-VALIDATED ESTIMATED COORDINATE ERROR.
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
1 N
2 CA
3 C
4 O
5 CB
6 N
7 CA
8 C
9 O
10 CB
11 CG
12 CD1
13 CD2
14 N
15 CA
16 C
17 O
18 CB
19 CG
20 CD1
21 CD2
22 NE1
23 CE2
24 CE3
25 CZ2
26 CZ3
27 CH2
ALA
ALA
ALA
ALA
ALA
LEU
LEU
LEU
LEU
LEU
LEU
LEU
LEU
TRP
TRP
TRP
TRP
TRP
TRP
TRP
TRP
TRP
TRP
TRP
TRP
TRP
TRP
1
1
1
1
1
2
2
2
2
2
2
2
2
3
3
3
3
3
3
3
3
3
3
3
3
3
3
13.830 17.835 32.697 1.00 11.41
12.869 16.725 32.889 1.00 11.31
12.106 16.547 31.592 1.00 12.00
12.366 17.226 30.614 1.00 11.37
11.891 17.029 34.056 1.00 11.89
11.150 15.638 31.585 1.00 13.43
10.392 15.362 30.376 1.00 14.98
9.556 16.543 29.879 1.00 14.65
9.465 16.764 28.657 1.00 13.62
9.522 14.116 30.561 1.00 15.03
8.919 13.539 29.291 1.00 17.13
10.038 13.103 28.360 1.00 17.29
8.027 12.361 29.656 1.00 17.65
8.960 17.305 30.796 1.00 14.18
8.157 18.443 30.347 1.00 16.10
8.998 19.448 29.543 1.00 14.26
8.580 19.864 28.472 1.00 14.34
7.359 19.103 31.491 1.00 19.02
8.163 19.810 32.534 1.00 24.63
8.699 19.262 33.683 1.00 25.51
8.505 21.199 32.555 1.00 27.29
9.348 20.230 34.403 1.00 27.56
9.253 21.428 33.743 1.00 28.36
8.258 22.278 31.686 1.00 27.60
9.754 22.695 34.083 1.00 28.94
8.761 23.542 32.026 1.00 28.78
9.503 23.735 33.216 1.00 29.43
GDB-MIRROR site machine
3.06 GHz PIV machine
1 GB RD RAM
240 GB Hard Disk
32 MB Graphics Card
Powered by Red Hat 7.3 Linux Operating
System
Current Dictionary is /Pub/Genome
Upto higher level directory
A thaliana/
C elegans/
H sapiens/
MITOCHONDRIA/
P falciparum/
README
S cerevisiae/
Bacteria/
*.faa= FASTA Amino Acid file
*.fna= FASTA Nuclei Acid file
*.gbk= GenBank flat file format
*.gbs= GenBank summary file format
*.ptt= ProTein Table
*.tab= Table to assemble genome
*.val= ASN.1 binary format
CAMBRIDGE STRUCTURAL DATABASE
• The CAMBRIDGE STRUCTURAL DATABASE
• Software for search, Retrieval Display and
Analysis of CSD contents
The CSD records bibliographic, 2D chemical and 3D
structural results from crystallographic analysis of
organics, organometallics and metal complexes .Both
X-Ray and Neutron Diffraction studies are included for
small and medium sized compounds containing upto
500 atoms including hydrogens).
THREE DBA
COMPONENTS
Database Integrity
Database Security
Database Recovery
DATABASE INTEGRITY
The major issue for the database management is to ensure
that the data in the database is accurate, correct, valid and
consistent.Any inconsistency between two or more entries that
represent the same entity demonstrates the lack of integrity.
Database technology cannot do very much to protect users
against data errors made in the outside world before the data
has been entered in the system.
However, certain safety measures can be built into a database
to ensure that errors within the system are minimized.
DATA RECOVERY
The process of recovery involves restoring the database to a
state which is know to be correct following some kind of failure.
The technique of redundancy is used in the sense that it has to
be possible to recover the database to its correct state from
information available somewhere else in the system.
The most common way to achieve this is to dump the contents
of the database with the defined frequency on another
medium, magnetic tape or optical disk, which is then stored in
the same place.
DATABASE SECURITY
The DBA has to ensure that adequate
measures are taken to prevent unauthorized
disclosure, alteration or destruction of both the
data within the database and the database
software itself.
A password and a list of privileges attach to it
are most commonly used to control user
access rights to database information.
THREE COMPONENTS OF
DATABASE
Development of a database structure that
allows the storage and maintenance of the
required data.
Data entry, maintenance and management.
Retrieval of the data by end users equipped
with suitable analysis and display tools.
DATABASE ADMINISTRATION
The database administrator (DBA) is a person or a
group of persons responsible for overall control of
database systems.
The DBA is usually not only answerable for the
design of the database, but also for choice of DBMS
used, its implementation and training of all involved
in the database running and use.
Once the data is entered, it has to be maintained and
kept upto date.
Data
Target data
“Cleaned”
data
Selection
Preprocessing
&
transformation
Data Mining
Patterns
knowledge
Interpolation
evaluation &
validation
PROBLEMS WITH THE DATA
Incomplete data
Noisy data
Temporal data
An extremely large amount of data
Non-textual data
INCOMPLETE DATA
Some data may be missing (e.g., some
fields may be left blank)
Sometimes, the fact that missing data
itself is a valuable piece of information.
NOISY DATA
The field may contain incorrectly entered
information.
We do not know how does this affect the
certainty factor (or) confidence level of
the results.
TEMPORAL DATA
Since database grow rapidly, how can data
be incrementally added to our results.
What effect should this have in the
knowledge discovery process
AN EXTREMELY LARGE
AMOUNT OF DATA
Some datasets can grow significantly over time.
How should such datasets be processed?
The option is to perform parallel processing, where n
processors, each process approximately 1/n’ th of the
data in approximately 1/n’ th of the time.
NON-TEXTUAL DATA
There are many types of data
that need to be manipulated,
including image data, multimedia
data (Video and Sound), spatial
data in GIS and user defined
data types.
Stand alone machine
application
Web Application
PERL
Very powerful for string manipulation
Uses CGI as the interface
JAVA
Application programming(Standalone machine)
Applet Programming
(Web oriented)
Useful for graphics application over the
WWW
WHAT IS PERL?
PERL is an interpreted language optimized for scanning
arbitrary test files, extracting information from those text files
The language is intended to be practical (easy to use, efficient,
complete) rather than beautiful (tiny, elegant and minimal).
PERL uses sophisticated pattern matching techniques to scan
large amounts of data very quickly.Although optimized for
scanning text, PERL can also deal with binary data and can
make dbm files look associate arrays
CGI ( Common Gateway Interface)
Common Gateway interface (CGI), as its name
implies, provides a gateway between a user
(Client) and command/logic oriented server.
CGI performs the task of translation, means
translates the needs of clients into server
requests and then back translates server
replies to clients.
Client
CGI
Server
Client
Java Servlet
Server
RMI concept is very
useful for multitier
architecture
EXAMPLE
www.hotmail.com
www.google.com
Software
(Search Engine)
RMI
WEB-Page
Java Server pages
(sun micro systems)
Active server pages
(Microsoft corporation)
useful for dynamic web page
creation
GRAPHICAL USER
INTERFACE
(GUI)
The Programmer can quickly design the user
interface by drawing and arranging the screen
elements rather than writing the raw code
CGI is easily visualizable to users
It is user friendly
Example:
MS-WINDOWS OPERATING SYSTEMS
GUI (Graphical User Interface)
Active X
(Microsoft corporation)
Java swing
(Sun micro systems)
Buttons, boxes and pull down
menus (windows based)
VB (Visual Basic)
Application development languages.
Supports graphics.
Good for standalone applications.
Web programming is not possible.But it
is possible to use script languages(vb
script or java script) to make it web
oriented.
VC++
System & Application
Programming
Almost same as VB
Additional advantage
System side
WORLD WIDE WEB (W W W)
World Wide Web is the famous and fastest growing Internet
function.It is the way of accessing information already on the
Internet using the concept of hypertext to link information.Like
FTP, any types of digital documents, images, artwork, movies
and sounds on the remote computer can be made
hyperlinks.The protocol used for accessing such information is
HTTP (Hyper Text Transfer Protocol)
The hyper linked documents are known as HTML documents.
They are written in a special language called HTML, stands for
Hyper Text Markup Language. The HTML is nothing but ASCII
text with embedded tags on it.
DBMS & RDBMS
DBMS:
Dbase
MS-Access
Mysql-server
FoxPro
(partially RDBMS)
RDBMS:
Sybase
Oracle
SQL-server
DATABASE
a bunch of tables
TABLES
Store numerous rows of information
FIELDS
The little boxes inside a tables
An expensive whopper of a database system called SQL
server, which is used in corporation that needs to store huge
wads of information.
ORACLE, which is another database format.
The best way to create your own access database is by using,
microsoft access.This tool chips with the professional edition of
office-87 and enables you to graphically design your own
tables and individual field.
Yet another one my-SQL.
Typical Web Search
Keywords
Search Engine
Output
Web Browser
Form
O/p (in HTML)
HTML
WWW
HTML
Form
O/p (in HTML)
CGI-Program
Flat file
Packages developed at the
Bioinformatics Centre
Raman Building
Indian Institute of Science
Bangalore 560 012
Principal Investigator
Dr. K. Sekar
E-mail [email protected]
Search Engines
144.16.71.10 / psst Protein Sequence Search Tool
144.16.71.2/bsdd Biomolecules Segment Display Device
144.16.71.10/msgs Motif Search in Genome Sequence
144.16.71.2/ssep Secondary Structural Elements in Protein
Programmers
1.
2.
3.
4.
5.
6.
7.
S.Saravanan
A.Ajmal Khan
C.K.Rajesh
T.Kamaraj
P.Selvarani
V.Shanthi
S.Sirajuddin Sheik
Database with Search facility
144.16.71.2/lsdb
144.16.71.2/lysdb
144.16.71.2/asdb
144.16.71.2/gsdb
Lipase Structural Database
Lysozyme Structural Database
3D-Amylase Database
Globin Structural Database
Programmers
1.
2.
3.
4.
5.
6.
7.
C.K.Rajesh
T.Kamaraj
P.Sundrarajan
P.Selvarani
V.Shanthi
A.S.Zahir Hussain
S.Sirajuddin Sheik
Software for
Structure analysis & manipulation
144.16.71.146/cap
144.16.71.146/rp
144.16.71.146/wap
144.16.71.146/sem
144.16.71.146/pdbgoodies
144.16.71.10/gpsm
144.16.71.146/mbd
144.16.71.146/dtf
1.
2.
3.
4.
5.
6.
Conformation Angles Package
Ramachandran Plot
Water Analysis Package
Symmetry Equivalent Molecules Generator
PDBGOODIES
Geometrical Parameters for Small
Molecules
Measurability of Biovoet difference
Distribution of Temperature Factor
Programmers
C.K.Rajesh
T.Kamaraj
P.Sundarajan
P.Selvarani
V.Shanthi
S.Sirajuddin Sheik
Protein Sequence Search Tool (PSST 1.1)
S.Saravanan,A.Ajmul Khan & K.Sekar
CURR.SCIENCE, (2000) 550 – 552
PDB Goodies – A Web based GUI to manipulate
Protein Data Bank files
A.S.Z.Hussain,V.Shanthi,S.S.Sheik,J.Jeyakanthan,P.Selvarani &K.Sekar
ACTA CRYST. (2002), D58, 1385 – 1386
Ramachandran Plot (RP)
S.Sheik,P. Sundararajan,A.S.Z Hussain & K.Sekar
BIOINFORMATICS (2002) (in the press)
Biomolecules Segment Display Device (BSDD)
P.Selvarani,V.Shanthi,C.K.Rajesh,S.Saravanan & K.Sekar
J.MOL. GRAPHICS & MODELLING (2002) (in press)
Water Analysis Package (WAP)
V.Shanthi, C.K.Rajesh,J.Jayalakshmi,V.G.Vijay & K.Sekar
J.APPL.CRYST. (2002) (in the press)
Take Home Message
Datamining is nothing but exploiting the
Hidden Trends in your data.
Create your own derived database.
No one tool or set of tools is universally
applicable.
Present the data in a useful format such as
graph or table.
Department of Biotechnology
Ministry of Science & Technology
Govt. of India
India
&
Jai Vigyan National Science Foundation
Govt. of India
India
Professor M. Vijayan
Professor N. Balakrishnan
Professor S.M. Rao
Professor S. Ramakumar
Colleagues and Friends