Biological Databases

Download Report

Transcript Biological Databases

Structure Databases
DNA/Protein structure-function
analysis and prediction
Lecture 6
Bioinformatics Section, Vrije Universiteit, Amsterdam
The dictionary definition
Main Entry: da·ta·base
Pronunciation: 'dA-t&-"bAs, 'da- also 'däFunction: noun
Date: circa 1962
: a usually large collection of data organized
especially for rapid search and retrieval (as by
a computer)
- Webster dictionary
WHAT is a database?
A collection of data that needs to be:

Structured

Searchable

Updated (periodically)

Cross referenced
Challenge:

To change “meaningless” data into useful information that can be
accessed and analysed the best way possible.
For example:
HOW would YOU organise all biological sequences so that the
biological information is optimally accessible?
You need an appropriate database management system (DBMS)
DBMS
Internal organization

Database
Controls speed and
flexibility
A unity of programs that



Store
Extract
Modify
Store
Extract
USER(S)
Modify
DBMS organisation types
Flat file databases (flat DBMS)

Simple, restrictive, table
Hierarchical databases (hierarchical DBMS)

Simple, restrictive, tables
Relational databases (RDBMS)

Complex,versatile, tables
Object-oriented databases (ODBMS)

Complex, versatile, objects
Relational databases
Data is stored in multiple related tables
Data relationships across tables can be
either many-to-one or many-to-many
A few rules allow the database to be
viewed in many ways
Lets convert the “course details” to a
relational database
Our flat file database
FLAT DATABASE 2 Course details
Name
E2 E3
P1 P2
Student 1 Chemistry Biology A
B
B
A
C …..
Student 1 Chemistry Maths
C
C
B
A
A …..
Student 1 Chemistry English A
.
.
.
.
A
A
A
A …..
Student 2 Ecology
Biology A
B
A
A
A …..
Student 2 Ecology
Maths
D
A
A
A …..
.
.
.
.
Depart.
Course
E1
A
Normalize (1NF) …
We remove repeating records (rows)
sID Name
dID
sID
cID
E1
E2 E3
P1 P2
1
Student1
1
1
1
A
B
B
A
C …..
2
Student2
2
1
2
C
C
B
A
A …..
3
A
A
A
A
A …..
1
2
A
A
B
D
A
A
A
A
A …..
A …..
cID Course
1
Biology
2
Maths
3
English
dID Department
1
Chemistry
2
Ecology
Primary keys
1
.
.
.
.
2
2
.
.
.
.
Foreign keys
Normalize (2NF) …
We remove redundant fields (columns)
sID Name
dID
cID Course
sID cID
gID wID
1
Student1
1
1
Biology
1
1
1
1
2
Student2
2
2
Maths
1
1
2
2
3
English
1
1
2
3
1
1
1
4
wID Project
1
1
3
5
1
E1
2
1
1
1
2
E2
2
1
1
2
dID Department
3
E3
2
1
2
3
1
Chemistry
4
P1
2
1
1
4
2
Ecology
5
P2
2
1
1
5
gID Grade
1
2
3
A
B
C
Relational Databases
What have we achieved?





No repeating information
Less storage space
Better reality representation
Easy modification/management
Easy usage of any combination of records
Remember
the DBMS has programs to access and edit this
information so ignore the human reading limitation of
the primary keys
Accessing database information
A request for data from a database is
called a query
Queries can be of three forms:



Choose from a list of parameters
Query by example (QBE)
Query language
Query Languages
The standard
 SQL (Structured Query Language) originally
called SEQUEL (Structured English QUEry
Language)
 Developed by IBM in 1974; introduced
commercially in 1979 by Oracle Corp.
 Standard interactive and programming
language for getting information from and
updating a database.

RDMS (SQL), ODBMS (Java, C++, OQL etc)
Distributed databases
From local to global attitude
Data appears to be in one location but is most definitely
not
A definition: Two or more data files in different locations,
periodically synchronized by the DBMS to keep data in
all locations consistent (A,B,C)
An intricate network for combining and sharing
information
Administrators praise fast network technologies!!!
Users praise the internet!!!
Data warehouse
Periodically, one imports data from databases and store
it (locally) in the data warehouse.
Now a local database can be created, containing for
instance protein family data (sequence, structure,
function and pathway/process data integrated with the
gene expression and other experimental data).
Disadvantage: expensive, intensive, needs to be
updated.
Advantage: easy control of integrated data-mining
pipeline.
So why do biologists care?
Three main reasons
Database proliferation

Dozens to hundreds at the moment
More and more scientific discoveries result
from inter-database analysis and mining
Rising complexity of required datacombinations

E.g. translational medicine: “from bench to
bedside” (genomic data vs. clinical data)
Biological databases
Like any other database

Data organization for optimal analysis
Data is of different types


Raw data (DNA, RNA, protein sequences)
Curated data (DNA, RNA and protein
annotated sequences and structures,
expression data)
Raw Biological data
Nucleic Acids (DNA)
Raw Biological data
Amino acid residues (proteins)
Curated Biological Data
DNA, nucleotide sequences
Gene boundaries, topology
Gene structure
Introns, exons, ORFs, splicing
Expression data
Mass spectometry
Curated Biological Data
Proteins, residue sequences
Extended sequence information
MCTUYTCUYFSTYRCCTYFSCD
Mass spectometry
(metabolomics, proteomics)
Secondary structure
Post-Translational protein
Modification (PTM)
Hydrophobicity, motif data
Protein-protein interaction
Curated Biological data
3D Structures, folds
Biological Databases
The 2003 NAR Database Issue: http://nar.oupjournals.org/content/vol31/issue1/
Distributed information
Pearson’s Law: The usefulness of a column of
data varies as the square of the number of
columns it is compared to.
A few biological databases
Nucleotide Databases
Alternative Splicing, EMBL-Bank, Ensembl, Genomes Server, Genome,
MOT, EMBL-Align, Simple Queries, dbSTS Queries, Parasites, Mutations,
IMGT
Genome Databases
Human, Mouse, Yeast, C.elegans, FLYBASE, Parasites
Protein Databases
Swiss-Prot, TrEMBL, InterPro, CluSTr, IPI, GOA, GO, Proteome Analysis,
HPI, IntEnz, TrEMBLnew, SP_ML, NEWT, PANDIT
Structure Databases
PDB, MSD, FSSP, DALI
Microarray Database
ArrayExpress
Literature Databases
MEDLINE, Software Biocatalog, Flybase Archives
Alignment Databases
BAliBASE, Homstrad, FSSP
Structural Databases
Protein Data Bank (PDB)
http://www.rcsb.org/pdb/
Structural Classification of Proteins
(SCOP)
http://scop.berkeley.edu
http://scop.mrc-lmb.cam.ac.uk/scop/
PDB
3D Macromolecular structural data
Data originates from NMR or X-ray
crystallography techniques
Total no of structures 34.626 (17/01/2006)
If the 3D structure of a protein is solved ...
they have it
PDB content
PDB information
The PDB files have a standard format
Key features
Informative descriptors
PDB-mirror on the WWW …
e.g.1AE5
Example output: 1AE5
SCOP
Structural Classification Of Proteins
3D Macromolecular structural data grouped
based on structural classification
Data originates from the PDB
Current version (v1.69)
25973 PDB Entries (July 2005).
70859 Domains
SCOP levels bottom-up
1.Family: Clear evolutionarily relationship
Proteins clustered together into families are clearly evolutionarily related. Generally, this
means that pairwise residue identities between the proteins are 30% and greater. However, in
some cases similar functions and structures provide definitive evidence of common descent in
the absence of high sequence identity; for example, many globins form a family though some
members have sequence identities of only 15%.
2.Superfamily: Probable common evolutionary origin
Proteins that have low sequence identities, but whose structural and functional features
suggest that a common evolutionary origin is probable are placed together in superfamilies.
For example, actin, the ATPase domain of the heat shock protein, and hexakinase together
form a superfamily.
3.Fold: Major structural similarity
Proteins are defined as having a common fold if they have the same major secondary
structures in the same arrangement and with the same topological connections. Different
proteins with the same fold often have peripheral elements of secondary structure and turn
regions that differ in size and conformation. In some cases, these differing peripheral regions
may comprise half the structure. Proteins placed together in the same fold category may not
have a common evolutionary origin: the structural similarities could arise just from the physics
and chemistry of proteins favouring certain packing arrangements and chain topologies.
SCOP-mirror on the WWW …
Enter SCOP at the top of the hierarchy
Keyword search of SCOP entries
CATH
Class, derived from secondary structure content, is
assigned for more than 90% of protein structures
automatically.
Architecture, which describes the gross orientation of
secondary structures, independent of connectivities, is
currently assigned manually.
Topology level clusters structures according to their
toplogical connections and numbers of secondary
structures.
The Homologous superfamilies cluster proteins with
highly similar structures and functions. The assignments
of structures to topology families and homologous
superfamilies are made by sequence and structure
comparisons.
CATH-mirror on the WWW …
DSSP
Dictionary of secondary structure of proteins
The DSSP database comprises the secondary
structures of all PDB entries
DSSP is actually software that translates the
PDB structural co-ordinates into secondary
(standardized) structure elements
A similar example is STRIDE
WHY bother???
Researchers create and use the data
Use of known information for analyzing
new data
New data needs to be screened
Structural/Functional information
Extends the knowledge and information on
a higher level than DNA or protein
sequences
In the end ….
Computers can figure out all kinds of
problems, except the things in the
world that just don't add up.
James Magary
We should add:
For that we employ the human brain,
experts and experience.
Bio-databases: A short word on
problems
Even today we face some key limitations

There is no standard format
Every database or program has its own format

There is no standard nomenclature
Every database has its own names

Data is not fully optimized
Some datasets have missing information without indications
of it

Data errors
Data is sometimes of poor quality, erroneous, misspelled
Error propagation resulting from computer annotation
What to take home
Databases are a collection of data

Need to access and maintain easily and flexibly
Biological information is vast and sometimes
very redundant
Distributed databases bring it all together with
quality controls, cross-referencing and
standardization
Computers can only create data, they do not
give answers
Review-suggestion: “Integrating biological
databases”, Stein, Nature 2003