Biological Databases

download report

Transcript Biological Databases

Structure Databases
DNA/Protein structure-function
analysis and prediction
Lecture 6
Bioinformatics Section, Vrije Universiteit, Amsterdam
Some pics were token from http://www.umanitoba.ca/afs/plant_science/courses
The dictionary definition
Main Entry: da·ta·base
Pronunciation: 'dA-t&-"bAs, 'da- also 'däOrigin: circa 1962
: a usually large collection of data organized
especially for rapid search and retrieval
(as by a computer)
- Webster dictionary
WHAT is a database?

A collection of data that needs to be:
 Structured (standardized data representation)
 Searchable
 Updated (periodically)
 Cross referenced

Challenge:
 To change “meaningless” data into useful information
that can be accessed and analysed the best way
possible.
Organizing data into knowledge
HOW would YOU organise all biological sequences so that the
biological information is optimally accessible?
You need an appropriate database management system (DBMS)
DBMS

Internal organization


Database
Controls speed and
flexibility
A unity of programs that



Store
Extract
Modify
Store
Extract
USER(S)
Modify
DBMS organisation types

Flat file databases (flat DBMS)


Hierarchical databases (hierarchical DBMS)


Simple, restrictive, tables
Relational databases (RDBMS)


Simple, restrictive, table
Complex,versatile, tables
Object-oriented databases (ODBMS)

Complex, versatile, objects
A flat file database
Cell_Stock : "SK11.pEA215.3"
Species "Escherichia coli"
Plasmid "pEA215.3"
Experiment
"SK11"
Freezer "AG334 -80C"
Box
"Pisum ESTs II"
Gridded "Rack(BF7) Box(Pisum ESTs II)"
Cell_Stock : "SK11.pI206KS"
Species "Escherichia coli"
Plasmid "pI206KS"
Experiment
"SK11"
Freezer "AG334 -80C"
Box
"Pisum ESTs II"
Gridded "Rack(BF7) Box(Pisum ESTs II)"
Cell_Stock : "SK11.pEA46.2"
Species "Escherichia coli"
.
Plasmid "pEA46.2"
.
Experiment
"SK11"
Freezer "AG334 -80C"
.
Box
"Pisum ESTs II"
.
Gridded "Rack(BF7) Box(Pisum ESTs II)"

Collection of records, each
containing several data
fields.

Disadvantageous
 Redundancy
 Force single view of the
data (‘organizer’ and
‘attributes’)
Relational databases

Data is stored in multiple related tables

Data relationships across tables can be either
many-to-one or many-to-many

A few rules allow the database to be viewed in
many ways

Lets convert the “course details” to a relational
database
Our flat file database
FLAT DATABASE 2 Course details
Name
E2 E3
P1 P2
Student 1 Chemistry Biology A
B
B
A
C …..
Student 1 Chemistry Maths
C
C
B
A
A …..
Student 1 Chemistry English A
.
.
.
.
A
A
A
A …..
Student 2 Ecology
Biology A
B
A
A
A …..
Student 2 Ecology
Maths
D
A
A
A …..
.
.
.
.
Depart.
Course
E1
A
Normalization 1: remove
repeating records (rows)
sID Name
dID
sID
cID
E1
E2 E3
P1 P2
1
Student1
1
1
1
A
B
B
A
C …..
2
Student2
2
1
2
C
C
B
A
A …..
3
A
A
A
A
A …..
1
2
A
A
B
D
A
A
A
A
A …..
A …..
cID Course
1
Biology
2
Maths
3
English
dID Department
1
Chemistry
2
Ecology
Primary keys
1
.
.
.
.
2
2
.
.
.
.
Foreign keys
Normalization 2: remove
repeating records (columns)
sID Name
dID
cID Course
sID cID
gID wID
1
Student1
1
1
Biology
1
1
1
1
2
Student2
2
2
Maths
1
1
2
2
3
English
1
1
2
3
1
1
1
4
wID Project
1
1
3
5
1
E1
2
1
1
1
2
E2
2
1
1
2
dID Department
3
E3
2
1
2
3
1
Chemistry
4
P1
2
1
1
4
2
Ecology
5
P2
2
1
1
5
gID Grade
1
2
3
A
B
C
Relational Databases

What have we achieved?





No repeating information
Less storage space
Better reality representation
Easy modification/management
Easy usage of any combination of records
Remember
the DBMS has programs to access and edit
this information so ignore the human reading
limitation of the primary keys
Accessing database
information

A request for data from a database is
called a query

Queries can be of three forms:
Choose from a list of parameters
 Query by example (QBE)

• QBE build wizard allows which data to display

Query language
Query Languages

The standard




SQL (Structured Query Language) originally called
SEQUEL (Structured English QUEry Language)
Developed by IBM in 1974; introduced commercially
in 1979 by Oracle Corp.
Standard interactive and programming language for
getting information from and updating a database.
RDMS (SQL), ODBMS (Java, C++, OQL etc)
Querying our biological
relational database

Many view are possible …
Plasmid View
Plasmid
pEA25
pEA46.2
pEA207.2
pEA214.6
pEA215.3
pEA238.2
pEA238.11
pEA277.11
pEA303.4
pEA315.2
peB4
Species
Escherichia
Escherichia
Escherichia
Escherichia
Escherichia
Escherichia
Escherichia
Escherichia
Escherichia
Escherichia
Escherichia
coli
coli
coli
coli
coli
coli
coli
coli
coli
coli
coli
Cell Stock
SK10.2.pEA25
SK11.pEA46.2
SK11.pEA207.2
MB123.pEA214.6
SK11.pEA215.3
MB123.3.PEA238.2
MB123.3.pEA238.11
SK11.pEA277.11
SK11.pEA303.4
MB123.3.pEA315.2
VB1.eB4
Experiment View
Experiment
SK4
SK4
SK4
SK4
SK4
SK5
SK5
SK5
SK5
SK5
SK5
SK5
Cell Stock
SK4.pPS-IAA4-5
SK4.pPS-IAA6
SK4.pTic110
SK4.pToc34
SK4.pToc86
SK5.pAB96.3
SK5.pABR17.10
SK5.pABR18.2
SK5.pI39
SK5.pI49KS
SK5.pI176KS
SK5.pI225KS
Box
Pisum
Pisum
Pisum
Pisum
Pisum
Pisum
Pisum
Pisum
Pisum
Pisum
Pisum
Pisum
ESTs
ESTs
ESTs
ESTs
ESTs
ESTs
ESTs
ESTs
ESTs
ESTs
ESTs
ESTs
I
I
I
I
I
I
I
I
I
I
I
I
Freezer
AG334 -80C
AG334 -80C
AG334 -80C
AG334 -80C
AG334 -80C
AG334 -80C
AG334 -80C
AG334 -80C
AG334 -80C
AG334 -80C
AG334 -80C
AG334 -80C
Distributed databases


From local to global attitude
Data appears to be in one location but is most
definitely not

A definition: Two or more data files in different
locations, periodically synchronized by the DBMS to
keep data in all locations consistent (A,B,C)

An intricate network for combining and sharing
information
Administrators praise fast network technologies!!!
Users praise the internet!!!


Data warehouse

Periodically, one imports data from databases and store it
(locally) in the data warehouse.

Now a local database can be created, containing for instance
protein family data (sequence, structure, function and
pathway/process data integrated with the gene expression and
other experimental data).

Disadvantage: expensive, intensive, needs to be updated.

Advantage: easy control of integrated data-mining pipeline.
So why do biologists care?
Three main reasons

Database proliferation

Dozens to hundreds at the moment
More and more scientific discoveries
result from inter-database analysis
and mining
 Rising complexity of required datacombinations


E.g. translational medicine: “from
bench to bedside” (genomic data vs.
clinical data)
Biological databases

Like any other database


Data organization for optimal analysis
Data is of different types
Raw data (DNA, RNA, protein
sequences)
 Curated data (DNA, RNA and protein
annotated sequences and structures,
expression data)

Raw Biological data
Nucleic Acids (DNA)
Raw Biological data
Amino acid residues (proteins)
Curated Biological Data
DNA, nucleotide sequences
Gene boundaries, topology
Gene structure
Introns, exons, ORFs, splicing
Expression data
Mass spectometry
Identify unknown compounds
Curated Biological Data
Proteins, residue sequences
Extended sequence information
MCTUYTCUYFSTYRCCTYFSCD
Mass spectometry
(metabolomics, proteomics)
Secondary structure
Post-Translational protein
Modification (PTM)
Hydrophobicity, motif data
Protein-protein interaction
Curated Biological data
3D Structures, folds
Biological Databases
The NAR Database Issue: http://www.oxfordjournals.org/nar/database/c/
Distributed information

Pearson’s Law: The usefulness of a column
of data varies as the square of the number of
columns it is compared to.
A few biological databases







Nucleotide Databases
Alternative Splicing, EMBL-Bank, Ensembl, Genomes Server,
Genome, MOT, EMBL-Align, Simple Queries, dbSTS Queries,
Parasites, Mutations, IMGT
Genome Databases
Human, Mouse, Yeast, C.elegans, FLYBASE, Parasites
Protein Databases
Swiss-Prot, TrEMBL, InterPro, CluSTr, IPI, GOA, GO, Proteome
Analysis, HPI, IntEnz, TrEMBLnew, SP_ML, NEWT, PANDIT
Structure Databases
PDB, MSD, FSSP, DALI
Microarray Database
ArrayExpress
Literature Databases
MEDLINE, Software Biocatalog, Flybase Archives
Alignment Databases
BAliBASE, Homstrad, FSSP
Structural Databases

Protein Data Bank (PDB)
http://www.rcsb.org/pdb/

Structural Classification of Proteins
(SCOP)
http://scop.berkeley.edu
http://scop.mrc-lmb.cam.ac.uk/scop/
PDB

3D Macromolecular structural data

Data originates from NMR or X-ray
crystallography techniques
Total no of structures 48.891
(date: this morning)
 If the 3D structure of a protein is
solved ... they have it

PDB content
PDB information

The PDB files have a standard format

Key features

Informative descriptors
PDB-mirror on the WWW
e.g.1AE5
Example output: 1AE5
Protein Structure Initiative (PSI)
Aims at determination of the 3D structure of all Proteins




Organize known protein sequences into families.
Select family representatives as targets.
Solve the 3D structure of targets by X-ray crystallography
or NMR spectroscopy.
Build models for other proteins by homology to solved 3D
structures.
+ many structures solved;
- many redundant structures (40%)
SCOP






Structural Classification Of Proteins
3D Macromolecular structural data grouped
based on structural classification
Data originates from the PDB
Current version (v1.73)
34494 PDB Entries (Feb 2008).
97178 Domains
SCOP levels bottom-up
1.Family: Clear evolutionarily relationship
Proteins clustered together into families are clearly evolutionarily related. Generally, this
means that pairwise residue identities between the proteins are 30% and greater. However, in
some cases similar functions and structures provide definitive evidence of common descent in
the absence of high sequence identity; for example, many globins form a family though some
members have sequence identities of only 15%.
2.Superfamily: Probable common evolutionary origin
Proteins that have low sequence identities, but whose structural and functional features
suggest that a common evolutionary origin is probable are placed together in superfamilies.
For example, actin, the ATPase domain of the heat shock protein, and hexakinase together
form a superfamily.
3.Fold: Major structural similarity
Proteins are defined as having a common fold if they have the same major secondary
structures in the same arrangement and with the same topological connections. Different
proteins with the same fold often have peripheral elements of secondary structure and turn
regions that differ in size and conformation. In some cases, these differing peripheral regions
may comprise half the structure. Proteins placed together in the same fold category may not
have a common evolutionary origin: the structural similarities could arise just from the physics
and chemistry of proteins favouring certain packing arrangements and chain topologies.
SCOP-mirror on the WWW …
Enter SCOP at the top of the hierarchy
Keyword search of SCOP entries
CATH




Class, derived from secondary structure content, is
assigned for more than 90% of protein structures
automatically.
Architecture, which describes the gross orientation of
secondary structures, independent of connectivities, is
currently assigned manually.
Topology level clusters structures according to their
toplogical connections and numbers of secondary
structures.
The Homologous superfamilies cluster proteins with
highly similar structures and functions. The
assignments of structures to topology families and
homologous superfamilies are made by sequence and
structure comparisons.
CATH-mirror on the WWW …
DSSP

Dictionary of secondary structure of proteins

The DSSP database comprises the
secondary structures of all PDB entries

DSSP is actually software that translates
the PDB structural co-ordinates into
secondary (standardized) structure
elements

A similar example is STRIDE
WHY bother???
Researchers create and use the data
 Use of known information for
analyzing new data
 New data needs to be screened
 Structural/Functional information
 Extends the knowledge and
information on a higher level than
DNA or protein sequences

In the end ….
Computers can figure out all kinds of
problems, except the things in the
world that just don't add up.
James Magary
We should add:
For that we employ the human brain,
experts and experience.
Bio-databases: A short word
on problems

Even today we face some key limitations
 There is no standard format
• Every database or program has its own format

There is no standard nomenclature
• Every database has its own names

Data is not fully optimized
• Some datasets have missing information without
indications of it

Data errors
• Data is sometimes of poor quality, erroneous,
misspelled
• Error propagation resulting from computer annotation
What to take home

Databases are a collection of data





Need to access and maintain easily and
flexibly
Biological information is vast and
sometimes very redundant
Distributed databases bring it all together
with quality controls, cross-referencing and
standardization
Computers can only create data, they do
not give answers
Review-suggestion: “Integrating biological
databases”, Stein, Nature 2003