Biological Databases

Download Report

Transcript Biological Databases

Biological Databases
Biology outside the lab
Why do we need Bioinfomatics?
Over the past few decades, major
advances in the field of molecular
biology, coupled with advances in
genomic technologies, have led
to an explosive growth in the
biological information generated
by the scientific community. This
deluge of genomic information
has, in turn, led to an absolute
requirement for computerized
databases to store, organize, and
index the data and for specialized
tools to view and analyze the
data.
Information flux from data to decision
Biology, Chemistry and
Pharmaceutical research generate an
huge amount of data. Information
analysis rate is smaller than data
production.
Human Genome progect:
22.1 bilion bases sequenced but …
what we do really know about it?
Bioinformatics
-
-
-
Building and managing of biological
databases (nucleotides, proteins, structures,
small molecules, pathways, literature, …)
Data mining and data analysis
(Computational Biology)
protein modelling ab initio – Homology
modelling – simulations (Molecular
Modeling)
Literature databases
http://www.ncbi.nlm.nih.gov/
Nucleotide databases
Protein databases

Uniprot databases:
- Swiss-prot: provide a high level of annotation, minimal level of redundancy
and high level of integration with other databases
- TrEMBL: a computer-annotated supplement of Swiss-Prot that contains all the
translations of EMBL nucleotide sequence entries not yet integrated in SwissProt.

NCBI protein database (meta-database containing sequences
from Uniprot entries, PDB derived sequences and translation from predicted
ORF in genebank)
Structural Database
Protein structures obtained by crystallography or
NMR are stored in PDB.
Microarray Databases


GEOminibus
SMD Stanford Microarray Database
Gene expression databases provides rough
data of microarray expression.
Data originated by different experiments can
be merged to obtain previously unidentified
results.
EST Databases

EST: Expressed Sequence Tags
5’ EST : These regions tend to be conserved
across species and do not change much
within a gene family
3’ EST: Because these ESTs are generated
from the 3' end of a transcript, they are likely
to fall within non-coding, or untranslated
regions (UTRs), and therefore tend to exhibit
less cross-species conservation than do
coding sequences.
Sequence Tagged Site (STS): help to
locate a gene in the genome. 3’EST are a
good source of STS
Available DBs:
Genebank – dbEST – Unigene
Tools





ORF finder
Blast
Multiple alignment
Conserved Domain Identification
Secondary structure and Folding Prediction
Example 1
sequencing
-Phylogenetically similar sequences
- Conserved Domain
Rough sequence
ORF
identification
A recombinant plasmid
containing clone shows an
interesting phenotype
Blast
In-frame sequence
CDS
Example 2
Example 2
Example 2
Exampe 2
Example 2
Tune the method
a) Increase window size in evaluating score
- increase local information integrating “environmental” data
- 2 residues window -> 2 frames
3 residues window -> 3 frames
….
b) Use degenerate matching methods (based on size, polarity, h-bond
behavior, …)