Database Searching
Download
Report
Transcript Database Searching
DB Web addresses
Orthology and paralogy
A practical approach
Searching the primaries
Searching the secondaries
Significance of database matches
Software Web addresses
1
Why Search Databases?
• To find out if a new DNA sequence already
is deposited in the databanks.
• To find proteins homologous to a putative
coding ORF.
2
Why Search Databases?
• To find similar non-coding DNA stretches in
the database,
(for example: repeat elements, regulatory
sequences).
• To locate false priming sites for a set of
PCR oligonucleotides.
3
What Databases Are Available?
• DNA (nucleotide sequences):
The big databases: Genbank, Embl, DDBJ an
their weekly updates. These databases exchange
information routinely.
• Genomic databases like the: Human (GDB),
Mouse (MGB), Yeast (SGB), etc…
• Special databases:
ESTs (expressed sequence tags)
STSs (sequence-tagged sites)
EPD (eukaryotic promoter database)
REPBASE (repetitive sequence database)
4
and many others.
What Databases Are Available?
• Protein (amino acid sequences):
The big databases are:
Swiss-Prot ( high level of annotation)
PIR (protein identification resource)
• Translated databases like:
SPTREMBL (translated EMBL)
GenPept (translation of coding regions in
GenBank)
• Special databases like:
PDB(sequences derived from the 3D structure
Brookhaven PDB)
5
Web Addresses
• http://www.ncbi.nlm.nih.gov/Entrez/
– http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?CMD=sear
ch&DB=nucleotide
– http://www.ncbi.nlm.nih.gov/Genbank/GenbankOverview.
html
– http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Protein
6
Let us go
http://www.ncbi.nlm.nih.gov/Entrez/
7
What is GenBank?
• http://www.ncbi.nlm.nih.gov/Genbank/Genbank
Overview.html
• GenBank® is the NIH genetic sequence
database, an annotated collection of all
publicly available DNA sequences …
8
Access to GenBank
• http://www.ncbi.nlm.nih.gov/Genbank/GenbankOvervi
ew.html.
• GenBank is available for searching at NCBI via
several methods.
• The GenBank database is designed to provide
and encourage access within the scientific
community to the most up to date and
comprehensive DNA sequence information.
Therefore, NCBI places no restrictions on the use
or distribution of the GenBank data.
9
NCBI databases
• http://www.ncbi.nlm.nih.gov/Database/inde
x.html
Let us try a tutorial
http://www.ncbi.nlm.nih.gov/Database/tut1.html
10
Web Addresses
• http://www.ebi.ac.uk/Databases/
– http://www.ebi.ac.uk/embl/index.html
– http://www.ebi.ac.uk/swissprot/index.html
– http://www.ebi.ac.uk/microarray/ArrayExpress/
arrayexpress.html
11
Homology and Analogy
It is important to understand a concept that
underpins sequence analysis - homology.
The term homology is confounded and
abused in the literature.
Simply, sequences are said to be
homologous if they are related by divergence
from a common ancestor.
12
What Is Homology ?
(from the Technion course)
• Similarity or likeness between
properties in species.
• Before Darwin, homology was defined
morphologically:
• Example:
13
Homology
Bats and butterflies fly, but are different.
Bats fly and whales swim, yet the bones in
a bat's wing and a whale's flipper are
strikingly alike.
Bats and butterflies wings are not
homologous.
Bats wings and whales flippers are
homologous.
14
Homology Interpretation
from Darwin
to 21st Century
• Darwin (1859) explained homology as
the result of descent with
modification from a common ancestor.
• Modern genetics: Homology
information is in the genes.
• Two sequences are homologous if they
are both similar and have a common
ancestor.
15
When Does Similarity Imply
Homology?
• Similarity by itself is not enough: for
example, short sequences similarity could
be random (result from different
ancestors).
• Large enough similarities typically imply
homology (and usually we do not have
direct evidence on descent).
• Sequence similarity comes with a
significance measure.
16
Homology and Analogy
Understanding homology allows us to
appreciate the concept of analogy; this is
encountered in protein structures that share
similar folds but have no demonstrable
sequence similarity; or that share groups of
catalytic residues with almost exactly
equivalent spatial geometries, but otherwise
have neither sequence nor structural
similarity. Such relationships are thought to
result from convergence to similar biological
solutions from different evolutionary starting17
points.
Homology and Analogy
The essence of sequence analysis is
the inference of homology.
Homology is not a measure of similarity, but
an absolute statement that sequences have a
divergent rather than a convergent
relationship.
Thus, phrases that quantify homology are
meaningless.
18
Orthology and Paralogy
Homologous proteins may perform the same
function in different species (orthologues) or
different but related functions within one
organism (paralogues).
Comparison of orthologues allows study of
molecular palaeontology, while paralogues
have provided deeper insights into the
underlying mechanisms of evolution.
19
Orthology and Paralogy
Paralogues arose from single genes via
successive duplication events.
The duplicated genes followed separate
evolutionary pathways, and new specificities
evolved through variation and adaptation.
20
Complete genomes
• http://www.ncbi.nlm.nih.gov/entrez/query.fc
gi?db=Genome
• Let us walk around among genomes
21
COGs
Phylogenetic classification of
proteins encoded in complete
genomes
Clusters of Orthologous Groups of proteins (COGs) were
delineated by comparing protein sequences encoded in 43
complete genomes, representing 30 major phylogenetic
lineages. Each COG consists of individual proteins or
groups of paralogs from at least 3 lineages and thus
corresponds to an ancient conserved domain. Proteins
from two eukaryotic genomes (Drosophila melanogaster
and Caenorhabditis elegans) were assigned to COGs and
can be reached from each individual COG page.
22
COGs
• http://www.ncbi.nlm.nih.gov/COG/
• Cognitor
• http://www.ncbi.nlm.nih.gov/COG/xognitor.html
• COG Help
• http://www.ncbi.nlm.nih.gov/COG/COGhelp.htm
l#top
»FTP
ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Mycobacterium_leprae/
23