Comparative Analysis

Download Report

Transcript Comparative Analysis

Creating NCBI
• The late Senator Claude Pepper recognized
the importance of computerized information
processing methods for the conduct of
biomedical research and sponsored
legislation that established the National
Center for Biotechnology Information
(NCBI) on November 4, 1988, as a division
of the National Library of Medicine (NLM) at
the National Institutes of Health (NIH).
What does NCBI do?
• Established in 1988 as a national resource
for molecular biology information, NCBI
creates public databases, conducts
research in computational biology, develops
software tools for analyzing genome data,
and disseminates biomedical information all for the better understanding of molecular
processes affecting human health and
disease.
OMIM, Online Mendelian
Inheritance in Man.
• This database is a catalog of human genes and
genetic disorders authored and edited by Dr.
Victor A. McKusick and his colleagues at Johns
Hopkins and elsewhere, and developed for the
World Wide Web by NCBI, the National Center
for Biotechnology Information. The database
contains textual information and references. It also
contains copious links to MEDLINE and sequence
records in the Entrez system, and links to
additional related resources at NCBI and
elsewhere.
Entrez is a search and retrieval
system that integrates information
from databases at NCBI.
•
GenBank is the NIH genetic
sequence database.
• GenBank (at NCBI), together with the DNA
DataBank of Japan (DDBJ) and the European
Molecular Biology Laboratory (EMBL) comprise the
International Nucleotide Sequence Database
Collaboration. These three organizations exchange
data on a daily basis.
GenBank grows at an exponential rate, with the
number of nucleotide bases doubling approximately
every 14 months. Currently, GenBank contains
more than 28 billion bases from over 250,000
species.
PubMed
• PubMed, a service of the National
Library of Medicine, provides access to
over 12 million MEDLINE citations back
to the mid-1960's and additional life
science journals. PubMed includes links
to many sites providing full text articles
and other related resources.
What is BLAST?
BLAST® (Basic Local Alignment Search Tool) is a set of
similarity search programs designed to explore all of the
available sequence databases regardless of whether the query
is protein or DNA. The BLAST programs have been designed
for speed, with a minimal sacrifice of sensitivity to distant
sequence relationships. The scores assigned in a BLAST
search have a well-defined statistical interpretation, making
real matches easier to distinguish from random background
hits. BLAST uses a heuristic algorithm which seeks local as
opposed to global alignments and is therefore able to detect
relationships among sequences which share only isolated
regions of similarity (Altschul et al., 1990).
Comparative Analysis
Darwin’s comparison of morphological
features of the Galapagos finches led him
to postulate the theory of natural
selection. When you compare the
sequences of genes and proteins, you are
performing the same type of analysis, just
at another level.
The most common type of
comparative method is
sequence alignment.
• Comparison of one sequence to the
entire database of known sequences
is an important discovery technique
for molecular biologists.
Explain:
• “One goal of sequence alignment is to enable the
researcher to determine whether two sequences display
sufficient similarity such that an inference of homology
is justified.”
• Similarity= an observable quantity often expressed as
% identity.
• Homology= ? (hint- there are no degrees of homology).
• BLAST tutorial: Introduction
Questions that might be
answered from a BLAST
search
• 1. How long is the sequence that you used
to search the database?
• 2. What is the most likely identity of this
sequence? What data supports this
conclusion?
• 3. What organism is the source of the
sequence? What is the common name for
this organism?
Questions that might be
answered from a BLAST
search
•
•
•
•
What phylum contains this organism?
What is the accession number for this sequence?
Is this sequence expressed? How do you know?
If your sequence is expressed, where (tissue) and
when is it expressed?
• Is anything known about factors that cause your
sequence to be expressed?
/BlastTutorial/
What is the difference between
RefSeq and GenBank?
• The GenBank archival sequence database includes
publicly available DNA sequences submitted from
individual laboratories and large-scale sequencing
projects. GenBank accession numbers are assigned
to these submitted sequences. Submitted sequence
data is exchanged between NCBIs GenBank,
EMBL Data Library (EMBL) and the DNA Data
Bank of Japan (DDBJ) to achieve comprehensive
worldwide coverage. As an archival database,
GenBank can be very redundant for some loci.
GenBank sequence records are owned by the
original submitter and can not be altered by a third
party.
What is the difference between
RefSeq and GenBank?
• RefSeq sequences are derived from GenBank and provide
non-redundant curated data representing our current
knowledge of known genes. Some records include
additional sequence information that was never submitted to
an archival database but is available in the literature. Some
sequence records are provided through collaboration; the
underlying primary sequence data is available in GenBank,
but may not be available in any one GenBank record.
RefSeq sequences are not submitted primary sequences.
RefSeq records are owned by NCBI and therefore can be
updated as needed to maintain current annotation or to
incorporate additional sequence information.
Unigene
• UniGene is an experimental system for
automatically partitioning GenBank
sequences into a non-redundant set of
gene-oriented clusters. Each UniGene
cluster contains sequences that
represent a unique gene, as well as
related information such as the tissue
types in which the gene has been
expressed and map location.
The End