Nucleotide Sequence Databases
Download
Report
Transcript Nucleotide Sequence Databases
Nucleotide Sequence Databases
Your guide to genes & genomes
Nucleotide Sequence Databases
• First generation
– GenBank is a representative example
– started as sort of a museum to preserve
knowledge of a sequence from first discovery
– great repositories, particularly for long-term study
of bioinformatic data
– flat files; not built for (and not great at) querying
Nucleotide Sequence Databases
• Second generation:
– Entrez gene is an example
– information is gene-centric (not just sequencecentric)
– all sequence information for a given gene can be
found in one place
Nucleotide Sequence Databases
• Third generation:
– Ensembl is a good example
– Information is organized
around whole genomes; not
only a specific gene’s
structure, but its context:
• position of this gene relative
to others
• strand orientation
• how gene relates to presence
or absence of biochemical
functions in organism
Prokaryotes (& Archaea)
• microscopic
organisms
• single cell
• no nucleus
• simple genome:
– single, circular DNA
molecule
– 600,000 – 8 million
base pairs
• 70% of genome
codes for proteins
Prokaryotes (& Archaea)
• genes don’t overlap
• no introns; mRNA is
collinear with gene
sequence
• protein sequences
derived by
translating longest
ORF (ATG to STOP)
spanning genetranscript sequence
source: http://www.cod.edu/people/faculty/fancher/CellStructure.htm
Thought for today …
source: http://www.scicomics.com/uploads/prokaryote.jpg
Eukaryotes
• way more complicated
– genes found in cell
nucleus
– genome size: 10 million
– 670 million base pairs
• much lower gene
density than
prokaryotes: in human
chromosomes, about
one gene for every
100,000 base pairs
source: http://www.cod.edu/people/faculty/fancher/CellStructure.htm
Eukaryotes
• much less efficient than
prokaryotes; less than
5% of human genome
codes for protein
• genes transcribed after
a promoter region; but
process may be strongly
influenced by sequence
elements relatively far
away
source: http://www.cit.gu.edu.au/~anthony/dungeon/balcony/
Eukaryotes
• Gene sequences and
mRNA/protein sequences
not collinear; only exons
are retained in mature
mRNA that encodes
protein
• A single gene may (and
often does) exhibit more
than one mRNA and
protein form
GenBank
• First example: prokaryotic gene
– point your browser to:
http://www.ncbi.nlm.nih.gov/entrez
– choose Nucleotide from the Search pull-down
menu
– in For box, type X01714 and click Go
– Click the link labeled X01714
– Can “Send To Text” if you want to save the file
GenBank fields
• LOCUS
– size of sequence (in base pairs)
– nature of molecule (e.g. DNA or RNA)
– topology (linear or circular)
• DEFINITION: brief description of gene
• ACCESSION: unique identifier for this (and
some other) databases
• VERSION: lists synonymous or past ID
numbers
GenBank fields
• KEYWORDS: list of terms related to entry; can
be used for keyword searching for related data
• SOURCE: common name of relevant organism
• ORGANISM: complete id, with taxonomic
classification
– note that ORGANISM is indented under SOURCE;
this indicates that ORGANISM is a subordinate
term, or subsection, of SOURCE
GenBank fields
• REFERENCE: credits author(s) who initially
determined the sequence; includes
subsections:
– AUTHOR
– TITLE
– JOURNAL
– PUBMED
• COMMENT: free-formatted text that doesn’t
fit in another category
GenBank fields
• FEATURES: table describing gene regions and
associated biological properties
– source: origin of specific regions of sequence; useful
for distinguishing cloning vectors from host sequences
– promoter: precise coordinates of promoter element in
the sequence; may be more than one of these
– misc feature: in this example, indicates (putative)
location of transcription start (mRNA synthesis)
– RBS (ribosome binding site): location of last upstream
element
– CDS (CoDing Segment): describes the ORF
GenBank fields: FEATURES: CDS
• gives coordinates from initial nucleotide (ATG)
to last nucleotide of stop codon (TAA)
• several lines follow, listing protein products,
reading frame to use, genetic code to apply
and several IDs for the protein sequence
• /translation section gives computer
translation of sequence into amino acid
sequence
Last Section: sequence itself
• This is the most important section in terms of
analysis using other tools
• Can isolate just this section and save the file, as
follows:
– Choose FASTA from the Display pull-down menu (top
of page)
– Choose Text in the Send To pull-down menu
– Use File/Save As to save the file
• use “Text” as file type
• give the file a name that you’ll know to associate with this
particular sequence
Example 2: eukaryotic mRNA
• Can obtain this example by searching Nucleotide
database for U90223
• Similar to prokaryote example, because we’re looking
at a direct coding sequence for a protein – not DNA, in
other words
• Notes on example:
– KEYWORD field is empty: this is an example of an
incomplete annotation
– remember, you’re looking at a primary database!
– FEATURES field contains some new terms:
• sig_peptide: location of mitochondrial targeting sequence
• mat_peptide: exact boundaries of mature peptide
Example 3: Eukaryotic gene
• Can obtain this record by searching Nucleotide
for AF018430
• General information:
– LOCUS: same info as previous examples – note the
locus name is different from the accession number
this time
– DEFINITION: specifies exon; remember, protein-coding
regions in eukaryotes are not contiguous as in
prokaryotes
– SEGMENT: indicates this is the second of 4; you’d need
all 4 to reconstruct the mRNA that codes for the
protein
Eukaryotic gene: FEATURES section
• source subsection includes a /map section:
– indicates chromosome (15)
– arm (q means long arm)
– cytogenic band (q21.1)
Eukaryotic gene: FEATURES section
• gene subsection: describes how to reconstruct
the mRNAs found in this and separate entries:
– the strings that begin “AF” refer to the GenBank
entries (remember, this one was AF018430), and the
numbers represent the nucleotide positions from the
entries
– if a set of numbers (example: 1..1177) is NOT
preceded by an entry indicator, it’s from the current
entry
– The < and > signs indicate that the start and stop
points are only approximate
Eukaryotic gene: FEATURES section
• mRNA section: can be read in a similar
manner to the gene section
• note that there are two mRNA sections (each
followed by a CDS section)
– first section describes mitochondrial RNA
– second section describes nuclear RNA
• exon section: indicates position of exon(s) in
sequence
Retrieving GenBank entries without
accession numers
• Search Nucleotide for specific product you’re
interested in; for example:
human[organism] AND dUTPase[Protein name]
– this search yields several entries; can click the Links link to
the right of one of these (AF018432) and choose Related
Sequences from the pull-down that appears
– retrieves several more entries, some DNA and some mRNA
– terms used in the titles of these entries can give us
additional search criteria:
human[organism] AND “dUTP pyrophosphatase”[Title]
– yields somewhat different set of entries