- Cal State LA - Instructional Web Server

Download Report

Transcript - Cal State LA - Instructional Web Server

Sequence Databases – 21 June 2007
Learning objectives





Be able to describe how information is stored in
GenBank.
Be able to read a GenBank flat file.
Be able to search GenBank for information.
Be able to explain the content difference between a
header, features and sequence.
Be able to say what distinguishes between a primary
database and a secondary database.
Be able to access and navigate the ENTREZ platform
for biological data analysis.
BIOSEQs – entry common to all
sequence databases
BIOSEQ = Biological sequence


Central element in the NCBI database model.
Found in both the nucleotide and protein databases
Comprises the sequence of a single continuous molecule of
nucleic acid or protein. Entry must have




At least one sequence identifier (Seq-id)
Information on the physical type of molecule (DNA, RNA, or
protein)
Descriptors, which describe the entire Bioseq
Annotations, which provide information regarding specific
locations within the Bioseq
What is GenBank?
The NIH genetic sequence database, an annotated
collection of all publicly available DNA sequences
Each record represents a single contiguous stretch of DNA
or RNA
DNA stretches may have more than one coding region
(gene).
RNA sequences are presented with T, not U
Records are generated from direct submissions to the DNA
sequence databases from the investigators (authors).
GenBank is part of the International Nucleotide Sequence
Database Collaboration.
General Comments on GBFF
Three sections:



1) Header-information about the whole record
2) Features-description of annotations-each represented
by a key.
3) Nucleotide sequence-each ends with // on last line of
record.
Nucleic acid (DNA or RNA (cDNA)) sequence
translated to amino acid sequence is a “feature”
Genbank Flat File (MyoD1 as an example)
Feature Keys
Purpose:
 1)
Indicates biological nature of sequence
 2) Supplies information about changes to
sequences
Feature Key
conflict
rep_origin
protein_bind
CDS
Description
Separate determinations of the same seq. differ
Origin of replication
Protein binding site on DNA
Protein coding sequence
Feature Keys-Terminology
Feature Key
CDS
Location/Qualifiers
23..400
/product=“alcohol dehydro.”
/gene=“adhI”
The feature CDS is a coding sequence beginning at base 23
and ending at base 400, has a product called “alcohol
dehydrogenase” and corresponds to the gene called
“adhI”.
Feature Keys-Terminology
(Cont.)
Feat. Key
CDS
Location/Qualifiers
join (544..589,688..1032)
/product=“T-cell recep. B-ch.”
/partial
The feature CDS is a partial coding sequence formed by joining
the indicated elements to form one contiguous sequence
encoding a product called T-cell receptor beta-chain.
(For MyoD1 – Accession number X61655)
Record from GenBank
GenBank division (plant, fungal and algal)
Modification date
Locus name
LOCUS
SCU49845
5028 bp
DNA
PLN
21-JUN-1999
DEFINITION
Saccharomyces cerevisiae TCP1-beta gene, partial cds, and
Axl2p (AXL2) and Rev7p (REV7) genes, complete cds.
ACCESSION
U49845 Unique identifier (never changes)
VERSION
U49845.1
KEYWORDS
.
Coding region
GI:1293613 GeneInfo identifier (changes whenever there is a change)
Nucleotide sequence identifier (changes when there is a change
in sequence (accession.version))
Word or phrase describing the sequence (not based on controlled vocabulary).
Not used in newer records.
SOURCE
ORGANISM
baker's yeast. Common name for organism
Saccharomyces cerevisiae
Eukaryota; Fungi; Ascomycota; Hemiascomycetes; Saccharomycetales;
Saccharomycetaceae; Saccharomyces.
Formal scientific name for the source organism and its lineage
based on NCBI Taxonomy Database
Record from GenBank (cont.1)
REFERENCE
AUTHORS
TITLE
JOURNAL
MEDLINE
REFERENCE
AUTHORS
TITLE
JOURNAL
MEDLINE
REFERENCE
1 (bases 1 to 5028)
Torpey,L.E., Gibbs,P.E., Nelson,J. and Lawrence,C.W.
Cloning and sequence of REV7, a gene whose function is required
for DNA damage-induced mutagenesis in Saccharomyces cerevisiae
Yeast 10 (11), 1503-1509 (1994)
95176709 Medline UID
2 (bases 1 to 5028)
Roemer,T., Madden,K., Chang,J. and Snyder,M.
Selection of axial growth sites in yeast requires Axl2p, a
novel plasma membrane glycoprotein
Genes Dev. 10 (7), 777-793 (1996)
96194260
3
(bases 1 to 5028)
AUTHORS
Roemer,T. Submitter of sequence (always the last reference)
TITLE
Direct Submission
JOURNAL
Submitted (22-FEB-1996) Terry Roemer, Biology, Yale University,
New Haven, CT, USA
Record from GenBank (cont.2)
There are three parts to the feature key: a keyword (indicates functional group), a location
(instruction for finding the feature), and a qualifier (auxiliary information about a feature)
FEATURES
source
Keys
CDS
Database cross-refs
Location/Qualifiers
1..5028 Location
/organism="Saccharomyces cerevisiae"
/db_xref="taxon:4932"
Qualifiers
/chromosome="IX"
/map="9"
<1..206 The 5’ end of the coding sequence begins upstream of the first nucleotide of the sequence.
end is complete.
/codon_start=3 Start of open reading frame
/product="TCP1-beta" Descriptive free text must be in quotations
/protein_id="AAA98665.1" Protein sequence ID #
/db_xref="GI:1293614"
Values
/translation="SSIYNGISTSGLDLNNGTIADMRQLGIVESYKLKRAVVSSASEA
AEVLLRVDNIIRARPRTANRQHM"
Note: only a partial sequence
The 3’
Record from GenBank (cont.3)
gene
687..3158 Another location
/gene="AXL2"
CDS
687..3158
/gene="AXL2"
/note="plasma membrane glycoprotein"
/codon_start=1
/function="required for axial budding pattern of S.
cerevisiae"
/product="Axl2p"
/protein_id="AAA98666.1"
/db_xref="GI:1293615"
/translation="MTQLQISLLLTATISLLHLVVATPYEAYPIGKQYPPVARVN. . . “ Cutoff
gene
complement(3300..4037) Another location
/gene="REV7"
CDS
complement(3300..4037)
/gene="REV7"
/codon_start=1
/product="Rev7p"
/protein_id="AAA98667.1"
/db_xref="GI:1293616"
/translation="MNRWVEKWLRVYLKCYINLILFYRNVYPPQSFDYTTYQSFNLPQ . . . “ Cutoff
Record from GenBank (cont.4)
BASE COUNT
1510 a
1074 c
835 g
1609 t
ORIGIN
1 gatcctccat atacaacggt atctccacct caggtttaga tctcaacaac ggaaccattg
61 ccgacatgag acagttaggt atcgtcgaga gttacaagct aaaacgagca gtagtcagct . .
.//
Primary databases vs.
Secondary databases
Primary database
comprises information submitted directly by the
experimenter.
 is called an archival database.

Secondary database
comprises information derived from primary
database.
 is a curated database.

Types of primary databases
carrying biological infomation
GenBank/EMBL/DDBJ
PDB-Three-dimensional structure
coordinates of biological molecules
PROSITE-database of protein
domain/function relationships.

http://www.expasy.org/prosite/
Types of secondary databases
carrying biological infomation
dbSTS-Non-redundant db of sequence-tagged sites (useful for
physical mapping)
Genome databases-(there are over 20 genome databases that can be
searched
EPD:eukaryotic promoter database
 http://www.epd.isb-sib.ch/
NR-non-redundant GenBank+EMBL+DDBJ+PDB. Entries with
100% sequence identity are merged as one.
ProDom
 http://protein.toulouse.inra.fr/prodom/current/html/home.php
PRINTS
 http://bioinf.man.ac.uk/dbbrowser/PRINTS/
BLOCKS
 http://bioinformatics.weizmann.ac.il/blocks/
Secondary Databases
DNA
RNA
protein
cDNA
DNA databases derived from GenBank
containing data for a single gene
•Non-redundant (nr)
•dbGSS (genome survey sequences)
•dbHTGS (high throughput)
•dbSTS (sequence tagged site)
•LocusLink
Protein databases derived
from GenBank containing
data for a single gene
RNA (cDNA) databases derived
•Non-redundant (nr)
from GenBank
containing data for a single gene •Swissprot
•PIR (Int’l. protein sequence)
•dbEST (expressed sequence tag)
•LocusLink
•UniGene
•LocusLink
References for understanding the
NCBI sequence database model
Here is the website for NCBI developer
tools.

http://www.ncbi.nlm.nih.gov/IEB/ToolBox/SD
KDOCS/INDEX.HTML
DNA  RNA  PROTEIN
RNA processing
RNA, but NOT mRNA
RNA, but NOT mRNA
Mature mRNA