Transcript Document
NCBI Molecular Biology Resources
The International Sequence Database
Collaboration
NIH
•Sequin
•BankIt
•WebIn
•SAKURA
Entrez
NCBI
GenBank
•Submissions
•Updates
•Submissions
•Updates
EMBL
CIB
NIG
DDBJ
•Submissions
•Updates
getentry
EBI
SRS
EMBL
Primary vs. Derivative Databases
Algorithms
Sequencing
Centers
GenBank
Updated ONLY
by submitters
INV VRT PHG VRL
UniSTS
EST
STS
GSS
HTG
UniGene
Updated
continually
by NCBI
RefSeq:
Annotation
Pipeline
PRI ROD PLN MAM BCT
Curators
Labs
RefSeq:
LocusLink and
Genomes Pipelines
TATAGCCG
AGCTCCGATA
CCGATGACAA
Types of Databases
Primary Databases
Original submissions by experimentalists
Remember biologyʼs Central Dogma: DNA RNA protein.
Primary refers to one dimensional ʻsymbolʼ information written in
sequential order necessary to specify a particular biological
molecular entity, be it polypeptide or polynucleotide.
Content controlled by the submitter
• Examples: GenBank, SNP, GEO, PubChem Substance
Derivative Databases
Built from primary data
Content controlled by third party (NCBI)
• Examples: Refseq, TPA, RefSNP, UniGene, Protein,
Structure, Conserved Domain, PubChem Compound
What is Entrez?
Entrez Global Query is an integrated search and retrieval
system for databases of National Center for Biotechnology
Information (NCBI).
It provides access to all NCBI databases simultaneously with a
single query string and user interface.
Support boolean operators and search term tags to limit parts
of the search statement to particular fields.
This returns a unified results page, that shows the number of
hits for the search in each of the databases, which are also
links to actual search results for that particular database.
A text search / retrieval engine NOT A DATABASE.
A tool for finding biologically linked data.
A virtual workspace for manipulating large datasets.
Entrez Databases
Each record is assigned a UID
Each record is given a Document Summary
unique integer identifier for internal tracking
GI number for Nucleotide
a summary of the record’s content (DocSum)
Each record is assigned links to biologically
related UIDs
Each record is indexed by data fields
[author], [title], [organism], and many others
The Entrez System
Journals
UniGene
Books
PubMed
Central
SNP
PubMed
UniSTS
Nucleotide
Protein
PopSet
ProbeSet
Entrez
Genome
Structure
Taxonomy
CDD
3D Domains
OMIM
The Entrez System: Text Searches
Entrez Taxonomy
The backbone of NCBI
[organism]
An Entrez Database - Nucleotide
GenBank: Primary Data (97.9%)
original submissions by experimentalists
submitters retain editorial control of
records
archival in nature
RefSeq: Derivative Data (2.1%)
curated by NCBI staff
NCBI retains editorial control of records
record content is updated continually
What is GenBank?
NCBI’s Primary Sequence Database
Nucleotide only sequence database
Archival in nature
Each record is assigned a stable accession number
GenBank Data
Direct submissions (traditional records )
Batch submissions (EST, GSS, STS)
ftp accounts (genome data)
Three collaborating databases
GenBank
DNA Database of Japan (DDBJ)
European Molecular Biology Laboratory (EMBL) Database
A Traditional GenBank Record
LOCUS
DEFINITION
AY182241
1931 bp
mRNA
linear
PLN 04-MAY-2004
Malus x domestica (E,E)-alpha-farnesene synthase (AFS1) mRNA,
complete cds.
ACCESSION
AY182241
VERSION
AY182241.2 GI:32265057
KEYWORDS
.
SOURCE
Malus x domestica (cultivated apple)
ORGANISM Malus x domestica
Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta;
Spermatophyta; Magnoliophyta; eudicotyledons; core eudicots;
rosids; eurosids I; Rosales; Rosaceae; Maloideae; Malus.
REFERENCE
1 (bases 1 to 1931)
AUTHORS
Pechous,S.W. and Whitaker,B.D.
TITLE
Cloning and functional expression of an (E,E)-alpha-farnesene
synthase cDNA from peel tissue of apple fruit
JOURNAL
Planta 219, 84-94 (2004)
REFERENCE
2 (bases 1 to 1931)
AUTHORS
Pechous,S.W. and Whitaker,B.D.
TITLE
Direct Submission
JOURNAL
Submitted (18-NOV-2002) PSI-Produce Quality and Safety Lab,
USDA-ARS, 10300 Baltimore Ave. Bldg. 002, Rm. 205, Beltsville, MD
20705, USA
REFERENCE
3 (bases 1 to 1931)
AUTHORS
Pechous,S.W. and Whitaker,B.D.
TITLE
Direct Submission
JOURNAL
Submitted (25-JUN-2003) PSI-Produce Quality and Safety Lab,
USDA-ARS, 10300 Baltimore Ave. Bldg. 002, Rm. 205, Beltsville, MD
20705, USA
REMARK
Sequence update by submitter
COMMENT
On Jun 26, 2003 this sequence version replaced gi:27804758.
FEATURES
Location/Qualifiers
source
1..1931
/organism="Malus x domestica"
/mol_type="mRNA"
/cultivar="'Law Rome'"
/db_xref="taxon:3750"
/tissue_type="peel"
gene
1..1931
/gene="AFS1"
CDS
54..1784
/gene="AFS1"
/note="terpene synthase"
/codon_start=1
/product="(E,E)-alpha-farnesene synthase"
/protein_id="AAO22848.2"
/db_xref="GI:32265058"
/translation="MEFRVHLQADNEQKIFQNQMKPEPEASYLINQRRSANYKPNIWK
NDFLDQSLISKYDGDEYRKLSEKLIEEVKIYISAETMDLVAKLELIDSVRKLGLANLF
EKEIKEALDSIAAIESDNLGTRDDLYGTALHFKILRQHGYKVSQDIFGRFMDEKGTLE
NHHFAHLKGMLELFEASNLGFEGEDILDEAKASLTLALRDSGHICYPDSNLSRDVVHS
LELPSHRRVQWFDVKWQINAYEKDICRVNATLLELAKLNFNVVQAQLQKNLREASRWW
ANLGIADNLKFARDRLVECFACAVGVAFEPEHSSFRICLTKVINLVLIIDDVYDIYGS
EEELKHFTNAVDRWDSRETEQLPECMKMCFQVLYNTTCEIAREIEEENGWNQVLPQLT
KVWADFCKALLVEAEWYNKSHIPTLEEYLRNGCISSSVSVLLVHSFFSITHEGTKEMA
DFLHKNEDLLYNISLIVRLNNDLGTSAAEQERGDSPSSIVCYMREVNASEETARKNIK
GMIDNAWKKVNGKCFTTNQVPFLSSFMNNATNMARVAHSLYKDGDGFGDQEKGPRTHI
LSLLFQPLVN"
ORIGIN
1 ttcttgtatc ccaaacatct cgagcttctt gtacaccaaa ttaggtattc actatggaat
61 tcagagttca cttgcaagct gataatgagc agaaaatttt tcaaaaccag atgaaacccg
121 aacctgaagc ctcttacttg attaatcaaa gacggtctgc aaattacaag ccaaatattt
181 ggaagaacga tttcctagat caatctctta tcagcaaata cgatggagat gagtatcgga
241 agctgtctga gaagttaata gaagaagtta agatttatat atctgctgaa acaatggatt
1801 aataaatagc agcaaaagtt tgcggttcag ttcgtcatgg ataaattaat ctttacagtt
1861 tgtaacgttg ttgccaaaga ttatgaataa aaagttgtag tttgtcgttt aaaaaaaaaa
1921 aaaaaaaaaa a
//
The Flatfile Format
Header
Feature Table
Sequence
An Example Record – M17755
Indexing for Nucleotide UID 4680720
Field
[primary accession]
[title]
[organism]
[sequence length]
[modification date]
[properties]
Indexed Terms
M17755
Homo sapiens thyroid peroxidase (TPO) mRNA…
Homo sapiens
3060
1999/04/26
biomol mrna
gbdiv pri
srcdb genbank
M17755: Feature Table
TPO [gene name]
CDS position in bp
thyroiditis
[text word]
thyroid peroxidase
[protein name]
protein
accession
Sequence: 99.99% Accurate
The sequence itself
is not indexed…
Use BLAST for that!
RefSeq: NCBI’s Derivative Sequence Database
RefSeq database is a collection of taxonomically diverse, nonredundant and richly annotated sequences representing naturally
occurring molecules of DNA, RNA, and protein.
Non-redundant nucleotide and protein sequences from plasmids,
organelles, viruses, archaea, bacteria, and eukaryotes.
Updated to reflect current sequence data and biology
Each RefSeq is constructed wholly from sequence data submitted
to the International Nucleotide Sequence Database Collaboration
(INSDC).
Similar to a review article, a RefSeq integrates information across
multiple sources at a given time hence provides a foundation for
uniting sequence data with genetic and functional information.
They are generated to provide reference standards for multiple
purposes ranging from genome annotation to reporting locations
of sequence variation in medical records.
The common Refseq accession prefix
Accession prefix
Molecular type
NC_
Complete genomic molecule (chromosome;
NT_
Genomic contig
NM_
Curated mRNA
XM_
mRNA (Computed)
NP_
Curated Protein
XP_
Protein (Computed)
NR_
Curated RNA
XR_
RNA(Computed)
microbial or organelle genome)
Entrez Gene and RefSeq
GenBank
RefSeq
Gene
Nucleotide
• Entrez Gene is the central depository for information about a gene
available at NCBI, and often provides links to sites beyond NCBI
• Entrez Gene includes records for organisms that have NCBI Reference
Sequences (RefSeqs)
• Entrez Gene records contain RefSeq mRNAs, proteins, and genomic
DNA (if known) for a gene locus, plus links to other Entrez databases
Entrez Gene: RefSeq Annotations
Beyond RefSeq
If your organism does not have RefSeqs…
UniGene : gene-based clusters of cDNAs and ESTs
WGS sequences in Entrez Nucleotide (wgs[prop])
Trace Archive
What is UniGene?
A gene-oriented view of sequence entries
•MegaBlast based automated sequence clustering
•Now informed by genome hits New!
•Nonredundant set of gene oriented clusters
•Each cluster a unique gene
•Information on tissue types and map locations
•Includes known genes and uncharacterized ESTs
•Useful for gene discovery and selection of
mapping reagents
Organisms in UniGene
Top Ten
1. Human
2. Rice
3. Mouse
4. Cow
5. Wheat
6. Zebrafish
7. Pig
8. Chicken
9. Frog (X. laevis)
10. Frog (X. tropicalis)
Finding UniGene Clusters
by link
by Entrez search
UniGene Cluster for TPO
Entrez Protein
GenPept (DDBJ, EMBL, GenBank)
RefSeq
PIR
Swiss Prot
189,005
PDB
PRF
Third Party Annotation
Total
4,444,405
1,753,167
222,395
68,621
12,079
4,219
6,693,891
Protein Sources and Links
PIR
RefSeq
no mRNA!
NM_000537
SWISS-PROT
GenPept
no mRNA!
M17755
PubMed is the NCBI gateway to MEDLINE.
MEDLINE contains bibliographic citations and author abstracts from over 4,000
journals published world-wide. It has 12 million records dating back to 1966.
In order to impose uniformity and consistency to the indexing of biomedical
literature MeSH vocabulary is used for indexing journal articles for MEDLINE.
MeSH is the acronym for "Medical Subject Headings."
MeSH is the list of the vocabulary terms used for subject analysis of
biomedical literature at NLM.
Taxonomy Browser is…
• browser for the major divisions of living organisms
(archaea, bacteria, eukaryota, viruses)
• taxonomy information such as genetic codes
• molecular data on extinct organisms
Structure site includes…
• Molecular Modelling Database (MMDB)
• biopolymer structures obtained from
the Protein Data Bank (PDB)
• Cn3D (a 3D-structure viewer)
• vector alignment search tool (VAST)
OMIM is…
•Online Mendelian Inheritance in Man
•catalog of human genes and genetic disorders
•edited by Dr. Victor McKusick, others at JHU
General Protein Databases
SWISS-PROT
GenPept/TREMBL
Translated coding sequences from GenBank/EMBL
Few annotations, more up to date
PIR
Manually curated
high-quality annotations, less data
Phylogenetic-based annotations
All 3 now combining efforts to form UniProt
(http://www.uniprot.org)
Swissprot format
http://us.expasy.org/sprot/userman.html
Non-redundant Databases
Sequence data only: cannot be browsed, can only be searched using a
sequence
Combine sequences from more than one database
Examples:
NR Nucleic (genbank+EMBL+DDBJ+PDB DNA)
NR Protein (SWISS-PROT+TrEMBL+GenPept+PDB protein)
Protein domain databases
Pfam
SMART
(http://smart.embl-heidelberg.de/help/smart_about.shtml
CDD (http://www.ncbi.nlm.nih.gov/Structure/cdd/wrpsb.cgi)
(http://www.sanger.ac.uk/Software/Pfam/)
Collection of multiple sequence alignments and hidden Markov models covering many
common protein domains and families
(a Simple Modular Architecture Research Tool)
Identification and annotation of genetically mobile domains and the analysis of domain
architectures
Combines SMART and Pfam databases
Easier and quicker search
Sequence Motif Databases
Scan Prosite (http://www.expassy.org/prosite)
and PRINTS
(http://bioinf.man.ac.uk/dbbrowser/PRINTS/)
Store conserved motifs occurring in nucleic acid or
protein sequences
Motifs can be stored as consensus sequences,
alignments, or using statistical representations such as
residue frequency tables