Transcript Document
NCBI FieldGuide
NCBI Molecular Biology
Resources
A Field Guide
•
•
•
•
•
•
About NCBI
NCBI Sequence Databases
Other NCBI Databases
Entrez Databases and Text Searching
Genomic Resources
BLAST Services
NCBI FieldGuide
NCBI Resources
Bethesda,MD
Created in 1988 as a part of the
National Library of Medicine at NIH
–
–
–
–
Establish public databases
Research in computational biology
Develop software tools for sequence analysis
Disseminate biomedical information
NCBI FieldGuide
The National Center for
Biotechnology Information
• Primary Databases
– Original submissions by experimentalists
– Content controlled by the submitter
• Examples: GenBank, SNP, GEO
• Derivative Databases
– Built from primary data
– Content controlled by third party (NCBI)
• Examples: Refseq, TPA, RefSNP, UniGene, NCBI Protein,
Structure, Conserved Domain
NCBI FieldGuide
Types of Databases
NCBI FieldGuide
The Entrez System
Primary
• GenBank / EMBL / DDBJ
35,116,960
Derivative
• RefSeq
• Third Party Annotation
• PDB
Total
259,219
3,182
4,703
35,384,248
NCBI FieldGuide
Entrez Nucleotides
• GenPept (GB,EMBL, DDBJ)
• RefSeq
• Third Party Annotation
• Swiss Prot
• PIR
• PRF
3,178,346
933,905
4,338
146,978
282,821
12,079
Total
4,314,705
BLAST nr
2,724,717
NCBI FieldGuide
Entrez Protein
What is GenBank?
• Nucleotide only sequence database
• Archival in nature
• GenBank Data
– Direct submissions (traditional records )
– Batch submissions (EST, GSS, STS)
– ftp accounts (genome data)
• Three collaborating databases
– GenBank
– DNA Database of Japan (DDBJ)
– European Molecular Biology Laboratory (EMBL)
Database
NCBI FieldGuide
NCBI’s Primary Sequence Database
NCBI FieldGuide
International Sequence
Database Collaboration
Entrez
NIH
NCBI
GenBank
•Submissions
•Updates
•Submissions
•Updates
EMBL
CIB
NIG
DDBJ
•Submissions
•Updates
getentry
EBI
SRS
EMBL
35
40
Sequence records
Total base pairs
35
Release 140:
32.5 million records
37.9 billion nucleotides
30
25
20
Average doubling time ≈ 12 months
20
15
15
10
10
5
0
5
’83 ’84 ’85 ’86 ’87 ’88 ’89 ’90 ’91 ’92 ’93 ’94 ’95 ’96 ’97 ’98 ’99 ’00 ’01 ’02 ’03 ’04
0
Total Base Pairs
(billions)
Sequence Records
(millions)
30
25
NCBI FieldGuide
The Growth of GenBank
Records are divided into 17 Divisions.
1 Patent (11 files)
5 Bulk
11 Traditional
Traditional
Divisions:
BULK Divisions:
• Direct
Batch Submissions
Submission
(Sequin
andFTP)
BankIt)
(Email and
• Accurate
Inaccurate
• Well
characterized
Poorly
characterized
EST (288) Expressed Sequence Tag
PRI (27) Primate
GSS (98) Genome Survey Sequence
PLN (10) Plant and Fungal
HTG (61) High Throughput Genomic
BCT (8) Bacterial and Archeal
STS (3) Sequence Tagged Site
INV (6) Invertebrate
HTC (3) High Throughput cDNA
ROD (11) Rodent
VRL (3) Viral
VRT (4) Other Vertebrate
MAM (1) Mammalian (ex. ROD and PRI)
PHG (1) Phage
SYN (1) Synthetic (cloning vectors)
UNA (1) Unannotated
Entrez query: gbdiv_xxx[Properties]
NCBI FieldGuide
Organization of GenBank:
GenBank Divisions
•Direct Submissions (Sequin and BankIt)
•Accurate
•Well characterized
BCT
INV
MAM
PHG
PLN
PRI
ROD
SYN
VRL
VRT
Bacterial and Archeal
Invertebrate
Mammalian (ex. ROD and PRI)
Phage
Plant and Fungal
Primate
Rodent
Synthetic (vectors, synth. genes)
Viral
Other Vertebrate
NCBI FieldGuide
Traditional GenBank Divisions
A Traditional GenBank Record
REFERENCE
AUTHORS
TITLE
JOURNAL
MEDLINE
PUBMED
REFERENCE
AUTHORS
TITLE
JOURNAL
REFERENCE
AUTHORS
TITLE
JOURNAL
REMARK
COMMENT
AF062069
3808 bp
mRNA
linear
INV 23-OCT-2002
Limulus polyphemus myosin III mRNA, complete cds.
AF062069
AF062069.2 GI:7144484
.
Limulus polyphemus (Atlantic horseshoe crab)
Limulus polyphemus
Eukaryota; Metazoa; Arthropoda; Chelicerata; Merostomata;
Xiphosura; Limulidae; Limulus.
1 (bases 1 to 3808)
Battelle,B.-A., Andrews,A.W., Calman,B.G., Sellers,J.R.,
Greenberg,R.M. and Smith,W.C.
A myosin III from Limulus eyes is a clock-regulated phosphoprotein
J. Neurosci. 18 (12), 4548-4559 (1998)
98279067
9614231
2 (bases 1 to 3808)
Battelle,B.-A., Andrews,A.W., Calman,B.G., Sellers,J.R.,
Greenberg,R.M. and Smith,W.C.
Direct Submission
Submitted (29-APR-1998) Whitney Laboratory, University of Florida,
9505 Ocean Shore Blvd., St. Augustine, FL 32086, USA
3 (bases 1 to 3808)
Battelle,B.-A., Andrews,A.W., Calman,B.G., Sellers,J.R.,
Greenberg,R.M. and Smith,W.C.
Direct Submission
Submitted (02-MAR-2000) Whitney Laboratory, University of Florida,
9505 Ocean Shore Blvd., St. Augustine, FL 32086, USA
Sequence update by submitter
On Mar 2, 2000 this sequence version replaced gi:3132700.
NCBI FieldGuide
LOCUS
DEFINITION
ACCESSION
VERSION
KEYWORDS
SOURCE
ORGANISM
GenBank Record: Locus
LOCUS
AF062069
3808 bp
mRNA
linear
INV 23-OCT-2002
DEFINITION
Limulus polyphemus
myosin
mRNA, complete
LOCUS
AF062069
3808
bp III
mRNA
linearcds.INV 23-OCT-2002
ACCESSION
AF062069
VERSION
AF062069.2 GI:7144484
KEYWORDS
.
SOURCE
Limulus polyphemus (Atlantic horseshoe crab)
ORGANISM Limulus polyphemus
Eukaryota; Metazoa; Arthropoda; Chelicerata; Merostomata;
Xiphosura; Limulidae; Limulus.
REFERENCE
1 (bases 1 to 3808)
AUTHORS
Battelle,B.-A., Andrews,A.W., Calman,B.G., Sellers,J.R.,
Greenberg,R.M. and Smith,W.C.
TITLE
A myosin III from Limulus eyes is a clock-regulated phosphoprotein
JOURNAL
J. Neurosci. 18 (12), 4548-4559 (1998)
MEDLINE
98279067
PUBMED
9614231
REFERENCE
2 (bases 1 to 3808)
AUTHORS
Battelle,B.-A., Andrews,A.W., Calman,B.G., Sellers,J.R.,
Greenberg,R.M. and Smith,W.C.
TITLE
Direct Submission
JOURNAL
Submitted (29-APR-1998) Whitney Laboratory, University of Florida,
9505 Ocean Shore Blvd., St. Augustine, FL 32086, USA
REFERENCE
3 (bases 1 to 3808)
AUTHORS
Battelle,B.-A., Andrews,A.W., Calman,B.G., Sellers,J.R.,
Greenberg,R.M. and Smith,W.C.
TITLE
Direct Submission
JOURNAL
Submitted (02-MAR-2000) Whitney Laboratory, University of Florida,
9505 Ocean Shore Blvd., St. Augustine, FL 32086, USA
REMARK
Sequence update by submitter
COMMENT
On Mar 2, 2000 this sequence version replaced gi:3132700.
Length
Locus name
Molecule type
Division
Modification Date
GenBank Record: Identifiers
LOCUS
DEFINITION
ACCESSION
VERSION
KEYWORDS
SOURCE
ORGANISM
AF062069
3808 bp
mRNA
linear
INV 23-OCT-2002
Limulus polyphemus myosin III mRNA, complete cds.
AF062069
AF062069.2 GI:7144484
.
Limulus polyphemus (Atlantic horseshoe crab)
Limulus polyphemus
Eukaryota; Metazoa; Arthropoda; Chelicerata; Merostomata;
Xiphosura; Limulidae; Limulus.
1 (bases 1 to 3808)
Battelle,B.-A., Andrews,A.W., Calman,B.G., Sellers,J.R.,
Greenberg,R.M. and Smith,W.C.
A myosin III from Limulus eyes is a clock-regulated phosphoprotein
J. Neurosci. 18 (12), 4548-4559 (1998)
98279067
9614231
2 (bases 1 to 3808)
Battelle,B.-A., Andrews,A.W., Calman,B.G., Sellers,J.R.,
Greenberg,R.M. and Smith,W.C.
Direct Submission
Submitted (29-APR-1998) Whitney Laboratory, University of Florida,
9505 Ocean Shore Blvd., St. Augustine, FL 32086, USA
3 (bases 1 to 3808)
Battelle,B.-A., Andrews,A.W., Calman,B.G., Sellers,J.R.,
Greenberg,R.M. and Smith,W.C.
Direct Submission
Submitted (02-MAR-2000) Whitney Laboratory, University of Florida,
9505 Ocean Shore Blvd., St. Augustine, FL 32086, USA
Sequence update by submitter
On Mar 2, 2000 this sequence version replaced gi:3132700.
ACCESSION
VERSION
REFERENCE
AUTHORS
TITLE
JOURNAL
MEDLINE
PUBMED
REFERENCE
AUTHORS
TITLE
JOURNAL
REFERENCE
AUTHORS
TITLE
JOURNAL
REMARK
COMMENT
AF062069
AF062069.2
GI:7144484
GenBank Record: Organism
LOCUS
DEFINITION
ACCESSION
VERSION
KEYWORDS
SOURCE
ORGANISM
AF062069
3808 bp
mRNA
linear
INV 23-OCT-2002
Limulus polyphemus myosin III mRNA, complete cds.
AF062069
AF062069.2 GI:7144484
.
Limulus polyphemus (Atlantic horseshoe crab)
Limulus polyphemus
Eukaryota; Metazoa; Arthropoda; Chelicerata; Merostomata;
Limulidae; Limulus.
SOURCEXiphosura;
Limulus
polyphemus (Atlantic horseshoe crab)
REFERENCE
1 (bases 1 to 3808)
ORGANISM
Limulus polyphemus
AUTHORS
Battelle,B.-A.,
Andrews,A.W., Calman,B.G., Sellers,J.R.,
Eukaryota;
Greenberg,R.M.
and Metazoa;
Smith,W.C. Arthropoda; Chelicerata; Merostomata;
TITLE
A myosin
III from Limulus
eyes isLimulus.
a clock-regulated phosphoprotein
Xiphosura;
Limulidae;
JOURNAL
J. Neurosci. 18 (12), 4548-4559 (1998)
MEDLINE
98279067
PUBMED
9614231
REFERENCE
2 (bases 1 to 3808)
AUTHORS
Battelle,B.-A., Andrews,A.W., Calman,B.G., Sellers,J.R.,
Greenberg,R.M. and Smith,W.C.
TITLE
Direct Submission
JOURNAL
Submitted (29-APR-1998) Whitney Laboratory, University of Florida,
9505 Ocean Shore Blvd., St. Augustine, FL 32086, USA
REFERENCE
3 (bases 1 to 3808)
AUTHORS
Battelle,B.-A., Andrews,A.W., Calman,B.G., Sellers,J.R.,
Greenberg,R.M. and Smith,W.C.
TITLE
Direct Submission
JOURNAL
Submitted (02-MAR-2000) Whitney Laboratory, University of Florida,
9505 Ocean Shore Blvd., St. Augustine, FL 32086, USA
REMARK
Sequence update by submitter
COMMENT
On Mar 2, 2000 this sequence version replaced gi:3132700.
NCBI’s Taxonomy
GenBank Record: Feature Table
FEATURES
source
CDS
Location/Qualifiers
1..3808
/organism="Limulus polyphemus"
/db_xref="taxon:6850"
/tissue_type="lateral eye"
258..3302
/note="N-terminal protein kinase domain; C-terminal myosin
heavy chain head; substrate for PKA"
/codon_start=1
/product="myosin III"
/protein_id="AAC16332.2"
/db_xref="GI:7144485"
/translation="MEYKCISEHLPFETLPDPGDRFEVQELVGTGTYATVYSAIDKQA
NKKVALKIIGHIAENLLDIETEYRIYKAVNGIQFFPEFRGAFFKRGERESDNEVWLGI
EFLEEGTAADLLATHRRFGIHLKEDLIALIIKEVVRAVQYLHENSIIHRDIRAANIMF
SKEGYVKLIDFGLSASVKNTNGKAQSSVGSPYWMAPEVISCDCLQEPYNYTCDVWSIG
ITAIELADTVPSLSDIHALRAMFRINRNPPPSVKRETRWSETLKDFISECLVKNPEYR
PCIQEIPQHPFLAQVEGKEDQLRSELVDILKKNPGEKLRNKPYNVTFKNGHLKTISGQ
1201 a
689 c
782 g
1136 t
/protein_id="AAC16332.2"
/db_xref="GI:7144485"
GenPept IDs
BASE COUNT
ORIGIN
1 tcgacatctg tggtcgcttt ttttagtaat aaaaaattgt attatgacgt cctatctgtt
3781 aagatacagt aactagggaa aaaaaaaa
//
A gene-oriented view of sequence entries
•MegaBlast based automated sequence clustering
•Now informed by genome hits New!
•Nonredundant set of gene oriented clusters
•Each cluster a unique gene
•Information on tissue types and map locations
•Includes well-characterized genes and novel
ESTs
•Useful for gene discovery and selection of
mapping reagents
NCBI FieldGuide
What is UniGene?
NCBI FieldGuide
UniGene
UniGene Build 168
132,990 mRNAs
Feb. 24, 2004
6,327 models
7,235 HTC
1,408,949 EST, 3'reads
2,082,199 EST, 5'reads
+
774,927 EST, other/unknown
---------4,412,627 total sequences in clusters
Final Number of Clusters (sets)
===============================
total
105,651
27,511
5,613
104,397
26,291
contain at least one mRNA
3,000,000,000
bp one HTC
contain
at least
contain
at least
one EST
30 K expected
genes
contain
both mRNAs transcripts
and ESTs
75% uncharacterized
NCBI FieldGuide
Human UniGene
NCBI FieldGuide
Genome Sequencing - HTG, GSS,(WGS)
Whole BAC insert (or genome)
shredding
sequencing
GSS division
or trace archive assembly
cloning isolating
whole genome shotgun assemblies
(traditional division)
Draft Sequence (HTG division)
NCBI FieldGuide
Other Genome Sequencing Products
Trace Archive
Whole Genome Shotgun
• Primary reads from WGS and EST projects
• Many not available in GenBank
• Earliest access to genome data
NCBI FieldGuide
Trace Archive
NCBI FieldGuide
Derivative Sequence
Databases
RefSeq
TPA
Curators
RefSeq
TATAGCCG
AGCTCCGATA
CCGATGACAA
Labs
Genome
Assembly
TATAGCCG
TATAGCCG
TATAGCCG
TATAGCCG
GenBank
UniGene
Algorithms
NCBI FieldGuide
NCBI Derivative Sequence Data
NCBI’s Derivative Sequence Database
• Curated transcripts and proteins
– reviewed
– human, mouse, rat, fruit fly, zebrafish, arabidopsis
• Model transcripts and proteins
• Assembled Genomic Regions (contigs)
– human genome
– mouse genome
• Chromosome records
– Human genome
– microbial
– organelle
srcdb_refseq[Properties]
ftp://ftp.ncbi.nih.gov/refseq/release/
NCBI FieldGuide
RefSeq:
•
•
•
•
•
•
•
NCBI FieldGuide
RefSeq Benefits
non-redundancy
explicitly linked nucleotide and protein sequences
updates to reflect current sequence data and biology
data validation
format consistency
distinct accession series
stewardship by NCBI staff and collaborators
mRNAs and Proteins
NM_123456
NP_123456
NR_123456
XM_123456
XP_123456
XR_123456
Gene Records
NG_123456
Chromosome
NC_123455
Assemblies
NT_123456
NW_123456
Curated mRNA
Curated Protein
Curated non-coding RNA
Predicted mRNA
Predicted Protein
Predicted non-coding RNA
Reference Genomic Sequence
Microbial replicons, organelle
Contig
WGS Supercontig
NCBI FieldGuide
RefSeq Accession Numbers
NCBI FieldGuide
Third Party Annotation (TPA) Database
• Annotations of existing GenBank sequences
• Allows for community annotation of genomes
• Direct submissions
– BankIt
– Sequin
tpa[Properties]
•dbSNP:
•Geo:
nucleotide polymorphism
Gene Expression Omnibus
microarray and other expression data
•Gene:
gene records
Unifies LocusLink and Microbial Genomes
•Structure:
imported structures (PDB)
Cn3D viewer, NCBI curation
•CDD:
conserved domain database
Protein families (COGs and KOGs)
Single domains (PFAM, SMART, CD)
NCBI FieldGuide
Other NCBI Databases
NCBI FieldGuide
NCBI Structures and Domains
Molecular Modeling Data Base
• Derived from experimentally determined PDB records
• Value added to PDB records including:
– Addition of explicit chemical graph information
– Validation
– Inclusion of Taxonomy, Citation,
– Conversion to ASN.1 data description language
• Structure neighbors determined by
Vector Alignment Search Tool (VAST)
NCBI FieldGuide
MMDB:
NCBI FieldGuide
Structure Summary
Cn3D viewer
Structure Neighbors
Conserved Domains
3D Domain Neighbors
•
•
•
•
Multiple sequence alignments
PSI-BLAST –based score matrices
Sources SMART, PFAM, COGs, KOGs
New NCBI curated domains
– structure informed alignments
• Stats:
–
–
–
–
–
COGS
4,873
KOGS
4,852
Pfam
5,193
Smart
653
NCBI CDD 316
NCBI FieldGuide
NCBI’s Conserved Domain Database
Entrez
&
BLAST
NCBI FieldGuide
WWW
Access
250000
1997
1998
1999
2000
200000
150000
100000
50000
Christmas Day
0
2001
NCBI FieldGuide
NCBI Web Traffic
NCBI FieldGuide
Using Entrez
An integrated database
search and retrieval system
Entrez: Database Integration
Word weight
PubMed
abstracts
3
-D
3-D
Structure
Structure
Taxonomy
Genomes
Phylogeny
BLAST
VAST
Nucleotide
sequences
Protein
sequences
BLAST
NCBI FieldGuide
Database Searching with Entrez
Using limits and field restriction to find human MutL homolog
Linking and neighboring with MutL
Mapping SNPs onto structure and the genome
NCBI FieldGuide
Global Entrez Search
MutL[All Fields]
NCBI FieldGuide
Document Summaries:
Limits & Preview/Index
NCBI FieldGuide
Entrez Nucleotides:
MutL
Author Name
EC/RN Number
Feature key
Filter
Gene Name Field Restriction
Issue
Journal Name
Keyword
Modification Date
Organism
Exclude
Page Number
Primary Accession
Properties
Protein Name
Publication Date
SeqID String
Sequence Length
Substance Name
Text Word
Title
Uid
Volume
bulk sequences
NCBI FieldGuide
Accession
Entrez
Nucleotides: Limits
All Fields
MutL
Title == Definition
Exclude Bulk Sequences
NCBI FieldGuide
Entrez Nucleotides: Limits
NCBI FieldGuide
Document Summaries: Limits
Accession
All Fields
Author Name
EC/RN Number
Feature key
Filter
Gene Name
Issue
Journal Name
Keyword
Modification Date
Organism
Page Number
Primary Accession
Properties
Protein Name
Publication Date
SeqID String
Sequence Length
Substance Name
Text Word
Title
Uid
Volume
NCBI FieldGuide
Adding Terms: Preview/Index
NCBI FieldGuide
Human MutL Search Results
GenBank Records
NCBI FieldGuide
Human MutL RefSeq
NCBI FieldGuide
NM_000249: Links
NCBI FieldGuide
Literature Links
PubMed
OMIM
Books
NCBI FieldGuide
NM_000249: PubMed
Books Link
NCBI FieldGuide
Conserved Domain
NCBI FieldGuide
OMIM: Human Disease Genes
NCBI FieldGuide
Sequence Links
Nucleotide
Protein
Genome Project BAC
similarity
Original GenBank mRNAs
Original GenBank genomic
NCBI FieldGuide
NM_000249: Related Sequences
NCBI FieldGuide
Taxonomy Link
The Tax Browser
NCBI’s Taxonomy
NCBI FieldGuide
Taxonomy Link
•
•
•
•
GenPept GenBank, EMBL, DDBJ CDS translations
RefSeq mRNA based (NP_) and genome based (XP_)
Swiss-Prot curated high quality protein reviews
PIR protein information resource Georgetown University
• PRF protein resource foundation
• PDB Protein Databank sequences from structures
NCBI FieldGuide
NCBI Protein Databases
BLAST Link
Conserved Domains
NCBI FieldGuide
Protein Link
NCBI FieldGuide
Related Proteins: Redundancy
Redundant Sequences
NCBI FieldGuide
Related Proteins: Links
Sequence from MutL structure
Arabidopsis homolog
Conserved Domain
NCBI FieldGuide
BLink: non-redundant relatives
NCBI FieldGuide
NM_000249: Genome Links