SEQUENCE

Transcript SEQUENCE

‫بسم هللا الرحمن الرحیم‬
Using NCBI Resources for Gene Discovery
Lecturer: Dr. Farkhondeh Poursina, PhD
[email protected]
1392
National Center for Biotechnology Information (NCBI)
National Library of Medicine
National Institutes of Health
http://www.ncbi.nlm.nih.gov/
PRIMARY BIOLOGICAL DATABASES

Nucleic acid & Protein
EMBL(European Molecular Biology Laboratory)
DDBJ (DNA Data Bank of Japan)
GenBank (NCBI, The National Center for
Biotechnology Information)
EMBL/GENBANK/DDJB
These 3 db contain mainly the same information (few
differences in the format)
 Serve as archives containing all sequences (single
genes, ESTs, complete genomes, etc.)


derived from:


Genome projects and sequencing centers
Individual scientists
Non-confidential data are exchanged daily
 Currently: 2.5 x107 sequences, over 3.2 x1010 bp;
 Sequences from > 50,000 different species;

THE ‘PERFECT’ DATABASE

Comprehensive, but easy to search.

Annotated, but not “too annotated”.

A simple, easy to understand structure.

Cross-referenced.

Minimum redundancy.

Easy retrieval of data.
THE NATIONAL CENTER FOR
BIOTECHNOLOGY INFORMATION
Bethesda,MD
Created in 1988 as a part of the
National Library of Medicine at NIH(National Institutes of Health)
–
–
–
–
Establish public databases
Research in computational biology
Develop software tools for sequence analysis
Disseminate biomedical information
WEB ACCESS: WWW.NCBI.NLM.NIH.GOV
New pages!
New Homepage
Common footer
TYPES OF MOLECULAR DATABASES
(SEQUENCE) AT NCBI

Primary Databases
Original submissions by experimentalists
 Content controlled by the submitter



Examples: GenBank, Trace, SRA, SNP, GEO
Derivative Databases

Derived from primary data

Curated/expert review(Content controlled by third party
(NCBI)

compilation and correction of data

Examples: NCBI Protein, Refseq, RefSNP, UniGene, Homologene,
Structure, Conserved Domain
PRIMARY VS. DERIVATIVE SEQUENCE DATABASES
RefSeq
Labs
Sequencing
Centers
TATAGCCG
AGCTCCGATA
CCGATGACAA
Curators
TATAGCCG
TATAGCCG
TATAGCCG
TATAGCCG
Genome
Assembly
Updated
continually
by NCBI
GenBank
UniGene
Updated ONLY
by submitters
Algorithms
THE PROBLEM
Rapidly growing databases with complex and changing
relationships
 Rapidly changing interfaces to match the above

Result
 Many people don’t know:
Where to begin
 Where to click on a Web page
 Why it might be useful to click there

DERIVATIVE SEQUENCE DATABASES
ENTREZ
FINDING RELEVANT
INFORMATION IN NCBI
DATABASES
YOU
CAN SEARCH
DNA
SEQUENCE DATABASE
Retrieve known sequences by
 ENTREZ
 http://www.ncbi.nlm.nih.gov/Entrez/
 Click – Nucleotide
 OR
Accession number
 Keyword search

Entrez is Internally Cross-linked
 DNA and protein sequences are linked to
other similar sequences
 Medline citations are linked to other
citations that contain similar keywords
3-D structures are linked to similar structures

DATABASES CONTAIN MORE THAN JUST DNA &
PROTEIN SEQUENCES

Retrieve all sequences for an organism or
taxon
Starting with an organism or taxon name...
 How to: Download the complete genome for
an organism
 Starting at the Genomes

How to: Find transcript sequences for a
gene
 Starting with ...
 A GENE NAME, PRODUCT NAME, OR
SYMBOL
 How to: Obtain genomic sequence for/near a
gene, marker, transcript or protein
 Starting with...



A GENE NAME OR SYMBOL
ENTREZ TIP: START SEARCHES IN
GENE
Entrez
Protein
Other Entrez DBs
BLink
Gene
Homologene:
Gene Neighbors
HomoloGene
UniGene
How to: Display genomic annotation
graphically
 Starting with...
 A NUCLEOTIDE RECORD (e.g. NC_000001)

BY APPLYING LIMITS, THERE ARE NOW JUST TWO
ENTRIES
Precise Results
A TRADITIONAL GENBANK RECORD
Locus Field
ACCESSION NO
ACCESSION VERSSION
Molecular weight
Definition Line
GI (GenInfo)
Taxonomy
Submission Field
Molecule Type
Modification Date
Genbank Division
TRADITIONAL GENBANK RECORD
ACCESSION
Accession
•Stable
•Reportable
•Universal
U07418
Coding sequence
VERSION
U07418.1
Version
Tracks changes in sequence
the sequence is the data
GI:466461
GI number
NCBI internal use
What is an accession number?
An accession number is label that used to identify a sequence. It is a string of
letters and/or numbers that corresponds to a molecular sequence.
Examples (all for retinol-binding protein, RBP4):
X02775
NT_030059
Rs7079946
GenBank genomic DNA sequence
Genomic contig
dbSNP (single nucleotide polymorphism)
DNA
N91759.1
NM_006744
An expressed sequence tag (1 of 170)
RefSeq DNA sequence (from a transcript)
NP_007635
AAC02945
Q28369
1KT7
RefSeq protein
GenBank protein
SwissProt protein
Protein Data Bank structure record
RNA
protein
Page 27
Feature Table
GenPept Record
Genomic DNA
Sequence
GENPEPT: GENBANK CDS TRANSLATIONS
FEATURES
source
gene
CDS
Location/Qualifiers
1..2484
/organism="Homo sapiens"
/mol_type="mRNA"
/db_xref="taxon:9606"
/chromosome="3"
/map="3p22-p23"
1..2484
>gi|463989|gb|AAC50285.1| DNA mismatch repair prote...
/gene="MLH1"
22..2292 MSFVAGVIRRLDETVVNRIAAGEVIQRPANAIKEMIENCLDAKSTSIQVIV...
EDLDIVCERFTTSKLQSFEDLASISTYGFRGEALASISHVAHVTITTKTAD...
/gene="MLH1"
/note="homolog of S. cerevisiae PMS1 (Swiss-Prot Accession
Number P14242), S. cerevisiae MLH1 (GenBank Accession
Number U07187), E. coli MUTL (Swiss-Prot Accession Number
P23367), Salmonella typhimurium MUTL (Swiss-Prot Accession
Number P14161) and Streptococcus pneumoniae (Swiss-Prot
Accession Number P14160)"
/codon_start=1
/product="DNA mismatch repair protein homolog"
/protein_id="AAC50285.1"
/db_xref="GI:463989"
/translation="MSFVAGVIRRLDETVVNRIAAGEVIQRPANAIKEMIENCLDAKS
TSIQVIVKEGGLKLIQIQDNGTGIRKEDLDIVCERFTTSKLQSFEDLASISTYGFRGE
ALASISHVAHVTITTKTADGKCAYRASYSDGKLKAPPKPCAGNQGTQITVEDLFYNIA
TRRKALKNPSEEYGKILEVVGRYSVHNAGISFSVKKQGETVADVRTLPNASTVDNIRS
REFSEQ
• Reference Sequences
− Nucleotide sequences and protein translation
− Curated by NCBI or NCBI-approved programs.
• Difference between GenBank and RefSeq
− GenBank has raw data and duplicated records
− Metadata in GenBank can be incomplete
− RefSeq annotated, curated and non-redundant.
− NCBI takes best sequences from GenBank and
curates for RefSeq records
SELECTED REFSEQ ACCESSION
NUMBERS
mRNAs and Proteins
NM_123456
NP_123456
NR_123456
XM_123456
XP_123456
XR_123456
Gene Records
NG_123456
Chromosome
NC_123455
AC_123455
Assemblies
NT_123456
NW_123456
Curated mRNA
Curated Protein
Curated non-coding RNA
Predicted mRNA
Predicted Protein
Predicted non-coding RNA
Reference Genomic Sequence
Microbial replicons, organelle
Alternate assemblies
Contig
WGS Supercontig
over 100,000
nucleotide entries
for HIV-1
only 1 RefSeq
HOW TO SAVE?







Choose FASTA from the Display drop-down menu
Transform the content of this window into plain text
by choosing Text from the drop-down menu located
on the far right of the menu bar.
Save the FASTA sequence by using the following
protocol:
a. In the Edit menu of your Web browser, click Select All
and then
click Copy.
b. Open a default Word document and, in the Edit menu of
Word, click Paste.
c. Finally, save your document as dUTPaseDNA.txt by
choosing the Save as type option text only (*.txt).
FASTA FORMAT DESCRIPTION
•
•
•
FASTA is a DNA and protein sequence alignment
software package first described (as FASTP) by
David J. Lipman and William R. Pearson in 1985
Popular Format and commonly used
A sequence in FASTA format begins with a singleline description, followed by lines of sequence
data. The description line is distinguished from
the sequence data by a greater-than (">") symbol
in the first column. It is recommended that all
lines of text be shorter than 80 characters in
length.
‫‪53‬‬
‫شکوه‬
‫ریاضی‬
‫‪،‬فران‬
‫ک‬
‫کاظمی‬

SEQUENCE

Transcript SEQUENCE

Directory