Entrez Slide Show - WV IDeA Network of Biomedical Research

Download Report

Transcript Entrez Slide Show - WV IDeA Network of Biomedical Research

Using
Entrez
The Life Sciences Search Engine
Searching NCBI Databases Efficiently
• Knowing how to retrieve the exact information you
need in an efficient way is the fundamental and
most important skill in Bioinformatics.
• Every NCBI database is designed and created for
some specific purposes.
• A common mistake Bioinformatics novices make is
searching for information in an inappropriate
database.
• Entrez links among and within databases, making it
easier to search for information.
What is Entrez?
• Entrez is an NCBI retrieval system designed
for searching several linked databases.
• Entrez is a search tool for integrated access
to the biological literature and sequence data.
• Entrez is extremely powerful, enabling the
user to quickly move between the different
specialized databases.
Entrez
• Entrez is divided into sites for nucleotide,
protein, structure, genomes, OMIM, and
more. You can use limits (such as RefSeq) to
focus your Entrez search.
• When you conduct a search via Entrez, your
query generates this screen, telling you the
number of hits to your query.
The Entrez System
Books
PubMed
PopSet
ProbeSet
e!
GDB
MGC
Protein
Entrez
LocusLink
OMIM
Structure
HGMD
Homologene
SNP
CDD
3D Domains
UCSC
Nucleotide
Genome
Taxonomy
The Big
Picture
UniGene
UniSTS
MapViewer
Entrez and LocusLink
• Entrez doesn’t link to all the databases that
contain sequences, however!
• LocusLink has its own groups of links to
specialty databases, since it doesn’t cover all
the genomes yet.
Entrez:
Database Integration
Word weight
PubMed
abstracts
Phylogeny
3
-D
3-D
Structure
Structure
Taxonomy
VAST
Genomes
BLAST
Nucleotide
sequences
Protein
sequences
BLAST
The (ever) Expanding Entrez
System
UniGene
PubMed
Nucleotide
Protein
Journals
Structure
CDD
Genome
Entrez
SNP
PopSet
OMIM
3D Domains
Taxonomy
UniSTS
ProbeSet
Books
PubMed
Books
Nucleotide
Protein
Entrez Databases
Biomedical literature
Online textbooks
GenBank, EMBL, DDBJ, RefSeq, PDB
[GenBank, EMBL, DDBJ], RefSeq,
SWISS-PROT, PIR, PRF, PDB
Genome
Complete genomes
Taxonomy Organisms in NCBI sequence databases
Structure MMDB: experimental 3D structures
Domains
CDD: conserved protein domains
3D Domains Compact 3D protein domains in MMDB
OMIM
Online Mendelian Inheritance in Man
SNP
Single nucleotide polymorphisms
UniSTS
Sequence Tagged Site markers
ProbeSet
Gene expression and microarray datasets
PopSet
Population study datasets
UniGene
Gene-based expressed sequence clusters
Nucleotide Database
• The Nucleotide database contains sequence data from
GenBank, EMBL, and DDBJ, the members of the
tripartite, international collaboration of sequence
databases.
• EMBL is the European Molecular Biology Laboratory at
Hinxton Hall, UK;
• DDBJ is the DNA Database of Japan in Mishima, Japan.
• Sequence data are also incorporated from the Genome
Sequence Data Base (GSDB), Santa Fe, NM.
• Patent sequences are incorporated through
arrangements with the U.S. Patent and Trademark
Office (USPTO) and via the collaborating international
databases from other international patent offices.
Entrez Nucleotides
•
•
•
•
Primary
GenBank / EMBL / DDBJ 35,116,960
Derivative
RefSeq
259,219
Third Party Annotation
3,182
PDB
4,703
Total
35,384,248
Database Searching with
Entrez
Using limits and field restriction to find
plant g6pdh
Linking and neighboring with g6pdh
Entrez Nucleotides
glucose 6 phosphate dehydrogenase
The G6PD enzyme catalyzes the oxidation of glucose-6phosphate to 6-phosphogluconate, while reducing nicotinamide
adenine dinucleotide phosphate (NADP+ to NADPH). In terms of
electron transfer, glucose-6-phosphate loses two electrons to
become 6-phosphogluconate and NADP+ gains two electrons to
become NADPH. This is the first step in the pentose phosphate
pathway. This pathway, or shunt, as it is sometimes called,
produces the 5- carbon sugar, ribose, which is an essential
component of both DNA and RNA.
Limits Are Helpful
• Limits allow restriction of a search to a defined subset of
the database.
• Limits can be set to restrict a search to a particular
database field (e.g., the Author field).
• Limits can be set to search everything but a particular type
of data (e.g., “exclude patent records”).
• Alternatively, limits can be set to search only a particular
type of data (e.g., Genomic RNA/DNA) or to search only data
from a particular source database (e.g., EMBL). Date limits
and sequence length limits are also possible.
• The contents of each Entrez database differ, and therefore
the Limits available for each database differ.
Entrez Nucleotides: Limits &
Preview/Index
glucose 6 phosphate dehydrogenase
Try using the Limits and Preview function to hone your search
To find the Plant G6PD genes.
Entrez Nucleotides: Limits
Accession
All Fields
Author Name
EC/RN Number
Feature key
Filter
Field Restriction
Gene Name
glucose 6 phosphate dehydrogenase
Issue
Journal Name
Keyword
Modification Date
Organism
Exclude bulk
Page Number
Primary Accession
Properties
Protein Name
Publication Date
SeqID String
Sequence Length
Substance Name
Text Word
Title Word
Uid
Volume
sequences
Entrez Nucleotides: Limits
glucose 6 phosphate dehydrogenase
Title == Definition
Exclude Bulk Sequences
Nuclear gene
mRNA molecule type
Document Summaries: Limits
Adding Terms: Preview/Index
Accession
All Fields
Author Name
EC/RN Number
Feature key
Filter
Gene Name
Issue
Journal Name
Keyword
green plants
Modification Date
Organism
Page
Number
green
plants
Primary Accession
Properties
Protein Name
Publication Date
SeqID String
Sequence Length
Substance Name
Text Word
Title Word
Uid
Volume
Plant cytosolic g6pdh mRNAs
Database Neighbors and Interlinking
• What makes Entrez more powerful than many
services is that most of its records are
linked to other records, both within a given
database (such as Nucleotide) and between
databases.
• Links within a database are called “neighbors”
(e.g., Nucleotide neighbors).
Links Between Databases
• Protein and Nucleotide neighbors are determined by
performing similarity searches using the BLAST
algorithm to compare the entry amino acid or DNA
sequence to all other amino acid or DNA sequences
in the database. We will discuss more about
BLAST later.
• Nucleotide sequence records in the Nucleotide
database are linked to the PubMed citation of the
article in which the sequences were published.
• Protein sequence records are linked to the
nucleotide sequence from which the protein was
translated.
Plant cytosolic g6pdh mRNAs
Links and neighbors
(related records)
Summary
Brief
GenBank
ASN.1
Formats
FASTA
GI list
LinkOut
PubMed Links
Protein Links
Nucleotide Neighbors
PopSet Links
Structure Links
Genome Links
Taxonomy Links
OMIM Links
LinkOut
• LinkOut is a feature of Entrez that is designed to
provide users with links from PubMed and other
Entrez databases to a wide variety of relevant webaccessible online resources:
–
–
–
–
Full-text publications
Other biological databases
Consumer health information
Research tools
• The goal is to facilitate access to relevant online
resources beyond the Entrez system to extend,
clarify, or supplement information found in the
Entrez databases.
Protein Database
• The protein
database includes
proteins from
translate regions of
DNA in GenBank as
well as sequence
from PIR
• The entry includes:
– The name of the
protein
– How the protein
sequence was
derived
– An accession and a
PID number
– The number of
amino acids
Protein Entry
The Entry also
includes:
• Structural
information for
the protein (if
known)
– Helices and Sheets
– Domains
– Etc
• The sequence of
amino acids
comprising the
protein
Setting Protein Database search limits
• Choose Protein from
the drop-down menu
– Can do a Boolean
search
– Or can set LIMITS
• Fields (eg Author,
Journal, etc.)
• Gene Location
(genomic,
mitochondrial etc)
• Segmented Sequence
• Only from (Database
to check)
• Modification date
Linking Between Databases
• Sometimes you will pull up a record and you
have no idea what organism the gene you are
looking at is from.
• For Example, the following record- what is
Medicago sativa ?
Entrez GenBank / GenPept
Taxonomy to the Rescue
• Entrez lets you click a live link from the
record and determine what organism
Medicago sativa is.
• It is alfalfa.
• You can also tell what it is related to
taxonomically, because sometimes the
common name isn’t very useful either!
Taxonomy Link
Advanced Neighbors: BLink
What is BLink
• BLink - BLAST Link
• Someone has done a BLAST search already,
and you can just retrieve it!
• BLink displays the graphical output of precomputed blastp results against the protein
non-redundant (nr) database.
This graphical output includes:
• Alignment of up to 200 BLAST hits on the query
sequence
• Best Hits to each organism
• List of known protein domains in the query sequence
• Filter hits by selecting the BLAST cutoff score
• Distribution of hits by taxonomic grouping
• Display of similar sequences with known 3D
structure
• Filter hits by database and/or by taxonomic
grouping
• Display a taxonomic tree of all organisms with
similar sequences
PopSet Links
• The PopSet database contains aligned
sequences submitted as a set resulting from a
population, phylogenetic, or mutation study.
• These alignments describe such events as
evolution and population variation.
• The PopSet database contains both
nucleotide and protein sequence data.
Protein Neighbors->PopSet Links
Protein Neighbors->Genome Links
PopSet search results
• The results or
a PopSet
search
• The PopSet
database
includes
alignments of
genes from
multiple
organisms OR
different gene
families OR
mutational
analyses
PopSet Entry
• The PopSet
entry
includes:
– The title of
the
paper/study
– The length
of the
sequence(s)
aligned
– The number
of aligned
sequences
PopSet Entry without alignment
• The PopSet
Entry without
an alignment
– Title of the
study
– The number
of sequences
included
– Links to the
sequences
Entrez
Structures
Protein Structures can also be in
databases
http://bmbiris.bmb.uga.edu/wampler/tutorial/prot0.html is a useful review
Tutorial.
Entrez links to structure databases
• The Structure database or Molecular Modeling
Database (MMDB) contains experimental data from
crystallographic and NMR structure determinations.
• The data for MMDB are obtained from the Protein
Data Bank (PDB).
• The NCBI has cross-linked structural data to
bibliographic information, to the sequence
databases, and to the NCBI taxonomy.
• Use Cn3D, the NCBI 3D structure viewer, for easy
interactive visualization of molecular structures
from Entrez.
Structure Search results
• The structure
of proteins are
also in a
database
• Search as
before
• Your search
results are
similar
Structure Entry
• The structure
Entry has links
to the other
databases
• And it will allow
you download a
file to open with
a structure
viewer program
• Proteins with similar structures and functions
have been identified in the databases
BLink: Advanced Protein Neighbors
BLink: Related Structures
Viewing Structure in Cn3D
• You can
download Cn3D
(a structural
viewer
program) from
NCBI
• This will allow
you to view the
structures
from the
structure
database
Cn3D Text Window
• The Text
window of
Cn3D will
align two or
more
proteins so
you can
compare the
structure of
multiple
proteins
BLink: Human Homologue
Human RefSeqs: Genome Reagents
MMDB: Molecular Modeling Data Base
• Derived from experimentally determined PDB records
• Value added to PDB records including:
– Addition of explicit chemical graph information
– Validation
– Inclusion of Taxonomy, Citation,
and other information
– Conversion to ASN.1 data description language
• Structure neighbors determined by
Vector Alignment Search Tool (VAST)
Structure Summary
Cn3D viewer
Structure Neighbors
Conserved Domains
3D Domain Neighbors
Cn3D 4.1
Cn3D 4.1: Structural Alignment
Conserved ATP binding site
Src Kinase H. sapiens
Casein kinase S. pombe
Cn3D: Simple Homology Modeling
human
swordtail
Using Cn3D to model domains
Other services and databases
from the NCBI
• LocusLink to all possible information from
NCBI and beyond for a few well
characterized model organisms.
• LocusLink is a great starting point: it collects
key information on each gene/protein from
major databases. It now covers 8 organisms.
• RefSeq provides a curated, optimal accession
number for each DNA (NM_006744) or
protein (NP_007635)
Locus Links
• Results of a
Locus links
search, includes:
–
–
–
–
–
–
Locus ID
Species
Locus symbol
Locus name
Locus location
Links
• Protein
Database
• OMIM
• Reference
Sequence
• Related
GenBank
Sequences
• Homologene
Data
• UniGene
• Variation Data
LocusLink: Selected Higher
Genomes
OMIM
PubMed
Map Viewer
HomoloGene
UniGene
RefSeq
Full report
GenBank
dbSNP
Protein
Protein Database
• The Protein database contains sequence data
from the translated coding regions from DNA
sequences in GenBank, EMBL, and DDBJ as
well as protein sequences submitted to:
–
–
–
–
Protein Information Resource (PIR)
SWISS-PROT
Protein Research Foundation (PRF)
Protein Data Bank (PDB) (sequences from solved
structures)
NCBI Protein Databases
•
•
•
•
GenPept GenBank, EMBL, DDBJ CDS translations
RefSeq mRNA based (NP_) and genome based (XP_)
Swiss-Prot curated high quality protein reviews
PIR protein information resource Georgetown University
• PRF protein resource foundation
• PDB Protein Databank sequences from structures
Entrez Protein
• GenPept
• RefSeq
(GB,EMBL, DDBJ)
• Third Party Annotation
3,442,298
856,191
3,834
• Swiss Prot
• PIR
• PRF
144,508
282,821
12,079
Total
3,442,298
BLAST nr
1,642,191
Protein Link
BLAST Link
Conserved Domains
Related Proteins: Redundancy
Redundant Sequences
Related Proteins: Links
Sequence from MutL structure
BLink: non-redundant relatives
Arabidopsis homolog
Conserved Domain
MLH1 Domain Structure:
CDD
ATPase Domain
Mismatch Repair Domain
MLH1: ATPase Domain
1BGQ: ATPase Domain in Cn3D
Yeast HSP90
ATP Binding site helix
Variations Human MLH1
BLink
Finding structural models
Mapping Variation Onto Structure
Loads sequence alignment and structure in Cn3D
Bacterial DNA mismatch repair proteins
Mapping Variation Onto Structure
Asn
Ile – Val
Conserved Asn
Ile
NCBI Genome Databases
• The Genome database provides views for a
variety of genomes, complete chromosomes,
sequence maps with contigs, and integrated
genetic and physical maps.
Microbial Genomes
ZWF
Genome search results
• Genome
Search Results
• The Genome
database
includes full
(and some
partial)
genomes from
viruses to
complex
organisms
Genome Entry
• Genome
entries include
– Maps of the
genome
– Links to the
sequence
– The organism
for the
genome
Genes Database: All Genomes
Coming soon!
Genes Database: All Genomes
Genes Database: All Genomes
But wait! There’s more!
• There is even more at NCBI that I have
covered here.
• This site map is also a guide to NCBI
resources. Each link leads to a brief
description of the resource on this page, then
to the resource itself.
http://www.ncbi.nlm.nih.gov/Sitemap/
There are many bioinformatics
servers outside NCBI.
• Try ExPASy’s sequence retrieval system at
http://www.expasy.ch/
• (ExPASy = Expert Protein Analysis System)
• Or try ENSEMBL at www.ensembl.org for a
premier human genome web browser.