Transcript Databases
Databases
Databases in Bioinfo
• Central in bioinformatics, as presented in 2 of the
definitions in previous lecture:
– The use of computers in solving information problems in
the life sciences. It mainly involves the creation of
extensive electronic databases on genomes, protein
sequences etc. Also involves techniques such as threedimensional modelling of biomolecules and biological
systems (www.universityscience.ie/pages/glossary.php).
– The collection, organization and analysis of large amounts
of biological data, using networks of computers and
databases
(www.abc.net.au/science/slab/genome2001/glossary.htm).
Databases in Bioinfo
• Central in bioinformatics, as presented in 2
of the definitions in previous lecture.
• There are a few central databases which
are used often.
• There are many different databases which
specialize in specific fields.
Databases in Bioinfo
• Central databases:
– NCBI/GenBank (http://www.ncbi.nlm.nih.gov/)
• Nucleotide + protein sequences and much more
– EBI/EMBL (http://www.ebi.ac.uk/)
• Similar to GenBank
– Ensembl (http://www.ensembl.org/index.html)
• Whole genomes browsing
– ExPASy - Swiss-Prot/Trembl
(http://www.expasy.ch/sprot/)
• Manually annotated and reviewed protein DB.
Databases in Bioinfo
• Specalised databases: (examples)
– Relevant to a particular gene (e.g. RDP - 16S
gene)
– Groups of sequences sharing common
properties (e.g. Pfam)
– Particular organisms (e.g. TAIR - The
Arabidopsis Information Resource)
– Protein structures (PDB)
– And many more…
Public Sequence Databases
• Three main databases
• Sequences are pooled (identical between
DBs)
– GenBank (NCBI – National Center for
Biotechnology Information)
– EMBL (EBI – European Bioinformatics
Institute)
– DDBJ (Japan Center for Information Biology)
NCBI - GenBank
• A collection of nucleotide sequences and their
translation to proteins
• Data resources:
– Submission of sequences from research groups
– Bulk submission of sequences from large sequencing
centers.
• An accession number is given to each
sequence, including translated protein
sequences.
• Most journals require indication of a
GenBank/EMBL/DDBJ accession number for
any newly described sequence as an obligatory
condition for publication.
Types of sequences
Genomic
sequences
cDNA
EST sequences
Nucleotide
sequences
Protein
sequences
NCBI - RefSeq
• The Reference Sequence (RefSeq)
collection aims to provide a
comprehensive, integrated, nonredundant, well-annotated set of
sequences, including genomic DNA,
transcripts, and proteins. (From the NCBI RefSeq
website)
• Site: http://www.ncbi.nlm.nih.gov/RefSeq/
NCBI - RefSeq
• The RefSeq database is a curated collection of
DNA, RNA, and protein sequences built by
NCBI.
• RefSeq provides only one example of each
natural biological molecule for major organisms
ranging from viruses to bacteria to eukaryotes.
• For each model organism, RefSeq aims to
provide separate and linked records for the
genomic DNA, the gene transcripts, and the
proteins arising from those transcripts.
NCBI - RefSeq
• RefSeq is limited to major organisms for which
sufficient data is available (16,248 distinct
organisms as of Sep. 2011),
• GenBank includes sequences for any organism
submitted (more than 300,000 different named
organisms).
• RefSeq records appear in a similar format as the
GenBank records from which they are derived.
They can be distinguished by their accession
prefix, which includes an underscore.
RefSeq accession numbers
Examples:
• NC_123456
– Complete genomic molecules including genomes,
chromosomes, organelles, plasmids.
• NM_123456
– Transcript products; mature messenger RNA (mRNA)
transcripts.
• NP_123456
– Protein products; primarily full-length precursor
products but may include some partial proteins and
mature peptide products.
•
Other names, see http://www.ncbi.nlm.nih.gov/RefSeq/key.html
Primary vs. Derivative
Sequence Databases
RefSeq
Labs
Sequencing
Centers
TATAGCCG
AGCTCCGATA
CCGATGACAA
Curators
TATAGCCG
TATAGCCG
TATAGCCG
TATAGCCG
Updated
continually
by NCBI
GenBank
Updated ONLY
by submitters
From NCBI field guides
Genome
Assembly
UniGene
Algorithms
NCBI - Entrez
• Entrez Global Query is an integrated search and
retrieval system that provides access to all
databases simultaneously with a single query
string and user interface.
• Entrez can efficiently retrieve related sequences,
structures, and references. The Entrez system
can provide views of gene and protein
sequences and chromosome maps.
• http://www.ncbi.nlm.nih.gov/sites/gquery
NCBI - Entrez
• Some of the databases searched by Entrez:
– Nucleotide: sequence database (GenBank)
– Protein: sequence database
– Gene: gene-centered information
– PubMed: biomedical literature citations and abstracts,
– OMIM: online Mendelian Inheritance in Man
– Genome: whole genome sequences and Mapping
– Structure: three-dimensional macromolecular
structures
– UniGene: gene-oriented clusters of transcript
sequences
– GEO Profiles: expression and molecular abundance
profiles
NCBI - PubMed
• PubMed is a free search engine for accessing
the MEDLINE database of citations, abstracts
and some full text articles on life sciences and
biomedical topics
• MEDLINE (Medical Literature Analysis and
Retrieval System Online) is a literature database
of life sciences and biomedical information,
compiled by the U.S. National Library of
Medicine (NLM).
• http://www.ncbi.nlm.nih.gov/pubmed/
NCBI - Gene
• Integrated Access to Genes of Genomes
in the Reference Sequence (RefSeq)
Collection.
• Supplies key connections in the nexus of
map, sequence, expression, structure,
function, citation, and homology data.
• http://www.ncbi.nlm.nih.gov/sites/entrez?d
b=gene
EMBL-EBI
• The European Bioinformatics Institute
(EBI) is part of European Molecular
Biology Laboratory (EMBL).
• Hosts two main databases:
– For nucleotide sequences (EMBL-Bank)
– For protein sequences (UniProt).
• Provide data resources in all the major
molecular domains
EMBL-EBI
• Indexes of databases and services:
– Databases
– Tools for Data Analysis Index
– Services
EMBL-EBI
• “The EMBL Nucleotide Sequence Database is
Europe's primary nucleotide sequence resource.
The main sources of the DNA and RNA
sequences in the database are direct
submissions from individual researchers,
genome sequencing projects and patent
applications.”
• Similar to GenBank
• Growth stats
UniProt Knowledgebase
• UniProtKB/Swiss-Prot:
– an annotated protein sequence database.
– Contains high-quality annotation, is non-redundant
and cross-referenced to many other databases
• UniProtKB/TrEMBL:
– a computer-annotated supplement to
UniProtKB/Swiss-Prot.
– Contains the translations of all coding sequences
present in the EMBL Nucleotide Sequence Database,
which are not yet integrated into Swiss-Prot.
Database entries
• Sequence entries are composed of
different line types, each with their own
format.
• Built in such a way that they are readable
both to humans and computers
• Three main formats:
– Genbank
– EBI
– FASTA
Database entries
EBI and SWISSPROT
Genbank
ID
RBL_AETCO
Reviewed;
483 AA.
AC
A4QJC3;
LOCUS
A4QJC3
483 aa
linear
PLN 13-OCT-2009
DT
11-SEP-2007, integrated into UniProtKB/Swiss-Prot.
DEFINITION RecName: Full=Ribulose bisphosphate carboxylase large chain;
DT
15-MAY-2007, sequence version 1.
Short=RuBisCO large subunit; Flags: Precursor.
DT
13-OCT-2009, entry version 18.
ACCESSION
A4QJC3
DE
RecName: Full=Ribulose bisphosphate carboxylase large chain;
VERSION
A4QJC3.1 GI:158513601
DE
Short=RuBisCO large subunit;
DBSOURCE
UniProtKB: locus RBL_AETCO, accession A4QJC3;
DE
EC=4.1.1.39;
class: standard.
>gi|158513601|sp|A4QJC3.1|RBL_AETCO RecName: Full=Ribulose
bisphosphate
carboxylase large chain; Short=RuBisCO large subunit; Flags: Precursor
DE
Flags:
Precursor;
created: Sep 11, 2007.
MSPQTETKASVGFKAGVKEYKLTYYTPEYETKDTDILAAFRVTPQPGVPPEEAGAAVAAESSTGTWTTVW
GN
Name=rbcL;
TDGLTSLDRYKGRCYHIEPVPGEESQFIAYVAYPLDLFEEGSVTNMFTSIVGNVFGFKALAALRLEDLRI
sequence updated: May 15, 2007.
OS
Aethionema cordifolium (Lebanon stonecress).
PPAYTKTFQGPPHGIQVERDKLNKYGRPLLGCTIKPKLGLSAKNYGRAVYECLRGGLDFTKDDENVNSQP
annotation updated: Oct 13, 2009.
OG
Plastid; Chloroplast.
FMRWRDRFLFCAEAIYKSQAETGEIKGHYLNATAGTCEEMIKRAVFARELGVPIVMHDYLTGGFTANTSL
xrefs: AP009366.1, BAF49778.1, YP_001122954.1
AHYCRDNGLLLHIHRAMHAVIDRQKNHGMHFRVLAKALRLSGGDHIHAGTVVGKLEGDRESTLGFVDLLR
OC
Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta;
xrefs (non-sequence databases): GeneID:4968541, GO:0009507,
DDYVEKDRSRGIFFTQDWVSLPGVLPVASGGIHVWHMPALTEIFGDDSVLQFGGGTLGHPWGNAPGAVAN
OC
Spermatophyta; Magnoliophyta; eudicotyledons; core eudicotyledons;
RVALEACVQARNEGRDLAVEGNEIIREACKWSPELAAACEVWKEIRFNFPTIDKLDPSAEKVA
GO:0000287, GO:0004497, GO:0016984, GO:0055114, GO:0009853,
OC
rosids; eurosids II; Brassicales; Brassicaceae; Aethionema.
GO:0019253, HAMAP:MF_01338, InterPro:IPR000685, InterPro:IPR017443,
OX
NCBI_TaxID=434059;
InterPro:IPR017444, Gene3D:G3DSA:3.20.20.110,
RN
[1]
Gene3D:G3DSA:3.30.70.150, Pfam:PF00016, Pfam:PF02788,
RP
NUCLEOTIDE SEQUENCE [LARGE SCALE GENOMIC DNA].
PROSITE:PS00157
RA
Hosouchi T., Tsuruoka H., Kotani H.;
KEYWORDS
Acetylation; Calvin cycle; Carbon dioxide fixation; Chloroplast;
RT
"Sequencing analysis of Aethionema coridifolium chloroplast DNA.";
Disulfide bond; Lyase; Magnesium; Metal-binding; Methylation;
RL
Submitted (MAR-2007) to the EMBL/GenBank/DDBJ databases.
Monooxygenase; Oxidoreductase; Photorespiration; Photosynthesis;
CC
-!- FUNCTION: RuBisCO catalyzes two reactions: the carboxylation of DPlastid.
CC
ribulose 1,5-bisphosphate, the primary event in carbon dioxide
SOURCE
chloroplast Aethionema cordifolium
CC
fixation, as well as the oxidative fragmentation of the pentose
ORGANISM Aethionema cordifolium
CC
substrate in the photorespiration process. Both reactions occur
Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta;
CC
simultaneously and in competition at the same active site (By
Spermatophyta; Magnoliophyta; eudicotyledons; core eudicotyledons;
CC
similarity).
rosids; eurosids II; Brassicales; Brassicaceae; Aethionema.
CC
-!- CATALYTIC ACTIVITY: 2 3-phospho-D-glycerate + 2 H(+) = D-ribulose
CC
1,5-bisphosphate + CO(2) + H(2)O.
CC
-!- CATALYTIC ACTIVITY: 3-phospho-D-glycerate + 2-phosphoglycolate =
FASTA
GenBank/GenPept entry
• Records in Genbank are divided into 2
parts:
– Annotation:
• General information: Accession, length & type
(aa/bp), Version, Database from which derived,
Source organism.
• References to articles and comments about the
sequence
• Features: Notes pertaining to sections of the
whole sequence.
– The sequence itself.
GenBank/GenPept entry
• Options to control search results: (Grey bar on
top).
• Format to display:
– GenPept (for protein) – full annotation and features.
– FASTA – just tag and sequence.
– Graph – graphical presentation of sequence and
features.
• How many results to show on the page (relevant
when browsing query results).
• Results can be sent to:
– The screen (default),
– Displayed in plain text format
– A file. (for downloading many sequences at once).
GenBank/GenPept entry
• You can display a subsection of
the sequence. This is useful for
whole genomes or for large
contigs.
• “Links” opens a dropdown of
links to many other databases
GenBank/GenPept entry
• Further details in hands-on session
• Best knowledge through trial and error…