Transcript gene - LICH

SEQUENCE
DATABASES
Daniel Svozil
Primary sequence databases
• All published genome sequences are available over the
internet
• requirement of every scientific journal
• Main resources (primary databases, big three):
• NCBI database (GenBank) (www.ncbi.nlm.nih.gov)
• European Molecular Biology Laboratory (EMBL) database
(www.ebi.ac.uk/embl)
• DNA Database of Japan (DDBJ) (www.ddbj.nig.ac.jp)
• DDBJ/EMBL/GenBank – form The International Nucleotide
Sequence Database Collaboration (INSDC, http://insdc.org)
• Contain all publicly available nucleotide sequences and their
protein translations.
• They exchange data nightly, so contain essentially the same data.
GenBank
• http://www.ncbi.nlm.nih.gov/genbank/, Nucleotide in drop-down menu
• local copy of DB – release (every 2 month in GenBank)
• 15 February 2012, release 188.0
• 137,384,889,783 bases, from 149,819,246 reported sequences, cca 100,000
organisms
• exponential growth, doubling every 18 months
• Direct submissions from individual laboratories, as well as
bulk submissions from large-scale sequencing centers
• disadvantage
• Primary database contain experimental results (with some
interpretation – annotation) but are not curated.
• There is no guarantee about data quality.
• Curated reviews are found in secondary databases.
• Sequences are identified by an accession number.
• Unique, reported in scientific papers describing that sequence,
combination of letters and numbers
• e.g. X01714 (1+5 variety), GL000191 (6+2 variety)
GenBank - prokaryotic gene
• Prokaryotes
• genome: circular DNA
• genome size: 0.6-8 Mb
• gene density: 1 gene per 1 kb
• 70% coding for proteins
• no overlap between genes
• genes transcribed right after the promoter
• genes are single piece (no splicing)
• Low variability in prokaryotic promoters. Typical promoter: Pribnow
box, -10, T80A95T45A60A50T96
• Protein sequences are derived by translating the longest open
reading frame ORF (from ATG to STOP) spanning the gene-transcript
sequence.
• The mRNA sequence gets translated into a protein after a special
signal, called the Ribosome Binding Site (RBS).
Bioinformatics for Dummies
TATA
sense strand
antisense strand
Transcription runs in the 5’ → 3’ of newly synthesized RNA strand.
Is assymetric – only one DNA strand is transcribed ([-], template, non-coding, antisense)
[+], nontemplate, coding, sense – sequence identical with the RNA
coding, sense – term is related to the resulting protein (mRNA is coding for protein, it
makes sense by determining the amino-acid sequence)
1st nucleotide of transcribed RNA corresponds to DNA nucleotide +1.
Sequence before this point (i.e. opposite to the flow of transcription) – upstream (-)
Sequence behind this point (i.e. in the direction of transcription) – downstream (+)
Tara Robinson, Genetics for Dummies
GenBank - prokaryotic gene
• http://www.ncbi.nlm.nih.gov/genbank/
• Search for X01714.
• GenBank entry is refered to as flat-file format.
• It’s called so because you can read it in linear fashion, it does not
involve indexes, pointers (well, actually it contains several
hyperlinks, it is not 100% flat).
The header
• LOCUS – the locus name (an arbitrary name), sequence
•
•
•
•
•
length, molecule type, division code (classification), date
of last modification.
DEFINITION – short definition of the gene.
ACCESSION – refered to as the primary accession
number.
VERSION – accession.version, GI (geninfo identifier,
GenBank specific, accession.version is now preferred)
KEYWORDS – list of terms broadly characterizing the
entry, historical reasons, no controlled vocabulary, not
used in new records
SOURCE – common name of the organism.
The header
• ORGANISM – formal scientific name for the source
organism (genus and species) and its lineage, based on
the phylogenetic classification scheme used in the NCBI
Taxonomy Database http://www.ncbi.nlm.nih.gov/Taxonomy/
• SOURCE vs. ORGANISM: baker's yeast vs. Saccharomyces
cerevisiae, search for each of these will yield the same results
• REFERENCE – at least one
• COMMENT – optional, some comment 
The features table
The features table
• Direct representation of the biological information in the
•
•
•
•
•
record.
Feature key (which biological property), location
information (where the feature is located), additional
qualifiers.
source – mandatory, origin of specific regions of the
sequence, useful when you want to distinguish cloning
vectors from host sequences.
promoter – coordinates of a promoter element. In X01714,
a –35 region is in 286..291, another promoter -10 region
misc_feature – in this case the putative location of the
transcription start (mRNA synthesis)
RBS (Ribosome Binding Site) – the location of the last
upstream element
The features table
• CDS (CoDing Segment) – describes the gene’s open
reading frame (ORF):
• The first line indicates the coordinates of the ORF from its initial
ATG to the last nucleotide of the first stop codon TAA (343 to 798).
• Each of the following lines (indented at the same level) gives the
name of a protein product, indicates the reading frame to use (here,
343 is the first base of the first codon), the genetic code to apply
(/transl_table), and a number of IDs for the protein sequence.
• /translation introduces the conceptual amino-acid sequence of the
coding segment. This sequence is a computer translation that uses
the coordinates, reading frame, and genetic code indicated in the
preceding lines.
The sequence section
• Starts with ORIGIN, ends with //
• Each line contains 60 nucleotides
• 1st nucleotide gets number 1
• Save in FASTA format (Display/FASTA (text))
• single line description starts with >, should be shorter than
80 characters
• *.fasta, *.fa, *.seq, *.fsa
GenBank - eukaryotic gene
• Eukaryotes
• genome: multiple pieces – chromosomes
• genome size: 10 Mb – 670 Gbp
• gene density: 1 gene per 100 kb in human, much lower than prok.
• genome is not efficient – less than 5% codes for proteins in human
• genes on opposite DNA strands might (rarely) overlap
• genes transcribed right after the promoter, but sequence elements
located far away can have a strong influence on this process
• splicing – exons + introns
• alternative splicing (1 gene = more proteins, 30 000 genes in
human result in 90 000 proteins)
Tara Robinson, Genetics for Dummies
GenBank - eukaryotic gene
• http://www.ncbi.nlm.nih.gov/genbank/
• Search for U90223.
• I carefully chose mRNA sequence, not genomic
sequence. Thus this entry is not that complex, no exons
etc.
• sig_peptide – location of a mitochondrial targeting
sequence
• mat_peptide – exact boundaries of the mature peptide
GenBank - eukaryotic gene
• http://www.ncbi.nlm.nih.gov/genbank/
• Search for AF018430 – gene from which the previous
mRNA sequence originated.
• This sequence is still rather simple, but it already contains
eukaryotic-specific entries.
• SEGMENT: This field relates to the mosaic structure of
eukaryotic genes. It indicates that this current GenBank
entry is the second segment of a super entry made of
four. You need all four entries to reconstruct the complete
mRNA sequence used as a template for producing the
protein
• The source section contains a /map section. For
AF018430, it indicates that the sequence belongs to
chromosome 15, and was more precisely mapped on the
long arm (q) of this chromosome, within th q21.1
cytogenetic band.
• gene – describe precisely the reconstruction of the
various mRNAs spread over several separate entries
• order – exon splicing recipe: take nucleotides from positions 1 to
1735 from entry AF018429, add nucleotides from positions 1 to
1177 from the current entry, …
• The < indicates that the gene might actually start before the
indicated position, the > indicates that the gene might actually
continue beyond the indicated position.
• mRNA – alternative splicing
Bioinformatics for Dummies
• exon – the position of the sole exon present in this
sequence
• search AF018432 – multiple exons in a single entry
• You get accession numbers by reading articles reporting
about the sequence.
• After you’ve accessed the first GenBank entry relevant to
your work, you can retrieve other related genes.
• search U90223
• Display – Summary – Related sequences
• Retrieved: various mRNA forms and partial sequences, partial
genomic sequences (around exons), and two large (154kb and
192kb) sequences of the 15q21.1 genomic region. There are some
monkey sequences as well!
Sample GenBank Record
http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html
No accession number
• GenBank is not the best database for keyword-based
searches (gene-centric databases are).
• Querying database by gene or protein keywords is still
possible, but not that reliable.
• you want to find the nucleotide sequence encoding the
human dUTPase
• sarch for human [organism] AND dUTPase [Protein name]
• find related sequences to AF018432. How many entries?
• dUTPase is a nickname for which protein?
• “dUTP pyrophosphatase”
• sarch for human [organism] AND “dUTP pyrophosphatase” [Title]
• Are the resulting entries same as in the previous search?
• This illustrates the general difficulty in retrieving all entries relevant to a
given subject, due to inconsistent usage of synonymous terms.
• many of the resulting entries are ESTs
• Limits – Exclude ESTs
RefSeq
• Many sequences are more than once in GenBank –
redundancy.
• NCBI developed RefSeq collection – a curated
secondary database
• aim: provide comprehensive, integrated, nonredundant set of
sequences
• For each model organism, RefSeq provides separate and
linked records for the genomic DNA, the gene transcripts,
and the proteins arising from those transcripts.
• RefSeq is limited to major organisms for which sufficient
data is available (more than 16,000 distinct “named”
organisms as of Jan 2012), while GenBank includes
sequences for any organism submitted.
RefSeq
• RefSeq entries – distinct accession number format, “2 +
6” with underscore
• e.g. NC_001477
Category
Description
NT
genomic contigs
NC
complete genomic molecules
NG
incomplete genomic region
NM
mRNA
NR
ncRNA
NP
protein
Reference: http://www.ncbi.nlm.nih.gov/RefSeq/key.html
Practise
Search the nucleotide domain of Entrez for breast cancer.
Provide the following information:
1. The number of Core nucleotide sequence records associated
with breast cancer
2. Number of the above sequence records that are from the
RefSeq database
3. Number of Core nucleotide sequences associated with
breast cancer that are mRNAs
4. Number of the above mRNA sequence records that are in
RefSeq and the words breast cancer appears in their titles
5. Number of Human gene BRCA1 RefSeq mRNA sequence
records with the words breast cancer in their titles
Practise
6. Accession numbers of the mRNA records for human
7.
8.
9.
10.
11.
12.
BRCA1a gene in RefSeq
Total number of nucleotides reported in 1st transcript
variant
Exact chromosomal location of BRCA1a on the human
genome
Number of times this sequence was updated
Total number of amino acids that this mRNA encodes
Identify the last amino acid in the encoded protein
Total number of Exons and Introns in this gene variant
(BRCA1a)
Practise
13. BRCA1a encodes full length BRCA1 protein (isoform 1).
14.
15.
16.
17.
How many variants (isoforms) there exist for BRCA1?
Also provide the accession numbers of their mRNA
sequences in RefSeq.
How is the “isoform 2” different from “isoform 1”?
Exact location (nucleotide position) of the BRCA1a start
codon
Exact location and sequence of the stop codon in
BRCA1a
What is the sequence of polyA signal for BRCA1a?
Querying the NCBI database
Gene-centric databases
• Sequence databases are great tools when you want to
come up with a bibliography for a particular sequence.
• However, they do not provide easy access to sequence
data when your query deals with broader issues related to
a gene or function.
• The second-generation nucleotide-sequence databases
have adopted a more gene-centric perspective.
• all the sequence information relevant to a given gene is made
accessible at once
• NCBI Entrez Gene
Genome-centric databases
• Nucleotide sequences are routinely determined at the
whole genome or chromosome scale – at least for
microorganisms
• We now have information not only about individual gene
sequences, but also e.g. about their relative positions or
strand orientation.
• To take advantage of this more global information,
researchers have had to design state-of-the-art genomecentric sequence-information management systems that
can connect specialized sequence collections with
browsing tools.