Finding Eukaryotic Open reading frames.

Download Report

Transcript Finding Eukaryotic Open reading frames.

Finding Eukaryotic Open reading
frames.
Introduction
• The open reading frame: (OFR) in prokaryotics
DNA (test your application)
• The structure of the Eukaryotic gene.
• Findings gene in Eukaryotes
– ORF and problems with ORF
– First exon and first intron
– Distinguish introns/exons (splice sites)
– Proximity of promoters (mentioned)
– Bases pair patterns
– Homology with existing sequences.
ORF prokaryotics (Pal Gene E.Coli)
Adapted Understanding bioinformatics 9.3
Using your assignment code
• Open the file: ORF pal gene.fasta
• Find all open reading frames. (This time you
must modify your code to translate each
codon, copy form convertor_hashtable.txt
• Compare to file: pal protein sequence.fasta.
– Visual inspect the files.
• What conclusion can you draw.
– On which read frame is the true ORF.
Structure of eukaryotic “gene
The “basic”
transcription/translation of
Eukaryotic gene
An ORF in Eukaryotic is a region
of the DNA which “could be” a
coding sequence (CDS) of a
gene.
It has a start codon (ATG) and
an end codon [ one of three]
(TAA, TAG, TGA)
The diagram shows the DNA
sequence of an eukaryotic
gene including promoter, UTR….
Eukaryotic expression showing exons/ introns…,
adapted from Zhang 2002
Structure of Eukaryotic CDS
ALDH10 gene exon 1 shows a
5’UTR in exon
ORF in Eukaryotes
• In comparison to prokaryotes Eukaryotic DNA is :
• Gene density is much lower; genes are further apart and can
vary significantly between chromosomes (~ 1.5% of human
DNA is CDS).
• The mRNA is monocistronic (one promoter per gene; N.B
prokaryotes generally are organised in operons) moreover; A
DNA sequence is transcribed into one mRNA sequences [this
may not be true of viral DNA]
• The “ORF” in the DNA sequence contains exons (translated
sequences (CDS) or exon and introns (Non CDS).
• The Introns are spliced to leave only exons
Global Sequence
7
ORF in Eukaryotes
• Some Impact of differences:
• You can no longer reliably translate an ORF into AA sequence to
give you the “true” protein (amino acid) sequence.
• The DNA sequence of Introns is like any non coding region of the
DNA in that the bases are just bases and should not be read as
elements of a “codon” so in an intron sequence:
– ATG does not represent a start codons
– TAA/TAG/TGA do not represent stop codons.
• Increase the complexity of determining true ORF in DNA sequences
because of “false positive” start and stop codons in introns; thus
many attempts to find genes/orf is now down via mRNA (not premRNA)
• The size of introns and exons need not be multiple of three. The
impact of this on the DNA ORF analysis is “shift” the DNA reading
frames.
Figure 9.2a the CDK10
gene
Note in this ATG is shown in a red box
(note it is 12 nucleotides into the first
exon) {this will not impact on ORF but
will mean an incorrect gene
annotation: why)
Only the first exon and intron is shown
full; the rest shows partial sequences
of introns are shown.
The fully sequences can be found in
the PISSRLE DNA sequence.
Frame shits using ORF technique
• When a pre-mRNA is splice into mRNA all the exons will have to be
in one reading frame.
• However:
– the splice sites need not occur at the beginning or start of exons.
– introns need not be multiples of three in size;
• What is the net effect of this in terms of trying to “predict”
translation of proteins using DNA sequences.:
– In can affect the Translation of an exon;
– It can affect the starting residue of the following exon….
– It can mean the “translation” of an exon(s) are being carried out in the
incorrect reading frame.
• The effect of manual translation It is dependent on the starting
position of the exon (correct reading frame) , the length of the exon
and the length of the intron.
– Refer to chapter 9 understanding bioinformatics
Predictive translation effect Exons/intron length
• Consider the following:
• We have the mRNA CDS of 60 bp in length (start…stop)
Exon A
ATG
DNA Strand
Exon 9
bp
Intron 30
bp
Exon 21
bp
Exon B
Intron 45
bp
TAA
Exon 30
Bp
BP…
• Let us assume that the intron1 is:
– at the end of codon three (position 9)
– the length of the intron is 30bp.
• Intron 2 occurs at:
– the end of codon 10 (position 30)
– and is 45 bp in length
• What is the effect of the translations: on Exon A and Exon B?
Predictive translation effect Exons/intron length
• Consider the following:
• We have the mRNA CDS of 60 bp in length (start…stop)
Exon A
ATG
DNA Strand
Exon 9
bp
Intron 30
bp
Exon 20
bp
Exon B
Intron 45
bp
TAA
Exon 31
Bp
BP…
• Let us assume that the intron1 is:
– at the end of codon three (position 9)
– the length of the intron is 30bp.
• Intron 2 occurs at:
– at position 29 (at the 3rd bp of codon 10)
– and is 45 bp in length
• What is the effect of the translations: on Exon A and Exon B?
Predictive translation effect Exons/intron length
• Consider the following:
• We have the mRNA CDS of 60 bp in length (start…stop)
Exon A
ATG
DNA Strand
Exon 9
bp
Intron 30
bp
Exon 21
bp
Exon B
Intron 43
bp
TAA
Exon 30
Bp
BP…
• Let us assume that the intron1 is:
– at the end of codon three (position 9)
– the length of the intron is 30bp.
• Intron 2 occurs at:
– at position 30(the end of codon 10)
– and is 43 bp in length
• What is the effect of the translations: on Exon A and Exon B?
Effect of Translation
• Example 1 no effect all multiples of 3
• Example 2 the last residue of exon 2 is
incorrect. The residue for Exon 3 is correct.
(but starts at bp 2 of first codon)
• Example 3 last expn is in different reading
frame.
• Refer to Incorrect_translation_examples.rar
Predicting exons ADH10 gene
The diagram shows the exons 1 and
exons 2 for ALDH10 gene.
The correct coding sequence is shown
in upper case. : the second ATG is the
actual start codon
The sequences can be found in the
sample sequence files.
What is the length of each exon (CDS).
Consider what may happened if you
applied a translation to each of the
reading frames?
Exon 1 is position 1352-1762; exon 2 is
2169-2400 .
The position of the actual ATG is 1610
Figure 9.7 : understanding Bioinformatics
Finding Exons Coding regions
• In order to ensure you can negate the previous issue(s) it is
imperative to identify splice sites:
• Identify start and stop signals (refer to Zhang 2002 Chasin 2007)
– Initial exon (start and 5’ splice site)
– Internal exon (3’ and 5’ site)
– Terminal site (3’ and and stop codon)
• Identify splice junctions:
– the 5’ splice junction is in general GT)
– The 3’ splice junction is in general AG.
• Refer to Exon 1 and Exon 2 in the ADH10 gene in previous slide:
– Exon 1 is position 1352-1762; exon 2 is 2169-2400 .
– The position of the actual ATG is 1610
– (1352-1610: is the 5’ UTR ofGlobal
exon
1; Translation initiation site)
Sequence
16
Splice site prediction
• While GT and AG are the general 5’ and 3’
splice; it is obvious that such pairings are not
uncommon: in fact there is a high degree of
false positives (understanding bioinformatics
p. 392).
Figure 9.10 understanding bioinformatics: spliceview… are prediction
programs.
Proximity of promoters
• Basically a true CDS (ORF) will have a promoter region
near by :
– Promoters in prokaryotes have well defined b.p. sequences
(motifs) upstream of the CDS (true ORF):
• The Pribnow box: TATAAT at about position -10
• ATTGACA at position -35
• An AT rich region before this box.
– Eukaryotic promoters are more complex: there is more
than one…
• TATA box
• CAAT box
• GC rich regions
• Conversely the presence of a ORF indicates that there
should be a promoter close by. (Bioinformatics 1 will
cover promoter prediction in greater detail in the next
lecture)
BP sequences in Exons/Introns
• The DNA sequence of a gene’s CDS contains different
ratio of bases as opposed to the non “CDS of a gene”
or non geneic DNA. (The student is expected to
research this)
• So the ratio of BP to each other and specific BP
sequences is different between Exons/Introns and
other non coding DNA. (remember the non CDS there
are no codons)
• If student requires greater, supplementary material,
detail it can be found in Zhang et al 2002 and other
references at the end of chapter 9 and 10 in
understanding bioinformatics
Homology in coding regions
• The CDS sequence of genes are generally highly:
Hypothesis why this is the case?
• Like prokaryotic DNA the CDS sequence is highly
conserved so database searches can facilitate
determining exons and thus ORF.
• By extracting a possible exon region. It can be
submitted to a search for similar sequences (BLAST
search) to see what it may reveal.
• If there are highly probable similarity existing exons
then it is likely to be a true exon
• An exon can also be translated and homologs of the
translated sequence can also be submitted to search
(The SWISS-Prot blast search engine should be used as
it contains experimentally determined AA sequences.)
Alternative splicing
• The diagram shows the main
four types of alternative
splicing.
• It clearly indicates that the
pre-mRNA is not the same as
the mRNA (so direct
translation via the DNA is
fraught with danger)
• Homological analysis and the
use of expressed sequence
tags (mRNA produced by
genes over different times and
different tissue types) can help
determine the different splices
• Can you think of any issues
that may arise, using ORF, if
there is alternative splicing?
Reference
• Baxevanis, A.D. 2005 Bioinformatics: a practical
guide to the analysis of genes and proteins.
Wiley; Chapter 5. [book is in the library]
• Klug, W.A. et al 2010; Concepts of Genetics;
Pearson Education p. 596-p.597
• Zhang, M.Q. 2002 Computational prediction of
eukaryotic coding genes. Nat Rev. Genet. 3 698709.
• Chasin, L.A. 2007 Searching for splicing motifs.
Adv Exp Med Biol. 623:85-106
• Zvelebil M. “understanding bioinformatics”
chapter 9 {book is in the library]
Global Sequence
22