Gene Finding

Download Report

Transcript Gene Finding

Gene Finding
Biological Background
The Central Dogma
Transcription
DNA
Translation
RNA
Protein
Background
*Essential Cell Biology; p.268
Non-coding regions  gene regulation
 Vicinity of TSS: direct interactions with Pol-II complex
 Larger vicinity – indirect interactions (chromatin remodelling)
The Genetic Code
Third Letter
First Letter
Second Letter
tRNA – Responsible for Translation
Adopted from Genetic Analysis V, p.388
tRNA – Responsible for Translation
Adopted from Genetic Analysis V, p.388
Frame Shifts
 Code Triplets (“codons”) are not overlapping
 3x2 possible ways of reading depending on
strand and the relative position where reading
starts
 This is not just our concern when looking for genes,
it is also the cell’s concern in terms of mutations:
 Original: THE FAT CAT ATE THE BIG RAT
 Delete C:THE FAT ATA TET HEB IGR AT

Prokaryotes Gene Finding
 No noclues
 Most DNA is coding (e.g. 70% in H.influenza)
 Each gene is one contiunes DNA sequence (no
introns)
 PolyI – rRNA, PolyII – mRNA, PolyIII - tRNA
Detecting ORF
Simple Idea:
If there is no gene encoded then the expected
frequency of STOP codon is 3/64 codons
ORF – open reading frame, a sequence of
codons with no STOP codon
Simple Algorithm:
1. scan until you find a stop condon, in all reading
frames.
2. Scan back to find a start codon.
3. If it’s long ehough, report this ORF as a putative
Cons: gene
Can’t detect short genes
High FP ( E.Coli has 6500 ORFS but only 1100 genes)
Coding vs. Non coding regions
Codon frequencies
 Codon usage in coding regions is different
 Leucine, Alanine, Tryptophan are coded in 6:4:1
different codons
 Expect to see a ratio of 6:4:1 in random sequence
 In proteins the appear in 6.9:6.5:1 ratio
 Another example:
A or T appear in 90% of the case as the last letter
of a codon in protein coding regions
Nocleutide MM for Gene Detection
2nd Order MM
Idea: extend the model to capture codons
Results: poor…. Code overlap in this model
MM over codons
Idea:
Transform the code into codons, then use 1rd MM
Why not use codon frequencies directly?
“Codon Preferences” program:
“Codon Preferences” program
Uses a window of 25 codons around each point
Score:
log(
P
)
1 P
Using Promoter’s Signal
 We are still far from perfect…
 idea: try to detect signals in the promoter regions,
to help descriminate real genes in ORFs
 Prokaryotes:
~-35 tss: TTGACA
~-10 tss: TATAAT (“TATA box” signal)
 No single promoter has the exact consensus
 Nearly all promoters have 2-3 from TAxyzT
 80-90% have all 3
 In 50% xyz = TAA
Up To here summary
 We have seen the problems in trying to find genes
in wide genome scan – Prokaryotes!
 The bottom line is that the problem is not really
solved, but most research in gene finding focus on
Eukaryotes, where the main interest lies …
 Next lecture – much more sophisticated models, to
handle the much more complex situation in
Eukaryotes in general, and Human in particular