Gene Finding
Download
Report
Transcript Gene Finding
Gene Finding
Biological Background
The Central Dogma
Transcription
DNA
Translation
RNA
Protein
Background
*Essential Cell Biology; p.268
Non-coding regions gene regulation
Vicinity of TSS: direct interactions with Pol-II complex
Larger vicinity – indirect interactions (chromatin remodelling)
The Genetic Code
Third Letter
First Letter
Second Letter
tRNA – Responsible for Translation
Adopted from Genetic Analysis V, p.388
tRNA – Responsible for Translation
Adopted from Genetic Analysis V, p.388
Frame Shifts
Code Triplets (“codons”) are not overlapping
3x2 possible ways of reading depending on
strand and the relative position where reading
starts
This is not just our concern when looking for genes,
it is also the cell’s concern in terms of mutations:
Original: THE FAT CAT ATE THE BIG RAT
Delete C:THE FAT ATA TET HEB IGR AT
Prokaryotes Gene Finding
No noclues
Most DNA is coding (e.g. 70% in H.influenza)
Each gene is one contiunes DNA sequence (no
introns)
PolyI – rRNA, PolyII – mRNA, PolyIII - tRNA
Detecting ORF
Simple Idea:
If there is no gene encoded then the expected
frequency of STOP codon is 3/64 codons
ORF – open reading frame, a sequence of
codons with no STOP codon
Simple Algorithm:
1. scan until you find a stop condon, in all reading
frames.
2. Scan back to find a start codon.
3. If it’s long ehough, report this ORF as a putative
Cons: gene
Can’t detect short genes
High FP ( E.Coli has 6500 ORFS but only 1100 genes)
Coding vs. Non coding regions
Codon frequencies
Codon usage in coding regions is different
Leucine, Alanine, Tryptophan are coded in 6:4:1
different codons
Expect to see a ratio of 6:4:1 in random sequence
In proteins the appear in 6.9:6.5:1 ratio
Another example:
A or T appear in 90% of the case as the last letter
of a codon in protein coding regions
Nocleutide MM for Gene Detection
2nd Order MM
Idea: extend the model to capture codons
Results: poor…. Code overlap in this model
MM over codons
Idea:
Transform the code into codons, then use 1rd MM
Why not use codon frequencies directly?
“Codon Preferences” program:
“Codon Preferences” program
Uses a window of 25 codons around each point
Score:
log(
P
)
1 P
Using Promoter’s Signal
We are still far from perfect…
idea: try to detect signals in the promoter regions,
to help descriminate real genes in ORFs
Prokaryotes:
~-35 tss: TTGACA
~-10 tss: TATAAT (“TATA box” signal)
No single promoter has the exact consensus
Nearly all promoters have 2-3 from TAxyzT
80-90% have all 3
In 50% xyz = TAA
Up To here summary
We have seen the problems in trying to find genes
in wide genome scan – Prokaryotes!
The bottom line is that the problem is not really
solved, but most research in gene finding focus on
Eukaryotes, where the main interest lies …
Next lecture – much more sophisticated models, to
handle the much more complex situation in
Eukaryotes in general, and Human in particular