Gene Finding in Prokaryotes

Download Report

Transcript Gene Finding in Prokaryotes

Biological Motivation
Gene Finding
Anne R. Haake
Rhys Price Jones
Gene Finding
Why do it?
• Find and annotate all the genes within the large
volume of DNA sequence data
– how many genes in an organism? homologies?
• Gain understanding of problems in basic science
– e.g. gene regulation-what are the mechanisms involved in
transcription, splicing, etc?
• Different emphasis in these goals has some effect on
the design of computational approaches for gene
finding.
Gene Finding by Biological Methods:
•
Extract mRNA
reverse
transcribe cDNA
Label cDNA
DNA library
Detecting by using cDNA probe
Gene found
Gene Finding by Computational
Methods
• Dependent on good experimental data to
build reliable predictive models
• Various aspects of gene structure/function
provide information used in gene finding
programs
Figure 12.3
Figure 12.3
The Informatics View of Genes
• Genes are character strings embedded in
much larger strings called the genome
• Genes are composed of ordered elements
associated with the fundamental genetic
processes including transcription, splicing,
and translation.
Gene Finding
• Cells recognize genes from DNA sequence
– find genes via their bioprocesses
• Not so easy for us..
CTAGCAGGGACCCCAGCGCCCGAGAGACCATGCAGAGGTCGCCT
CTGGAAAAGGCCAGCGTTGTCTCCAAACTTTTTTTCAGGTGAGA
AGGTGGCCAACCGAGCTTCGGAAAGACACGTGCCCACGAAAGAG
GAGGGCGTGTGTATGGGTTGGGTTGGGGTAAAGGAATAAGCAGT
TTTTAAAAAGATGCGCTATCATTCATTGTTTTGAAAGAAAATGT
GGGTATTGTAGAATAAAACAGAAAGCATTAAGAAGAGATGGAAG
AATGAACTGAAGCTGATTGAATAGAGAGCCACATCTACTTGCAA
CTGAAAAGTTAGAATCTCAAGACTCAAGTACGCTACTATGCACT
TGTTTTATTTCATTTTTCTAAGAAACTAAAAATACTTGTTAATA
AGTACCTANGTATGGTTTATTGGTTTTCCCCCTTCATGCCTTGG
ACACTTGATTGTCTTCTTGGCACATACAGGTGCCATGCCTGCAT
ATAGTAAGTGCTCAGAAAACATTTCTTGACTGAATTCAGCCAAC
AAAAATTTTGGGGTAGGTAGAAAATATATGCTTAAAGTATTTAT
TGTTATGAGACTGGATATAT...
G
CTAGCAGGGACCCCAGCGCCCGAGAGACCATGCAGAGGTCGCCT
CTGGAAAAGGCCAGCGTTGTCTCCAAACTTTTTTTCAGGTGAGA
AGGTGGCCAACCGAGCTTCGGAAAGACACGTGCCCACGAAAGAG
GAGGGCGTGTGTATGGGTTGGGTTGGGGTAAAGGAATAAGCAGT
TTTTAAAAAGATGCGCTATCATTCATTGTTTTGAAAGAAAATGT
GGGTATTGTAGAATAAAACAGAAAGCATTAAGAAGAGATGGAAG
AATGAACTGAAGCTGATTGAATAGAGAGCCACATCTACTTGCAA
CTGAAAAGTTAGAATCTCAAGACTCAAGTACGCTACTATGCACT
TGTTTTATTTCATTTTTCTAAGAAACTAAAAATACTTGTTAATA
AGTACCTANGTATGGTTTATTGGTTTTCCCCCTTCATGCCTTGG
ACACTTGATTGTCTTCTTGGCACATACAGGTGCCATGCCTGCAT
ATAGTAAGTGCTCAGAAAACATTTCTTGACTGAATTCAGCCAAC
AAAAATTTTGGGGTAGGTAGAAAATATATGCTTAAAGTATTTAT
TGTTATGAGACTGGATATAT...
Types of Genes
• Protein coding
– most genes
• RNA genes
–
–
–
–
rRNA
tRNA
snRNA (small nuclear RNA)
snoRNA (small nucleolar RNA)
3 Major Categories of Information used in
Gene Finding Programs
• Signals/features = a sequence pattern with
functional significance e.g. splice donor & acceptor
sites, start and stop codons, promoter features such
as TATA boxes, TF binding sites, CpG islands
• Content/composition -statistical properties of coding
vs. non-coding regions.
– e.g. codon-bias; length of ORFs in prokaryotes;GC content
• Similarity-compare DNA sequence to known
sequences in database
– Not only known proteins but also ESTs, cDNAs
Looking for Protein Coding Genes
• Look for ORF (begins with start codon, ends with
stop codon, no internal stops!)
– long (usually > 60-100 aa)
– If homologous to “known” protein more likely
• Look for basal signals
– Transcription, splicing, translation
• Look for regulatory signals
– Depends on organism
• Prokaryotes vs Eukaryotes
• Vertebrate vs fungi
Easier problem:
Gene Finding in Bacterial Genomes
Why?
• Dense Genomes
• Short intergenic regions
• Uninterrupted ORFs
• Conserved signals
• Abundant comparative information
– Complete Genomes available for many
What do Prokaryotic Genes look like?
5’
3’
Open Reading Frame
Promoter region (maybe)
Ribosome binding site (maybe)
Termination sequence (maybe)
Start codon / Stop Codon
Prokaryotic Gene Expression
Promoter Cistron1
Cistron2 CistronN Terminator
Transcription
RNA Polymerase
mRNA 5’
3’
1
2
Translation
SD in polycistronic message
C
N
N
N
Ribosome, tRNAs,
Protein Factors
C
N
C
1
2
Polypeptides
Slide modified from:
http://biology.uky.edu/520/Lecture/lect8/lect8Notes.ppt
3
Open Reading Frame (ORF)
• Any stretch of DNA that potentially encodes a
protein
• The identification of an ORF is the first
indication that a segment of DNA may be part
of a functional gene
Open Reading Frames
A C G T A A C T G A C T A G G T G A A T
CGT
GTA
AAC
ACT
TGA
GAC
CTA
TAG
GGT
GTG
GAA
AAT
Each grouping of the nucleotides into consecutive
triplets constitutes a reading frame. There are three
different reading frames in the 5’->3’ direction and a
further three in the reverse direction on the opposite
strand.
A sequence of triplets that contains no stop codon is an
Open Reading Frame (ORF)
ORFs as gene candidates
• An open reading frame that begins with a start codon
(usually ATG, GTG or TTG, but this is speciesdependent)
• Most prokaryotic genes code for proteins that are 60
or more amino acids in length
• The probability that a random sequence of
nucleotides of length n has no stop codons is
(61/64)n
• When n is 50, there is a probability of 92% that the
random sequence contains a stop codon
• When n is 100, this probability exceeds 99%
Codon Bias
• Genetic code degenerate
– Equivalent triplet codons code for the same amino acid
– http://www.pangloss.com/seidel/Protocols/codon.html
• Codon usage varies
– organism to organism
– gene to gene
• Biological basis
– Avoidance of codons similar to stop
– Preference for codons that correspond to abundant tRNAs
within the organism
Codon Bias
Gene Differences
GlyGGG
GlyGGA
GlyGGT
GlyGGC
GAL4
0.21
0.17
0.38
0.24
ADH1
0
0
0.93
0.07
Slide modified from:
http://biology.uky.edu/520/Lecture/lect8/lect8Notes.ppt
Codon Bias
Organism differences
• Yeast Genome: arg specified by AGA 48% of
time (other five equivalent codons ~10%
each)
• Fruitfly Genome: arg specified by CGC 33%
of time (other five ~13% each)
• Complete set of codon usage biases can be
found at: http://www.kazusa.or.jp/codon/
GC content
• GC relative to AT is a distinguishing factor of
bacterial genomes
• Varies dramatically across species
– Serves as a means to identify bacterial species
• For various biological reasons
– Mutational bias of particular DNA polymerases
– DNA repair mechanisms
– horizontal gene transfer (transformation, transduction,
conjugation)
GC Content
• GC content may be different in recently
acquired genes than elsewhere
• This can lead to variations in the frequency of
codon usage within coding regions
– There may be significant differences in codon bias within
different genes of a single bacterium’s genome
Ribosome Binding Sites
• RBS is also known as a Shine-Dalgarno
sequence (species-dependent) that should
bind well with the 3’ end of 16S rRNA (part of
the ribosome)
• Usually found within 4-18 nucleotides of the
start codon of a true gene
Shine-Dalgarno Sequence
• Is a nucleotide sequence (consensus =
AGGAGG) that is present in the 5'untranslated region of prokaryotic mRNAs.
• This sequence serves as a binding site for
ribosomes and is thought to influence the
reading frame.
• If a subsequence aligning well with the ShineDalgarno sequence is found within 4-18
nucleotides of an ORF’s start codon, that
improves the ORF’s candidacy.
Bacterial Promoter
-35
T82T84G78A65C54A45…
(16-18 bp)…
T80A95T45A60A50T96…(A,G)
-10
+1
Not so simple: remember, these are
consensus sequences
Termination Sequences
• 3’-U tail
• Stem/loop
– Inverted repeat immediately preceding the runs of uracil
Termination sequence