Transcript Slide 1

Chapter 8
Gene Prediction
•Automated sequencing of genomes require automated gene assignment
•Includes detection of open reading frames (ORFs)
•Identification of the introns and exons
•Gene prediction a very difficult problem in pattern recognition
•Coding regions generally do not have conserved sequences
•Much progress made with prokaryotic gene prediction
•Eukaryotic genes more difficult to predict correctly
Ab initio methods
•Predict genes on given sequence alone
•Uses gene signals
•Start/stop codon
•Intron splice sites
•Transcription factor binding sitesribosomal binding sites
•Poly-A sites
•Codon demand multiple of three nucleotides
•Gene content
•Nucleotide composition – use HMMs
Homology based methods
•Matches to known genes
•Matches to cDNA
Consensus based
•Uses output from more than one program
Prokaryotic gene structure
•ATG (GTG or TTG less frequent) is start codon
•Ribosome binding site (Shine-Dalgarno sequence)
complementary to 16S rRNA of ribosome
•AGGAGGT
•TAG stop codon
•Transcription termination site (-independent termination)
•Stem-loop secondary structure followed by string of Ts
•Translate sequence into 6 reading frames
•Stop codon randomly every 20 codons
•Look for frame longer that 30 codons (normally 50-60 codons)
•Presence of start codon and Shine-Dalgarno sequence
•Translate putative ORF into protein, and search databases
•Non-randomness of 3rd base of codon, more frequently G/C
•Plotting wobble base GC% can identify ORFs
•3rd base also repeats, thus repetition gives clue on gene location
Markov chains and HMMs
•
•
•
•
•
•
•
•
•
•
•
Order depends on k previous positions
The higher the order of a Markov model to describe a
gene, the more non-randomness the model includes
Genes described in codons or hexamers
HMMs trained with known genes
Codon pairs are often found, thus 6 nucleotide patterns
often occur in ORFs – 5th-order Markov chain
5th-order HMM gives very accurate gene predictions
Problem may be that in short genes there are not
enough hexamers
Interpolated Markov Model (IMM) samples different
length Markov chains
Weighing scheme places less weight on rare k-mers
Final probability is the probability of all weighted k-mers
Typical and atypical genes
GeneMark (http://exon.gatech.edu/genemark/)
Trained on complete microbial genomes
Most closely related organism used for predictions
Glimmer (Gene Locator and Interpolation Markov Model)
(http://www.cbcb.umd.edu/software/glimmer/)
FGENESB (http://linux1.softberry.com/)
5th-order HMM
Trained with bacterial sequences
Linear discriminant analysis (LDA)
RBSFinder (ftp://ftp.tigr.org )
Takes output from Glimmer and searches for S-D sequences
close to start sites
Performance evaluation
•Sensitivity Sn = TP/(TP+FN)
•Specificity Sp = TP/(TP+FP)
•CC=TP.TN-FP.FN/([TP+FP][TN+FN][TP+TN])1/2
Gene prediction in Eukaryotes
Low gene density (3% in humans)
Space between genes very large with multiply repeated sequences
and transposable elements
Eukaryotic genes are split (introns/exons)
Transcript is capped (methylation of 5’ residue)
Splicing in spliceosome
Alternative splicing
Poly adenylation (~250 As added) downstream of CAATAAA(T/C)
consensus box
Major issue identification of splicing sites
GT-AG rule (GTAAGT/ Y12NCAG 5’/3’ intron splice junctions)
Codon use frequencies
ATG start codon
Kozak sequence (CCGCCATGG)
•Ab initio programs
•Gene signals
•Start/stop
•Putative splice signals
•Consensus sequences
•Poly-A sites
•Gene content
•Coding statistics
•Non-random nucleotide distributions
•Hexamer frequencies
•HMMs
Discriminant analysis
•Plot 2D graph of coding length versus 3’ splice site
•Place diagonal line (LDA) that separates true coding from
non-coding sequences based on learnt knowledge
•QDA fits quadratic curve
•FGENES uses LDA
•MZEF(Michael Zang’s Exon Finder uses QDA)
Neural Nets
•A series of input, hidden and output layers
•Gene structure information is fed to input layer, and is separated
into several classes
•Hexamer frequencies
•splice sites
•GC composition
•Weights are calculated in the hidden layer to generate output of
exon
•When input layer is challenged with new sequence, the rules that
was generated to output exon is applied to new sequence
HHMs
•GenScan (http://genes.mit.edu/GENSCAN.html)
5th-order HMM
•Combined hexamer frequencies with coding signals
•Initiation codons
•TATA boxes
•CAP site
•Poly-A
•Trained on Arabidopsis and maize data
•Extensively used in human genome project
•HMMgene (http://www.cbs.dtu.dk/services/HMMgene)
•Identified sub regions of exons from cDNA or proteins
•Locks such regions and used HMM extension into neighboring regions
Homology based programs
•Uses translations to search for EST, cDNA and proteins in
databases
•GenomeScan (http://genes.mit.edu/genomescan.html)
•Combined GENSCAN with BLASTX
•EST2Genome
(http://bioweb.pasteur.fr/seqanal/interfaces/est2genome.html)
•Compares EST and cDNA to user sequence
•TwinScan
•Similar to GenomeScan
Consensus-based programs
•Uses several different programs to generate lists of predicted
exons
•Only common predicted exons are retained
•GeneComber
(http://www.bioinformatics.ubc.ca/gencombver/index.php)
•Combined HMMgene with GenScan
•DIGIT (http://digit.gsc.riken.go.jp/cgi-bin/index.cgi)
•Combines FGENESH, GENSCAN and HMMgene
Accuracy
Nucleotide Level
Exon Level
Sn
Sp
CC
Sn
Sp
(Sn+Sp)
/2
ME
WE
FGENES
0.86
0.88
0.83
0.67
0.67
0.67
0.12
0.09
GeneMark
0.87
0.89
0.83
0.53
0.54
0.54
0.13
0.11
Genie
0.91
0.90
0.88
0.71
0.70
0.71
0.19
0.11
GenScAN
0.95
0.90
0.91
0.71
0.70
0.70
0.08
0.09
HMMgene
0.93
0.93
0.91
0.76
0.77
0.76
0.12
0.07
Morgan
0.75
0.74
0.74
0,.46
0.41
0.;43
0.20
0.28
MZEF
0.70
0.73
0.66
0.58
0.59
0.59
0.32
0.23
Chapter 9
Promoter and regulatory element prediction
•Promoters are short regions upstream of transcription start
site
•Contains short (6-8nt) transcription factor recognition site
•Extremely laborious to define by experiment
•Sequence is not translated into protein, so no homology
matching is possible
•Each promoter is unique with a unique combination of factor
binding sites – thus no consensus promoter
Prokaryotic gene
TF site
polymerase
TF
ORF
-35 box
-10 box
•70 factor binds to -35 and -10 boxes and recruit full polymerase enzyme
•-35 box consensus sequence: TTGACA
•-10 box consensus sequence: TATAAT
•Transcription factors that activate or repress transcription
•Bind to regulatory elements
•DNA loops to allow long-distance interactions
Eukaryotic gene structure
TF site
Pol II
TF site
TATA
Inr
Polymerase I, II and III
Basal transcription factors (TFIID, TFIIA, TFIIB, etc.)
TATA box (TATA(A/T)A(A/T)
“Housekeeping” genes often do not contain TATA boxes
Initiatior site (Inr) (C/T) (C/T) CA(C/T) (C/T) coincides with
transcription start
Many TF sites
Activation/repression
Ab initio methods
•Promoter signals
•TATA boxes
•Hexamer frequencies
•Consensus sequence matching
•PSSM
•Numerous FPs
•HMMs incorporate neighboring information
Promoter prediction in prokaryotes
•Find operon
•Upstream offirst gene is promoter
•Wang rules (distance between genes, no -independent
termination, number of genomes that display linkage)
•BPROM (http://www.softberry.com)
•Based of arbitarry setting of operon egen distances
•200bop uopstream of first gene
•‘many FPs
•FindTerm (http://sun1.softberry.com)
•Searches for -independent termination signals
Prediction in eukaryotes
•
•
•
•
•
•
•
•
•
•
•
•
•
•
Searching for consensus sequences in databases (TransFac)
Increase specuificity by searching for CpG islands
High density fo trasncription factor binding sitres
CpGProD (http://pbil.univ-lyon1.fr/software/cpgprod.html)
CG% inmoving window
Eponine (http://servlet.sanger.ac.uk:8080/eponine/ )
Matches TATA box, CCAAT bvox, CpG island to PSSM
Cluster-Buster (http://zlab.bu.edu/cluster-buster/cbust.html)
Detects high concentrations of TF sites
FirstEF (http://rulai.cshl.org/tools/FirstEF/)
QDA of fisrt exonboundary
McPromoter (http://genes.mit.edu/McPromoter.html)
Neural net of DNA bendability, TAT box,initator box
Trained for Drosophila and human sequences
Phylogenetic footprinting technique
•Identify conserved regulatory sites
•Human-chimpanzee too close
•Human fish too distant
•Human0-mouse appropriate
•ConSite (http://mordor.cgb.ki.se/cgi-bin/CONSITE/consite)
•Align two sequences by global; alignment algorithm
•Identify conserved regions and compare to TRANSFAC database
•High scoring hits returned as positives
•rVISTA (http://rvista.dcode.org)
•Identified TRANSFAC sites in two orthologous sequences
•Aligns sequences with local alignment algorithm
•Highest identity regions returned as hits
•Bayes aligner
(http://www.bioinfo.rpi.edu/applications/bayesian/bayes/bayes.align12.
pl)
•Aligns two sequences with Bayesian algorithm
•Even weakly conserved regions identified
Expression-profiling based method
Microarray analyses allows identification of co-regulated genes
Assume that promoters contain similar regulatory sites
Find such sites by EM and Gibbs sampling using iteration of PSSM
Co-expressed genes may be regulated at higher levels
MEME (http://meme.sdsc.edu/meme/website/meme-intro.html)
AlignACE (http://atlas.med.harvard.edu/cgi-bin/alignace.pl)
Gibbs sampling algorithm
Web humour…