Lecture 12 - School of Science and Technology

Download Report

Transcript Lecture 12 - School of Science and Technology

Bioinformatics
Lecture 12
• Splicing and gene prediction in eukaryotes
• Critical splice signals
• Coding statistics: DNA differences between
exons and introns
• Discriminant function and combined approach
Splicing and gene prediction in eukaryotes
• Any type of gene prediction and particularly ab initio is tremendously
complicated in eukaryotes by the splicing phenomenon.
• The task is difficult, to predict positions of exon-intron boundaries for
those eukaryotic genes, which have multiple introns, and to predict absence
of introns for intronless genes.
• Eukaryotic genomes differ significantly in a number of ways, which
requires species specific prediction programs.
• The major differences include: a) variation in GC-content (e.g.
mammalian genomes have large variation in GC-content, referred as
isochors), b) variation in codon usage frequencies.
• All these factors, if not taken into consideration, diminish quality of
prediction.
AT/GC ratios in coding regions in some eukaryotes
0.7
0.6
0.5
0.4
AT%
0.3
CG%
0.2
0.1
0
A.thaliana
C.elegans
D.melanogaster
H.sapiens
The number of correct and incorrect (number in parentheses) of whole gene
model predictions shared among the 3 programs from a test set of 1783 genes
GenMark.hmm(GM)
Genscan+(GS)
GlimmerM(GA)
Incorrect gene refers to
cases in which all coding
exons in the gene are in
perfect agreement among
the gene finders but not
with the true gene
mRNA splicing
Critical splice signals
EXON 1
INTRON
A G G U A/G A G U
(100%)
Donor site
5’ splice
junction
EXON 2
U U A/G A U/C
( 62 –68 %)
U/C A G G/A
(100%)
Branch site
Acceptor site
3’ splice
junction
Frequencies of nucleotides at the ends of exons
The first 10 nucleotides of exons, 5’ end
The last 10 nucleotides of exons, 3’ end
C. elegans
C. elegans
D. melanogaster
D. melanogaster
H. sapiens
H. sapiens
Recognition of variable splice sites and gene prediction
• At least 3 critical signals/motifs (donor, acceptor and branch sites) should
be recognised in order to predict position of an intron and both splice
junctions.
• Significant sequence variation in these sites between species and different
genes negatively affects quality of predictions.
• The best average of error (false-positive + false-negative) rate for either
donor or acceptor site prediction is about 5%. This may be acceptable if the
search is restricted by a short region. However search of a large region
leads to unacceptable rate of the false-positive because for every true site
there are hundreds of pseudo-sites.
• For example, if a large region has 40 true sites and 4000 pseudo-sites, one
true site would be missed (2.5% false-negatives) and 100 pseudo-sites
would be predicted as true sites (2.5% false-positives)!
Recognition of variable splice sites and gene prediction
• Since adjacent donor site and acceptor site are not independent, this
correlation can be explored for further eliminating false-positives.
• For short introns, occurring mostly in lower eukaryotes, an intron is
recognized by the interaction of splicing factors binding across the intronends (hence 5’ss – 3’ss correlation).
• In vertebrates, exons are much shorter, recognition of exons by the
interaction of splicing factors binding across the exon-ends (hence 3’ss –
5’ss correlation) is the key.
• Therefore mammalian functional splice sites can only be effectively
identified simultaneously through exon recognition.
• Also there are several additional signals/motifs essential for the correct
splicing, which are responsible for recognition of certain proteins involved
in splicing. Identification of such sites and their use in prediction programs
should increase quality of eukaryotic gene predictions.
Coding statistics: DNA differences between exons and introns
• Except splicing signals and ORF there are several additional
characteristics, which may help to discriminate between exons and
introns including
• These features include DNA periodicity in exons, codon
preferences, hexamer usage, codon prototype, compositional bias
between codon positions
DNA periodicity in exons
Frequency of nucleotide A in phase 0 H. sapiens exons aligned at the 5' end
0.4
0.35
0.25
0.2
0.15
0.1
0.05
Position
49
46
43
40
37
34
31
28
25
22
19
16
13
10
7
4
0
1
Frequency
0.3
DNA periodicity in exons,   3
Curve of best-fit in H. sapiens phase 0 exons - dinucleotide 'AG'
0.16
0.1
0.08
0.06
0.04
Nucleotide position
96
91
86
81
76
71
66
61
56
51
46
41
36
31
26
21
16
11
6
0.02
0
1
Frequency
0.14
0.12
Periodic structure in DNA sequences.
The absolute frequency of the A A pair with ( 0 to 5) nucleotides between the two A's in the 200 first
base pairs of the sequences in the set of 1761 human exons and 1753 human introns. A clear period-3
pattern appears in coding regions, which is absent in non-coding regions. A similar periodic pattern
appears in coding regions for the other fifteen possible pairs of nucleotides.
Codon Preference
• A coding statistic was introduced to measure uneven usage of synonymous
codons solely.
• Indeed, from a codon usage table, we can compute the relative probability
of each synonymous codon to code for a given amino acid.
• For instance, GAG and GAA the two codons coding for Glutamic Acid
are used in coding regions with probabilities 0.03882 and 0.02751, which
results in a relative probability of 0.59 and 0.41, respectively.
Hexamer usage correlation
• Bias in the distribution of oligonucleotides longer than codons can also be used to
discriminate between coding and non-coding regions. Bias in the usage of hexamers may be
the most discriminant one (probably because of dependence between adjacent amino acids in
the proteins). Bias in hexamer usage can be computed exactly as bias in codon usage as the
background information for codon frequencies is known and frequencies of each of 64 2 = 4096
hexamers can be found.
• There are several ways to construct frame specific hexamer score, both log-odd
LE(w,i) =
log [fE(w,i)/fI(w)] and preference score PE(w,i) = fE(w,i) / [fE(w,i) + fI(w)], where fE(w,i) is
frequency of hexamer w in frame i, calculated from known exon training data and fI(w) is the
frequency of w from known introns.
Probabilities of the four nucleotides at the different codon positions conditioned to the nucleotide
in the preceding codon position. Estimated from a set of human exon and intron sequences.
Codon position 1
A
C
G
T
A
.36
.21
.19
.24
C
.27
.23
.14
.35
G
.35
.24
.23
.19
T
.18
.27
.23
.31
Codon position 2
A
C
G
T
A
.16
.28
.40
.16
C
.19
.44
.12
.25
G
.15
.41
.27
.17
T
.07
.33
.45
.16
Codon position 3
A
C
G
T
A
.22
.21
.44
.13
C
.33
.29
.15
.22
G
.24
.27
.37
.12
T
.13
.21
.53
.13
Codon Prototype, Markov model measure and Average Mutual Information
• A measure can be introduced which show how similar to the prototypical
distribution (see the table) is the observed distribution of base frequencies at
the three codon positions in a sequence (exon or intron).
• Dependencies between nucleotide positions in coding regions can be
explicitly described by means of Markov Models.
• Average Mutual Information can measure the probability in the sequence
of the pair of nucleotides i and j and at a distance of k nucleotides.
Nucleotide
Codon position
1
2
3
A
0.27
0.31
0.18
C
0.24
0.24
0.31
G
0.32
0.20
0.29
T
0.17
0.26
0.22
Values of different coding statistics in the 223 bp long 2nd coding exon of the human globin gene, and in a 223 bp long seq. from the middle of the 2nd intron of the same gene
Exon sequence
Intron sequence
Non-coding
frames
Coding frame
Frame 1
Frame 2
Frame 3
Codon Usage
24.06
-16.13
-3.16
-14.36
-23.74
-19.67
Hexamer Usage
27.62
-11.64
-6.51
-20.90
-27.56
-22.07
39.98
-14.58
-8.46
-26.73
-27.81
-25.87
Codon Preference
15.97
-1.32
7.24
-7.96
-12.70
-14.93
Amino Acid Usage
8.17
-14.87
-10.17
-6.15
-10.69
-4.57
Codon Prototype
9.87
-11.23
-10.30
-11.45
-17.44
-14.49
order 1
29.92
-2.69
-3.31
-35.44
-42.40
-41.73
order 2
34.73
-18.26
-7.77
-29.61
-41.76
-40.05
order 5
72.69
-21.38
13.56
-37.63
-30.99
-36.40
Markov Model
Position Asymmetry
0.0957
0.0211
Periodic Asymmetry Index
1.159
1.009
0.00681
0.000344
2.278
0.892
Average Mutual
Information
Fourier Spectrum
Pattern discriminant analysis
• A number of different pattern features of sequences are used to
discriminate coding (ex) and non coding seq. A linear and quadratic
analysis are shown with the later being more efficient. EPS is the 6-mer
exon preference score and 3’SS (3’splicing site) is an example
EPS
COMBINER
computational gene prediction using multiple sources of evidence
• The next generation of computational method able to construct gene
models is currently developed, which takes as input (combines) a genomic
sequence and the locations of gene predictions from ab initio gene finders,
protein sequence alignments, expressed sequence tag (EST) and cDNA
alignments, splice site predictions, and other evidence
• An example of such program is COMBINER, which uses rigorous
statistical assessments, evaluate candidate gene models and estimate
probabilities using so-called decision trees.