Transcript Document

JIGSAW: a better way to
combine predictions
J.E. Allen, W.H. Majoros, M. Pertea, and S.L. Salzberg. JIGSAW, GeneZilla, and
GlimmerHMM: puzzling out the features of human genes in the ENCODE regions.
Genome Biology 2007, 7(Suppl):S9.
J. E. Allen and S. L. Salzberg. JIGSAW: integration of multiple sources of evidence for
gene prediction. Bioinformatics 21(18): 3596-3603, 2005.
J. E. Allen, M. Pertea and S. L. Salzberg. Computational gene prediction using mutliple
sources of evidence. Genome Research, 14(1), 2004.
Collecting gene structure evidence for
JIGSAW
Figure 1. Evidence from the UCSC genome browser used as input to JIGSAW. Evidence includes:
computational gene finders, alignments from gene expression evidence and evidence of cross-species
sequence conservation.
Representing gene structure
evidence in JIGSAW
• Each evidence source can predict up to
six gene features:
– Start codon
– Stop codon
– Intron
– Protein coding nucleotides
– Donor site
– Acceptor site
Figure 3. Four evidence sources mapped to sequence S: gene prediction (GP1) with no
confidence score, gene prediction with confidence score 0.65 (GP2), cDNA aligned with
86% identity and an EST aligned with 95% identity. Examples of the different feature vector
types are shown: start codon (sta), stop codon (stp), donor site (don), acceptor site (acc),
intron (inr) and amino acid codon (cod). Each element in the feature vector is an evidence
source’s prediction for that feature type. The possible exon boundaries are k0, k1, …, k6.
Training
0.9
Gene pred. 1
0.9
0.6
0.6
0.92
0.92
0.92
Gene pred. 2
59%
cDNA
EST alignment
59%
92%
S1
Start site
feature vectors
Initial
exon
92%
Terminal
exon
Stop site
feature vectors
95%
S2
Single exon
Donor site
feature vectors
Acceptor site
feature vectors
59%
…
85%
Sm
Initial
exon
Example coding
feature vectors
85%
Internal Terminal
exon
exon
Example intron
feature vectors
Schematic of the JIGSAW training procedure. Known genes are used to evaluate the accuracy of the
different combinations of evidence. Prediction accuracy for each feature type (start codon, stop codon,
acceptor, donor, amino acid codon and intron) is measured separately.
Fig 4a. The plot shows the accuracy of predictions based on alignments to non-human
sequences that overlap a gene finder’s predictions. Each point is a pair of alignments
observed in training and their percent identity to the genomic sequence. ‘+’ points are
labeled ‘accurate’ and ‘x’ points are labeled ‘inaccurate.’ The two lines correspond to
the non-leaf nodes in the decision tree.
Figure 4b. Decision tree used to partition the feature
vector space from Figure 4a into three sub-regions. This
decision tree indicates that non-human cDNA
alignments with > 95% identity to the human sequence
(region “V1”) are accurate protein coding predictors.
JIGSAW dynamic programming
S
t0
b0
q0
e0
Interval: ti  (bi , ei , qi )
t1
t2
q1
q2
assigns state q i to the subsequence from bi to ei .
• Dynamic programming algorithm:
• at the end of each interval (e0, for example), store the
score of the best parse ending at that location
• Modification: store scores for every parse “type” ending
at e0
• Types are start, stop, coding, intron, donor, acceptor
JIGSAW GHMM gene model
Evidence types for JIGSAW
experiments on human DNA
•
•
•
•
•
•
•
cDNA from human genes
UniGene transcripts
GenBank cDNAs matching SwissProt proteins w/at least 98% identity
RefSeq genes from non-human species
TIGR Gene Index (human and other)
Ab initio gene finders
– Genscan, GeneID, GeneZilla, GlimmerHMM
– NOTE: JIGSAW allows you to use the same gene finder as multiple
“lines” of evidence - e.g., GlimmerHMM with different parameter
settings
Alignment-based gene finders
– Twinscan
– SGP
•
Predicted conserved elements from phylogenetic analysis (PhastCons)
Percent
Effects of different evidence sources
100
90
80
70
60
50
40
30
20
10
0
Sensitivity
Specificity
F-score
Gene
finders
non-human
EST
human
mRNA
curated
all
KnownGene
(no
JIGSAW)
Figure 6. JIGSAW prediction performance using different combinations of evidence. Gene finders = ab
initio gene finders only; non-human EST = gene finders + non human expression evidence; human
mRNA = gene finders + human mRNA; curated cDNA = gene finders + KnownGene, All = all evidence.
KnownGene = cDNA evidence from curated proteins (from UCSC) without using JIGSAW.
Comparison of JIGSAW and other
methods on human ENCODE regions
100
90
80
60
50
Sensitivity
40
Specificity
30
F-score
20
10
Sensitivity(Sn)= % of exons correctly predicted
Specificity(Sp)= % exons predictions that are correct
F-score=(2 x Sn x Sp) / (Sn + Sp)
bl
se
m
En
te
r
Ex
on
H
un
n
Ex
oG
ea
en
es
h+
+
Fg
s
Au
gu
s
tu
on
ira
g
Pa
SA
W
0
JI
G
Percent
70
Gene Prediction Accuracy at the exon level: Sensitivity versus specificity. Top panel: dotplot for sensitivity versus
specificity at the exon level for CDS evaluation. Each dot represents the overall value for each program on the 31 test
sequences. Fig. 6 from Guigo et al., Genome Biology 2006, 7(Suppl 1):S2
Gene Prediction Accuracy at the exon level: Sensitivity versus specificity. Bottom panel: boxplots of the average
sensitivity and specificity for each program. Each dot corresponds to the average in each of the test sequences for
which GENCODE annotation existed. Fig. 6 from Guigo et al., Genome Biology 2006, 7(Suppl 1):S2.
EGASP results:
Gene level
accuracy
JIGSAW on other species