Genomic Annotation

Transcript Genomic Annotation

Genomic Annotation
Genes and Pseudogenes in
Primates
So Far….

Understand the basics of genetic
homology



interpret score & e-value
combine local alignments
How to use homology from various
databases to improve annotation

protein, EST, neighbor species homology can
all add more evidence
Ab Initio gene finders

Ab Initio: “From the beginning”






Computer programs that attempt to find and
annotate genes based solely on the nucleotide
sequence
High success rate for prokaryotes (70 - 80%)
Low success rate for eukaryotes (15 -25%)
Most failures for eukaryotes involve the ends of
the gene (fused & split genes, wrong start or stop)
Ab initio gene finders do pretty well at getting at
least part of a gene right
Strategy: start with ab initio predictions & modify
based on other evidence; gather as much
evidence as you can to support your conclusion
Genscan

Good “basic” gene finder



Provides useful predictions even without speciesspecific training
Can be improved if you have a set of known genes
from that or related species to optimize algorithm for
those gene characteristics
Many other gene finders out there; most of these
automate the incorporation of other forms of
evidence that must also be provided (EST data,
conservation among neighbor species)
Basic Strategy for Annotation



Use ab initio prediction to focus attention
on genomic features of interest
Add as much other evidence as you can
to refine and support your conclusion
What other evidence is there?
1.
2.
3.
4.
Basic gene structure
Motif information
BLAST homologies: nr, protein, est
Other species or other proteins
Chimpanzee annotation
1.
Basic gene structure




Only ~15% of known mammalian genes
have 1 exon
Many pseudogenes are mRNA’s that have
been retro-transposed back into the
genome; many of these will appear as single
exon genes
Increase vigilance for signs of a pseudogene
for any single exon gene
Alternatively, there may be missing exons
Chimpanzee annotation
2.
Motif information


Genscan uses statistical methods to predict
genes, will tag all apparent ORFs of
sufficient length
Since genome is very large, statistical
methods will give some false positives
(sequence looks like a gene simply by chance)

If the predicted gene has protein motifs
found in other proteins, it is much less likely
to be false positive and more likely to be a
real gene or a real pseudogene
Chimpanzee annotation
BLAST homology: nr, protein, EST
3.




Homology to known proteins argues against
false positive
Mammals have many gene families and many
pseudogenes (both of these can show high
similarity to your predicted gene)
Consider length, percent identity when
examining alignments. Human vs. chimp
orthologs should differ by <1%; most paralogs
will differ by more than this
Without good EST evidence you can never be
sure; make your best guess and be able to
defend it!
Chimpanzee annotation
Other species or other proteins
4.



For any similarity hit, look for even better hits
elsewhere in the genome; orthologs and
pseudogenes will look similar but there will usually
be an even better hit somewhere else.
If you are convinced you have a gene and it is a
member of a multi-gene family, be sure to pick the
right ortholog
Look at synteny with properly distant species
(mouse or rat); evidence for a transposition
suggests a pseudogene
Group Practice


Follow the handout in which we analyze
two genes from a 170 kb region of the
chimpanzee genome
To save time the GENSCAN analysis is
completed for you and can be retrieved
from Goose

Genomic Annotation

Transcript Genomic Annotation

Directory