Research_Presentation
Download
Report
Transcript Research_Presentation
Survey of Misannotations and
Pseudogenes in the Arabidopsis Genome
Tanmay Prakash
Objectives
Objectives
•Find Possible Misannotations
•Find Possible Pseudogenes
Why
•Misannotation can hinder research
•Pseudogenes can be used to study
natural selection
Misannotations
UTR CDS
Intron
CDS
UTR Many misannotations are
the result of gene
prediction programs
mislabeling introns
because of the presence of
a stop codon
Pseudogenes
Pseudogenes are DNA sequences that no longer function but
resemble the functional genes they once were. There are
two types:
•Processed
•Non-processed
Common Properties of Pseudogenes
•Stop Codons
•Frameshift mutations
•Lack of Selective Pressure
agtacatgcataggactcgatcgactc
STCIGLDRL
agtacatgataggactcgatcgactc
ST..DSID
Pipeline
Query
Protein
Domains
Subject
Arabidopsis
Introns
BLAST
Search
Genes
Matching
In Introns
Genes
Matching
In Both
Query
Protein
Domains
Subject
Arabidopsis
CDS
HMMER
Search
Genes
Matching
In CDS
Possibly
Misannotated
Genes
Check for
Stop Codons
Frameshift
Check
Ka/Ks
Possible
Pseudogenes
Query
Protein
Domains
Subject
Arabidopsis
Introns
BLAST
Search
Genes
Matching
In Introns
HMMER
Search
Genes
Matching
In Exons
Query
Protein
Domains
Subject
Arabidopsis
CDS
Genes
Matching
In Both
Possibly
Misannotated
Genes
Results
There were 346 genes (different
models not included) that had
matches to the same domain in the
introns and exons
There were 299 genes (different models not
included) that had matches to the same
domain in an intron and flanking exons.
These are most likely misannotations.
Domain
Possible Misannotations #Domains
PF01657.7
16
76
PF02902.8
15
32
PF06721.1
13
3
PF07734.2
15
113
4 domains with the most possible misannotations
Number of Misannotations
Domain Family Size vs Misannotations
16
14
12
10
8
6
4
2
0
Series1
0
500
1000
1500
2000
Number of Domains in Family
2500
3000
Percentage Misannotation
Misannotation Frequency
0.6
0.5
0.4
0.3
0.2
0.1
0
0
2000
4000
6000
8000
Number of Genes Matching Domain
10000
Domian Gene Frequentcy
Number of
Misannotations
20
15
10
5
0
0
2000
4000
6000
Number of genes matching Domain
8000
10000
Future Research
•Identify pseudogenes by looking for stop codons, and
frameshift mutations in the introns and checking the
Ka/Ks value
•Use a more recent database of domains
•Follow the same process for the rice genome
Acknowledgement
Dr. Shin-Han Shiu
Dr. Kosuke Hanada
Dr. Melissa Lehti-Shiu
Dr. Gail Richmond
HSHSP