Coding Domain Sequence Prediction and Alternative Splicing

Download Report

Transcript Coding Domain Sequence Prediction and Alternative Splicing

Coding Domain Sequence Prediction and
Alternative Splicing Detection in Human
Malaria Gambiae
Jun Li1, Bing-Bing Wang2, Jose M. Ribeiro3, Kenneth D.
Vernick1,4
1. Dept of Microbiology, University of Minnesota, St. Paul, MN. 2.
Pioneer Hi-Bred International, Johnston, IA. 3. LMVR/NAID, NIH,
MD. 4. UGGIV, Institut Pasteur, Paris, France
Introduction
•
•
•
•
Nearly 2/3 of the worlds population are at risk for malaria
1.5 to 2.5 million children die annually
A. gambiae is the major malaria vector
Genome-wide research needs good CDS structure prediction and
alternative splicing information.
• Current used A. gambiae CDS structures were predicted based on
comparative algorithms that are too conserve. A lot of genes are
missing.
• Comparative gene prediction algorithms also have problems in
prediction of terminal exons, thus, >40% CDS predicted by this
algorithm miss start and/or stop codons.
• The purpose of this work is to create a A. gambiae specific gene
model, fix the incompletion of CDS, and provide the AS information.
Combinational Gene Prediction Algorithm
• Gold gene set to train
GlimmerHMM
200000
180000
160000
140000
120000
100000
80000
60000
40000
20000
0
• Open-Reading-Frame
-Selection Algorithm
Union CDS
Any internal Stop?
No
EST
cluster
perfect
match to
known
proteins
perfectly
mapped to
gambiae
gold path
• Exon-Gene-Union Algorithm
{x  A} {x  P}  x  C
Where x is the basepair,
A is ab initio predicted CDS and
P is comparative predicted CDS
C is combinational CDS
Yes
CDS set
Alternative Splicing
No
A frame spanning the
whole region of Union CDS?
No
Multiple CDS found
by comparative algorithm
No
Multiple CDS found
by ab initio algorithm
No
The longest transcript
Combinational algorithm improves
single algorithm prediction
Com binational vs Ensem bl
Novel
Internal exon
changed
Extension
Identical
Sensitivity
Specificity
Complete
Rate
GlimmerH
MM
95%
90%
100%
ensembl
92%
99%
60%
96%
99%
95%
CombiComparison of CDS structure from
national
combinational algorithm and ensembl. algorithm
Alternative splicing detection in A. gambiae
AS distribution
in A. gambiae
Est-aid AS detection algorithm
100%
Align EST to genome, Processing
alignments, extract exon/intron 90%
information
80%
100%
100%
90%
Others 90%
80%
Others
Others Others
ExonS
80%
Upload to70%
MySQL DB
70%
ExonS 70%
60%
60%
AltP 60%
AltP
AltP
AltD 50%
50%
AltD
AltD
AltA 40%
40%
AltA
AltA
Quality control, make EST cluster,
50% merge
introns and exons from individual
40% alignments
30%
30%
20%
20%
IntronR20%
Compare intron/intron and intron/exon,
find
10%
overlapping event, classify AS event.
0%
10%
0%
Raw
ExonS ExonS
AltP
AltD
AltA
30%
IntronR IntronR
IntronR
10%
0%
Raw
Raw
Curated
Conclusion: 1512 CDS have alternative splicing, most of AS happened in CDS
region which will enrich protein structure and function. Manual curation shows
that the false positive (due to EST contamination) is low (10%). The AS type
distribution indicated that mosquito is more close to plants than mammals.
Software package and web presentation
The combinational CDS
prediction and alternative
splicing detection pipeline
have been integrated into
our open-source package
(welcome collaboration).
Results is also accessible
through web.