Transcript T. brucei

Genome sequence update
Plasmodium falciparum (3D7) - published in 2002. Draft coverage. No sequence updates for a
year. No new annotation since?
Leishmania major Friedlin - version 3.0 data release containing 32.436 Mb in 71 contigs. Release
of version 4.0, which will contain each of the 36 chromosomes in a single contig, is anticipated
within the next few weeks.
Trypanosoma cruzi CL-Brener - current assembly contains 8,780 contigs, which can be linked
into 5,517 scaffolds, representing a total of 67.557 Mb.
T. brucei (TREU927) - latest official release (version 2.0) contained 125 contigs, representing
25.544 Mb, and a new release containing essentially single sequence contigs (with the exception
of sub-telomeric repeat regions) for each chromosome is expected within the next month.
Plasmodium vivax (Salvador I) - currently at 10X coverage. Genome closure is pending funding,
which, if successful, will allow gap closure and finishing by Spring 2005. In the absence of
funding, annotation of the genome will begin this autumn.
L. infantum (clone JPCM5 (MCAN/ES/98/LLM-877)) - 5x sequencing was completed in October
2003. Annotation of this sequence has not yet begun.
Annotation update
With the exception of the P. vivax and L. infantum, these genome sequences have been annotated
for protein coding genes.
L. Major - manual examination of predictions carried out at both SBRI and WTSI refined the
number of likely protein-coding genes to 8021 for the version 3.0 release. Addition of new
sequence in version 3.1, has brought the current total number in GeneDB (the “official”
repository for LmjF annotation) to 8151.
T. cruzi - AutoMAGI used to predict probable protein-coding genes. Due to the complex
organization of the T. cruzi genome discussed above, a total of 25,235 genes have been predicted.
Automated annotation using a variety of different approaches, such as Blastp, and Pfam analysis
has been carried out at TIGR.
T. brucei - 13,321 predicted protein-coding genes. It is believed that this number is a significant
over-prediction, and the sequencing centers are now working to exclude a relatively large number
of small genes, which are unlikely to be protein coding.
As is the case for both L. major and T. cruzi, the gene predictions in T. brucei are currently under
refinement, based upon comparison between the gene predictions from each of the three
organisms. In the case of T. brucei, and T. cruzi, a lower of genes is anticipated in the next
version release expected next month, whilst the number is anticipated to be marginally higher for
L. major.
Impact on SGPP
The importance of these continuing efforts for the SGPP project are clear; the numbers of
possible target proteins have increased dramatically over the last year.
However, the complement of putative protein coding genes from each of these genomes is still in
the process of being refined; the current datasets contain significant numbers of false positives (as
well as a smaller, but significant number of false negatives).
The sequencing centers involved in these projects are currently engaged in the resolution of a
number of these issues in the trypanosomatid genomes, and it is expected that they will be
resolved to a great degree within a matter of weeks.
Progress
Targets have been selected from all species under review (with the exception of the newly
included L. infantum genome).
The table provided below shows the number of proteins, for each species, flowing through each
part of the process from downloading from the relevant data source to confirmation as a viable
target.
Species
Downloaded
Unique*
Confirmed
L. major
13,939
9,773
5,087
T. brucei
13,340
10,963
566
T. cruzi
25,070
13,982
6,014
P. falciparum
9,639
4,556
2,683
P. vivax
N/A
N/A
169
* The number of unique genes identified through this methodology is an overestimate. For any
given gene, previous versions of the annotation may well differ from later versions due to either
changes in the underlying sequence, or alterations in the prediction of the start codon. Such
changes are likely to occur for both L. major and T. brucei genes, due to initial annotate of
unfinished sequence, followed by sequence alteration, and in T. brucei and P. falciparum due to
initial automated annotation, followed by manual inspection. We are currently implementing a
system that aims to identify genes whose annotations have altered, in order to ensure that the
failure of the resultant protein to express is correctly identified as being due to initial sequence
errors.
Selection criteria used to date
Selection for both soluble proteins and integral membrane proteins; TM prediction algorithms
used to differentiate between these two classes, and also to allow cleavage of predicted signal
peptides and targeting signals.
Target selection generally applies an amino acid length threshold of 800 amino acids (although
shorter boundaries have been used and are currently being used.
Parsing of large proteins from baker lab (David K).
Selection of Plasmodium targets based on interactions identified by the Fields lab
(Marissa/Doug).
Multi-leish approach (Chris M/Peter M).
In order to concentrate on proteins that are likely to represent novel folds - PDB search to
quantify identity and similarity of the target sequence to proteins of known structure (David K).
In order to strengthen our drug target strategy, we have also imposed restraints upon the degree of
identity to human proteins (currently set at 50%) beyond which proteins are excluded from
further analysis (Frank).
Targeting of known and putative enzymes (COGs, EC numbers, BRENDA) (David K).
Targetting of proteins with possible medical relevance through selection of proteins with
sequence identity to proteins under patent (Wes/Fred).
Identification of a set of P. vivax targets, homologous to previously attempted P. falciparum
proteins (David K).