Diapositive 1 - LBGI Bioinformatique et Génomique Intégratives

Download Report

Transcript Diapositive 1 - LBGI Bioinformatique et Génomique Intégratives

ARPAnno: a dedicated web tool for Annotation of Actin
Related Proteins
Jean Muller1,3, Yukako Oma2, Laurent Vallar3, Evelyne Friederich3, Olivier Poch1 and Barbara Winsor2
1
Laboratoire de Biologie et Génomique Structurales, IGBMC, CNRS/INSERM/ULP, BP 163, 67404 Illkirch cedex, France.
2 Laboratoire Modèles Levure de Pathologies Humaines, FRE2375, IPCB, CNRS, 21 rue Descartes, 67084 Strasbourg, France.
3 Laboratoire de Biologie Moléculaire, d'Analyse Génique et de Modélisation, CRP-Santé, 42, rue du Laboratoire, L-1911, Luxembourg.
[email protected]
Introduction
Actin Related Proteins (ARPs) are key players in major biological processes important for cell life. In cytoskeleton activities, the ARP2/3 complex is essential for actin dynamics, ARP1 and
ARP11 are involved in microtubule based vesicle trafficking, in nuclear functions (transcriptional activation, tumor suppression…), ARP4-ARP9 are components of many chromatin modulation
complexes (SWI2/SNF2, SWR1, HAT). Conventional actins and ARPs together define a large family of homologous proteins, the actin superfamily, with a tertiary structure known as the “actin
fold”. Since 1997 (Poch and Winsor), the unified classification of ARPs is composed of 11 families, based primarily on their decreasing relative sequence similarity to conventional actin
sequences, where ARP1 is the most similar and ARP11 the least similar. Due to close sequence relationships between ARPs and actin sequences, it is frequently difficult to unambiguously
annotate ARP sequences using classical database searches. It is then of high interest to develop discriminative tools to distinguish ARPs and actin, in order to understand the mechanisms in
which they are involved. An initial dataset has been defined forming the basis of a multiple alignment of all ARP sequences. This set allows us to characterise each ARP family (sequence
identity, specific residues and insertions, phylogenetic distribution) and to implement ARPAnno (http://bips.u-strasbg.fr/ARPAnno) a web server dedicated to ARP sequence annotation.
Initial set
ARP families characterisation
In depth protein database (Uniprot) searches to retrieve the maximum number of different ARP
sequences using for each family distinct queries from distantly related organisms (i.e H. sapiens, D.
melanogaster and S. cerevisiae) and the PipeAlign program.
1
Basic sequence analysis
IniID  % Identity to group of 29 actins
Mean ARP family percent identity to reference actin:
http://bips.u-strasbg.fr/PipeAlign
n
 IDSi , S REF
(blastp, ballast, DbClustal, Rascal, DPC)
RefID  i 1
73340 proteins were detected, representing 4200 non redundant and
“non fragment” sequences. Proteins with ≤ 15% amino acid identity
or unrelated sequences, were not included in the final alignment.
FamID  2
Increased number of ARP sequences in protein database (Uniprot) from 29 (1997) to 146 (July
2004). This can be divided in 3 groups of ARPs: >19 sequences for ARP1-4, >10 ARP5, ARP6 and
ARP8 and ≤ 10 ARP1, ARP9, ARP10 and ARP11.
Definition of ARP family features
Assessment of 11 ARP family classifications.
Distribution of ARP families among eukaryotes
Deletion
Insertion
Actin is present in all eukaryotic organisms explored.
Specific Insertion
Presence and absence patterns reveal pairs of ARPs (ARP2
with ARP3, ARP4 with ARP6, and ARP5 with ARP8). This
strongly correlates with biological data available for ARP
containing complexes.
Specific residue or motif
Hot spot of insertion/deletion
Highlights specific features such as conserved residues or motifs and insertions for ARP1-9. No
specific features have been defined for the divergent ARP10 and ARP11.
ARP4 and ARP6 are present in all organisms tested. Nuclear
ARP is the minimum package for eukaryotic organisms.
4 hot spots of insertions (A, B, C, D) can be seen in peripheral positions to core fold.
S. pombe and Y. lipolytica have no ARP7 but are the only yeast
out of 31 to own a second ARP4 (ARP4*).
Creation of an ARP family Knowledge Filter which is a cornerstone for ARP annotation process.
n(n  1)
High family conservation (FamID) for ARP1-3, the main cytoplasmic ARPs in contrast to nuclear ARPs
and the most divergent ARP10 and ARP11 families.
Eukaryotic presence and absence distribution is cross validated
using proteome searches (blastp in Uniprot) and genome
exploration (tblastn) from 19 different organisms ranging from T.
pseudonana (algae) to H. sapiens (mammals).
Actin subdomain 1, 2, and 3, 4
1 i  j  n
Decreasing percent identity to reference actin (RefID) for ARP1 to ARP11.
3
Actin sequence
A
n
Mean percent identity inside a family:
 IDS i , S j
High quality ARP Multiple Alignment of Complete Sequences (MACS) containing 692 sequences
and 146 ARPs.
2
Initial percent identity used to classify ARP families:
ARPAnno web server
A multi-step process
Validation
>Q5ZM58_CHICK Hypothetical protein.
MESYDVIANQPVVIDNGSGVIKAGFAGDQIPKYCFPNYVGRPKH
VRVMAGALEGDIFIGPKAEEHRGLLSIRYPMEHGIVKDWNDMER
IWQYVYSKDQLQTFSEEHPVLLTEAPLNPRKNRERAAEVFFETF
NVPALFISMQAVLSLYATGRTTGVVLDSGDGVTHAVPIYEGFAM
PHSMRIDIAGRDVSRFLRLYLRKEGYDFHTTSEFEIVKTIKERACY
LSINPQKDETLETEKAQYYLPDGSTIEIGSARFRAPELLFRPDLIG
EECEGLHEVLVFAIQKSDMDLRRTLFSNIVLSGGSTLFKGFGDRL
LSEVKKLAPKDVKIRISAPQERLYSTWIGGSILASLDTFKKMWVS
KKEYEEDGARAIHRKTF
Unknown potential actin like protein
1
Local alignment with blastp and determination of
eligible families for next step using GID and pCover.
All 146 sequences of available ARPs have been correctly annotated.
GID
blastp
pCover
Global percent identity
68 new sequences from recent version of Uniprot; 36 conventional actin, 3 Orphans,
6 ARP1, 7 ARP2, 6 ARP3, 8 ARP4, 1 ARP9 and 1 ARP10 from diverse organisms
such as Y. lipolytica, D. hansenii, P. tetraurelia, X. tropicalis or G. gallus.
Percent sequence coverage
Web interface
Actin
ARP1
ARP2
ARP3
ARP4
ARP5
ARP6
ARP7
ARP8
ARP9
http://bips.u-strasbg.fr/ARPAnno
ARP10 ARP11
>Q5ZM58_CHICK Hypothetical protein.
MESYDVIANQPVVIDNGSGVIKAGFAGD
QIPKYCFPNYVGRPKHVRVMAGALEGDI
FIGPKAEEHRGLLSIRYPMEHGIVKDWN
DMERIWQYVYSKDQLQTFSEEHPVLLTE
APLNPRKNRERAAEVFFETFNVPALFISM
QAVLSLYATGRT
2
Global alignment with reference alignment of
eligible families using clustalw.
Fasta sequence
clustalw
3
Filtering for specific residues, motifs (pDR) and
insertions (pDI).
Knowledge
Filter
4
Calculation of one score for each eligible family and
determination of most suitable ARP family.
ScoreARP
pDR
Percent of specific residues
pDI
Percent of specific insertions
Coloured multiple alignment available
S ARPi  0.2GIDARPi  0.1 pCoverARPi  0.4 pDR ARPi  0.3 pDI ARPi
Conclusions and perspectives
•The development of a high quality multiple alignment of ARP sequences permits the validation of
the ARP classification and the definition of family features (residues and insertions).
•The major ARP families are the nuclear ARP4 and ARP6.
Table results
Poch, O., and Winsor, B. (1997). Who's who among the Saccharomyces cerevisiae actin-related proteins? A classification
and nomenclature proposal for a large family. Yeast 13, 1053-1058.
Plewniak, F., et al. (2003). PipeAlign: A new toolkit for protein family analysis. Nucleic Acids Res 31, 3829-3832.
Altschul, S.F., et al. (1997). Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic
Acids Res 25, 3389-3402.
Thompson, J.D., et al. (1994). CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through
sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res 22, 4673-4680.
•Correlation of ARP organisms distribution with functional data is a benchmark case for
phylogenetic profiling methods.
•In future: Maintain ARP MACS up to date and add some
structural features to ARPAnno.
•ARPAnno a new web server for the unambiguous identification of ARP sequences is available.
•Extend the genome exploration.
Acknowledgments: Ministère de la Culture, de l’Enseignement Supérieur et de la Recherche du Luxembourg, Fonds National de la recherche du Luxembourg,CNRS, INSERM, France