Transcript Slides

ProRepeat
a comprehensive directory of
exact tandem repeats in proteins
prorepeat.bioinformatics.nl
PolyQ and neurodegenerative diseases
9 diseases causes by polyQ repeats
- HD
- DRPLA
- SCA 1,2,3,6,7,17
- Kennedy’s disease (SBMA)
www.bioinformatics.nl
Androgen receptor (AR)
Transcription Factor
HORMONE BINDING
TRANSCRIPTIONAL REGULATION
DNA BINDING
-COOH
NH3-
polyQ tract length has important consequences
■ shorter tracts : prostate cancer susceptibility
■ longer tracts : feminization syndromes
■ over 40 residues : SBMA (spinal and bulbar
muscular atrophy) or Kennedy’s disease
9-35 residues, average
of 20-25 depending on
ethnic origin
T1 T2
T3
Region 1
Region 2
www.bioinformatics.nl
Region 3
PolyQ in AR

Collection of polyQ repeats



792 human individuals
available from earlier study
(Edwards, 1992)
26 armadillo individuals
sequenced by CP
77 mammals and marsupials
from protein database
Céline Poux, RU
www.bioinformatics.nl
What about repeats in other proteins?



ProRepeat database
Data sources: UniProt and RefSeq
Limited to exact tandem repeats



Standard, linear-time suffix tree algorithm
Stored in Oracle 10g
Interface in PHP5
Maarten van den Bosch, WUR
www.bioinformatics.nl
unit length
repetitions
1
≥5
2
≥4
3
≥3
4 .. N
≥2
Simple query syntax:
e.g. “Q” or “DE”
DE is equivalent
to ED; DEF is
equivalent to
EFD and FDE
www.bioinformatics.nl
Or use ProSite syntax:
e.g. “[DE]-{P}-X(0,1).”
www.bioinformatics.nl
Taxonomic distributions of hits
www.bioinformatics.nl
www.bioinformatics.nl
Sorting/grouping options










Identifier
Repeat unit
Repetitions
Unit length
Length
Start location
End location
Protein
Taxonomy
Ontology
www.bioinformatics.nl
Link to DNA data

DNA coding sequences of available repeats also
stored in the database

Extracted from EMBL
and/or RefSeq
Hong Luo, WUR
www.bioinformatics.nl
Link to DNA data / errors


Approximately 3% of corresponding nucleotide
sequences cannot be retrieved
Errors caused by

No links to nucleotide database (35%)
• NO_ANNOTATED_CDS
• No EMBL links

Annotation errors in the nucleotide database (65%)
www.bioinformatics.nl
Number of different units
Number of different units per unit size per proteome
900
Hsapiens
800
Athaliana
Celegans
700
Cserevesiae
600
Ptroglodytes
Ggallus
500
Rnorvegicus
400
Mmusculus
Ecoli
300
200
100
Unit length
Guido Kappé, RU
www.bioinformatics.nl
>2
0
19
17
15
13
11
9
7
5
3
1
0
Single amino acid (SAA) repeat length distribution in Homo sapiens
100%
T
90%
S
80%
Percentage (%)
70%
60%
Q
50%
40%
30%
G
20%
P
A
10%
E
0%
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
>20
Total SAA repeat length (aa)
A
B
C
D
E
F
G
www.bioinformatics.nl
H
I
K
L
M
N
P
Q
R
S
T
U
V
W
X
Y
Z
Am ino acid distribution Hom o sapiens
30
Percentage (%)
25
20
15
10
5
0
A
B
C
D
E
F
G
H
I
K
L
M
N
P
Q
R
Am ino acid
All prot. - Rep.
www.bioinformatics.nl
Rep. - SAA
SAA
S
T
U
V
W
X
Y
Z
Am ino acid distribution Arabidopsis thaliana
30
Percentage (%)
25
20
15
10
5
0
A
B
C
D
E
F
G
H
I
K
L
M
N
P
Q
Am ino acid
All prot. - Rep.
www.bioinformatics.nl
Rep. - SAA
SAA
R
S
T
U
V
W
X
Y
Z
Current work




Annotation of repeats versus function
Adding imperfect tandem repeats - a.k.a.
approximate tandem repeats (ATR) – to the
database
Offering remote access via web services (WSDL
and BioMoby)
Expansion of the analysis capabilities of the
interface
www.bioinformatics.nl
PolyQ in AR (reprise)


Impure tracts longer and more variable than pure
CAG tracts (mainly CAA, CCG, and CGG)
Presence of other codons better explained by
codon duplication than multiple point mutations


interrupting codons are part of elongation process,
rather than hampering their dynamics as proposed
previously
Negative correlation between lengths of the
different CAG tracts

maximal expansion length that protein can handle
without being deleterious
Céline Poux, RU
www.bioinformatics.nl
Acknowledgements

Wageningen University and Research Centre





Maarten van den Bosch
Hong Luo
Mark Kramer
Harm Nijveen
Radboud University, Nijmegen



Guido Kappé
Céline Poux
Wilfried W. de Jong
www.bioinformatics.nl
This work was supported in
part by project grants from
NWO/BMI (GK, CP) and the
NBIC/BioAssist program (HN)
Thank you for your attention!
See also our posters on phylogenetic domain visualisation
(TreeDomViewer) and microarray (re)annotation at the ISMB
Post-doc positions available: contact [email protected]
or [email protected]
prorepeat.bioinformatics.nl