invited talk

Download Report

Transcript invited talk

Identification
of specificity-determining positions
in protein alignments
Mikhail Gelfand
Research and Training Center “Bioinformatics”
Institute for Information Transmission Problems, RAS
ECCB2005, Madrid
Motivation
• Large protein families with general function assigned by
homology, not much functional information
• Much less structural data. Not many structures with
substrates, cofactors etc.
• Some specificity assignments from comparative genomics
=>
• Search for specificity-determining positions in alignments
– identification of functional sites
– prediction of specificity
– understanding and eventually re-design of function
Specificity (of transporters) from
comparative genomics – three
examples. 1. New specificities in a
little studied family
Pasteurellaceae
S-box (rectangle frame)
MetJ (circle frame)
LYS-element (circles)
Tyr-T-box (rectangles)
NMB
SON-2
BL1111
SON-1
VC-2
VC-1
BH
SON-3
clostridia
OB
CAC0744
LysT
CB
EF-nhaC1 PPE
Archaea
LP-nha2
LGA
L
ME
LP-nha1
LB
EF-nhaC2
TyrT
BC1434
FN1414
BT1270
CB
NMB05 36
FN0352
BC4121
TTE-nhaC
SA2117
CJ
OB2874
269.
47
CTC
CPE
DF
FN0978
OB1118
HP
MetT
BS-yheL
FN0650 BC1709
CTC00901
FN062 4
CTC02520
BS-mleN
BB0637
CPE2317
FN1420
CTC02529
VCA0193
SO1087
FN1422
BC0373
BB0638
FN207 7
BH3946
VC2037
SA2292
HI1107
VV21061
MleN
malate/lactate
2. Misleading homology: The PnuC family of transporters
The THI
elements
The RFN
elements
3. A nightmare. The NiCoT family of nickel-
cobalt transporters
SDP (Specificity-Determining Position)
Alignment position that is conserved within
groups of proteins having the same specificity
(specificity groups) but differs between them
SDP is not
equivalent to
a functionally
important
position
Measure of specificity:
mutual information

Ip 

f p ( , i) log
all specificity all amino
groupsi
acids 
f p ( , i )
f p ( )
f (i )
f p ( , i)
f p ( ) f (i)
= count of amino acid α in group i at position p
divided by the total number of sequences
= frequency of amino acid α in position p
= fraction of proteins in group i
Taking into account the structure
of the phylogenetic tree:
random shuffling and linear
regression
linear regression
 min
Z-score
I p  I exp
p
Z 
p  (I exp)
p
=> positions that are
more specific than
expected given the tree
Smoothing: pseudocounts and
similarity between amino acid residues
• m(ab) = amino acid substitution matrix
• n(a,i) = count of amino acid a at position i
Automated threshold setting:
the Bernoulli estimator
Are 5 SDP with Z-score > 12 better than 10
SDP with Z-score > 9?
Z1  Z 2   
k *  arg min Pthere are at least k observed Z - scores Z  Z k  
k
n

i i
ni 
 arg min 1   C n q p 
k
i  n  k 1



p  P( Z  Z k ) 

Zk
1
exp(  Z 2 )dZ
2
q 1 p
Other similar techniques
• Evolutionary trace (Lichtarge et al. 1996, 1997) –
need structure; gradual construction of group-specific
consensus
• Evolutionary rate shifts (DIVERGE, Gu et al. 2002) –
positions with group-specific evolutionary rate
• Surface patches of slowly evolving residues
(Rate4Site, Pupko et al. 2002) – need structure
• PCA in the sequence space (Casari et al., 1995)
• Correlated mutations (Pazos and Valencia, 2002)
• Prediction of functional sub-types (Hannenhalli and
Russell, 2000) – relative entropy of HMM profiles for
groups
SDPpred: Web interface
Input: multiple alignment of proteins
divided into specificity groups
=== AQP ===
%sp|Q9L772|AQPZ_BRUME
-------------------------------------mlnklsaeffgtfwlvfggcgsa
ilaa--afp-------elgigflgvalafgltvltmayavggisg--ghfnpavslgltv
iiilgsts------------------------------slap-----------------qlwlfwvaplvgavigaiiwkgllgrd-------------------------------------%sp|P48838|AQPZ_ECOLI
-------------------------------------mfrklaaecfgtfwlvfggcgsa
vlaa--gfp-------elgigfagvalafgltvltmafavghisg--ghfnpavtiglwa
lvihgatd------------------------------kfap-----------------qlwffwvvpivggiiggliyrtllekrd------------------------------------%tr|Q92ZW9
-------------------------------------mfkklcaeflgtcwlvlggcgsa
vlas--afp-------qvgigllgvsfafgltvltmaytvggisg--ghfnpavslglav
iiilgsth------------------------------rrvp-----------------qlwlfwiaplfgaaiagivwksvgeefrpvd---------------------------------=== GLP ===
%sp|P11244|GLPF_ECOLI
----------------------------msqt---stlkgqciaeflgtglliffgvgcv
aalkvag---------a-sfgqweisviwglgvamaiyltagvsg--ahlnpavtialwl
glilaltd------------------------------dgn--------------g-vpr
-flvplfgpivgaivgafayrkligrhlpcdicvveek--etttpseqkasl------------%sp|P44826|GLPF_HAEIN
----------------------------mdks-----lkancigeflgtalliffgvgcv
…
SDPpred: Output
Alignment of the
family with the
SDPs highlighted
(Alignment view)
Detailed description Plot of probabilities
of each SDP
used by the Bernoulli
(List of SDPs)
estimator to set the
cutoff
(Probability plot view)
Transcription factors from the LacI family
• Training set: 459 sequences,
average length: 338 amino acids,
85 specificity groups
– 44 SDPs
10 residues contact NPF (analog of
the effector)
7 residues in the effector contact zone
(5Ǻ<dmin<10Ǻ)
6 residues in the intersubunit
contacts
5 residues in the intersubunit
contact zone (5Ǻ<dmin<10Ǻ)
7 residues contact the operator
sequence
6 residues in the operator contact
zone (5Ǻ<dmin<10Ǻ)
LacI from E.coli
SDP clusters at the subunit contact region
Cluster I
Effector
Cluster II
DNA operator
LacI (lactose repressor) from E.coli (1jwl)
Overall statistics (LacI of E. coli)
• Total 348 amino
acids
Non-contacting residues
(distance to the DNA,
effector, or the other
subunit >10Ǻ)
Contact zone
(may be
functional)
• 44 SDP
Contacting residues
(distance to the DNA,
effector, or the other
subunit <5Ǻ)
Membrane channels of the MIP family
• Training set: 17 sequences,
average length 280 amino acids,
2 specificity groups:
Aquaporines & glyceroaquaporines
– 21 SDPs
8 residues contact glycerol
(substrate) (dmin<5Ǻ)
8 residues oriented to the
channel
5 residues in the contacts with
other subunits
GlpF from E.coli
Two SDP clusters at the contact of
subunits forming the tetramer
Cluster I
20Leu, 24Ile, 108Tyr of
one subunit, 193Ser of
another subunit
Cluster II
Glu43
Substrate
(glycerol)
Subunit I
Glpf (glycerol facilitator) from E. coli (1fx8)
Overall statistics (GlpF from E.coli)
• Total 281 amino
acids
Non-contacting residues
(distance to the
substrate, or another
subunit >10Ǻ)
Contact zone
(may be
functional)
• 21 SDP
Contacting residues
(distance to the
substrate, or another
subunit <5Ǻ)
isocitrate/isopropylmalate dehydrogenases :
combinations of specificities towards
substrate and cofactor
• IDH: catalyzes the oxidation of
isocitrate to α-ketoglutorate and
CO2 (TCA) using either NAD or
NADP as a cofactor in
organisms from prokaryotes to
higher eukaryotes
Mitochondria
• IMDH: catalyzes oxidative
decarboxylation of 3isopropylmalate into 2-oxo-4methylvalerate (leucine
biosynthesis) in prokaryotes and
fungi, the cofactor is NAD
Eukaryota
Archaea
Bacteria
Eukaryota
Archaea
Bacteria
Selecting specificity groups
1. By substrate: all IDHs
vs. all IMDHs
2. By cofactor: all NADdependent vs. all
NADP-dependent
IDH (NADP)
type II
IDH (NAD)
IDH (NADP)
type II
3. Four groups
IDH (NADP)
type II
IDH
IDH(NAD)
(NAD)
IMDH (NAD)
IDH (NADP)
type I
IMDH (NAD)
IDH (NADP)
type I
IMDH (NAD)
IDH (NADP)
type I
Predicted SDPs
most SDPs near the substrate
SDPs near the substrate
and the cofactor
SDPs near the substrate,
the cofactor and the other
subunit
SDPs, the cofactor and the substrate
Substrate (isocitrate)
Cofactor (NADP)
Nicotinamide nucleotide
100Lys, 104Thr, 105Thr,
107Val, 337Ala, 341Thr:
substrate-specific and
four group SDPs,
functionally not
characterized
Adenine nucleotide
344Lys, 345Tyr, 351Val:
cofactor-specific SDPs,
known determinants of
specificity to cofactor
NADP-dependent IDH from E. coli (1ai2)
SDPs predicted for different groupings
cofactorspecific SDPs
208Arg
337Ala
100Lys
300Ala
Color code:
105Thr
229His
154Glu
103Leu
233Ile
158Asp
115Asn
305Asn
308Tyr
155Asn
231Gly 327Asn
344Lys 287Gln
164Glu
351Val 345Tyr
241Phe
38Gly 40Asp
104Thr
107Val
152Phe
323Ala 245Gly 161Ala 232Asn
Contacts cofactor
Contacts substrate AND cofactor
162Gly 36Gly
Contacts substrate
45Met
Contacts substrate AND the other subunit
Contacts the other subunit
substratespecific SDPs
31Tyr
341Thr
97Val
98Ala
Four groups
Overview
• Transcription factors: contacts with the
cofactor and the DNA
• Transporters: contacts with the substrate
• Enzymes: contacts with the substrate and
the cofactor
And all:
• contacts between subunits
Protein-DNA interactions
Entropy at aligned sites (blue plots) and the number of contacts
(red: heavy atoms in a base pair at a distance <cutoff from a protein atom)
CRP
PurR
IHF
TrpR
The observed correlation does not
depend on the distance cutoff
CRP/FNR family of regulators
TGTCGGCnnGCCGACA
CooA
Desulfovibrio
TTGTGAnnnnnnTCACAA
FNR
Gamma
TTGATnnnnATCAA
HcpR
Desulfovibrio
TTGTgAnnnnnnTcACAA
Correlation between contacting
nucleotides and amino acid residues
•
•
•
•
DD
DV
EC
YP
VC
DD
DV
EC
YP
VC
CooA in Desulfovibrio spp.
CRP in Gamma-proteobacteria
HcpR in Desulfovibrio spp.
FNR in Gamma-proteobacteria
COOA
COOA
CRP
CRP
CRP
HCPR
HCPR
FNR
FNR
FNR
ALTTEQLSLHMGATRQTVSTLLNNLVR
ELTMEQLAGLVGTTRQTASTLLNDMIR
KITRQEIGQIVGCSRETVGRILKMLED
KXTRQEIGQIVGCSRETVGRILKMLED
KITRQEIGQIVGCSRETVGRILKMLEE
DVSKSLLAGVLGTARETLSRALAKLVE
DVTKGLLAGLLGTARETLSRCLSRMVE
TMTRGDIGNYLGLTVETISRLLGRFQK
TMTRGDIGNYLGLTVETISRLLGRFQK
TMTRGDIGNYLGLTVETISRLLGRFQK
Contacting residues: REnnnR
TG: 1st arginine
GA: glutamate and 2nd arginine
TGTCGGCnnGCCGACA
TTGTGAnnnnnnTCACAA
TTGTgAnnnnnnTcACAA
TTGATnnnnATCAA
The correlation holds for other factors
in the family
Factor
CRP
VFR
CLP
FNR & ANR
FNR
FNR & FixK
DNR & Nnr
FNR
PrfA
NtcA
CysR
CooA
HcpR*
HcpR*
HcpR*
HcpR*
HcpR*
HcpR*
HcpR*
ArcR
CprK
FlpA&B
Organisms
Consensus
Specific aa
Enterobacteria&Vibrio&Pasteurellaceae
TTGTGAnnnnnnTCACAA
R
E
R
Pseudomonas sp.
TTGTGAnnnnnnTCACAA
R
E
R
Xanthomonas&Xylella sp.
nTGTGAnnnnnnTCACAn
R
E
R
Gamma-proteobacteria
nnTTGATnnnnATCAAnn
V
E
R
Beta-proteobacteria
nnTTGATnnnnATCAAnn
L
E
R
Alpha-proteobacteria
nnTTGATnnnnATCAAnn
I/L E
R
Pseudomonas &Paracoccus
nnTTGATnnnnATCAAnn
P
E
R
Bacillus sp.
nTGTGAnnTAnnTCACAn
R
E
R
Listeria
nnTTAACAnnTGTTAAnn
S
S
R
Cyanobacteria
ntGTAnCnnnnGnTACan
R
V
R
Cyanobacteria
?
R
V
R
Desulfovibrio sp. and R.rubrum
nTGTCGGCnnGCCGACAn
R
Q
T
Desulfovibrio sp.
TTGTgAnnnnnnTcACAA
R
E
R
Desulfuromonas acetoxidans, Desulfotalea
atTTGAccnnggTCAAat
psychrophila
S/P E
R
Clostridia, Bacteroides, Thermotogales,
ctGTAACawwtCTTACag
Fusobacteria, Treponema
R
P
R
~P. gingivalis
nTGTCGCnnnnGCGACAn
R
A
R
~C. difficile
nnGGATnnnnnnATCCnn
R
S
R
~T.tengcongensis, D.halfniensa nTGTGAnnnnnnTCACAn
R
E
R
~Acidithiobacillus ferrooxidans
nCTTGATTnnAATCAAGn
P
E
R
Bacillus, Enterococcus sp.
nTGTGAnATATnTCACAn
R
E
A/S
Desulfitobacterium dehalogenas nnTTAnTGnnCAnTAAnn
H
V
R/K
Lactococcus lactis
nnTTGATnnnnATCAAnn
P
E
R
Metabolic system
catabolic repression
virulence
phytopathogenicity
response to anaerobiosis
response to anaerobiosis
nitrogen fixation
denitrification
response to anaerobiosis
virulence
nitrogen metabolism
sulfate utilization
CO utilization
prismane & sulfate reduction
prismane
prismane
prismane
prismane
prismane
prismane
arginine catabolism
halorespiration
?
Inducer
cAMP
cAMP
? (not cAMP)
O2,NO
O2
O2
NO, NO2
O2-low conditions
?
2-oxoglutarate
sulfate?
CO
?
?
?
?
?
?
?
O2
aromatics
Eh, O2
SMc04260 11
Plans and
perspectives.
Protein-DNA
interactions
RAFR_ECOLI 12
RRC03428
13
10 SMb21598
SMb20324
9 BS_YvdE
8 MALR_STAXY
SMc03060
14
EC_MalI
7 SACR_LACLA
BS_RbsR
16
SMc04401
REF00345
6
SMb21272 15
RKP03067 19
SMc02975 17
REF00754
mlr2242 18
SCRR_SALTY
BS_CcpA
20
EC_FruR
21
5 TTE0201
RKP05215
4 PA1949
22
SMb21650
3 EC_PurR
GALR_STRTR
EC_EbgR
2 EC_RbsR
LacI family of
transcriptional
regulators
(each branch
represents a
subfamily)
23
1 EC_LacI
43 EC_YcjW
EC_TreR
24
VCA0654
25
SCRR_STAXY 26
42
41
RPU04121
SMb21372
STM3696
29
40
27
STM2345
RKP05499
36
39 PA2320
RSc1790
EC_CytR
34
EC_IdnR
EC_AscG
30
STM1555
31
EC_GalR
EC_GalS
38 RRC03254
EC_GntR
37
CSCR_ECOLI
33
35 SMc03165
XCC2369
32
PA2259
BS_KdgR
28
D-galactose & galactosides
maltose & trehalose
sucrose
D-fructose
D-ribose
D-xylose
… and their signals
1605 regulators from 189 genomes, forming 302
groups of orthologs and binding 2518 sites
• A new family of
Ni/Co transporters
• No structural data
• Specificity
predicted by
comparative
genomics
• Predicted SDPs
form several
clusters in the
alignment, are
located on the
same sides of
alpha-helices
• Mutational
analysis
Plans and perspectives.
Experimental verification
Terminators of
translation in
prokaryotes /
decoding of
stop-codons.
Specificity of
RF1 (UAG, UAA)
and RF2 (UGA,
UAA)
Fragment of the
alignment (117
pairs). SDPs are
shown by black
boxes above the
alignment.
“Interesting” positions:
invariant, SDPs, variable rate.
SDPs and
invariant
positions:
two
decoding
sites?
Plans and perspectives
• Use of 3D structures, when available.
Identification of functional sites as spatial
clusters of SDPs and conserved positions
• Automated identification of specificity
groups based on the analysis of the
phylogenetic tree
• Protein-DNA interactions
• Identification of protein-protein contact
surfaces
Publications
• N.J.Oparina, O.V.Kalinina, M.S.Gelfand, L.L.Kisselev (2005) Common and
specific amino acid residues in the prokaryotic polypeptide release factors RF1
and RF2: possible functional implications. Nucleic Acids Research 33 (in press).
• O.V.Kalinina, A.A.Mironov, M.S.Gelfand, A.B.Rakhmaninova (2004) Automated
selection of positions determining functional specificity of proteins by comparative
analysis of orthologous groups in protein families. Protein Science 13: 443-456.
• O.V.Kalinina, P.S.Novichkov, A.A.Mironov, M.S.Gelfand, A.B.Rakhmaninova
(2004) SDPpred: a tool for prediction of amino acid residues that determine
differences in functional specificity of homologous proteins. Nucleic Acids
Research 32: W424-W428.
• O.V.Kalinina, M.S.Gelfand, A.A.Mironov, A.B.Rakhmaninova (2003) Amino acid
residues forming specific contacts between subunits in tetramers of the membrane
channel GlpF. Biophysics (Moscow) 48: S141-S145.
• L.A.Mirny, M.S.Gelfand (2002) Using orthologous and paralogous proteins to
identify specificity determining residues in bacterial transcription factors. Journal
of Molecular Biology 321: 7-20.
• L.Mirny, M.S.Gelfand (2002) Structural analysis of conserved base-pairs in
protein-DNA complexes. Nucleic Acids Research 30: 1704-1711.
• http://math.belozersky.msu.ru/~psn/
Acknowledgements
•
•
•
•
•
•
Leonid Mirny (Harvard, MIT)
Olga Kalinina
Andrei A. Mironov
Alexandra B. Rakhmaninova
Dmitry Rodionov
Olga Laikova
•
•
•
•
Howard Hughes Medical Institute
Ludwig Institute of Cancer Research
Russian Fund of Basic Research
Russian Academy of Sciences,
programs “Molecular and Cellular Biology”
and “Origin and Evolution of the Biosphere”