NESG structures submitted to the PDB are eukaryotic proteins

Download Report

Transcript NESG structures submitted to the PDB are eukaryotic proteins

% Sequence Identity
Sequence Similarity Analysis Often Misses Evolutionary Relationships
Which Can Be Detected by Combined Analysis of 3D Structural and
Sequence
Homologous relationships
established by both 3D
structure and sequence:
Homologous
Non-homologous
Residues Aligned
Adapted from work by Sanders and co-workers
Structure can often provide valuable
clues to biochemical and biophysical
aspects of protein function
Structure-based Functional Genomics
Biological Functions
of Genes and Proteins
• Genetic Function / Phenotype
• Cellular Function
••Biochemical
Biochemical Function
Function
•Detailed
• Detailed Atomic
Atomic Mechanism
Mechanism
An Important Approach to the
Protein Folding Problem is to
Characterize the
“Natural Language of Proteins”
Representative 3D Structure
from Each of Several Thousand
Sequence Families of Domains
National Institutes of Health
Protein Structure Initiative (PSI)
Long-Range Goal
To make the three-dimensional atomic level
structures of most proteins easily available
from knowledge of their corresponding DNA
sequences
http://www.nigms.nih.gov/psi.html/
J. Norvell
Expected PSI Benefits
• Structure provides information on function
and will aid in the design of experiments
• Development of better therapeutic targets
from comparisons of protein structures from:
– Pathogens vs. hosts
– Diseased vs. normal tissues
J. Norvell
PSI Benefits (con’t)
• Collection of structures will address key
biochemical and biophysical problems
– Protein folding, prediction, folds, evolution, etc.
• Benefits to biologists
–
–
–
–
Technology developments
Structural biology facilities
Availability of reagents and materials
Experimental outcome data on protein production
and crystallization
J. Norvell
PSI Pilot Phase
• 5-year pilot phase, September, 2000
• Pilot phase Goals
– Development of high throughput structure
genomics pipeline to produce unique, nonredundant protein structures
– Pilots for testing all facets and strategies of
structural genomics
• PSI target selection policy
– Representatives of protein sequence families
– Public release of all targets, progress, results, and
structures
J. Norvell
PSI Pilot Research Centers
• Seven research centers funded in
FY2000
• Two additional research centers funded
in FY2001
• Co-funding by NIAID for two of the nine
research centers
• Many subprojects
J. Norvell
PSI Pilot Phase -- Lessons Learned
•
•
•
•
•
•
Structural genomics pipelines can be constructed
and scaled-up
High throughput operation works for many proteins
Genomic approach works for structures
Bottlenecks remain for some proteins
A coordinated, 5-year target selection policy
must be developed
Homology modeling methods need
improvement
J. Norvell
Northeast Structural Genomics Consortium:
A SG Research Network
Bioinformatics
Protein Production / Biophysics
Barry Honig, Columbia University
Mark Gerstein, Yale University
Sharon Goldsmith, Columbia University
Chern Goh, Yale University
Igor Jurisica, Ontario Cancer Inst.
Andrew Laine, Columbia University
Jessica Lau, Rutgers University
Jinfeng Liu, Columbia University
Diana Murray, Cornell Medical School
Burkhard Rost, Columbia University
Mike Wilson, Yale University
Gaetano Montelione, Rutgers University
Thomas Acton, Rutgers University
Stephen Anderson, Rutgers University
Cheryl Arrowsmith, Ontario Cancer Inst.
YiWen Chiang, Rutgers University
Natasha Dennisova, Rutgers Univedrsity
Masayori Inouye, RWJMS - UMDNJ
Lichung Ma, Rutgers University
Rong Xiao, Rutgers University
Adlinda Yee, Ontario Cancer Instit
Protein NMR
X-ray Crystallography
Wayne Hendrickson, Columbia University
Peter Allen, Columbia University
George DeTitta, Hauptman-Woodward
John Hunt, Columbia University
Rich Karlin, Columbia University
Joe Luft, Hauptman-Woodward
Alex Kuzin, Columbia University
Phil Manor, Columbia University
Liang Tong, Columbia University
Kalyan Das, Rutgers University
Thomas Szyperski, SUNY Buffalo
James Aramani, Rutgers University
Cheryl Arrowsmith, Ontario Cancer Inst.
John Cort, Pacific Northwest Natl Labs
Michael Kennedy, Pacific Northwest Natl Labs
Gaouhua Liu , SUNY Buffalo
Theresa Ramelot, Pacific Northwest Natl Labs
Janet Huang, Rutgers University
Gaetano Montelione, Rutgers University
GVT Swapna, Rutgers University
Bin Wu, Ontario Cancer Inst.
Goals of the NESG Consortium
Short Term
Develop a Scalable Platform for
Structural and Functional Proteomics of
Prokaryotic and Eukaryotic Proteins
Long Term
Characterize the repertoire of eukaryotic
protein structural domain families
The NESG Publication Network
PubNet
Douglas, Montelione, Gerstein
Bioinformatics, 2005 in press
Target Selection
Strategy
Target Selection for Structural Proteomics
C. Orengo, Snowbird, UT 4.17.04
How many protein families can we identify in the
genomes with/without structural representatives?
Which families should we target to maximise
structural coverage of the genomes?
the
Can we select families to optimise function coverage?
Rost Clusters:
Structural Genomics Targets
• Protein domain families / clusters
• Full length proteins < 340 amino acids
• No member > 30% identity to PDB structures
~
“NESG
• No20,000
regions of low
complexity Clusters”
• Not predicted to be membrane associated
Target genomes
Eukaryotes
Arabidopsis thaliana (A)
Caenorhabditis elegans (W)
Drosophila melanogaster (F)
Homo sapiens (H)
Saccharomyces cerevisiae (Y)
(Mus musculus)
Reagent genomes (prokaryotes):
Eubacteria
Archea
Aquifex aeolicus (Q)
Bacillus subtilis (S)
Escherichia coli (E)
Haemophilus influenzae (I)
Helicobacter pylori (P)
Staphylococcus aureus (Z)
Thermotoga maritima (V)
Campylobacter jejuni (B)
Neisseria meningitides (M)
Thermus thermophilus (U)
Aeropyrum pernix (X)
Archaeoglobus fulgidus (G)
Methanobacterium thermoautotrophicum (T)
Pyrococcus horikoshii (J)
NESG Domain Clusters
Aeropyrum pernix
Aquifex aeolicus
Arabidopsis thaliana
Archaeglobus fulgidis
Bacillus subtilis
Brucella melitensis
Caenorhabditis elegans
Campylobacter jejuni
Caulobacter crescentus
Drosophila melanogaster
Deinococcus radiodurans
Escherichia coli
Fusobacterium nucleatum
Haemophilus influenzae
Helicobacter pylori
Homo sapiens
• Protein domain families / clusters
• Full length proteins < 340 amino acids
• No member > 30% identity to PDB structures
• No regions of low complexity
• Not predicted to be membrane associated
Human cytomegalovirus
Lactococcus lactis
M. thermoautotrophicum
Neisseria meningitidis
Other
Pyrococcus furiosus
Pyrococcus horikoshi
Saccharomyces cerevisiae
Staphylococcus aureus
Streptococcus pyogenes
Streptomyces coelicolor
Thermoplasma acidophilum
Thermotoga maritima
Thermus thermophilus
Vibrio cholerae
WR41
ET8
1 Euka: 2 Proka
Cloned / Expressed
> 1000 Human Proteins
Liu, Hegi, Acton, Montelione, & Rost PROTEINS 2004. 56: 188-200
Wunderlich et al. PROTEINS 2004 56: 181-187
Acton et al. Meths Enzymol. 2005 in press
Protein Structure
Production
Primer Prímer Program
http://www-nmr.cabm.rutgers.edu/bioinformatics/index.html
Auto-Steps with the Biorobot 8000
PCR Reaction
Qiaquick Purify
DNA Mini-preps
Set up-96 well
PCR Purification
Cycle Sequencing
Big Dye removal
96- Well Expression
Overnight culture
Transfer ~200 ul of
overnight culture to
appropriate well
24 Well Blocks
2 ml of MJ9
HSQC and HetNOE Screening
Amenability to Structural Determination by NMR
Is Determined on NiNTA-Purified Samples
HR969
# Targets
314
Good Excellent
60
20%
25
8%
Critical NMR Observation From SPiNE
Some 30% of full-length, expressed,
soluble eukaryotic proteins
from the Rost Clusters
produced in E. coli by NESG
are DISORDERED based on
Heteronuclear 1H-15N NOE Data
It may not be possible to
determine 3D structures of a
large portion of the Rost domain
families in isolation!
Sample Optimization - Buffer Screening
Microdialysis Buttons- Optimization for NMR
Vary Buffer Conditions - Stability
Buffer
NaCl
DTT
Arginine
50 mM Ammonium Acetate pH 5.0
0
0
0
50 mM Ammonium Acetate pH 5.0
0
10 mM
0
50 mM Ammonium Acetate pH 5.0
0.1 M
10 mM
0
50 mM Ammonium Acetate pH 5.0
0
10 mM
0.1 M
50mM MES pH 6.0
0
0
0
50mM MES pH 6.0
0
10 mM
0
50mM MES pH 6.0
0.1 M
10 mM
0
50mM MES pH 6.0
0
10 mM
0.1 M
50mM Bis.Tris pH 6.5
0
0
0
50mM Bis.Tris pH 6.5
0
10 mM
0
50mM Bis.Tris pH 6.5
0.1 M
10 mM
0
50mM Bis.Tris pH 6.5
0
10 mM
0.1 M
100 mM Arginine
Bagby S, Tong KI, Liu D, Alattia JR, Ikura M. 1997. J Biomol NMR.
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Screen for ppt.
Small sample mass
(50 ug/button)
Analytical Gel Filtration with Light Scattering
Aggregation Screening - Crystallization
LS
RI
Proterion - 96 Well
Monodisperse Conditions
Less Sample
More Conditions
Philip Manor, Roland Satterwhite and John Hunt
ÄKTAxpress™
4 modules in parallel
16 samples AC-GF
Affinity Chromatography (AC)
HiTrap™ Chelating HP, 1 and 5 ml
Gel Filtration (GF)
HiLoad 16/60 Superdex 200 pg
AC
AC/GF
5 hours
12 hours
Solubility / 2004 Stats
Solubility vs Organism
Organism
A. aeolicus (Q)
A. thaliana (A)
A. fulgidis (G)
B. subtilis (S)
B. melitensis (L)
C. elegans (W)
C. jejuni (B)
D. melanogaster (F)
E. faecalis (Ef)
E. coli ( E)
H. influenzae (I)
H. pylori (P)
H. sapiens (H)
N. meningitidis (M)
P. furiosus (Pf)
P. horikosh i (J)
S. pyogenes (D)
Cloned % Sol*
85
46
35
29
23
74
158
49
15
67
90
50
20
55
113
15
23
100
118
50
101
57
75
21
548
43
22
54
48
46
19
63
12
50
PDB
3
1
2
4
0
6
0
1
0
12
4
1
4
1
2
1
1
2004 Production
Total Week Goal
Cloned
511 51
50
Fermented 183 20 ~20-24
Purified
180 20 ~20-24
2004 HR Success
Many HR (Human)
proteins in advanced
stages of NMR
3 HR Crystal structures
*defined as greater than 60% soluble by SDS-PAGE
analysis
T. Acton et al
Internet-based Data Management
NESG PROGRESS SUMMARY Jan 1, 2005
Cloned Targets
Purified Targets
Crystal Structures in PDB
NMR Structures in PDB
Structures In PDB
Total Structures
In Refinement (NMR + Xray)
4,220
1,458
84
72
147
160
13
Intrinsically Disordered Proteins
Intrinsicly Unfolded Proteins
70 Produced in E. coli
Full-length>
Proteins
New Folds
12
Organism
% Unfolded
Publications
209
E. coli
8%
yeast
18%
fly / worm
25%
human
35%
Phylogenetic Distribution of
160 NESG Structures
Most (>95%) completed
NESG structures are
members of eukaryotic
protein domain families
Archea
Eubacteria
Some 35 (~20%) NESG
structures submitted to
the PDB are eukaryotic
proteins
Uniqueness
of NESG Structures
Leverage of NESG Structures
Total Leverage ~20,000 Structures
Novel Leverage ~ 4,000 Structures
upper panel shows the
number of new models that
could be built for ten
entirely sequenced
eukaryotes (tan) and for the
human genome (green)
lower panel: number of
proteins for which the
sequence-unique
structures
experimentally
determined (red) by
each consortium could
be used to build
homology models (light
green).
Liu and Rost