Automated Model-Building with TEXTAL

Download Report

Transcript Automated Model-Building with TEXTAL

TEXTAL - Automated Crystallographic Protein Structure
Determination Using Pattern Recognition
Principal Investigators:
Thomas Ioerger (Dept. Computer Science)
James Sacchettini (Dept. Biochem/Biophys)
Other contributors:
Tod D. Romo, Kreshna Gopal, Erik McKee,
Lalji Kanbi, Reetal Pai & Jacob Smith
Funding: National Institutes of Health
Texas A&M University
X-ray crystallography
• Most widely used method for
protein modeling
• Steps:
– Grow crystal
– Collect diffraction data
– Generate electron density map
(Fourier transform)
– Interpret map i.e. infer atomic
coordinates
– Refine structure
• Model-building
– Currently: crystallographers
– Challenges: noise, resolution
– Goal: automation
X-ray crystallography
• Most widely used method for
protein modeling
• Steps:
– Grow crystal
– Collect diffraction data
– Generate electron density map
(Fourier transform)
– Interpret map i.e. infer atomic
coordinates
– Refine structure
• Model-building
– Currently: crystallographers
– Challenges: noise, resolution
– Goal: automation
Overview of TEXTAL
• Automated model-building program
Electron density map
(or structure factors)
TEXTAL
Protein model
(may need refinement)
• Can we automate the kind of visual processing of
patterns that crystallographers use?
– Intelligent methods to interpret density, despite noise
– Exploit knowledge about typical protein structure
• Focus on medium-resolution maps
– optimized for 2.8A (actually, 2.6-3.2A is fine)
– typical for MAD data (useful for high-throughput)
– other programs exist for higher-res data (ARP/wARP)
Crystal
Collect data
Diffraction data
Electron density map
CAPRA: models backbone
LOOKUP: model side chains
SCALE MAP
TRACE MAP
Model of backbone
CALCULATE
FEATURES
PREDICT Cα’s
Model of backbone & side chains
BUILD CHAINS
PATCH & STITCH
CHAINS
POST-PROCESSING
SEQUENCE ALIGNMENT
REFINE CHAINS
REAL SPACE REFINEMENT
Corrected & refined model
CAPRA:
C-Alpha Pattern-Recognition Algorithm
tracing
Best-first search with heuristic
scoring function based on:
• neural net scores
• density
• connectivity
• secondary structure
linking
Neural network:
estimates which
pseudo-atoms are
closest to true Ca’s
Example of Ca-chains fit by CAPRA
Rat a2 urinary protein (P. Adams)
data: 2.5A MR
map generated at 2.8A
% built: 84%
# chains: 2
lengths: 47, 88
RMSD: 0.82A
Stage 2: LOOKUP
• LOOKUP is based on Pattern Recognition
– Given a local (5A-spherical) region of density, have we
seen a pattern like this before (in another map)?
– If so, use similar atomic coordinates.
• Use a database of maps with known structures
– 200 proteins from PDB-Select (non-redundant)
– back-transformed (calculated) maps at 2.8A (no noise)
– regions centered on 50,000 Ca’s
• Use feature extraction to match regions efficiently
– feature (e.g. moments) represent local density patterns
– features must be rotation-invariant (independent of 3D
orientation)
– use density correlation for more precise evaluation
BUILD CHAINS: Examines network of Cα’s
and use heuristic search to connect them to
form backbone chains
CAPRA
LOOKUP: Uses case-based reasoning to
find, for each Cα, the best matching local
region in a database
The LOOKUP Process
Find optimal
rotation
Database
of known
maps
Region in map to
be interpreted
Two-step filter:
1) by features
2) by density
correlation
“2-norm”: weighted Euclidean
distance metric for retrieving matches:
dist ( R1 , R2 ) 
w
i
i
 ( Fi ( R1 )  Fi ( R 2 )) 2
Examples of Numeric Density Features
•Distance from center-of-sphere to centerof-mass
•Moments of inertia - relative dispersion
along orthogonal axes
•Geometric features like “Spoke angles”
•Local variance and other statistics
Features are designed to be rotation-invariant, i.e. same
values for region in any orientation/frame-of-reference.
TEXTAL uses 19 distinct numeric features to represent
the pattern of density in a region, each calculated over
4 different radii, for a total of 76 features.
F=<1.72,-0.39,1.04,1.55...>
F=<0.90,0.65,-1.40,0.87...>
F=<1.58,0.18,1.09,-0.25...>
F=<1.79,-0.43,0.88,1.52...>
SLIDER: Feature-weighting algorithm
•
Euclidean distance metric used for retrieval:
dist ( R1 , R2 ) 
•
•
w
i
 ( Fi ( R1 )  Fi ( R 2 )) 2
i
importance of relevant features, avoid noisy features
Goal: find optimal weight vector w the generates highest
probability of hits (matches) in top K candidates from database
• Concept of Slider:
• analyze distances between representative matches and mismatches
• adjust features so the most matches are ranked higher than mismatches
Slider Algorithm(w,F,{Ri},matches,mismatches)
choose feature fF at random
for each <Ri,Rj,Rk>, Rjmatches(Ri),Rkmismatches(Ri)
compute cross-over point li where:
dist’(Ri,Rj)=dist’(Ri,Rk)
dist’(X,Y)= l(Xf-Yf)2+(1-l)dist\f(X,Y)
pick l that is best compromise among li
ranks most matches above mismatches
update weight vector: w’update(w,f,l), wf’=l
repeat until convergence
SLIDER Results
Accuracy of case retrieval
100
SLIDER
90
SFS
80
SBS
70
DIET
60
0
50
100
150
200
250
Number of matches retrieved
Accuracy of ranking
Convergence of feature selection/weighting
algorithms
8
7
6
5
4
3
2
1
0
SLIDER
SBS
DIET
SFS
Iterations
Uniform
weights
Effectiveness of retrieval using Euclidean
(tolerance = .02)
Speed of convergence
Average no of matches
caught in top k
7
Time (seconds)
2000
1500
1000
500
6
5
4
3
Uniform-weighted
2
Slider-weighted
1
0
SLIDER
SFS
SBS
DIET
0
0
1000
2000
k
3000
4000
Stage 3: Post-Processing
Quality of TEXTAL models
• Typically builds >80% of the protein
atoms
• Accuracy of coordinates: ~1Å error
(RMSD)
– Depends on resolution and quality of map
PcaA
• Mycolic acid cyclopropyl synthase (Smith&Sacchettini)
• original structure solved at 2.0A via MAD
R-value = 0.22, R-free = 0.27
• 287 residues, a/b fold
Example of density quality
(~1s contour with Ca trace)
Electron density map (2.8A)
Results of tracing
Strip off branches of trace (linearize)
Linearized trace shows backbone connectivity
Pick Ca’s using neural net; link together
Results of CAPRA
Comparison to backbone of true structure (white)
Percent built = 89%
(missing: 15-residue N-terminus, 17-residue disordered loop)
4 single-atom insertions; 5 single-atom deletions
RMSD = 0.81A
CAPRA model consists of 3 chains
Chain lengths: 14, 96, 145 residues
Results of LOOKUP (modeling side-chains)
Comparison of TEXTAL model to true structure
Percent amino acid identity = 87.5%
(mistakes: small frame-shifts around gaps in alignment)
all-atom RMSD = 0.92A
Closeup of b-strand (TEXTAL model in green)
Closeup of another b-strand and turn
Implementation
• Project started in 1998
– Collaboration between TAMU Computer Science & Biochemistry
departments
•
•
•
•
•
100,000 lines of C/C++, Perl, Python code
~8 developers
CVS for version management
Platforms: Irix, Linux, OSX, Win32
Speed: 1-3 hours for medium-sized proteins
Deployment
• September 2004: Linux and OSX distributions
– Can be downloaded from http://textal.tamu.edu:12321
– 40 trial licenses granted so far
• June 2002: WebTex (http://textal.tamu.edu:12321)
–
–
–
–
Till May 2005: TB Structural Genomics Consortium members only
Recently open to the public
~500 jobs successfully processed
120 users from 70 institutions in 20 countries
• July 2003: Model building component of PHENIX
– Python-based Hierarchical ENvironment for Integrated Xtallography
– Consortium members:
• Lawrence Berkeley National Lab
• University of Cambridge
• Los Alamos National Lab
• Texas A&M University
– April 2005: Alpha release - over 300 downloads so far
Python-based Hierarchical ENvironment for Integrated
Xtallography
Crystallography toolbox, heavy atom search, refinement
PHASER (University of Cambridge)
Maximum likelihood phasing
SOLVE/RESOLVE (Los Alamos National Lab)
Statistical density modification, minimum bias phasing
TEXTAL™ (Texas A&M University)
Model building
PHENIX
HYSS, CCTBX (Lawrence Berkeley Lab)
diffraction
data
refined
molecular
model
Conclusions
• Pattern recognition is a successful technique for
macromolecular model-building
• Future directions:
–
–
–
–
–
–
–
recognizing disulfide bridges, metal ions, detergents...
building ligands, co-factors, etc.
using models built to iteratively improve phases
building at higher or lower resolutions
intelligent agent for guiding model-completion
detecting and exploiting non-crystallographic symmetry
building nucleic acids (RNA and DNA)
• Importance and challenges of interdisciplinary
research
Acknowledgements
• Funding:
– National Institutes of Health
• Our group:
– Jacob Smith, Kreshna Gopal, Lalji Kanbi, Erik McKee,
Reetal Pai, Tod Romo
• Our association with the PHENIX group:
– Paul Adams (Lawrence Berkeley National Lab)
– Randy Read (Cambridge University)
– Tom Terwilliger (Los Alamos National Lab)