TEXTAL - TAMU Computer Science Faculty Pages

Download Report

Transcript TEXTAL - TAMU Computer Science Faculty Pages

The TEXTAL System:
Automated Model-Building Using
Pattern Recognition Techniques
Dr. Thomas R. Ioerger
Department of Computer Science
Texas A&M University
Collaboration with: Dr. James C. Sacchettini,
Center for Structural Biology, Texas A&M Univ.
With support from: National Institutes of Health
Automated Structure Determination
• Key step to high-throughput Structural Genomics,
structure-based drug design, etc.
• Many computational tools to generate a map, but...
• Given electron density map, how to extract atomic
coordinates automatically?
• Currently requires humans (+O): potential bottleneck
• Sources of difficulty: complexity, low resolution,
phase errors, weak density
• Related methods: Shake&Bake, ARP/wARP, XPowerfit, template convolution...
Overview of TEXTAL
• Apply pattern recognition techniques
• Exploit database of previously-solved maps
• Model molecular structures in local regions
(e.g. spheres of 5 Angstrom radius)
• Intuitive principles:
1) Have I ever seen a region with a
pattern of density like this before?
2) If so, what were previous
local atomic coordinates?
Overview (cont’d)
• Divide-and-Conquer:
1) identify alpha-carbon positions (chain-tracing)
2) model regions around alpha-carbons (CAs),
including backbone and side-chain atoms
3) concatenate local models back together,
resolve any conflicts
• Database contains many regions centered on
CAs from previous maps
• ~5A radius right for “structural repetition”
Main Stages of TEXTAL
electron density map
build-in side-chain
and main-chain atoms
locally around each CA
CAPRA
C-alpha chains
Reciprocal-space
refinement/ML DM
LOOKUP
example:
real-space
refinement
model (initial coordinates)
Post-processing routines
model (final coordinates)
Human
Crystallographer
(editing)
Feature Extraction
• Database: ~105 regions from ~100 maps
• How to identify closest match (efficiently)???
• Calculate numerical features that represent the
pattern in each region
• Must be rotation-invariant
• Search can be very fast: just compare features
F=<1.72,-0.39,1.04,1.55...>
F=<0.90,0.65,-1.40,0.87...>
F=<1.58,0.18,1.09,-0.25...>
F=<1.79,-0.43,0.88,1.52...>
Rotation-Invariant Features
• Average density: m=(1/n)Sri, where ri is
density at each lattice point in region
• Other Statistical Features: standard
deviation, kurtosis…
• Distant to center of mass:
– <xc,yc,zc>=(1/n)< Sxiri/m,Syiri/m,Sziri/m>
– dcen=(xc2+ yc2+ zc2)
More Features
• Moments of inertia
– measures dispersion around axes of symmetry
in a density distribution
– calculate 3x3 inertia matrix
– diagonalize to get eigenvalues
– sort from largest to smallest
– take magnitudes and ratios of moments
More Features
• Spoke angles
– if region centered on CA, should have 3
“spokes” of density emanating from center
– find best-fit vectors; calc. angles among them
• surface area of contours
• connectivity of density/bones in region
• other geometrical features...
Feature Weights
Feature
Distance to center of mass
ratio of moments 1 and 3
ratio of moments 1 and 3
skewness
skewness
ratio of moments 1 and 2
median spoke angle
minimum spoke angle
skewness
ratio of moments 1 and 2
maximum spoke angle
ratio of moments 1 and 3
magnitude of moment 1
median spoke angle
minimum spoke angle
Weight
0.183
0.153
0.136
0.080
0.055
0.055
0.052
0.051
0.049
0.038
0.037
0.031
0.022
0.019
0.015
Radius(A)
5
4
5
3
6
4
6
4
5
5
4
3
6
4
6
CAPRA: C-Alpha PatternRecognition Algorithm
Density
Trace
map
Neural
Network
pseudo atoms
Linking into
C-alpha chains
predictions of
distance to true CA
C-alpha
coordinates
• Tracer - remove lattice points from map (lowest density
first) without breaking connectivity
• Neural nework - for each pseudo atom, extract features,
input to network, predict distances to CAs (1:10 in trace),
trained on example points in real maps
• Linking - desire long chains, good CA predictions (not in
side-chains), “structurally plausible” (e.g. linear, helical)
Example of the CAPRA Process
Example of CAPRA chains
The LOOKUP Process
Database Construction
•
•
•
•
•
•
•
•
Ideally would use solved MAD/MIR maps
Using “back-transformed” maps works well
PDB  structure factors (include B-factors)
keep reflections down to 2.8A
Fourier transform  electron density map
50 proteins from PDBSelect (non-homol.)
about 50,000 regions
Feature extraction done offline
Details of Matching Process
• Feature-based matching:
– Euclidean distance metric between feature vectors.
– dist(R1,R2)=Swi(Fi(R1)-Fi(R2))2
• Must weight features by relevance
– less-relevant features add noise
– Slider algorithm: optimize weights by comparing
features in matching regions versus mismatches
• Verify selections by density correlation
– requires search for optimal rotation
Post-Processing Routines
• Imperfections in the initial model:
– backbone atoms not necessarily juxtaposed
between adjacent residues, or in same direction
– side-chains occasionally “flipped” into backbone
– residue identities often incorrect (based on dens.)
• Fixing “flips” and direction - take candidate
match with next highest correlation
• Real-space refinement: regularizes backbone
• Use sequence alignment to fix identities?
New Results on Real MAD Maps
protein
CZRA
M01
size (#aa) type
95
a
317
a+b
# residues built
CAs missed
incorrect CAs
# chains
longest chains
CA RMSD
overall RMSD
aCZRA:
CZRA
86
11a
2
4
39,27,18
0.79
0.84
source
MAD
MAD
reso.
2.3A
2.4A
M01
286
34b
5
8
96,85,65
0.97
1.07
missed a 5-res loop (weak density) and C-terminus
bM01: missed a 17-res helix, 9 deletions, 5 due to breaks, 3-res false backbone
Histograms of Distances
Between Matched Atoms
0
0
3
50
2.6
20
2.2
100
1.8
40
1
150
0.6
60
0.2
200
3
80
2.6
250
2.2
100
1.8
300
1.4
120
1
350
0.6
140
0.2
M01
400
1.4
CZRA
160
Analysis of Amino Acid Types
% identity % struct. sim.
27.4
71.4
16.3
56.2
CZRA
M01
Confusion Matrix for CZRA:
Amino acid in TEXTAL model
Amino acid in true structure
G
A
C
S
P
V
T
I
D
N
L
Q
E
M
H
K
F
Y
R
W
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
G
0
1
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
A
0
3
0
2
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
C
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
S
0
2
0
5
0
2
0
0
0
0
0
0
0
0
0
0
0
0
0
0
P
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
V
0
2
0
1
0
3
1
0
0
0
0
0
0
0
0
0
0
0
0
0
T
0
0
0
0
0
2
1
1
0
0
0
0
0
0
0
0
0
0
0
0
I
0
0
0
0
0
3
1
1
0
0
1
0
0
0
0
0
0
0
0
0
D
0
1
0
0
0
0
0
1
0
0
1
0
0
1
0
0
0
0
0
0
N
0
0
0
1
0
1
0
0
0
0
2
0
0
0
0
1
0
0
0
0
L
0
0
0
2
0
2
0
1
0
1
6
0
0
0
0
0
0
0
0
0
Q
0
1
0
1
0
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
E
0
0
0
0
0
1
0
0
0
0
1
0
0
0
0
2
0
0
0
0
M
0
0
0
0
0
0
0
0
0
0
0
1
0
1
0
0
0
0
0
0
H
0
2
0
2
0
0
0
0
1
0
0
0
0
0
0
2
0
0
0
0
K
0
0
0
1
0
1
0
0
1
0
1
0
1
0
0
1
0
0
0
0
F
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
Y
0
0
0
0
0
0
1
0
0
0
1
0
0
0
0
0
0
0
0
0
R
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
2
0
0
1
0
W
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0