Usha`s presentation - The University of Texas at Dallas
Download
Report
Transcript Usha`s presentation - The University of Texas at Dallas
Pair-wise Structural Comparison
using DALILite Software of
DALI
Rajalekshmy Usha
Overview
History
Protein Structure Comparison
Comparison Algorithm
Input and Output Interface
Demo on the software
Analysis of the Result
References
History
Earliest resources(1970s) were sequence data
Pioneered by Dayhoff
Structural database appeared in mid-1990s
Structural data is sparse
PDB (protein Data Bank) has 39,464 structural entries to date
NCBI (National Center for Biotechnology Information)
has over 12 million entries on sequence data
Popular Structural classifications of proteins in:
Structural Classification of Proteins (SCOP)
Distance Matrix Alignment (DALI)
CATH
Others are DDBase, 3Dee and DaliDD (Dali Domain Database)
Protein Structure Comparison
Popularized by Liisa Holm and Chris Sander (1993)
DALI
Created by Liisa Holm
Completely automated
Too large and complex to be installed in external sites
Use distance matrices
Standalone version of search engine of Dali server
Why use structural data?
3D structure of the proteins have been conserved over time
Leads to interesting evolutionary observations, prediction of
structure and functions
Comparison Algorithm
Exhaustive, all-against-all 3D structure comparison
Helps to understand the distribution of known structure in shape
space
Use protein structures from PDB
Use distance matrix
three dimensional coordinates of each protein residues (i.e., C-α
atoms)
pair-wise distance between the residue centers (a 2D
representation of 3D structure)
each structure’s contact map are overlaid
move them horizontally and vertically
overlap along the diagonal represent similar backbone
confirmations (secondary structure)
off-diagonal similarity
tertiary structure similarity
Underlying Algorithms
Branch and Bound Search to find the optimal
alignment
Uses distance matrices
Collapsed into regions of overlap (sub-matrices) of fixed
size
The sub-matrices are stitched together if there is an
overlap with the neighboring fragments
Uses similarity score
Monte Carlo Optimization Algorithm
To optimize the alignment
Understanding the Formula Used
Similarity Score
core is the set of structurally equivalent residue pairs
between proteins A and B
Δ is the deviation of the intermolecular Cα-Cα
intermolecular distance between (iA,jA) and (iB,jB), relative
to their arithmetic mean d.
θ is the similarity threshold, set empirically to 0.2
ω is the envelope function and ω = exp(-d2/r2), where r =
20ºA
High score means good fit
Branch and Bound Search
Consider only nongapped segment pairs
This reduces the complexity of structure alignment
Natural segmentation uses the secondary structures of the
query structure
Diagonal lines represent the nongapped segment pairings
E.g. α helices and β strands
Pairing between segments of query structure (horizontal)
and the proteins being aligned to it (vertical).
Do an alignment score (similarity score) within the segments and
between the segments
Split the search space into smaller subset of candidate pairings
(matrices)
Chose the upper bound on the sum-of-pairs score
Subset with the highest bound contains the optimal alignment
Branch and Bound Search
Image source: Holm L., Park J (2000)DaliLite workbench for protein structure comparison. Bioinformatics 16, 567
Monte Carlo Optimization Algorithm
A basic move is made
The move is random
Probability of accepting a move is p = e beta*(s’-s), where S’ = new
score, S= old score and beta is a parameter
Involves addition or deletion of residue equivalence assignment
Two basic modes of operation
Expansion mode
Alignment is incremented by using overlapping contact patterns
Extend the alignment by including all pairs of matching contact
patterns with the same residue pairs (iA ,iB)
Adding new fragment requires tentative removal of inconsistent
previous equivalent assignment
The removal is permanent
Trimming mode
Removal of fragment that give a net negative contribution to the
similarity score
Done after the 1st and every 5 subsequent expansion cycles
The Monte Carlo Optimization
Thick black line indicates the optimum found after branch and bound
algorithm
Red dashed line indicates final alignment after Monte Carlo
Optimization
Image source: Holm L., Park J (2000)DaliLite workbench for protein structure comparison. Bioinformatics 16, 567
DaliLite Database Search Input Interface
DaliLite Database Server Output
DaliLite Database Server Output : 2
DaliLite Pair wise Comparison Input Interface
Statistical Analysis of the Result
Z- score:
X is the raw score to be standardized
σ is the standard deviation
μ is the mean
Score < 2.0 are structurally dissimilar
RMSD (Root Mean Square Deviation)
Average distance between the backbones of the superimposed proteins
δ = distance between N pairs of equivalent Cα atoms
Sequence Identity
percentage of identical amino acids over all structurally equivalent
residues
DaliLite Output
DaliLite Output : 2 – cont’d
DaliLite Output : 3 – cont’d
Demo on Using DaliLite
http://www.ebi.ac.uk/dali/index.html
1CDK
and 1CJA
1CJA:A
1CDK:A
Image source : PDB.org
1CDK is a cAMP-dependent protein kinase and 1CJA is an actin-fragmin kinase
1CPC
and 1KTP
1CPC:A
1KTP:A
Image source : PDB.org
1CPC and 1KTP belong to the same phycocyanin family (light harvesting protein complex); both
have six helices sequentially aligned.
References
Holm L., Sander C(1993 a) Protein Structure
Comparison by Alignment of Distance Matrices.
Journal of Molecular Biol. 233(1): 123-138
Holm L., Park J(2000) DaliLite workbench for protein
structure comparison. Bioinformatics 16, 566-567
Holm L., Sander C(1996) Mapping the protein
universe. Science 273: 595-602
Bourne P.E., Weissig H. Structural Bioinformatics.
Wiley-Liss, Hoboken, New Jersey
http://wikipedia.org