Presentation (PowerPoint File)
Download
Report
Transcript Presentation (PowerPoint File)
Data Mining for Protein Structure
Prediction
Mohammed J. Zaki
SPIDER Data Mining Project:
Scalable, Parallel and Interactive
Data Mining and Exploration at RPI
http://www.cs.rpi.edu/~zaki
Outline of the Talk
How do proteins form?
Protein folding problem
Contact map mining
Using HMMs based on local motifs
Mining “physical” dense frequent patterns
(non-local motifs)
Future directions
Heuristic rules
Folding pathways
How do Proteins Form?
How do Proteins Form?
Building Blocks of Biological Systems
DNA (nucleotides, 4 types): information carrier/encoder
RNA: bridge from DNA to protein
Protein (amino acids, 20 types): action molecules.
Processes
Replication of DNA
Transcription of gene (DNA) to messenger RNA (mRNA)
Splicing of non-coding regions of the genes (introns)
Translation of mRNA into proteins
Folding of proteins into 3D structure
Biochemical or structural functions of proteins
Protein Folding Problem
Protein Structures
Primary structure
Un-branched polymer
20 side chains (residues or amino acids)
Higher order structures
Secondary: local (consecutive) in sequence
Tertiary: 3D fold of one polypeptide chain
Quaternary: Chains packing together
Amino Acid
Polypeptide Chain
Torsion Angles
The Protein Folding Problem
Contact Map Mining
Contact Map
Amino acids Ai and Aj are in contact if their
3D distance is less than threshold (7å)
Sequence separation is given as |i-j|
Contact map C is an N x N matrix, where
C(i,j) = 1 if Ai and Aj are in contact
C(i,j) = 0 otherwise
Consider all pairs with |i-j| >= 4
Protein 2igd: 3D Structure
Anti-parallel Beta Sheets
Alpha Helix
Parallel Beta Sheets
Contact Map (2igd PDB)
Amino Acid Aj
Parallel Beta Sheets
Anti-parallel Beta Sheets
Alpha Helix
Amino Acid Ai
How much information in Amino Acids
Alone: Classification Problem
A pair of amino acids (Ai,Aj) is an instance
The class: C (1) or NC (0), i.e., contact or
non-contact
Highly skewed class distribution
1.7% C and 98.3% NC; 300K C vs 17,3M NC
Features for each instance
Ai and Aj
Class: C or NC
Predicting Protein Contacts
Predict contacts for new sequence
A D T S
A
0 1 0
D
0 1
T
1
S
K
C
P
K
0
0
0
0
C
1
0
0
1
0
P
0
1
0
0
1
0
Ai
A
A
D
D
T
S
K
Aj
T
C
S
P
S
C
P
F1
..
..
..
..
..
..
..
Fn
..
..
..
..
..
..
..
Classification via Association Mining
Association mining good for skewed data
Mining: Mine frequent itemsets in C data (Dc)
P(X | Dc) = Frequency(X | Dc) / |Dc|
Counting: find P(X | Dnc)
Pruning
Likelihood of a contact: r = P(X|Dc) / P(X|Dnc)
Prune pattern X if ratio r of contact to non-contact
probability is less then some threshold
i.e., keep only the patterns highly predictive of
contacts
Testing Phase
90-10 split into training and testing
2.4 million pairs, with 36K contacts (1.5%)
Evidence calculation:
Find matching patterns P for each instance
Compute cumulative frequency in C and NC
Sc = Sum of frequency (X | Dc) where X in P
Snc = Sum of frequency (X | Dnc) where X in P
Compute evidence: ratio of Sc / Snc
Prediction: Sort instances on evidence
Predict top PR fraction as contacts
Experiments
794 Proteins from Protein Data Bank
Distinct structures (< 25% similarity)
Longest: 907, Smallest: 35 amino acids
90-10 split for training-testing
Total pairs: 20 million (> 2.5 GB)
Contacts: 330 thousand (1.6%)
Highly uneven class distribution
Evaluation Metrics
Na: set of all pairs
Na*: all pairs with positive evidence
Ntc: true contacts in test data
Ntc*: true contacts with positive evidence
Npc: predicted contacts
Ntpc: correctly predicted contacts
Accuracy = Ntpc / Npc
Coverage = Ntpc / Ntc
Prediction Ratio (PR): Ntc*/Na*
Random Predictor Accuracy: Ntc/Na
Results (Amino Acids; All Lengths)
Crossover: 7% accuracy and 7% coverage; 2 times over Random
Results (Amino Acids; by length)
1-100: 12% accuracy(A) and coverage (C); 100-170: 6% A and C
170-300: 4.5% A and C; 300+: 2% A and C
Using HMMs based on
Local Motifs to Improve
Classification
An HMM for Local Predictions
HMMSTR (Chris Bystroff, Biology, RPI)
Build a library of short sequences that tend to
fold uniquely across protein families: the ISites Library
Treat each motif as a Markov chain
Merge the motifs into a global HMM for local
structure prediction
Training the HMM
Build I-sites Library
Short sequence motifs (3 to 19)
Exhaustive clustering of sequences
Non-redundant PDB dataset (< 25% similarity)
Build an HMM
Each of 262 motifs is a chain of Markov states
Each state has sequence and structure for
one position
Merge I-sites motifs hierarchically to get one
global HMM for all the motifs
HMM Output
Total of 282 States in the HMM
Each state produces or “emits”:
Amino acid profile (20 probability values)
Secondary structure (D) (helix, strand or loop)
Backbone angles (R) (11 dihedral angle symbols)
Finer structural context (C) (10 context symbols)
I-Sites Motifs (Initiation Sites)
Beta Hairpin
Beta to Alpha
Helix C-Cap
Data Format and Preparation
Take the 794 PDB proteins
Compute optimal alignment to HMM
Find best state sequence for the observed
acids
Output probability distribution of a residue over
all the 282 HMM states
Integrate the 3 datasets
Alignment probability distribution (Nx282)
Amino acid and context information (D, R, C)
Contact map (NxN)
HMMSTR Output (per Protein)
PDB Name: 153l_
Sequence Length: 185
Position: 1
Residue: R
Coordinates: 0.0, -73.2, 17
AA Profile (20 values): 0.0
HMMSTR State Probabili
0.0 0.7 0.3 0.0
Distances (185 values): 0
...
Position: 185
Residue: Y
Coordinates: -88.7 , 0.0, 0
AA Profile (20 values): 0.0
HMMSTR State Probabili
0.0 0.2 0.5 0.3 0.0
Distances (185 values): 15
Adding features from HMMSTR
The class: C (1) or NC (0)
Highly skewed class distribution
Approx 1.5% C and 98.5% NC
Features for each instance
Ai Aj Di Dj Ri Rj Ci Cj
Profile: pi1 pi2 … pi20 pj1 pj2 … pj20
HMM States: qi1 qi2 .. qi282 qj1 qj2 .. qj282
Class: C or NC
HMM and AA + (R,D,C) ; All Lengths
Left Crossover: 19% accuracy and coverage; 5.3 times over Random
Right Crossover (+RDC): 17% accuracy and coverage; 5 times over Random
HMM + AA + R,D,C (by length)
1-100: 30% accuracy(A) and coverage (C); 100-170: 17% A and C
170-300: 10% A and C; 300+: 6% A and C
Predicted Contact Map (2igd)
Summary of Classification Results
Challenging prediction problem
In essence, we have to predict a contact
matrix for a new protein
Hybrid HMM/Associations approach
Best results to-date: 19% overall
accuracy/coverage, 30% for short proteins
14.4% Accuracy (Fariselli, Casadio ‘99; NN)
13% Accuracy (Thomas et al ‘96)
Short proteins: 26% (Olmea, Valencia, ‘97)
Mining “Physical” Dense
Frequent Patterns (nonlocal motifs)
Characterizing Physical, Protein-like
Contact Maps
A very small subset of all contact maps code
for physically possible proteins (self-avoiding,
globular chains)
A contact map must:
Satisfy geometric constraints
Represent low-energy structure
What are the typical non-local interactions?
Frequent dense 0/1 submatrices in contact maps
3-step approach: 1) data generation, 2) dense
pattern mining, and 3) mapping to structure space
Dense Pattern Mining
12,524 protein-like 60 residue structures
Use HMMSTR to generate protein-like sequences
Use ROSETTA to generate their structures
Monte Carlo fragment insertion (from I-sites library)
Up to 5 possible low-energy structures retained
Frequent 2D Pattern Mining
Use WxW sliding window; W window size
Measure density under each window
(N-W)^2 / 2 possible windows per N length protein
Look for “minimum density”; scale away from diag
Try different window sizes
Counting Dense Patterns
Naïve Approach:for W=5, N=60 there are
1485 windows per protein. Total 15 Million
possible windows for 12,524 proteins
Test if two submatrices are equal
Linear search: O(P x W^2) with P current dense patterns
Hash based: O(W^2)
Our Approach: 2-level Hashing
O(W) time
Pattern (WxW Submatrix) Encoding
Encode submatrix as string (W integers)
Submatrix
Integer Value
00000
0
01100
12
01000
8
01000
8
00000
0
Concatenated String: 0.12.8.8.0
Two-level Hashing
v1.v2 .....vW
String ID (M) =
W
Level 1 (approximate): h1( M ) v
i
i 1
Level2 (exact) : h2 (M) = StringID (M)
Binding Patterns to Proteins Sequence and
Structure
Using window size, W=5
StringID:0.12.8.8.0, Support = 170
00000
01100
01000
01000
00000
Occurrences:
pdb-name (X,Y)
1070.0
52,30
1145.0
51,13
1251.2
42,6
1312.0
54,11
1732.0
49,6
2895.0
49,7
...
X_sequence
ILLKN
VFALH
EVCLR
HGYDE
HRFAK
SRCLD
Y_sequence
TFVRI
GFHIA
GSKFG
ATFAK
KELAG
DTIYY
Interaction
alpha::beta
alpha::strand
alpha::strand
alpha::beta
alpha::beta
alpha::beta
Frequent Dense Local Patterns
Submatrix
0 0
0
0 0
0
0 0
0
0 0
0
0 0
0
0 0
1
0 0
1
0 0
1
0
0
0
0
0
0
0
0
0
1
0
1
0
1
0
0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 1
0 1 1
0 1
0
1 1
0
1 1
0
1 0
0
0 0
0
0 0
0
0 0
0
0 0
0
1
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
1 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
1 0
0
0 1
0
0 0
0
0 0
0
0 0
0
0 0
0
0 0
1
0 0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
1
0 0 0
0 0 0
0 0 0
1 0 0
0 1 0
0 0 1
0 0 0
0 0 0
Frequent Dense Non-Local Patterns
Alpha – Alpha
Alpha – Beta Sheet
Frequent Dense Non-Local Patterns
Alpha – Beta Turn
Beta Sheet – Beta Turn
Future Directions
Mining Physicality Rules
Comprehensive list of non-local motifs
I-sites library catalogs local motifs
Mining heuristic rules for “physicality”
Based on simple geometric constraints
Rules governing contacts and non-contacts
Parallel Beta Sheets: If C(i,j) = 1 and C(i+2,j+2) = 1, then
C(i,j+2) = 0 and C(i+2,j) = 0
Anti-parallel Beta Sheets: If C(i,j+2) = 1 and C(i+2,j) = 1,
then C(i,j) = 0 and C(i+2,j+2) = 0
Alpha Helices: If C(i,i+4) = 1, C(i,j) = 1, and C(i+4,j) = 1,
then C(i+2,j) = 0
Heuristic Rules of Physicality
Anti-parallel Beta Sheets
i+2
j
i
j+2
Parallel Beta Sheets
i+2
j+2
i
j
If C(i,j+2) = 1 and C(i+2,j) = 1,
then C(i,j) = 0 and C(i+2,j+2) = 0
If C(i,j) = 1 and C(i+2,j+2) = 1,
then C(i,j+2) = 0 and C(i+2,j) = 0
Protein Folding Pathways
Rules for Pathways in Contact Map Space
Pathway is time-ordered sequence of contacts
Condensation rule: New contact within Smax
U(i,j) <= Smax; U(i,j) is unfolded residues from i to j
Pathway prediction is complementary to structure
prediction
Contact Map Folding Pathways