MSAProteinStructurePred

Download Report

Transcript MSAProteinStructurePred

MSA and protein structure
prediction.
Nov. 20, 2012
Learning objectives-Become familiar with the cluster
alignment approach to multiple sequence alignment.
Understand the biochemical basis of secondary structure
prediction programs. Become familiar with the Protein
Data Bank. Understand neural networks and how they help
to predict secondary structure.
Workshop-Predict secondary structure of p53.
Quiz on Nov. 27
Next HW due on Nov. 27
 Problems 2 and 3 from Chapter 6
 Problem 1 from Chapter 7
Multiple Sequence Alignment
Chapter 6
Multiple Sequence Alignment
Collection of three or more amino acid (or
nucleic acid) sequences partially or
completely aligned.
Aligned residues tend to occupy
corresponding positions in the 3-D structure
of each aligned protein.
Example of Sequence Alignment
using Clustal W
Asterisk represents identity
: represents high similarity
. represents low similarity
General steps to multiple alignment.
Create Alignment
Edit the alignment to ensure that regions that are
obviously dissimilar are removed
USED FOR:
Phylogenetic Structure Find conserved motifs Design of
Analysis
to deduce function
PCR primers
Analysis
Practical use of MSA
Helps to place protein into a group of
related proteins. It will provide insight into
function, structure and evolution.
Identifies sequencing errors
Identifies important regulatory regions in
the promoters of genes.
Clustal W (Thompson et al.,
1994)
CLUSTAL=Cluster alignment
The underlying concept is that groups of
sequences are phylogenetically related. If they
can be aligned, then one can construct a
phylogenetic tree.
Phylogenetic tree-a tree showing the evolutionary
relationships among various biological species or
other entities that are believed to have a common
ancestor.
Flowchart of computation steps in
Clustal W (Thompson et al., 1994)
Pairwise alignment: calculation of distance matrix
Creation of unrooted neighbor-joining tree
Rooted NJ tree (guide tree) and calculation of sequence weights
Progressive alignment following the guide tree
Preliminary pairwise alignments
Compare each pair of sequences.
Different
sequences
A
-
B
.87
C
.59 .60
A B
C
Each element represents the number
of exact matches divided by the
sequence length (ignoring gaps).
Thus, the higher the value of the element
the more closely related the two sequences.
In this matrix, sequence A is 87% identical to sequence B
Step 1-Calculation of Distance
Matrix
Use the Distance Matrix to create a Guide Tree to
determine the “order” of the sequences.
Hbb-Hu
1
-
Hbb-Ho
2
.17
-
Hba-Hu
3
.59
.60
-
Hba-Ho
4
.59
.59
.13
-
Myg-Ph
5
.77
.77
.75
.75
-
Gib-Pe
6
.81
.82
.73
.74
.80
-
Lgb-Lu
7
.87
.86
.86
.88
.93
.90
-
1
2
3
4
5
6
7
I = # of identical aa’s in pairwise global alignment
D = 1 – (I)
total number of aa’s in shortest sequence
D = Difference score
Step 2-Create an unrooted NJ tree
Myg-Ph
Hba-Ho
Hba-Hu
Hbb-Ho
Gib-Pe
Hbb-Hu
Lgb-Lu
Step 3-Create Rooted NJ Tree
Weight
Alignment
Order of alignment:
1 Hba-Hu vs Hba-Ho
2 Hbb-Hu vs Hbb-Ho
3 A vs B
4 Myg-Ph vs C
5 Gib-Pe vs D
6 Lgh-Lu vs E
Table 6.2 Sequence weight calculations
Sequence number
Sequence name
Raw sequence
weight
Normalized
sequence weight.
1
2
Hbb-Hu
Hbb-Ho
0.223
0.226
0.506
0.511
3
Hba-Hu
0.193
0.437
4
Hba-Ho
0.203
0.459
5
Myg-Ph
0.411
0.930
6
Gib-Pe
0.399
0.903
7
Lgb-Lu
0.442
1.000
Step 4-Progressive alignment
From Baxenavis and
Ouellette , 2004
Step 4-Progressive alignment
Scoring during
progressive
alignment
M(t,v) = 0; M(t,i) = -1; M(l,v) = 1; M(l,i) = 2
Following the steps in the above figure, calculation of the score for the comparison of A and B at the
outlined position is:
0 * 0.506*0.437 = 0
-1 * 0.506*0.459 = -.232
1 * 0.511 * 0.437 = .223
2 * 0.511 * 0.459 = .469
(0 + (-0.232) + 0.223 + 0.469)/4 = 0.460
Amino acid weight matrices
As we know, there are many scoring
matrices that one can use depending on the
relatedness of the aligned proteins.
As the alignment proceeds to longer
branches the aa scoring matrices are
changed to those more suitable for distant
evolutionary relationships. The length of
the branch is used to determine which
matrix to use and contributes to the
alignment score.
Rules for alignment
Short stretches of 5 hydrophilic residues often indicate loop or random
coil regions (not essential for structure) and therefore gap penalties are
reduced reduced for such stretches.
Gap penalties for closely related sequences are lowered compared to
more distantly related sequences (“once a gap always a gap” rule). It
is thought that those gaps occur in regions that do not disrupt the
structure or function.
Alignments of proteins of known structure show that protein gaps do
not occur more frequently than every eight residues. Therefore
penalties for gaps increase when required at 8 residues or less for
alignment. This gives a lower alignment score in that region.
A gap weight is assigned after each aa according the frequency that
such a gap naturally occurs after that aa when comparing the structures
of two close homologs.
Multiple Alignment
Considerations
Quality of guide tree. It would be good to have a set of
closely related sequences in the alignment to set the pattern
for more divergent sequences.
If the initial alignments have a problem, the problem is
magnified in subsequent steps.
CLUSTAL W is best when aligning sequences that are
related to each other over their entire lengths.
Remove variable N- and C- terminal regions.
If protein is enriched for G,P,S,N,Q,E,K,R then these
residues should be removed from gap penalty list. (what
types of residues are these?)
Protein structure prediction
Chapter 7
What is secondary structure?
Three major types:
Alpha Helical Regions
Beta Strand Regions (sometimes
called Sheet Regions)
Loops
Coils, Turns, Extended (anything else)
Can we predict the final structure?
http://en.wikipedia.org/wiki/Protein_folding
Some Prediction Methods
Computationally intense methods

Based on physical properties of aa’s and bonding
patterns. Quantum mechanics and Molecular
mechanics-based methods.
Statistics of amino acid distributions in known
structures

Chou-Fasman, GOR
Sequence similarity to sequences with known
structures

PSI-PRED
Chou-Fasman
First widely used procedure
Output-helix, strand or turn
Percent accuracy: 60-65%
Psi-BLAST Predict Secondary
Structure (PSIPRED)
Three steps:
 1) Generation of position specific
scoring matrix.
 2) Prediction of initial secondary
structure
 3) Filtering of predicted structure
PSIPRED
Uses multiple aligned sequences for prediction.
Uses training set of folds with known structure.
Uses a two-stage neural network to predict
structure based on position specific scoring
matrices generated by PSI-BLAST (Jones, 1999)


First network converts a window of 15 aa’s into a raw
score of h,e (sheet), c (coil) or terminus
Second network filters the first output. For example, an
output of hhhhehhhh might be converted to hhhhhhhhh.
Can obtain a Q3 value of 70-78% (may be the
highest achievable)
Neural networks
• Computer neural networks are based on simulation of adaptive
learning in networks of real neurons.
•Neurons connect to each other via synaptic junctions which are either
stimulatory or inhibitory.
•Adaptive learning involves the formation or suppression of the right
combinations of stimulatory and inhibitory synapses so that a set
of inputs produce an appropriate output.
Neural Networks (cont. 2)
90% of training set was used (known structures)
10% was used to evaluate the performance of the neural
network after the training session.
Neural Networks (cont. 3)
•During the training phase, selected sets of proteins of known
structure were scanned, and if the decisions were incorrect, the
input weightings were adjusted by the software to produce the
desired result.
•Training runs were repeated until the success rate is
maximized.
•Careful selection of the training set is an important aspect of
this technique. The set must contain as wide a range of
different fold types as possible without duplications of
structural types that may bias the decisions.
Neural Networks (cont. 4)
•An additional component of the PSIPRED procedures involves
sequence alignment with similar proteins.
•The rationale is that some amino acids positions in a sequence
contribute more to the final structure than others. (This has been
demonstrated by systematic mutation experiments in which each
consecutive position in a sequence is substituted by a spectrum of amino
acids. Some positions are remarkably tolerant of substitution, while
others have unique requirements.)
•To predict secondary structure accurately, one should place less weight
on the tolerant positions, which clearly contribute little to the structure
•One must also put more weight on the intolerant positions.
Row specifies aa position
15 groups of 21 units
(1 unit for each aa plus
one specifying the end)
Provides info
on tolerant or
intolerant positions
Filtering network
three outputs are helix, strand or coil
(Jones, 1999)
Example of Output from
PSIPRED
PSIPRED PREDICTION RESULTS
Key
Conf: Confidence (0=low, 9=high)
Pred: Predicted secondary structure (H=helix, E=strand, C=coil)
AA: Target sequence
Conf: 923788850068899998538983213555268822788714786424388875156215
Pred: CCEEEEEEEHHHHHHHHHHCCCCCCHHHHHHCCCCCEEEEECCCCCCHHHHHHHCCCCCC
AA: KDIQLLNVSYDPTRELYEQYNKAFSAHWKQETGDNVVIDQSHGSQGKQATSSVINGIEAD
10
20
30
40
50
60
How to calculate Q3?
Sequence:
MEETHAPYRGVCNNM
Actual Structure:
CCCCCHHHHHHEEEE
PSIPRED Prediction: CCCCCHHHHHHEEEH
Q3 = 14/15 x 100 = 93%
3D structure data
The largest 3D structure database is the
Protein Databank
It contains over 30,000 records
 Each record contains 3D coordinates for
macromolecules
 80% of the records were obtained from X-ray
diffraction studies, 20% from NMR.

Part of a record from the PDB
ATOM
1
N
ARG A
14
22.451
98.825
31.990
1.00 88.84
N
ATOM
2
CA
ARG A
14
21.713 100.102
31.828
1.00 90.39
C
ATOM
3
C
ARG A
14
22.583 101.018
30.979
1.00 89.86
C
ATOM
4
O
ARG A
14
22.105 101.989
30.391
1.00 89.82
O
ATOM
5
CB
ARG A
14
21.424 100.704
33.208
1.00 93.23
C
ATOM
6
CG
ARG A
14
20.465 101.880
33.215
1.00 95.72
C
ATOM
7
CD
ARG A
14
20.008 102.147
34.637
1.00 98.10
C
ATOM
8
NE
ARG A
14
18.999 103.196
34.718
1.00100.30
N
ATOM
9
CZ
ARG A
14
18.344 103.507
35.833
1.00100.29
C
ATOM
10
NH1 ARG A
14
18.580 102.835
36.952
1.00 99.51
N
ATOM
11
NH2 ARG A
14
17.441 104.479
35.827
1.00100.79
N
Show how to do the homework
assignment from Chapter 7
http://www.rcsb.org/pdb/home/home.
do
Quiz #3 prep
BLAST



Three steps
Gapped BLAST
Heuristic program
CLUSTAL W




Pairwise alignments
Difference matrix
Guide tree
Importance of having
highly similar sequences
Secondary Structure
prediction



Chou-Fasman
PSIPRED
Good for secondary str