Biology and computers

Download Report

Transcript Biology and computers

Scoring Matrices
1)
2)
3)
April 23, 2009
Learning objectivesLast word on Global Alignment
Understand how the Smith-Waterman algorithm can be
applied to perform local alignment.
Have a general understanding about PAM and BLOSUM
scoring matrices.
Homework 3 and 4 due today
Quiz 1 today
Writing topic due today
Homework 5 due Thursday, April 30.
Global Alignment
output file
Global: HBA_HUMAN vs HBB_HUMAN
Score: 290.50
HBA_HUMAN
1
HBB_HUMAN
1
HBA_HUMAN
45
HBB_HUMAN
44
HBA_HUMAN
84
HBB_HUMAN
89
HBA_HUMAN
129
HBB_HUMAN
134
VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFP 44
|:| :|: | | |||| : | | ||| |: : :| |: :|
VHLTPEEKSAVTALWGKV..NVDEVGGEALGRLLVVYPWTQRFFE 43
HF.DLS.....HGSAQVKGHGKKVADALTNAVAHVDDMPNALSAL 83
| |||
|: :|| ||||| | :: :||:|::
: |
SFGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATL 88
SDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKF 128
|:|| || ||| ||:|| : |: || |
|||| | |: |
SELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKV 133
LASVSTVLTSKYR
:| |: | ||
VAGVANALAHKYH
141
146
%id = 45.32
%similarity = 63.31 (88/139 *100)
Overall %id = 43.15; Overall %similarity = 60.27 (88/146 *100)
Smith-Waterman Algorithm Advances
in
Applied Mathematics, 2:482-489 (1981)
Smith-Waterman algorithm –can be used for local alignment
-Memory intensive
-Common searching programs such as BLAST use SW algorithm
Smith-Waterman (cont. 1)
a. Initializes edges of the matrix with zeros
b. It searches for sequence matches.
c. Assigns a score to each pair of amino acids
-uses similarity scores
-uses positive scores for related residues
-uses negative scores for substitutions and gaps
d. Scores are summed for placement into Mi,j. If any
sum result is below 0, a 0 is placed into Mi,j.
e. Backtracing begins at the maximum value found
anywhere in the matrix.
f. Backtrace continues until the it meets an Mi,j value of 0.
Smith-Waterman (cont. 2)
H E A G A W G H E E
P
A
W
H
E
A
E
0
0
0
0
0
0
0
0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 5 0 5 0 0 0 0 0
0 0 0 3 0 2012 4 0 0
10 2 0 0 1 12182214 6
2 16 8 0 0 4101828 20
0 82113 5 0 41020 27
0 6131912 4 0 416 26
Put zeros on top row
and left column.
Assign initial scores
based on a scoring
matrix. Calculate
new scores based on
adjacent cell scores.
If sum is less than
zero or equal to zero
begin new scoring
with next cell.
This example uses the BLOSUM45 Scoring Matrix with a gap
penalty of -8.
Smith-Waterman (cont. 3)
H E A G A W G H E E
P
A
W
H
E
A
E
0
0
0
0
0
0
0
0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 5 0 5 0 0 0 0 0
0 0 0 3 0 2012 4 0 0
10 2 0 0 1 12182214 6
2 16 8 0 0 4101828 20
0 82113 5 0 41020 27
0 6131912 4 0 416 26
AWGHE
|| ||
AW-HE
Score=28
Begin backtrace at the
maximum value found
anywhere on the
matrix.
Continue the backtrace
until score falls to zero
Calculation of similarity score and
percent similarity
A W G H E
A W - H E
5
15 -8
10
6
Blosum45 SCORES
GAP PENALTY (novel)
% SIMILARITY =
NUMBER OF POS. SCORES
DIVIDED BY NUMBER OF AAs
IN REGION x 100
% SIMILARITY = 4/5 x 100
= 80%
Similarity Score= 28
Why search sequence databases?
1. I have just sequenced something. What is
known about the thing I sequenced?
2. I have a unique sequence. Does it have
similarity to another gene of known
function?
3. I found a new protein sequence in a lower
organism. Is it similar to a protein from
another species?
Perfect searches for similar
sequences in a database
First “hit” should be an exact match.
Next “hits” should contain all of the
genes that are related to your gene
(homologs).
Next “hits” should be similar but are
not homologs
How does one achieve the
“perfect search”?
Consider the following:
Scoring Matrices (PAM vs. BLOSUM)
Local alignment algorithm
Database
Search Parameters
Expect Value-change threshold for score
reporting
 Translation-of DNA sequence into protein
 Filtering-remove repeat sequences

Which Scoring Matrix to use?
PAM-1
BLOSUM-100
Small evolutionary
distance
High identity within
short sequences
PAM-250
BLOSUM-20
Large evolutionary
distance
Low identity within
long sequences
BLOSUM Scoring Matrices
Which BLOSUM Matrix to use?
BLOSUM
80
62
35
Identity (up to)
80%
62% (usually default value)
35%
If you are comparing sequences that are very similar, use
BLOSUM 80. Sequences that are more divergent (dissimilar)
than 20% are given very low scores in this matrix.