Transcript ppt

Sequence similarity (II)
Schedule
• Mar 23 midterm assigned
Mar 30 midterm due
April 6 teams assigned
April 13
April 20
April 27
May 4
alignment
prot struct/drugs
prot struct/drugs
RNA structure
RNA structure
team rpts
team rpts
• May 11
final (in class)
General gap penalties
• Alignments can no longer be scored as the
sum of their parts
• They still are the sum of blocks with one
matched letter or one gap each
• Blocks are: matched letters, s-gap, t-gap
A|A|C|---|A|GAT|A|A|C
A|C|T|CGG|T|---|A|A|T
Smith Waterman – local alignment
A
C
A
D
A
B
R
A
A
B
R
A
B
R
A
A
A
C
C
A
D
A
A
B
R
A
A
A
D
A
A
A
A
B
A
B
R
A
A
R
A
A
A
A
A
A
B
B
R
A
A
A
R
A
A
A
A
DP for general gaps
• Requires three arrays, one for each block type
• Time complexity is cubic
• This is expensive at best, prohibitive for large
problems
Affine gap penalty
• Charge h for each gap, plus g * (len(gap))
• This still has quadratic complexity!
Point accepted mutations
• Some mutations are more likely than others
• In proteins, some amino acids are more
similar than others (size, charge,
hydrophobicity)
• A point accepted mutation matrix is a table
with probability of each transition in fixed
time
PAM matrices
• The entire matrix sums to 1
• A ‘unit of evolution’ is time in which 1/100
amino acids is expected to change
Scoring matrix
•
•
•
•
Consider aligned letters a,b
Pr(b is a mutation of a) = Mab
Pr(b is a random occurrence) = pb
Score(a,b) = 10log(Mab / pb)
Blast
• Basic Local Alignment Search Tool
• Def: ‘segment’ is a subsequence (without
gaps)
• Def: ‘segment pair’ is two segments of
equal length
• Rem: the score of a segment pair is the sum
of its aligned letters
What Blast does
• Input:
– a PAM matrix
– a database of sequences B
– a query sequence A
– a threshhold S
• Output:
– all segment pairs(A,B) with score > S
How Blast works
• Compile short, high-scoring strings (words)
• Search for hits -- each hit gives a seed
• Extend seeds
Z-scores
• Given an alignment of A, B, how significant
is it?
• Permute A many times
Align each permutation with B
Collect the scores
Z-score = score – mean / standard deviation
Blast on proteins
• Words are w-mers which score at least T
against A
• Use hashing or dfa to search for hits
• Extend seed until heuristically determined
limit is reached
Blast on nucleic acids
• Words are w-mers in query A
• Letters compressed, four to byte
• Filter database B for very common words to
avoid false positives
• Extend seeds as in proteins
What does Blast give you?
• Efficiency
• A rigorous statistical theory which gives the
probability of a segment pair occurring by
chance