Transcript Terms

Pairwise Sequence Analysis-V
• Relatedness
– “Not just important, but everything”
• Modeling Alignment Scores
– Coin Tosses
– Unit Distributions
– Extreme Value Distribution
– Lambda and K revealed
– Loose Ends
Lecture 6 CS566
1
Modeling Expectation
• Reduced model: Coin tosses
– Given:
• N coin tosses
• Probability of heads p
– Problem:
• What is the average number of longest run of heads?
– Solution:
• Experimental: Perform several repetitions and count
• Theoretical: E(Runmax) = log1/pN
– For example, for fair coin and 64 tosses, E(Runmax) = 6
Lecture 6 CS566
2
Random alignment as Coin tosses
• Head = Match
• Assume
– Score = Run of matches
– Maximum score = Longest run of matches
• Therefore
– Same model of expectation
– For example: For DNA sequences of length N,
• E(matchlengthmax) = Expected longest run of matches
= log1/pN
Lecture 6 CS566
3
Local alignment as Coin tosses
• Assume
– Score in local alignment = Run of matches
– Maximum score = Longest run of matches
• Therefore
– Similar model of expectation
– For DNA sequences of length n & m
• E(Matchlengthmax) ~ log1/p(nm) (Why not just n or m?)
~ log1/p(K’nm)
• Var(Matchlengthmax) = C (i.e., Independent of sample
space)
Lecture 6 CS566
4
Refining Model
• S = AS matrix based scoring between
unrelated sequences
• E(S) ~ log1/p(K’nm) ~ [ln(Knm)]/
(where  = loge 1/p)
• Holy Grail: Need P(S > x), probability of a
score between unrelated sequences
exceeding x
Lecture 6 CS566
5
Poisson distribution estimate of P(S > x)
• Consider Coin Toss Example
Given [x >> E(Runmax)]
Define Success = (Runmax  x)
Define Pn = Probability of n successes
Define y = E[Success],i.e., Average no. of successes
Then, probability of n successes follows Poisson dist.
Pn = (e-yyn)/n!
Probability of 0 successes (No score exceeding x) is
given by P0 = e-y.
Then, probability of at least one score exceeding x,
P(S > x) = i 0 Pi = (1 - P0) = 1 - e-y
For Poisson distribution, y = Kmne-x. Therefore,
P(S > x) = 1 – exp (-Kmne-x)
Lecture 6 CS566
6
Unit Distributions
• Normalize Gaussian and EVD
– Area under curve = 1
– Curve maximum at 0
• Then
– For Gaussian
• Mean = 0; SD = 1
• P(S > x) = 1 – exp (-e-x)
– For EVD
• Mean = 0.577 (Euler cons); Variance = 2/6 = 1.645
• P(S > x) = 1 – exp (-e-(x-u))
– Z-score representation in terms of SDs
• P (Z > z) = 1 – exp(-e-1.2825z – 0.5772)
Lecture 6 CS566
7
Lambda and K
•  = Scale factor for scoring system
– Effectively converts AS matrix values to actual
natural log likelihoods
• K = Scale factor that reduces search
space to compensate for nonindependence of local alignments
• Esimated by fitting to Poisson
approximation or equation for E(S)
Lecture 6 CS566
8
Treasure Trove of Probabilities
• Probability distribution of scores between
unrelated sequences P(Sunrel)
• Probability distribution of number of scores
from P(Sunrel) exceeding some cut-off,
mean represents number of scores
exceeding cut-off observed on average
• Probability of observing score x occurring
between unrelated sequences P(S  x)
Lecture 6 CS566
9
Loose Ends
• What about gap parameters?
– Short answer: No formal theory
– Long answer:
• Found empirically
• Choice of parameters can be used to convert local alignment
algorithm into a global alignment
• What about gapped alignment?
– Not formally proven, but simulations show statistical
behavior similar to ungapped alignment
• Effective sequence length n’ = n – E(matchLengthmax)
Lecture 6 CS566
10