Probabilities and Probabilistic Models

Download Report

Transcript Probabilities and Probabilistic Models

Probabilities and Probabilistic
Models
Probabilistic models
• A model means a system that simulates an
object under consideration.
• A probabilistic model is a model that produces
different outcomes with different probabilities – it
can simulate a whole class of objects, assigning
each an associated probability.
• In bioinformatics, the objects usually are DNA or
protein sequences and a model might describe a
family of related sequences.
1.
Examples
The roll of a six-sided die – six parameters p , p , …, p ,
1
2
where pi is the probability of rolling the number i. For
probabilities, pi > 0 and pi  1
6
i
2. Three rolls of a die: the model might be that the rolls
are independent, so that the probability of a sequence
such as [2, 4, 6] would be p2 p4 p6.
3. An extremely simple model of any DNA or protein
sequence is a string over a 4 (nucleotide) or 20 (amino
acid) letter alphabet. Let qa denote the probability, that
residue a occurs at a given position, at random,
independent of all other residues in the sequence.
n
Then, for a given length n, the probability of the
P ( x1 ,, xn )   q xi
sequence x1,x2,…,xn is
i 1
Conditional, joint, and marginal
probabilities
• two dice D1 and D2. For j = 1,2, assume that the
probabi-lity of using die Dj is P(Dj ), and for iPD=j (i ) .
1,2,, 6, assume that the probability of rolling an
i with dice Dj is
P (i | D j )  PD j (i ) .
• In this simple two dice model, the conditional
probability of rolling an i with dice Dj is:
P( i , D j )  P( D j ) P( i | D j ) .
• The joint probability of picking die Dj and rolling
an i is:
2
2
P (i )   P (i , D j )   P ( D j ) P (i | D j )
j 1 i – marginal
j 1
• The probability of rolling
probability
Maximum likelihood estimation
• Probabilistic models have parameters that are usually
estimated from large sets of trusted examples, called
a training set.
• For example, the probability qa for seeing amino acid
a in a protein sequence can be estimated as the
observed frequency fa of a in a database of known
protein sequences, such as SWISS-PROT.
• This way of estimating models is called Maximum
likelihood estimation, because it can be shown that
using the observed frequencies maximizes the total
probability of the training set, given the model.
• In general, given a model with parameters and a set
of data D, the maximum likelihood estimate (MLE) for
 is the value which maximizes P(D | ).
Model comparison problem
• An occasionally dishonest casino uses two kinds
of dice, of which 99% are fair, but 1% are
loaded, so that a 6 appears 50% of the time.
• We pick up a dice and roll [6, 6, 6]. This looks like
a loaded die, is it? This is an example of a model
comparison problem.
• I.e., our hypothesis Dloaded is that the die is
loaded. The other alternative is Dfair. Which
model fits the observed data better? We want to
calculate:
P(Dloaded | [6, 6, 6])
Prior and posterior probability
• P(Dloaded | [6, 6, 6]) is the posterior probability that
the dice is loaded, given the observed data.
• Note that the prior probability of this hypothesis
is 1/100 – prior because it is our best guess
about the dice before having seen any
information about the it.
1 1 1
1
P ([6,6,6] | Dloaded )     
• The likelihood of the hypothesis Dloaded
2 :2 2
8
P (Y | X ) P ( X )
P ( X | Y )  – using Bayes’ theorem
• Posterior probability
P (Y )
Comparing models using Bayes’
theorem
• We set X = Dloaded and Y = [6, 6, 6], thus obtaining
P ([6,6,6] | Dloaded ) P ( Dloaded )
P ( Dloaded | [6,6,6]) 
P ([6,6,6])
• The probability P(Dloaded) of picking a loaded die is 0.01.
• The probability P([6, 6, 6] | Dloaded) of rolling three sixes
using a loaded die is 0.53 = 0.125.
• The total probability P([6, 6, 6]) of three sixes is
P([6, 6, 6] | Dloaded) P(Dloaded) + P([6, 6, 6] |
Dfair)P(Dfair).
(0.5 3 )( 0.01)
 0.21
• Now, P ( Dloaded | [6,6,6]) 
3
1 3
(0.5 )( 0.01)  ( 6 ) (0.99 )
• Thus, the die is probably fair.
Biological example
• Lets assume that extracellular (ext) proteins have
a slightly different composition than intercellular
(int) ones. We want to use this to judge whether
a new protein sequence x1,…, xn is ext or int.
• To obtain training data, classify
allintproteins in
ext
fa
SWISS-PROT into ext, intfaand unclassifiable
ones.
• Determine the frequencies
and
of
each amino acid a in ext and int proteins,
respectively.
• To be able to apply Bayes’ theorem, we need to
determine the priors pint and pext, i.e. the
probability that a new (unexamined) sequence is
extracellular or intercellular, respectively.
Biological example - cont.
n
P ( x | ext)  
• We have:
i 1
n
q ext
xi
and
P ( x | int )   q ixnti
i 1
• If we assume that any sequence is either
extracellular or intercellular, then we have
P(x) = pext P(x | ext) + pintP(x | int).
ext
ext
p
q
P (ext) P (we
x | ext
)
 i xi
• ByPBayes’
obtain
(ext | x )theorem,


P( x)
int
int
pext i q ext
xi  p  i q xi
the posterior probability that a sequence is
extracellular.
• (In reality, many transmembrane proteins have both
Probability vs. likelihood
pravdepodobnosť vs. vierohodnosť
• If we consider P( X | Y ) as a function of X, then
this is called a probability.
• If we consider P( X | Y ) as a function of Y , then
this is called a likelihood.
Sequence comparison by compression
Motivation
• similarity as a marker for homology. And
homology is used to infer function.
• Sometimes, we are only interested in a numerical
distance between two sequences. For example,
to infer a phylogeny.
Figure adapted from http://www.inf.ethz.ch/personal/gonnet/acat2000/side2.html
Text- vs DNA compression
• compress, gzip or zip – routinely used to
compress text files. They can be applied also to a
text file containing DNA.
• E.g., a text file F containing chromosome 19 of
human in fasta format – |F|= 61 MB, but
|compress(F)|= 8.5 MB.
• 8 bits are used for each character. However, DNA
consists of only 4 different bases – 2 bits per
base are enough:
A = 00, C = 01, G = 10,
and T = 11.
• Applying a standard compression algorithm to a
file containing DNA encoded using two bits per
base will usually not be able to compress the file
The repetitive nature of DNA
• Take advantage of the repetitive nature of DNA!!
• LINEs (Long Interspersed Nuclear Elements),
SINEs.
•
UCSC Genome Browser: http://genome.ucsc.edu
DNA compression
• DNA sequences are very compressible, especially for
higher eukaryotes: they contain many repeats of different
size, with different numbers of instances and different
amounts of identity.
• A first idea: While processing the DNA string from left to
right, detect exact repeats and/or palindromes (reversecomplemented repeats) that possess previous instances in
the already processed text and encode them by the length
and position of an earlier occurrence. For stretches of
sequence for which no significant repeat is found, use
two-bit encoding. (The program Biocompress is based on
this idea.)
– Data structure for fast access to sequence patterns already
encountered.
– Sliding window along unprocessed sequence.
DNA compression
• A second idea: Build a suffix tree for the whole
sequence and use it to detect maximal repeats of
some fixed minimum size. Then code all repeats
as above and use two-bit encoding for bases not
contained inside repeats. (The program Cfact is
based on this idea.)
• Both of these algorithms are lossless, meaning
that the original sequences can be precisely
reconstructed from their encodings. An number
of lossy algorithms exist, which we will not
discuss here.
• In the following we will discuss the GenCompress
algorithm due to Xin Chen, Sam Kwong and Ming
Edit operations
•
•
The main idea of GenCompress is to use inexact
matching, followed by edit operations. In other
words, instances of inexact repeats are encoded
by a reference to an earlier instance of the
repeat, followed by some edit operations that
modify the earlier instance to obtain the current
instance.
Three standard edit operations:
1. Replace: (R, i, char) – replace the character at position
i by character char.
2. Insert: (I, i, char) – insert character char at position i.
3. Delete: (D, i) – delete the character at position i.
Edit operations
• different edit operation sequences:
– (a) C C C C R C C C C C
or
(b) C C C C D C I C C C C
gaccgtcatt
gaccttcatt
gaccgt catt
gacc ttcatt
• infinite number of ways to convert one string into another.
• Given two strings q and p. An edit transcript (q, p) is a list
of edit operations that transforms q into p.
• E.g., in case (a) the edit transcript is:
(gaccgtcatt, gaccttcatt) = (R, 4, t),
• whereas in case (b) it is:
(gaccgtcatt, gaccttcatt) = (D, 4), (I, 4, g).
• (positions start at 0 and are given relative to current state
of the
• string, as obtained by application of any preceding edit
Encoding DNA
1.
2.
Using the two-bit encoding method, gaccttcatt can
be encoded in 20 bits:
10 00 01 01 11 11 01 00
11 11
The following three encode gaccttcatt relative to
gaccgtcatt:
In the exact matching method we use a pair of
numbers (repeat − position, repeat − length) to
represent an exact repeat. We can encode
gaccttcatt as (0, 4), t, (5, 5), relative to gaccgtcatt.
Let 4 bits encode an integer, 2 bits encode a base and
one bit to indicate whether the next part is a pair
(indicating a repeat) or a base. We obtain an encoding
in 21 bits:
0 0000
Encoding DNA
3. In the approximate matching method we can
encode gaccttcatt as (0, 10), (R, 4, t) relative to
gaccgtcatt. Let use encode R by 00, I by 01, D
by 11 and use a single bit to indicate whether
the next part is a pair or a triplet. We obtain an
encoding in 18 bits:
0
0000 1010 1 00 0100 11
4. For approximate matching, we could also use
the edit sequence (D, 4), (I, 4, t), for example,
yielding the relative encoding (0, 10), (D, 4), (I, 4,
t), which uses 25 bits:
0 0000 1010 1 11
0100 1 01 0100 11.
GenCompress:
• a one-pass algorithm based on approximate matching
• For a given input string w, assume that a part v has
already been compressed and the remaining part is u,
with w = vu. The algorithm finds an “optimal prefix” p of u
that approximately matches some substring of v such that
p can be encoded economically. After outputting the
encoding of p, remove
w the prefix p from u and append it
to v. If no optimal prefix is found, output the next base
v
u
and then move it from u to v. Repeat until u = .
p’
p
The condition C
• How do we find an “optimal prefix” p? The
following condition will be used to limit the
search.
• Given two numbers k and b. Let p be a prefix of
the unprocessed part u and q a substring of the
processed part v. If |q| > k, then any transcript (q,
p) is said to satisfy the condition C = (k, b) for
compression, if its number of edit operations is 
b.
• Experimental studies indicate that C = (k, b) = (12,
3) gives good results.
• In other words, when attempting to determine the
optimal prefix for compression, we will only
The compression gain function
• We define a compression gain function G to
determine whether a particular approximate repeat q,
p and edit transcript are beneficial for the encoding:
• G(q, p, ) = max {2|p| − |(i, |q|)| − w · |(q, p)| − c, 0}.
• where
– p is a prefix of the unprocessed part u,
– q is a substring of the processed part v of length |q|
that starts at position i,
– 2|p| is the number of bits that the two-bit encoding
would use,
– |(i, |q|)| is the encoding size of (i, |q|),
– w is the cost of encoding an edit operation,
– |(q, p)| is the number of edit operations in (q, p),
The GenCompress algorithm
Input: A DNA sequence w, parameter C = (k, b)
Output: A lossless compressed encoding of w
Initialization: u = w and v = 
while u   do
Search for an optimal prefix p of u
if an optimal prefix p with repeat q in v is found then
Encode the repeat representation (i, |q|), where i is
the position of q in v, together with the shortest edit
transcript (q, p). Output the code.
else Set p equal to the first character of u,
encode and output it.
Remove the prefix p from u and append it to v
end