MCMC - Computer Science Department, Technion

Download Report

Transcript MCMC - Computer Science Department, Technion

Markov Chain Monte Carlo
Hadas Barkay
Anat Hashavit
Motivation

Models built to represent diverse types of genetic
data tend to have a complex structure. Computation
on these structures are often intractable.


We see this, for instance, in pedigree analysis.
Stochastic simulation techniques are good ways to
obtain approximation.

Markov Chain Monte Carlo methods have proven to be
quite useful for such simulations.
Markov chain


Markov chains are random processes X1,X2,…,having a
discrete 1-dimensional index set (1,2,3…)
domain D of m states {s1,…,sm} from which it takes
it’s values

Satisfying the markov property:
 The distribution of xt+1 given all the previous states
xt xt-1 … depends only on the immediately preceding
state xt
P[X t 1  x t 1 | X t  x t , X t -1  x t -1 , . . . , X1  x1 , X 0  x 0 ]
 P[X t 1  x t 1 | X t  x t ]
X
S1
X2
S2
X3
S3
Xi-1
Si-1
Xi
Si
Xi+1
Si+1
X
S1
X2
S2
X3
S3
Xi-1
Si-1
Xi
Si
Xi+1
Si+1
A markov chain also consists of
1. An m dimensional initial distribution vector
(  (s1 ),
,  (s m ))
2. An m×m transition probabilities transition matrix
P= (Aij) s.t P(i,j)=Pr[xt+1=si|xt=sj]
Notice that the transition probability between Xt
and Xt+2 are also derived in the following way:
Pr( xt  2  sk | xt  si )   Pr(xt  2  sk , xt 1  s j | xt  si )
j
  Pr( xt  2  sk | xt  s j ) Pr(xt 1  s j | xt  si )
j
  pij p jk
j


The multiplications shown in the previous slide are
equivalent to multiplying M by itself .
We conclude that P2ik= Pr(xt+2=sk|xt=si)
more generally Pnik = Pr(xt+n=sk|xt=si)
Stationary distribution
A distribution P = (1,…, m) is stationary if P P = P
•
•
When P is unique for a transition matrix P , the rows
of lim n  P n all coinside with P.
in other words , the long-run probability of being in
state j is Pj, regardless of the initial state.
The gambler example




a gambler wins 1 dollar with probability
losses 1 dollar with probability 1-p
the gambler plays 5 rounds.
The transition matrix of that markov chain is:
0
1
2
3
4 5
0 1
1 1  p

2 0
3 0

4 0

5 0
0
0
1 p
0
0
0
0
p
0
1 p
0
0
0
0
p
0
1 p
0
0
0
0
p
0
0
0
0

0
0

p

1



there is no unique stationary distribution for this
example
any distribution  for which 1+6=1 will satisfy P=.
For example:
0
0
0
 1
1  p 0
p
0

p
 0 1 p 0


 0.5 0 0 0 0 0.5  
0
0 1 p 0

 0
0
0 1 p

0
0
0
 0
0
0
0
p
0
0
0
0

0 


0.5
0
0
0
0
0.5



0

p

1

For p=0.5 we can see that
0
0
0 0 0 
1
1
0.5 0 0.5 0 0 0 
 0.8



 0 0.5 0 0.5 0 0 
0.6
 
0


0 0.5 0 p 0
0.4



0
0.2
0
0 0.5 0 0.5



0
0
0
0
0
1


0

0 0 0 0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
The long run frequency of being in a given state
depends on the initial state
0
0
0
0
0
0 
0.2 

0.4 
0.6 

0.8 

1 
Some properties of Markov chains





A Markov chain is aperiodic if the chances of getting
from state to state are not periodic in the number of
states needed.
States i,j communicate if state j is reachable from
state i, i.e there exists a finite number n s.t pn(i,j)>0.
A Markov chain is said to be irreducible if all its
states communicate.
A Markov chain is said to be ergodic if it is
irreducible and aperiodic.
An ergodic Markov chain always has a unique
stationary distribution
MCMC - Monte Carlo Markov Chain
Goal:
Calculating an Expectatoin Value  
   (i)
i
i
When the probabilities i are given up to an
unknown normalization factor Z.
Examples :
 E 
1
1. Boltzman distribution:  i  exp   i 
Z
 kbT 
 Ei 
where Z   exp  

k
T
i
 b 
is the canonical Partitioning function,
which is sometimes incalculable.
2. Constructing the likelihood function …
Prob(data| 2) = P(x11 ,x12 ,x13 ,x21 ,x22 ,x23) =
Probability of data (sum over all
states of all hidden variables)
Prob(data| 2) = P(x11 ,x12 ,x13 ,x21 ,x22 ,x23) =
 l11m, l11f … s23f [P(l11m) P(l11f) P(x11 | l11m, l11f,) … P(s13m)
P(s13f) P(s23m | s13m, 2) P(s23m | s13m, 2) ]
Method: Simulate
don’t go over all space.
Calculate an estimator to the expectation value.
We want to calculate the expectation value     s ( s )
s
Instead, we calculate an estimator to the
expectation value, by sampling from the space .
1 n 1
  E  ( S )    ( Si )
n i 0
MCMC tells us how to carry out the sampling to create
a good (unbiased, low variance) estimator.
1 n 1
lim   ( Si )  lim   
n  n
n 
i 0
MCMC: create an ergodic (=irreducible, aperiodic)
Markov Chain and simulate from it.
We need to define an ergodic transition matrix P   pij 
for which  is a unique stationary distibution.
If P satisfies detailed balance, i.e.  i pij   j p ji
for all i  j , then  is stationary for P:
 p  
i
ij
j
j
j
  i    j p ji
j
p ji
Metropolis (1953) suggested a simulating algorithm, in
which one defines P indirectly via a symmetric
transition matrix Q in the following way:
If you are in state i, propose a state j with probability qij
 j
Move from i to j with probability aij  min 1, . 
 i 
This assures detailed balance of P :
 i pij   j p ji
 i 
 j
 i qij min 1,    j q ji min 1, 
 i 
 j
qij min  i ,  j   q ji min  j ,  i 
Hastings (1970) generalized for a non-symmetric Q :
  j q ji 
aij  min 1,

  i qij 
Metropolis – Hastings Algorithm:
• Start with any state x0 .
• Step t: current state is xt 1  i .
Propose state j with probablity qij.
  j q ji 
compute aij  min 1,

  i qij 
Accept state j if aij  1.
else, draw uniformly from [0,1] a random
aij
number cc. If
state j
, accept
else , stay in state
Simulated Annealing
Goal: Find the most probable state of a
Markov Chain.
Use the Metropolis Algorithm with a
different acceptance probability:
1








aij  min 1,  j  
   i  
Where  is the temperature.
Cool gradually, i.e., allow the chain to run m
times with an initial temperature  0 , then
set a new temperature  1   0   and allow
the chain to run again. Repeat n times, until   0
Simulated Annealing (Kirkpatrick,1983)
Goal: Find the global minimum of a multi
-dimensional surface S ( x1 , x2 , , xn ), when
all states have an equal probability.
Use another variation of the Metropolis
Algorithm.
Suppose you are in state i. Draw a new state j,
and accept using the acceptance probabilty:

 ( S j  Si )  
aij  min 1 , C  exp  

kT



Where T is the temperature.
Cool gradually.
Parameters to play with:

n (number of loops)

m (number of iterations in each loop)

T0 (initial temperature, shouldn’t be too high or
too low)

T (cooling speed, shouldn’t be too fast)

k (“Boltzman constant” – scaling parameter)

C (first step scaling parameter)

x0  ( x0,1 , x0,2 ,

qij (how to draw the next step)
, x0,n ) (initial state)
MCMC Example: Descent Graph



Corresponds to a specific inheritance vector.
Vertices: the individuals’ genes (2 genes for each
individual in the pedigree).
Edges: represent the gene flow specified by the
inheritance vector. A child’s gene is connected by
an edge to the parent’s gene from which it flowed.
(Taken from tutorial 6)
1
2
11
12
13
14
21
22
23
24
a/b
a/b
a/b
a/b
a/c
Descent Graph
1
2
(a,b)
Assume that the descent
graph vertices below
represent the pedigree on
the left.
b/d
3
4
(a,b)
(a,b)
5
6
(a,b)
(a,c)
7
8
(b,d)
(Taken from tutorial 6)
Descent Graph
1
2
(a,b)
3
4
(a,b)
(a,b)
5
6
(a,b)
(a,c)
7
8
(b,d)
Assume that paternally inherited genes are on the left.
2. Assume that non-founders are placed in increasing
order.
3. A ‘1’ (‘0’) is used to denote a paternally (maternally)
originated gene.
 The gene flow above corresponds to the inheritance
vector: v = ( 1,1; 0,0; 1,1; 1,1; 1,1; 0,0 )
1.
(Taken from tutorial 6)
Creating a Markov Chain of descent graphs:
•A Markov Chain state is a (legal) descent graph.
•We need to define transition rules from one state to
another.
•We will show 4 possible transition rules :
T0 , T1 , T2a , T2b .
Transition rule T0 :
Choose parent, child.
Transition rule T0 :
Choose parent, child.
Swap paternal arc to maternal arc (or vice versa)
Small change in the graph
slow mixing
Transition rule T1 :
Choose person i.
Transition rule T1 :
Choose person i.
Perform a T0 transition from i to each one of
his/her children.
Transition rule T2a :
Choose a couple i.j with common children.
Transition rule T2a :
Choose a couple i.j with common children.
Exchange the subtree rooted at the maternal node
of i with the subtree rooted at the maternal node
of j. exchange also the paternally rooted subtrees.
Transition rule T2b :
Choose a couple i.j with common children.
Transition rule T2b :
Choose a couple i.j with common children.
Exchange the subtree rooted at the maternal node
of i with the subtree rooted at the paternal node
of j, and vice versa.
Problem :
a
1/1
2/2
3/3
b
1/1
2/2
1/1
3/3
3/3
1/2
2/2
1/1
2/2
3/3
1/2
Two legal graphs of inheritance.
No way to move from a to b using allowed transitions
(the states don’t communicate).
Solution: allow passage through illegal graphs (tunneling)
b
1/1
2/2
1/1
3/3
1/2
1/1
2/2
1/1
3/3
3/3
2/2
3/3
1/1
2/2
2/2
3/3
2/2
3/3
1/2
1/1
1/2
Applying tunneling in MCMC:
In order to move from state i to state j (one step of
the chain),
(1) Select a transition (+person+locus) and perform it.
(2) Toss a coin. If it says “head”, stop.
If it says “tail”, goto (1).
(3) If the proposed state j is legal, accept it with the
Metropolis probability
 j
aij  min 1, 
 i 
This assures that the Markov Chain is irreducible.
One use of descent graphs MCMC is the calculation of the
location score for location d, given marker evidence M and
trait phenotypes T :
 Prd T M  
log10 

Pr(
T
)


The difficult term to calculate is the conditional probability
Prd T M  .
Assuming linkage equilibrium we can write
Prd T M    Prd T Gˆ Pr  Gˆ M  ,where Ĝ is a descent graph.
Gˆ


MCMC: the stationary distribution of the chain is the
conditional distribution Pr Gˆ M .


If a sequence of legal descent graphs Gˆ1 ,
, Gˆ n is generated
by running the chain,
then for n large enough the estimate
n
1
is: Prd T M    Prd T Gˆ . The term Prd T Gˆ can be
n i 1
calculated quickly, using MENDEL.




The Gibbs sampler




Special case of the Metropolis - Hastings algorithm
for Cartesian product state spaces
Suppose that each sample point i=(i1,i2,i3..im) has m
components, for instance, a pedigree of m individuals
The Gibbs sampler updates one component of i at a
time.
If the component is chosen randomly and resampled
conditional on the components, then the acceptance
probability is 1 .
The Gibbs sampler
Proof:
 Let ic be the uniformly chosen component
 Denote the remaining components by i-c
 If j is a neighbor of i reachable by changing only
component ic then j-c=i-c.
j
1
 The proposal probability is q 
ij
m  {k : k c  i c } k

and satisfies piqij=pjqji
The ratio appearing in the acceptance probability is
i
1
j
 j q ji
m  {k : k c  j c } k

1
j
1
 i qij
i
m  {k : k c  i c } k
Example:
MCMC with ordered genotypes


Suppose we want to generate a pedigree for which we
know only the unordered genotype of 4 individuals in
the last generation .
We make the following simplifying assumptions:

The population is in HW equilibrium

The three loci are in linkage equilibrium

There is no crossover interference between the
three loci

One option is to first simulate the founders
genotype according to population frequencies,
then “drop” genes along the tree according to
Mendel’s law and the recombination fractions
1
2
11
12
13
14
21
22
23
24
2,4
1,2
2,2
2,4
1,2
1,2
1,2
1,3
1,3
3,4
1,2
1,3

One realization of an ordered genotype consisting
with the data we have is the following pedigree :
1
2
2|1
1|2
1|3
1|4
3|1
2|1
11
12
13
14
2|1
1|3
1|2
4|2
2|3
2|3
2|4
1|3
3|2
3|3
21
22
23
2|4
1|2
2|2
2|4
1|2
1|2
2|1
3|1
2|1
1|3
1|3
24
4|3
1|2
1|3

Computing the probability of a certain realization of
this pedigree is easy but even for this small tree the
proposal we saw is just one of ((43)2)4x(23)12=284
options.

It is obvious we need a better method .
The Gibbs sampler
The Gibbs sampler updates the genotype of one
individual at one locus at a time.
 This is easy given all other data due to the local nature
of the dependency.
We need to compute
* Pr[xim(l), xip(l)|xim(-l), xip(-l), x-i,y]
Where xim(l), xip(l) are the maternal and paternal alleles
of locus l of individual i,
xim(-l), xip(-l) are the paternal and maternal alleles of all
other loci of individual i,
y is the phenotype data,
and x-i denotes the data of all other individuals in the
pedigree.




This calculation involves only the phenotype of i
and the genotypes of his parents and children
under the no crossover assumption the genotype
of on locus across a chromosome depends only
on it’s adjacent loci
We get the following expression to calculate *
Pr[x (l ), x (l ) | x ( l ), x ( l ), x , y ] 
im
ip
im
ip
i
Pr[x | xm ]Pr[x | x p ]Pr[y (l ) | x (l )]
Pr[x
|x ]

im
ip
i
i
km
k kids (i )
Pr[x
|x ]
x (l ) Pr[xim | xm ]Pr[xip | x p ]Pr[y (l ) | xi (l )]

i
km
i
k kids (i )

Lets return to the pedigree we showed earlier.
Assume that the ordered genotype is the current
state, and we want to update individual 13 at locus
number 2.

There are 3 genotypes consistent with the genotype
configuration of other individuals in the pedigree
1|1,2|1, 1|3

We need to calculate the probability of each one of
these states according to the equation shown in the
last slide

We first compute the paternal and maternal
transmission probabilities for each allele in these
three genotypes.

Then we need the transmission probabilities from
individual 13 to his kids.

Once the three have been computed ,they are
normalized by dividing each one by the total sum.

The genotype is sampled according to that
distribution .
Problems :


Updating is done a single genotype at a time.
For a large pedigree this is a very slow process
The Gibbs sampler may also be reducible, i.e. some
legal states will never be reached.
For example:
A|B
A|A
A|C
B|C
A|C
by Gibbs
A|A
A|B
C|B

One solution to this problem is allowing illegal states
in the chain which will later be discarded.

Another is to identify the irreducible components
called islands and to define metropolis states
allowing to jump from one island to another.

However the whole process of finding the islands
and setting up proposal for the metropolis jumps is
not fully automated

the work involved for each different pedigree
prevents the widespread use of this method.
Other MCMC that use the Gibbs Sampler

The L-sampler (Heath, 1997)

A Whole Locus Gibbs sampler combining
Monte Carlo simulation with single-locus
peeling
Other MCMC that use the Gibbs
Sampler

The M-sampler (Thompson and Heath, 1999;
Thompson, 2000a)

A whole-meiosis Gibbs sampler which jointly
updates an entire meiosis from the full
conditional distribution
References



K. Lange, Mathematical and Statistical Methods for
Genetic Analysis, chapter 9, Springer (1997).
A. Bureau, T.P. Speed, Course stat 260: Statistics in
Genetics, week 7.
A.W. George, E. Thompson, Discovering Disease
Genes: Multipoint Linkage Analysis via a New Markov
Chain Monte Carlo Approach, Statistical Science,
Vol. 18, No.4 (2003)

Lecture notes.

Tutorial notes.
The
End!!