Introduction to Hidden Markov Model

Download Report

Transcript Introduction to Hidden Markov Model

Introduction to Hidden Markov
Models
Wang Rui
CAD&CG state key lab
2004-5-26
Outline




Example ---- Video Texture
Markov Chains
Hidden Markov Models
Example ---- Motion Texture
Example ---- Video Texture
 Problem statement
video clip
video texture
The approach
How do we find good transitions?
Finding good transitions
Compute L2 distance Di, j between all frames
vs.
frame i
frame j
Similar frames make good transitions
Fish Tank
Mathematic model of Video Texture
A sequence of random variables
{ADEABEDADBCAD}
A sequence of random variables
{BDACBDCACDBCADCBADCA}
Mathematic
Markov Model
Model
The future is independent of the
past and given by the present.
Markov Property
 Formal definition
– Let X={Xn}n=0…N be a sequence of random
variables taking values sk N Iff
P(Xm=sm|X0=s0,…,Xm-1=sm-1) = P(Xm=sm| Xm-1=sm-1)
then the X fulfills Markov property
 Informal definition
– The future is independent of the past given the
present.
History




Markov chain theory developed around 1900.
Hidden Markov Models developed in late 1960’s.
Used extensively in speech recognition in 1960-70.
Introduced to computer science in 1989.
Applications
 Bioinformatics.
 Signal Processing
 Data analysis and Pattern recognition
Markov Chain
 A Markov chain is specified by
– A state space S = { s1, s2..., sn }
– An initial distribution a0
– A transition matrix A
Where A(n)ij= aij = P(qt=sj|qt-1=si)
 Graphical Representation
as a directed graph where
– Vertices represent states
– Edges represent transitions with positive
probability
Probability Axioms
 Marginal Probability – sum the joint probability
 Conditional Probability
Calculating with Markov chains
 Probability of an observation sequence:
– Let X={xt}Lt=0 be an observation sequence from
the Markov chain {S, a0, A}
Motivation of Hidden Markov Models
 Hidden states
The state of the entity we want to model is often not
observable:
– The state is then said to be hidden.
 Observables
– Sometimes we can instead observe the state of
entities influenced by the hidden state.
 A system can be modeled by an HMM if:
– The sequence of hidden states is Markov
– The sequence of observations are independent (or
Markov) given the hidden
Hidden Markov Model
 Definition
– Set of states S = { s1, s2..., sN }
– Observation symbols
V = { v1, v2, …, vM }
– Transition probabilities A between any two states
aij = P(qt=sj|qt-1=si)
– Emission probabilities B within each state
bj(Ot) = P( Ot=vj| qt = sj)
– Start probabilities  = {a0}
Use  = (A, B, ) to indicate the parameter set of the
model.
Generating a sequence by the
model
Given a HMM, we can generate a sequence of length n as
follows:
1.
2.
3.
4.
Start at state q1 according to prob a0t1
Emit letter o1 according to prob et1(o1)
Go to state q2 according to prob at1t2
… until emitting yn
1
1
a02
0
1
…
1
…
2
2
2
2
…
…
…
N
N
N
K
o1
o2
o3
…
…
N
b2(o1)
on
Example
Calculating with Hidden Markov
Model
Consider one such fixed state sequence
Q = q1 q2… qT
The observation sequence O for the Q is
T
P(O | Q,  )   P(Ot | qt ,  )
t 1
 bq1 (O1 )  bq2 (O2 )    bqT (OT )
The probability of such a state sequence Q can be written
as
P(Q |  )  a0 q1 aq1q2  aq2q3  aqT 1qT
The probability that O and Q occur simultaneously,
is simply the product of the above two terms, i.e.,
P(O, Q |  )  P(O | Q,  ) P(Q |  )
P(O, Q |  )  a0 q1 bq1 (O1 )aq1q2 bq2 (O2 )aq2q3  aqT 1qT bqT (OT )
Example
The three main questions on
HMMs
1. Evaluation
GIVEN
FIND
a HMM (S, V, A, B, ), and a sequence O,
Prob[ y | M ]
2. Decoding
GIVEN
FIND
a HMM (S, V, A, B, ), and a sequence O,
the sequence Q of states that maximizes P(O, Q | )
3. Learning
GIVEN a HMM (S, V, A, B, ), with unspecified
transition/emission probs and a sequence Q,
FIND
parameters  = (ei(.), aij) that maximize P[x|]
Evaluation
Find the likelihood a sequence is generated by
the model
 A straight way
The probability of O is obtained by summing all
possible state sequences q giving
P(O |  )   P(O | Q,  ) P(Q |  )
all Q


q1 , q2 ,qT
Complexity is O(NT)
Calculations is unfeasible
b (O1 )aq1q2 bq2 (O2 )aq2 q3    aqT 1qT bqT (OT )
q1 q1
The Forward Algorithm
 A more elaborate algorithm
– The Forward Algorithm
0
a02
a0n
 2 (1)  [ 1 (i )ai1 ]b1 (O2 )
i 1
1
1
…
1
2
2
…
2
… an1
…
…
N
N
N
K
o1
o2
o3
1
a01
a11
N
2
a21
N
P(O1O2 |  )    2 (i )
i 1
…
…
N
on
N
P(O |  )    T (i )
i 1
The Forward Algorithm
The Forward variable
 t (i)  P(O1O2 Ot , qt  Si |  )
We can compute α(i) for all N, i,
Initialization:
α1(i) = aibi(O1)
Iteration:
N
i = 1…N
 t 1 (i )  [  t (i )aij ]b j (Ot 1 )
t  1T  1
i 1
1
1
…
1
2
2
…
2
… an1
…
…
N
N
N
K
o1
o2
o3
1
Termination:
a01
N
P(O |  )    T (i )
i 1
0
a02
a0n
a11
2
a21
…
…
N
on
The Backward Algorithm
The backward variable
 t (i)  P(Ot 1Ot  2 OT | qt  Si ,  )
Similar, we can compute backward variable for all N, i,
Initialization:
βT(i) = 1,
i = 1…N
Iteration: N
 t (i)   aijb j (Ot 1 )  t 1 ( j )
t  T  1, T  2,,1,1  i  N
j 1
Termination:
N
0
a02
a0n
P(O |  )   a0 j b1 (O1 ) 1 ( j )
j 1
1
1
…
1
2
2
…
2
… an1
…
…
N
N
N
K
o1
o2
o3
1
a01
a11
2
a21
…
…
N
on
Consider
Thus P (qT
T i   P(O1O2 OT , qT S i|  )
P(O, qT  Si )
 T iT 
 Si O) 

P(O)
T iT 
i
P (O, qt  Si )
Also P(qt  Si O) 
P (O)
P (O1O2 Ot , qt  Sit , Ot 1Ot  2 OT )

P (O)
P (O1O2 Ot , qt  Si )P(Ot 1Ot  2 OT | O1O2 Ot , qt  Si )

P (O)
Forward, fk(i)
Backward, bk(i)
P (O1O2 Ot , qt  Si )P(Ot 1Ot  2 OT | qt  Si )

P (O)
 t i  t i 

  i 
T i 
i
Decoding
Decoding
GIVEN a HMM, and a sequence O.
Suppose that we know the parameters of the Hidden
Markov Model and the observed sequence of
observations O1, O2, ... , OT.
FIND the sequence Q of states that maximizes
P(Q|O,)
Determining the sequence of States q1, q2, ... , qT,
which is optimal in some meaningful sense. (i.e. best
“explain” the observations)
P (O, Q |  )
Consider P(Q O,  )  P (O |  )
So that maximizes the above probability
This is equivalent to maximizing
P (O, Q |  )
 ai1 bi1o1 ai1i2 bi2o2 ai2i3 bi3o3 aiT 1iT biT oT
P (O, Q |  )
A best path finding
problem
a02
0
max P (O, Q |  )
 max ln( P (O, Q |  ))

P(Q O,  )
 
1
1
1
…
1
2
2
2
…
2
…
…
…
N
N
N
K
o1
o2
o3


…
…
N
on

 max( ln ai1 bi1o1  ln ai1i2 bi2o2   ln aiT 1iT biT oT )
Viterbi Algorithm
 A dynamic programming
Initialization:
δ1(i) = a0ibi(O1) , i = 1…N
ψ1(i) = 0.
Recursion:
δt(j) = maxi [δt-1(i) aij]bj(Ot)
t=2…T
j=1…N
ψ1(j) = argmaxi [δt-1(i) aij] t=2…T
j=1…N
Termination:
P* = maxi δT(i)
qT* = argmaxi [δT(i) ]
Traceback:
qt* = ψ1(q*t+1 )
t=T-1,T-2,…,1.
Learning
 Estimation of Parameters of a Hidden Markov Model
1. Both the sequence of observations O and the
sequence of States Q is observed
learning  = (A, B, )
2. Only the sequence of observations O are observed
learning Q and  = (A, B, )
 Given O and Q then the Likelihood is given
by: L A, B,    a b a b a b a b
i1 i1o1 i1i2 i2o2 i2i3 i3o3
iT 1iT iT oT
the log-Likelihood is given by:
     
 ln a   ln b   ln a   ln b 
l  A, B,    ln L A, B,    ln ai1  ln bi1o1  ln ai1i2
i2i3
 
i3o3
iT 1iT
M
M
iT oT
  f i 0 ln ai   f ij ln aij    ln bio 
M
i 1
M
i 1 j 1
i 1 o i 
where f i 0  the number of times state i occurs in the first state
fij  the number of times state i changes to state j.
 iy  f  y  i  (or p  y  i  in the discrete case)
 
o i
 the sum of all observatio ns ot where qt  Si
In such case these parameters computed by
Maximum Likelihood estimation are:
fi0
aˆi 
1
f ij
aˆij  M
, and
 fij
j 1
b̂i = the MLE of bi computed from the
observations ot where qt = Si.
 Only the sequence of observations O are
observed
L A, B,   
a b
a b ai2i3 bi3o3 aiT 1iT biT oT
i1 i1o1 i1i2 i2o2
i1 ,i2 ... iT
– It is difficult to find the Maximum Likelihood
Estimates directly from the Likelihood function.
– The Techniques that are used are
1. The Segmental K-means Algorithm
2. The Baum-Welch (E-M) Algorithm
The Baum-Welch (E-M)
Algorithm
 The E-M algorithm was designed originally to
handle “Missing observations”.
 In this case the missing observations are the
states {q1, q2, ... , qT}.
 Assuming a model, the states are estimated by
finding their expected values under this model.
(The E part of the E-M algorithm).
 With these values the model is estimated by
Maximum Likelihood Estimation (The M
part of the E-M algorithm).
 The process is repeated until the estimated
model converges.
The E-M Algorithm
Let f O, Q    LO, Q,   denote the joint
distribution of Q,O.
Consider the function:
Q ,    EX ln LO, Q,   Q,  
(1)

 .


Starting with an initial estimate of
A sequence of estimates (m)  are formed by
( m 1)
( m)





Q

,

finding
to maximize
with respect to  .
The sequence of estimates(m) 
converge to a local maximum of the
likelihood
.
LQ,    f Q  