Transcript q T

Hidden Markov Model
Instructor : Saeed Shiry
&
CHAPTER 13
ETHEM ALPAYDIN
© The MIT Press, 2004
1
‫مقدمه‬




In a word successive letters are dependent; in English ‘h' is very likely
to follow ‘t' but not 'x'. Such processes where there is a sequence of
observations-for example, letters in a word, base pairs in a DNA
sequence-cannot be modeled as simple probability distributions.
A similar example is speech recognition where speech utterances are
composed of speech primitives called phonemes; only certain
sequences of phonemes are allowed, which are the words of the
language.
At a higher level, words can be written or spoken in certain sequences
to form a sentence as defined by the syntactic and semantic rules of the
language.
A sequence can be characterized as being generated by a parametric
random process. In this chapter, we discuss how this modeling is done
and also how the parameters of such a model can be learned from a
training sample of example sequences.
2
‫اهداف‬


Modeling dependencies in input
Sequences:
 Temporal: In speech; phonemes in a word
(dictionary), words in a sentence (syntax,
semantics of the language).
In handwriting, pen movements
 Spatial: In a DNA sequence; base pairs
3
History
Markov chain theory developed around 1900.
Hidden Markov Models developed in late 1960’s.
Used extensively in speech recognition in 1960-70.
Introduced to computer science in 1989.




Applications



Bioinformatics.
Signal Processing
Data analysis and Pattern recognition
4
HMMs and their Usage

HMMs are very common in Computational
Linguistics:




Speech recognition (observed: acoustic signal, hidden:
words)
Handwriting recognition (observed: image, hidden: words)
Part-of-speech tagging (observed: words, hidden: part-ofspeech tags)
Machine translation (observed: foreign words, hidden:
words in target language)
5
Discrete Markov Processes

Consider a system that at any time is in one of a set
of N distinct states:
S = { s1, s2..., sn }

The state at time t is denoted as
qt , t = 1,2, ... ,

so for example qt = Si means that at time t. the
system is in state Si.
6
Discrete Markov Processes

At regularly spaced discrete times the system
moves to a state with a given probability, depending
on the values of the previous states:
P(qt+1=Sj | qt=Si, qt-1=Sk ,...)

For the special case of a first-order Markov model, the
state at time t + 1 depends only on state at time t,
regardless of the states in the previous times:
(qt+1=Sj | qt=Si, qt-1=Sk ,...) = P(qt+1=Sj | qt=Si)
First-order Markov
Today is the first day of the rest of your life.
7
Discrete Markov Processes
Transition probabilities
aij ≡ P(qt+1=Sj | qt=Si)
aij ≥ 0 and Σj=1N aij=1



Initial probabilities
πi ≡ P(q1=Si)
Σj=1N πi=1
Going from Si to Sj has the same probability no
matter when it happens, or where it happens in the
observation sequence. A = [aij] is a N x N matrix
whose rows sum to 1.
8
Stochastic Automaton
This can be seen as a stochastic automaton
9
Observable Markov model



In an observable Markov model. the states are
observable. At any time t, we know qt, and as the
system moves from one state to another, we get an
observation sequence that is a sequence of states.
The output of the process is the set of states at
each instant of time where each state corresponds
to a physical observable event.
We have an observation sequence 0 that is the state
sequence O = Q = {q1 q2 .. qt}, whose probability is
given as:
T
P O  Q | A ,    P q1  P qt | qt 1   q1aq1q2 aqT 1qT
t 2
10
Example: Balls and Urns

Three urns each full of balls of one color
S1: red, S2: blue, S3: green
  0.5,0.2,0.3
T
O  S 1 , S 1 , S 3 , S 3
0.4 0.3 0.3
A  0.2 0.6 0.2
0.1 0.1 0.8
P O | A ,    P S 1   P S 1 | S 1   P S 3 | S 1   P S 3 | S 3 
 1  a11  a13  a33
 0.5  0.4  0.3  0.8  0.048
11
Balls and Urns: Learning

Given K example sequences of length T
# sequences starting with S i 

ˆi 
# sequences
# transition s from S i to S j 
âij 
# transition s from S i 


k
k
1
q

S
and
q
k t 1 t i
t 1  S j
T- 1

k
1
q
k t 1 t  Si
T- 1


k
1
q
k 1  Si

K

12
Hidden Markov Models




States are not observable
but when we visit a state, an observation is recorded that
is a probabilistic function of the state. We assume a
discrete observation in each state from the set
{v1,v2,...,vM}
Emission probabilities
bj(m) ≡ P(Ot=vm | qt=Sj)
bj(m) is the observation, or emission probability that
we observe Vrn ,m = 1, ... ,M in state Sj.
The state sequence Q is not observed, that is what makes the model
"hidden," but it should be inferred from the observation sequence O.
13


For each observation sequence, there are
multiple state sequences
In this case of a hidden Markov model, there
are two sources of randomness: Additional to
randomly moving from one state to another,
the observation in a state is also random.
14
Example: Balls and Urns






The hidden case: each urn contains balls of different colors.
Let bj (m) the probability of drawing a ball of color m from urn j.
We observe a sequence of ball colors but without knowing the
sequence of urns from which the balls were drawn.
The number of ball colors may be different from the number of
urns. For example, let us say we have three urns and the
observation sequence is
O = {red, red, green, blue, yellow}
in the case of a hidden model, a ball could have been picked
from any urn. In this case, for the same observation sequence O,
there may be many possible state sequences Q that could have
generated.
15
HMM Unfolded in Time
16
Elements of an HMM





N: Number of states
M: Number of observation symbols
A = [aij]: N by N state transition probability
matrix
B = bj(m): N by M observation probability
matrix
Π = [πi]: N by 1 initial state probability vector
λ = (A, B, Π), parameter set of HMM
17
Three Basic Problems of HMMs
Given a number of sequences of observations, we are
interested in three problems:

Evaluation: Given a model λ, evaluate the probability
of any given observation sequence, O = {O1O2 .. OT},
namely, P (O | λ)

State sequence: Given λ, and O, find out the state
sequence Q = {qlq2 ... qT}, which has the highest
probability of generating O, or find Q* such that
maximizes
P (Q* | O, λ )

Learning: Given a training set of observation
sequences, X={Ok}k, find λ* such that
P ( X | λ* )=maxλ P ( X | λ )
18
(Rabiner, 1989)
Evaluation

Given an observation sequence 0 = {0102 ... OT} and a
state sequence Q = {ql q2 ... qT}, the robability of
observing O given the state sequence Q is simply

The probability of the state sequence Q is
19
Forward variable:

We define the forward variable at (i) as the probability
of observing the partial sequence {01 ... Ot} until time t
and being in Si at time t, given the model , λ:
 t i   PO1 Ot , qt  Si |  
Initialization:
Recursion
N





 t 1 j    t i aij b j Ot 1 
 i 1

20
Forward variable:


When we calculate the forward variables, it is easy to
calculate the probability of the observation sequence:
T (i) is the probability of generating the full observation
sequence and ending up in state Si. We need to sum up
over all such possible final states.
21

Backward variable:
 t i   POt 1 OT | qt  Si ,  
Initializa tion :
T i   1
Recursion :
N
 t i    aijb j Ot 1  t 1  j 
j 1
22
caution
23
Finding the State Sequence

Let us define gt (i) as the probability of being in
state Si at time t, given O and λ, which can be
computed as
g t i   P qt  S i O ,  


t i t i 
N
j 1
t  j t  j 
Choose the state that has the highest probability,
for each time step:
qt*= arg maxi γt(i)
No!
24
Viterbi’s Algorithm
δt(i) ≡ maxq1q2∙∙∙ qt-1 p(q1q2∙∙∙qt-1,qt =Si,O1∙∙∙Ot | λ)




Initialization:
δ1(i) = πibi(O1), ψ1(i) = 0
Recursion:
δt(j) = maxi δt-1(i)aijbj(Ot), ψt(j) = argmaxi δt-1(i)aij
Termination:
p* = maxi δT(i), qT*= argmaxi δT (i)
Path backtracking:
qt* = ψt+1(qt+1* ), t=T-1, T-2, ..., 1
25
Learning

We define (i, j) as the probability of being in Sj
at time t and in Sj at time t + I, given the whole
observation O and λ :
t i , j   P qt  S i , qt 1  S j | O ,  
t i , j  
t i aijb j Ot 1 t 1  j 
   k a b O  l 
k
l
t
kl
l
t 1
t 1
Baum - Welch algorithm (EM) :
1 if qt  S i
1 if qt  S i and qt 1  S j
t
z 
zij  
0 otherwise
0 otherwise
t
i
26
Baum-Welch (EM)
 
 
E  step : E zit  gt i 
E zijt  t i , j 
M  step :
K
ˆi 
k
g
 1 i 
k 1
K
âij
k 1
K
k 1
 i , j 
Tk 1 k
t
t 1
Tk 1 k
t
t 1
g i 
g  j 1O  v 


m 
  g i 
K
b̂ j



 
K
k 1
Tk 1 k
k
t
t
t 1
K
Tk 1 k
t
k 1
t 1
m
27
References

Concept:


Rabiner, L. R. (1989). A Tutorial on Hidden
Markov Models and Selected Applications in
Speech Recognition. Proceedings of the IEEE,
77(2), 257-285.
Application:

Geng, J. and Yang, J. (2004). Automatic
Extraction of Bibliographic Information on the
Web. Proceedings of the 8th International
Database Engineering and Applications
Symposium (IDEAS’04), 193-204.
28