Transcript pptx

Corpora and Statistical Methods
Lecture 8
Albert Gatt
Part 2
Markov and Hidden Markov Models: Conceptual
Introduction
In this lecture
 We focus on (Hidden) Markov Models
 conceptual intro to Markov Models
 relevance to NLP
 Hidden Markov Models
 algorithms
Acknowledgement
 Some of the examples in this lecture are taken from a tutorial
on HMMs by Wolgang Maass
Talking about the weather
 Suppose we want to predict tomorrow’s weather. The
possible predictions are:
 sunny
 foggy
 rainy
 We might decide to predict tomorrow’s outcome based on
earlier weather
 if it’s been sunny all week, it’s likelier to be sunny tomorrow than if it
had been rainy all week
 how far back do we want to go to predict tomorrow’s weather?
Statistical weather model
 Notation:
 S: the state space, a set of possible values for the weather: {sunny, foggy,
rainy}
 (each state is identifiable by an integer i)
 X: a sequence of random variables, each taking a value from S
 these model weather over a sequence of days
 t is an integer standing for time
 (X1, X2, X3, ... XT) models the value of a series of random variables
 each takes a value from S with a certain probability P(X=si)
 the entire sequence tells us the weather over T days
Statistical weather model
 If we want to predict the weather for day t+1, our model might
look like this:
P( X t 1  sk | X 1... X t )
 E.g. P(weather tomorrow = sunny), conditional on the weather
in the past t days.
 Problem: the larger t gets, the more calculations we have to
make.
Markov Properties I: Limited horizon

The probability that we’re in state si at time t+1 only
depends on where we were at time t:
P( X t 1  si | X 1... X t )  P( X t 1  si | X t )

Given this assumption, the probability of any sequence is
just:
T
P( X 1 ,..., X T ) 
 P(X
i 1
i
| X i 1 )
Markov Properties II: Time invariance
 The probability of being in state si given the previous state does
not change over time:
P( X t 1  si | X t )  P( X 2  si | X 1 )
Concrete instantiation
Day t
sunny
rainy
foggy
sunny
0.8
0.2
0.2
Day t+1
rainy
0.05
0.6
0.3
foggy
0.15
0.2
0.5
This is essentially a transition matrix, which gives us
probabilities of going from one state to the other.
We can denote state transition probabilities as aij (prob.
of going from state i to state j)
Graphical view

1.
2.
3.
4.
Components of the model:
states (s)
transitions
transition probabilities
initial probability distribution
for states
Essentially, a non-deterministic finite
state automaton.
Example continued
 If the weather today (Xt) is sunny, what’s the probability that
tomorrow (Xt+1) is sunny and the day after (Xt+2) is rainy?
Markov
assumption
P( X t 1  s, X t  2  r | X t  s)
 P( X t  2  r | X t 1  s, X t  s)  P( X t 1  s | X t  s)
 P( X t  2  r | X t 1  s)  P( X t 1  s | X t  s)
 0.05  0.8
 0.04
Formal definition
 A Markov Model is a triple (S, , A) where:
 S is the set of states
  are the probabilities of being initially in some state
 A are the transition probabilities
Hidden Markov Models
A slight variation on the example
 You’re locked in a room with no windows
 You can’t observe the weather directly
 You only observe whether the guy who brings you food is carrying an umbrella
or not
 Need a model telling you the probability of seeing the umbrella, given the weather
 distinction between observations and their underlying emitting state.
 Define:
 Ot as an observation at time t
 K = {+umbrella, -umbrella} as the possible outputs
 We’re interested in P(Ot=k|Xt=si)
 i.e. p. of a given observation at t given that the underlying weather state at t is si
Symbol emission probabilities
weather
sunny
Probability
of umbrella
0.1
rainy
0.8
foggy
0.3
This is the hidden model, telling us the probability that
Ot = k given that Xt = si
We assume that each underlying state Xt = si emits an
observation with a given probability.
Using the hidden model
 Model gives:P(Ot=k|Xt=si)
 Then, by Bayes’ Rule we can compute: P(Xt=si|Ot=k)
P(Ot  k | X t  si ) P( X t  si )
P( X t  si | Ot  k ) 
P(Ot  k )
 Generalises easily to an entire sequence
HMM in graphics
 Circles indicate states
 Arrows indicate probabilistic dependencies between states
HMM in graphics
 Green nodes are hidden states
 Each hidden state depends only on the
previous state (Markov assumption)
Why HMMs?
 HMMs are a way of thinking of underlying events
probabilistically generating surface events.
 Example: Parts of speech
 a POS is a class or set of words
 we can think of language as an underlying Markov Chain of
parts of speech from which actual words are generated
(“emitted”)
 So what are our hidden states here, and what are the
observations?
HMMs in POS Tagging
DET
ADJ
N
V
 Hidden layer (constructed through
training)
 Models the sequence of POSs in the
training corpus
HMMs in POS Tagging
DET
ADJ
N
V
the
tall
lady
is
 Observations are words.
 They are “emitted” by their corresponding
hidden state.
 The state depends on its previous state.
Why HMMs
 There are efficient algorithms to train HMMs using
Expectation Maximisation
 General idea:
 training data is assumed to have been generated by some HMM
(parameters unknown)
 try and learn the unknown parameters in the data
 Similar idea is used in finding the parameters of some n-gram
models, especially those that use interpolation.
Formalisation of a Hidden Markov model
Crucial ingredients (familiar)
 Underlying states: S = {s1,…,sN}
 Output alphabet (observations): K = {k1,…,kM}
 State transition probabilities:
A = {aij}, i,j Є S
 State sequence: X = (X1,…,XT+1)
+ a function mapping each Xt to a state s
 Output sequence: O = (O1,…,OT)
 where each ot Є K
Crucial ingredients (additional)
 Initial state probabilities:
Π = {πi}, i Є S
(tell us the initial probability of each state)
 Symbol emission probabilities:
B = {bijk}, i,j Є S, k Є K
(tell us the probability b of seeing observation Ot=k, given
that Xt=si and Xt+1 = sj)
Trellis diagram of an HMM
s1
a1,1
a1,2
s2
s3
a1,3
Trellis diagram of an HMM
a1,1
s1
a1,2
s2
a1,3
s3
Obs. seq:
o1
time:
t1
o2
t2
o3
t3
Trellis diagram of an HMM
b1,1,k
a1,1
s1
a1,2
s2
b1,1,k
b1,2,k
a1,3
b1,3,k
s3
Obs. seq:
o1
time:
t1
o2
t2
o3
t3
The fundamental questions for HMMs
1.
Given a model μ = (A, B, Π), how do we compute the
likelihood of an observation P(O| μ)?
2.
Given an observation sequence O, and model μ, which is the
state sequence (X1,…,Xt+1) that best explains the observations?

3.
This is the decoding problem
Given an observation sequence O, and a space of possible
models μ = (A, B, Π), which model best explains the observed
data?
Application of question 1 (ASR)

Given a model μ = (A, B, Π), how do we compute the
likelihood of an observation P(O| μ)?

Input of an ASR system: a continuous stream of sound
waves, which is ambiguous

Need to decode it into a sequence of phones.


is the input the sequence [n iy d] or [n iy]?
which sequence is the most probable?
Application of question 2 (POS Tagging)

Given an observation sequence O, and model μ, which is the state sequence
(X1,…,Xt+1) that best explains the observations?


this is the decoding problem
Consider a POS Tagger

Input observation sequence:
 I can read

need to find the most likely sequence of underlying POS tags:
 e.g. is can a modal verb, or the noun?
 how likely is it that can is a noun, given that the previous word is
a pronoun?
Summary
 HMMs are a way of representing:
 sequences of observations arising from
 sequences of states
 states are the variables of interest, giving rise to the
observations
 Next up:
 algorithms for answering the fundamental questions about
HMMs