Probabilistic Model of Sequences

Download Report

Transcript Probabilistic Model of Sequences

Probabilistic Model of Sequences
Ata Kaban
The University of Birmingham
Sequence
•
•
•
•
•
•
•
•
Example1: a b a c a b a b a c
Example2: 1 0 0 1 1 0 1 0 0 1
Example3: 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3
Roll a six-sided die N times. You get a sequence.
Roll it again: You get another sequence.
Here is a sequence of characters, can you see it?
What is a sequence?
Alphabet1 = {a,b,c}, Alphabet2={0,1},
Alphabet3={1,2,3,4,5,6}
Probabilistic Model
• Model = system that simulates the sequence under
consideration
• Probabilistic model = model that produces
different outcomes with different probabilities
– It includes uncertainty
– It can therefore simulate a whole class of sequences &
assigns a probability to each individual sequence
• Could you simulate any of the sequences on the
previous slide?
Random sequence model
• Back to the die example (can possibly be loaded)
– Model of a roll: has 6 parameters: p1,p2,p3,p4,p5,p6
– Here, p_i is the probability of throwing i
– To be probabilities, these must be non-negative and must
sum to one.
– What is the probability of the sequence [1, 6, 3]?
p1*p6*p3
• NOTE: in the random sequence model, the
individual symbols in a sequence do not depend on
each other. This is the simplest sequence model.
Maximum Likelihood
parameter estimation
• The parameters of a probabilistic model are
typically estimated from large sets of trusted
examples, called training set.
• Example (t=tail, h=head) : [t t t h t h h t]
– Count up the frequencies: t5, h3
– Compute probabilities:
• p(t)=5/(5+3), p(h)=3/(5+3)
– These are the Maximum Likelihood (ML) estimates of
the parameters of the coin.
– Does it make sense?
– What if you know the coin is fair?
Overfitting
• A fair coin has probabilities p(t)=0.5, p(h)=0.5
• If you throw it 3 times and get [t, t, t], then the ML
estimates for this sequence are p(t)=1, p(h)=0.
• Consequently, from these estimates, the probability of e.g.
the sequence [h, t, h, t] = ………….
• This is an example of what is called overfitting. Overfitting
is the greatest enemy of Machine Learning!
• Solution1: get more data
• Solution2: build in what you already know into the model.
(Will return to it during the module)
Why is it called Maximum Likelihood?
• It can be shown that using the frequencies
to compute probabilities maximises the total
probability of all the sequences given the
model (the likelihood). P(Data|parameters)
Probabilities
• Have two dice D1 and D2
• The probability of rolling I given die D1 is called
P(i|D1). This is a conditional probability
• Pick a die at random with probability P(Dj), j=1 or
2. The probability for picking die Dj and rolling i
is is called joint probability and is
P(I,Dj)=P(Dj)P(I|Dj).
• For any events X and Y, P(X,Y)=P(X|Y)P(Y)
• If we know P(X,Y), then the so-called marginal
probability p(X) can be computed as P( X )   P( X , Y )
Y
• Now, we show that maximising P(Data|parameters) for the
random sequence model leads to the frequency-based
computation that we did intuitively.
Notations :
S : sequence of symbols s1 ,..., s L
L : length of sequence
T : size of alphabet
p (1),..., p (T ) : parameters ( probabilit ies )
xt : frequency of occurrence of symbol t , (t  1,...T )
L
T
l 1
t 1
P ( S | p (1),..., p (T ))   p ( sl )   p (t ) xt 
 maximise
 maximise the logarithm of the likelihood
L
T
l 1
t 1
log P ( S | p (1),..., p (T ))   log p ( sl )   xt log p (t )
T
Remember t hat p(t) need to be probabilit ies, so  p (t )  1
t 1
T
This constraint can be imposed by adding a Lagrangian term   ( p (t )  1)
t 1
T
T
t 1
t 1
Therefore we need to maximise Obj   xt log p(t )   ( p(t )  1)
Remember, at a maximum, the derivative of a function is zero.
x
x
Obj
 t    0  p(t )  t
p(t) p(t )

Now to compute  , multiply both sides by p(t) and add up both sides when t  1,...T
xt  p(t )
T
x
t 1
t
T
   p(t )  
t 1
So, p(t ) 
xt
T
x
t 1
t
Why did we bother?
Because in more complicated models we cannot ‘guess’ the
result.
Markov Chains
• Further examples of sequences:
– Bio-sequences
– Web page request sequences while browsing
• These are not anymore random sequences, but have a time-structure.
• How many parameters would such a model have?
• We need to make simplifying assumptions to end up with a reasonable
number of parameters
• The first order Markov assumption: the observation only depends on
the immediately previous one, no longer history
P( sl | sl 1 ,..., s1 )  P( sl | sl 1 )
• Markov Chain = sequence model which makes the Markov assumption
Markov Chains
• The probability of a Markov sequence:
P( S )  P( s1 ) P( s2 | s1 ) P( s3 | s2 )...P( s L | sL 1 )
L
 P( s1 ) P( sl | sl 1 )
l 2
• The alphabet’s symbols are also called states
• Once the parameters are estimated from training
data, the Markov chain can be used for prediction
• Amongst others, Markov Chains are successful for
web browsing behavior prediction
Markov Chains
• A Markov Chain is stationary if at any time,
it has the same transition probabilities.
P(st  i | st 1  j )  (shorthand )  P(i | j )
• We assume stationary models here.
• Then the parameters of the model consist of
the transition probability matrix & initial
state probabilities.
ML parameter estimation
• We can derive how to compute the
parameters of a Markov Chain from data,
using Maximum Likelihood, as we did for
random sequences.
• The ML estimate of the transition matrix
will be again very intuitive:
xij
Remember that
P(i | j )  T
 P(i | j )  1, j
 xij
T
i 1
i 1
Simple example
• If it is raining today, it will rain tomorrow with probability
0.8 implies the contrary has probability 0.2
• If it is not raining today, it will rain tomorrow with
probability 0.6 implies the contrary has probability 0.4
• Build the transition matrix
P(no rain tomorrow | rain today)   0.8 0.2 
 P(rain tomorrow | rain today)

  

P
(
rain
tomorrow
|
no
rain
today
)
P
(
no
rain
tomorrow
|
no
rain
today
)
0
.
6
0
.
4

 

• Be careful which numbers need to sum to one and which
don’t. Such a matrix is called stochastic matrix.
• Q: It rained all week, including today. What does this
model predict for tomorrow? Why? What does it predict
for a day from tomorrow? (*Homework)
Examples of Web Applications
• HTTP request prediction:
– To predict the probabilities of the next requests from the same user
based on the history of requests from that client.
• Adaptive Web navigation:
– To build a navigation agent which suggests which other links
would be of interest to the user based on the statistics of previous
visits.
– The predicted link does not strictly have to be a link present in the
Web page currently being viewed.
• Tour generation:
– Is given as input the starting URL and generates a sequence of
states (or URLs) using the Markov chain process.
Building Markov Models from
Web Log Files
• A Web log file is a collection of records of user requests
for documents on a Web site, an example:
177.21.3.4 - - [04/Apr/1999:00:01:11 +0100] "GET /studaffairs/ccampus.html
HTTP/1.1" 200 5327 "http://www.ulst.ac.uk/studaffairs/accomm.html" "Mozilla/4.0
(compatible; MSIE 4.01; Windows 95)"
• Transition matrix can be seen as a graph
– Link pair: (r - referrer, u - requested page, w - hyperlink
weight)
– Link graph: it is called the state diagram of the MarkovChain
• a directed weighted graph
• a hierarchy from the homepage down to multiple levels
Link Graph: an example (University of Ulster site)
Zhu et al. 2002
1
University of Ulster
1800
2700
Department
720
880
CS
5
3
200
Science
&Arts
810
International
Office
6
9000
Start
Information
2
4500
Student
300
4
S
2390
1800
Undergraduate
Library
8
7
72
2390
2400
Graduate
10
9
1800
2400
9000
648
880
600
State diagram:
282
E
- Nodes: states
Jobs
Exit
11
2128
12
Register
2128
- Weighted arrows:
number of transitions
Experimental Results
(Sarukkai, 2000)
• Simulations :
– ‘Correct link’ refers to the actual link chosen at the next step.
– ‘depth of the correct link’ is measured by counting the umber of
links which have a probability greater than or equal to the correct
link.
– Over 70% of correct links are in the top 20 scoring states.
– Difficulties: very large state space
Simple exercise
• Build the Markov transition matrix of the following
sequence:
[a b b a c a b c b b d e e d e d e d]
State space: {…………….}
 P(a | a) P(b | a) P(c | a) P(d | a) P(e | a)   0

 
 P(a | b) P(b | b) P(c | b) P(d | b) P(e | b)  
 P(a | c) P(b | c) P(c | c) P(d | c) P(e | c)   

 
P
(
a
|
d
)
P
(
b
|
d
)
P
(
c
|
d
)
P
(
d
|
d
)
P
(
e
|
d
)

 
 P(a | e) P(b | e) P(c | e) P(d | e) P(e | e)  

 








Further topics
• Hidden Markov Model
– Does not make the Markov assumption on the observed
sequence
– Instead, it assumes that the observed sequence was
generated by another sequence which is unobservable
(hidden), and this other sequence is assumed to be
Markovian
– More powerful
– Estimation is more complicated
• Aggregate Markov model
– Useful for clustering sub-graphs of a transition graph
K
P( st | st 1 )   P( st | k ) P(k | st 1 )
k 1
HMM at an intuitive level
• Suppose that we know all the parameters of the following
HMM, as shown on the state-diagram below. What is the
probability of observing the sequence [A,B] if the initial
state is S1? The same question if the initial state is chosen
randomly with equal probabilities.
ANSWER:
If the initial state is S1:
0.2*(0.4*0.8+0.6*0.7) = 0.148.
In the second case:
0.5*0.148+0.5*0.3*(0.3*0.7+0.7
*0.8) = 0.1895.
Conclusions
• Probabilistic Model
• Maximum Likelihood parameter estimation
• Random sequence model
• Markov chain model
--------------------------------• Hidden Markov Model
• Aggregate Markov Model
Any questions?