No Slide Title

Download Report

Transcript No Slide Title

White Parts from: Technical overview for
machine-learning researcher – slides from
UAI 1999 tutorial
1
Important Distinctions in Learning BNs
•Complete data versus incomplete data
•Observed variables versus hidden variables
•Learning parameters versus learning structure
•Scoring methods versus conditional independence tests methods
•Exact scores versus asymptotic scores
•Search strategies versus Optimal learning of trees/polytrees/TANs
2
of this lecture
The lecture today assumes: complete data, no
hidden variables, exact scores, general search
3
The Maximum Likelihood approach:
maximize probability of data with
respect to the unknown parameter(s).
Probability of data:
The maximum for binomial sampling
data is obtained when  = h / (h+t).
Easier yet equivalent method is to
maximize the log likelihood function.
4
Use a priori knowledge rather than data alone.
Encode uncertainty about the parameter.
Choose the median or average as a point estimate
5
(As before)
p(|data) =  p() p(data|)
6
If we had more 100 heads the peak would move much more to the right.
If we had 50 heads and 50 tails the peak would just sharpen considerably.
p(|data) = p(| hhth …ttth) =  p() p( hhth …ttth|)
p(| data) =  p() #h (1-) #t = p(| #h, #t)
(#h, #t) are sufficient statistics for binomial sampling
7
8
Example: ht … htthh
9

10

11

12
13
14
From Prior to Posterior
Observation 1: If the prior is Beta(;a,b) and we have seen A
heads and B tails, then the posterior is Beta(;A+a,B+b).
Consequence: If the prior is Beta(;a,b) and we use a point
estimate  = a/N, then after seeing the data our point estimate
changes to = (A+a)/(N+N’) where N’=A+B.
So what is a good choice for the hyper-parameters {a, b} ?
For a random coin, maybe (100,100)?
For a random thumbtack maybe (7,3)?
a and b are imaginary counts, N=a+b is the equivalent sample
size while A, B are the data counts and A+B is the data size.
15
16
17
18
19
20
From Prior to Posterior in Blocks
Observation 2: If the prior is Dir(a1,…,an) and we have
seen A1 …An counts from each state, then the posterior is
Dir(a1+A1,…, an+An).
Consequence: If the prior is Dir(a1,…,an) and we use a point
estimate  = (a1/N,…,an/N) then after seeing the data our point
estimate changes to
 =( ( a1+A1)/(N+N’), … ,(an+An)/(N+N’))
Note that posterior distribution can be updated after each data
point (namely in sequel or “online”) and that the posterior can
serve as prior for the future data points.
21
Another view of the update
Recall the Consequence that from  = (a1/N,…,an/N) we move
on after seeing the data to i = ( ai+Ai)/(N+N’) for each i.
This update can be viewed as mixture of prior and data estimates:
i = N/(N+N’) (ai/N) + N’(N+N’) (Ai/N’)
22
Learning Bayes Net parameters
p(), p(h| ) =  and p(t| ) = 1- 
23
Learning Bayes Net parameters
24
P(X=x | x) = x
P(Y=y |X=x, y|x, y|~x)= y|x
P(Y=y |X=~x, y|x, y|~x)= y|~x
Global and Local parameter independence  three
separate independent thumbtack estimation tasks,
assuming complete data.
25