BayesLearn273AFall06

Download Report

Transcript BayesLearn273AFall06

Sample Midterm question.
Sue want to build a model to predict movie ratings. She has a matrix of data,
where for M movies and U users she has collected ratings between 1 and 5.
Not all users have rated all movies. These entries are indicated by a “?”.
Sue has build a model that can predict unrated movies for any user u in the database.
The model always predicts a real number, q, between 1 and 5. To further optimize her predictions
she decides to train another model that maps her predictions q to new rating estimates q’.
To train this model she uses all observed ratings in the database (i.e. ignoring “?”)
and fits a neural network as follows:
C  | ri  q 'i |
with
i
q 'i  a  qi  b
 0
Here “i” runs over all observed ratings, and |.| denotes the absolute value.
1) Derive the gradients
dC /da , dC / db
2) Give pseudo-code for a stochastic gradient descent algorithm for learning “a” and “b”.
3) Given a fixed step-size, explain if this algorithm will converge after infinitely many gradient updates?
4) Jimmy has another algorithm that also predicts ratings. Sue and Jimmy decide to combine their
models and compute a bagged estimate. Calling Sue prediction q_sue and Jimmy prediction q_jim
give an expression for a combined prediction using bagging.
5) Explain whether bagging increases or decreases variance and why.
Bayesian Learning
Instructor: Max Welling
Read chapter 6 in book.
Probabilities
Building models with probability distributions is important because:
• We can naturally include prior knowledge
• We can naturally encode uncertainty
• We can build models that are naturally protected against overfitting.
• We define multivariate probability distributions over discrete sample spaces by
P (x , y ,...) with
 P (x , y ,...)  1
x ,y ,...
• Probability densities are different beasts. They are defined over continuous
sample spaces and we have
 dxdy ...P (x , y ,...)  1
Can P(x) > 1 for probability densities? How about discrete distributions?
Conditional Distributions
• A conditional distribution P (x | y )
after we know the value for y.
• Bayes rule:
expresses the remaining uncertainty in x,
P (x , y )  P (x | y )P (y ) 
P (y | x )P (x )
P (y | x )P (x )
P
(
x
|
y
)



P (y )
P (x , y )  P (y | x )P (x ) 
 P ( y | x ) p (x )
x
Useful for assessing diagnostic probability from causal probability:
P(Cause|Effect) = P(Effect|Cause) P(Cause) / P(Effect)
E.g., let M be meningitis, S be stiff neck: P(m|s) = P(s|m) P(m) / P(s) = 0.8 × 0.0001 / 0.1 = 0.0008
Note1: even though the probability of having a stiff neck given meningitis is very large (0.8), the
posterior probability of meningitis given a stiff neck is still very small (why?).
Note2: P(s|m) only depends on meningitis (a stable fact), but P(m|s) depends on whether e.g. the flu is
around.
(Conditional) Independence
• There are two equivalent ways you can test for independence between two
random variables.
P (x , y )  P (x )P (y )  P (x | y )  P (x )  P (y | x )  P (y )
• Conditional independence is a very powerful modeling assumption. It says:
P (x , y | z )  P (x | z )P (y | z )
• Note that this does not mean that P(x,y)=P(x)P(y). Only x and y are only
independent given a third variable.
Example C.I.
• Asthma and lung cancer are not
independent (more people with asthma
also suffer from lung cancer).
smog
• However, there is a third cause, that
explains why: smog causes both asthma
and lung-cancer.
asthma
lung
cancer
• Given that we know the presence of
smog, asthma and lung-cancer become
independent.
• This type of independency can be
graphically using a graphical model.
Bayesian Networks
smog
• To every graphical model corresponds
a probability distribution.
P (As , Lc , Sm )  P (As | Sm )P (Lc | Sm )P (Sm )
asthma
lung
cancer
• More generally:
P (x1 , x2,...)   P (xi | Parents [xi ])
i
bike
falls
earth
quake
• To every graphical model corresponds a list
of (conditional) independency relations that
we can either read off from the graph, or prove
using the corresponding expression.
• In this example we have:
P (Al , Eq , Bf )  P (Al | Eq , Bf )P (Eq )P (Bf )
alarm
Prove this
This implies marginal independence between
Eq and Bf: P (Eq , Bf )  P (Eq )P (Bf )
Naive Bayes Classifier
class
label
attribute 1
attribute 2
attribute 4
attribute 3
5
P ( x 1 , x 2 , x 3 , x 4 , x 5 , y )   P ( xi | y ) p ( y )
i 1
attribute 5
NB Classifier
• First we learn the conditional probabilities P (xi | y )
and P (y )
• To classify we use Bayes rule and maximize over y
K
P (y | x1 , xK ) 
 P (x
i 1
i
| y ) p (y )
P (x1 ,..xk )
• We can equivalently maximize:
K

argmax log P (xi | y )  log P (y ) 
y
 i 1

Example: Text
• Data consists of documents from a certain class y.
Xi is a count of the number times words “i” is present
in the document (bag-of-words)
• We describe the probability that a word in a document in class y is equal to
vocabulary word “i” to be:
P (w  i | y  c )  qic
q
i
ic
 1 c
• Also, the probability that a document is from class c is given by:
P (y  c )   c

c
• The probability of a document is:
c
1
P ({xi }, y  c )   qic xi  c
i
•So, classification for a new test document (with unknown c) boils down to:


argmax  xitest log qic  log  c 
c
i

W is # words in test doc.
Learning NB
• One can maximize the log-probability of the data under the model:


argmax   y doc ,c   xidoc log qic  log  c ) 
 i

q , c ,doc
• Taking derivatives and imposing the normalization constraints, one finds:
c 
qic 

doc
y doc ,c
N
x
doc
doc
i
y
Wc

Dc # docs in class c

D
total #docs
doc
,c
Wic #times word "i " is found in docs of class "c "


Wc
total # words in docs of class "c "
• So learning is really easy: It’s just counting!
Smoothing
• With a large vocabulary, there may not be enough documents in class c to have
every word in the data. E.g. the word mouse was not encountered in documents
on computers.
• This means that when we happen to encounter a test document on computers
that mentions the word mouse, the probability of it belonging to the class
computers is 0.
• This is precisely over-fitting (with more data this would not have happened).
• Solution: smoothing (Laplace correction):
D  d c
c  c
D d
W  wc qic
qic  ic
Wc  wc
d
# of imaginary extra docs
c
smooth a priory estimate of
wc
# of imaginary extra words in class c
c
qic smooth a priory estimate of qic
Bayesian Networks
• If all variables are observed
learning just boils down to counting
• Sometimes variables are never observed.
This are called “hidden” or “latent” variables.
• Learning is now a lot harder, because
plausible fill-in values for these variables
need to be infreed.
• BNs are very powerful expert systems.
Full Bayesian Approaches
• The idea is to not fit anything (so you can’t over-fit).
• Instead we consider our parameters as random variables.
• If we place a prior distribution P ( ) on the parameters, we can simply
integrate them out (now they are gone!).
P (x |  )   d  P (x |  ) P ( )  P (x )
• Remember though that bad priors lead to bad models, it’s not a silver bullet.
• In the limit of large numbers of data-items, one can derive the MDL penalty:
LL
1
 # parameters  log N
2
• Computational overhead for full Bayesian approaches can be large.
Conclusions
• Bayesian learning is learning with probabilities and using Bayes rule.
• Full Bayesian “learning” marginalizes out parameters
• Naive Bayes models are “generative models” in that you imagine how data
is generated and then invert it using bayes rule to classify new data.
Separate model are trained for each class!
• All other classifiers seen so far are discriminative. Decision surface were
trained on all classes jointly. With many data-items discriminative is expected
to better, but for small datasets generative (NB) is better.
• When we don’t know the class label y in NB, we have a hidden variable.
Clustering is like fitting a NB model with hidden class label. MoG uses
Gaussian conditional distributions. If we use discrete distributions (q) we are
fitting a “mixture of multinomials”. This is a good model to cluster text documents.