slides lec11,12

Download Report

Transcript slides lec11,12

Sample Midterm question.
Sue want to build a model to predict movie ratings. She has a matrix of data,
where for M movies and U users she has collected ratings between 1 and 5.
Not all users have rated all movies. These entries are indicated by a “?”.
Sue has build a model that can predict unrated movies for any user u in the database.
The model always predicts a real number, q, between 1 and 5. To further optimize her predictions
she decides to train another model that maps her predictions q to new rating estimates q’.
To train this model she uses all observed ratings in the database (i.e. ignoring “?”)
and fits a neural network as follows:
C  | ri  q 'i |
with
i
q 'i  a  qi  b
 0
Here “i” runs over all observed ratings, and |.| denotes the absolute value.
1) Derive the gradients
dC /da , dC / db
2) Give pseudo-code for a stochastic gradient descent algorithm for learning “a” and “b”.
3) Given a fixed step-size, explain if this algorithm will converge after infinitely many gradient updates?
4) Jimmy has another algorithm that also predicts ratings. Sue and Jimmy decide to combine their
models and compute a bagged estimate. Calling Sue prediction q_sue and Jimmy prediction q_jim
give an expression for a combined prediction using bagging.
5) Explain whether bagging increases or decreases variance and why.
Bayesian Learning
Instructor: Max Welling
Read chapter 6 in book.
Probabilities
Building models with probability distributions is important because:
• We can naturally include prior knowledge
• We can naturally encode uncertainty
• We can build models that are naturally protected against overfitting.
• We define multivariate probability distributions over discrete sample spaces by
P (x , y ,...) with
 P (x , y ,...)  1
x ,y ,...
• Probability densities are different beasts. They are defined over continuous
sample spaces and we have
 dxdy ...P (x , y ,...)  1
Can P(x) > 1 for probability densities? How about discrete distributions?
Conditional Distributions
• A conditional distribution P (x | y )
after we know the value for y.
• Bayes rule:
expresses the remaining uncertainty in x,
P (x , y )  P (x | y )P (y ) 
P (y | x )P (x )
P (y | x )P (x )
P
(
x
|
y
)



P (y )
P (x , y )  P (y | x )P (x ) 
 P ( y | x ) p (x )
x
Useful for assessing diagnostic probability from causal probability:
P(Cause|Effect) = P(Effect|Cause) P(Cause) / P(Effect)
E.g., let M be meningitis, S be stiff neck: P(m|s) = P(s|m) P(m) / P(s) = 0.8 × 0.0001 / 0.1 = 0.0008
Note1: even though the probability of having a stiff neck given meningitis is very large (0.8), the
posterior probability of meningitis given a stiff neck is still very small (why?).
Note2: P(s|m) only depends on meningitis (a stable fact), but P(m|s) depends on whether e.g. the flu is
around.
(Conditional) Independence
• There are two equivalent ways you can test for independence between two
random variables.
P (x , y )  P (x )P (y )  P (x | y )  P (x )  P (y | x )  P (y )
• Conditional independence is a very powerful modeling assumption. It says:
P (x , y | z )  P (x | z )P (y | z )  P (x | y , z )  P (x | z )  P (y | x , z )  P (y | z )
• Note that this does not mean that P(x,y)=P(x)P(y). Only x and y are only
independent given a third variable.
Example C.I.
• Asthma and lung cancer are not
independent (more people with asthma
also suffer from lung cancer).
smog
• However, there is a third cause, that
explains why: smog causes both asthma
and lung-cancer.
asthma
lung
cancer
• Given that we know the presence of
smog, asthma and lung-cancer become
independent.
• This type of independency can be
graphically using a graphical model.
Bayesian Networks
smog
• To every graphical model corresponds
a probability distribution.
P (As , Lc , Sm )  P (As | Sm )P (Lc | Sm )P (Sm )
asthma
lung
cancer
• More generally:
P (x1 , x2,...)   P (xi | Parents [xi ])
i
earth
quake
cat enters
house
• To every graphical model corresponds a list
of (conditional) independency relations that
we can either read off from the graph, or prove
using the corresponding expression.
• In this example we have:
P (Al , Eq , Ct )  P (Al | Eq , Ct )P (Eq )P (Ct )
alarm
Prove this
This implies marginal independence between
Eq and Ct: P (Eq ,Ct )  P (Eq )P (Ct )
earth
quake
cat enters
house
Explaining Away
alarm
• If we don’t know whether the alarm went off, Eq and Ct are independent.
P (Eq ,Ct )  P (Eq )P (Ct )
• If we observe “alarm goes off” are Eq and Ct still independent in this case?
P (Eq ,Ct | Al )
?
P (Eq | Al )P (Ct | Al )
• Answer: no!
So, the alarm went off. Since earthquakes are very unlikely, you thought it
must have been the cat again touching the alarm: P (Eq  1 | Al  1)  0.0001
However, now you observe the cat was with friends (you now observe
information about Ct). Do you now think an earthquake was more likely?
(note: Alarms can also go off at random)
P (Eq  1 | Ct  0, Al  1)  ?
Naive Bayes Classifier
class
label
attribute 1
attribute 2
attribute 4
attribute 3
5
P ( x 1 , x 2 , x 3 , x 4 , x 5 , y )   P ( xi | y ) p ( y )
i 1
attribute 5
NB Classifier
• First we learn the conditional probabilities P (xi | y )
and P (y ) (later)
• To classify we use Bayes rule and maximize over y
K
P (y | x1 ,..., xK ) 
 P (x
i 1
i
| y ) p (y )
P (x1 ,..xk )
• We can equivalently solve:
K

argmax log P (xi | y )  log P (y ) 
y
 i 1

Multinomial Distribution
• For count data the multinomial distribution is often appropriate.
N!


nD
n1
n2
P (x1  n1 , x2  n2 ,..., xD  nD )  
  1   2  ...  D 
 n1 !n2 !...nD ! 
D
n
i 1
i
 N,
N

i 1
i
 1,
E [xi ]  N  i ,
Var (xi )  N  i (1   i ),
Cov (xi , x j )  N  i  j
• Example: a,a,b,a,b,c,b,a,a,b
with: a  0.6, b  0.3, c  0.1
The probability of this particular sequence is: Q=0.6x0.6x0.3x0.6x0.3x0.1x0.3x0.6x0.6x0.3
The probability of a sequence with 5 a’s, 4 b’s and 1 c is: 10!/5!4!1! x Q
• Longer sequences are have less prob. because there are more of them and:



n1 ,n2 ,...,nD





s .t .
ni N 


i

P (x1  n1 ,...)  1
Example: Text
•Data consists of documents from a certain class y.
Xi is a count of the number times words “i” is present in the document.
We can imagine that we throw all words in a certain class on one big pile
and forget about the particular document it came from (it’s like a very long doc.)
We use multinomials, but “forget” about the counting factor (it doesn’t matter).
Every word can be thought of as a sample from the multinomial.
• We describe the probability that a word in a document in class y is equal to
vocabulary word “i” to be:
P (w  i | y  c )  qic
q
i
ic
 1 c
• Also, the probability that a document is from class c is given by:
P (y  c )   c

c
• The probability of a document is (in a given word order):
c
1
P ({xi }, y  c )   qic xi  c
i
• So, classification for a new test document (with unknown c) boils down to:


argmax  xitest log qic  log  c 
c
i

Learning NB
• One can maximize the log-probability of the data under the model:


argmax   ydoc ,c   xi ,doc log qic  log  c ) 
 i

q , c ,doc
• Taking derivatives and imposing the normalization constraints, one finds:
c 
qic 

doc
ydoc ,c
D
x
doc
i ,doc
Wc

y
Dc # docs in class c

D
total #docs
doc ,c

Wic #times word "i " is found in docs of class "c "

Wc
total # words in docs of class "c "
• So learning is really easy: It’s just counting!
Smoothing
• With a large vocabulary, there may not be enough documents in class c to have
every word in the data. E.g. the word mouse was not encountered in documents
on computers.
• This means that when we happen to encounter a test document on computers
that mentions the word mouse, the probability of it belonging to the class
computers is 0.
• This is precisely over-fitting (with more data this would not have happened).
• Solution: smoothing (Laplace correction):
D  d c
c  c
D d
W  wc qic
qic  ic
Wc  wc
d
# of imaginary extra docs
c
smooth a priory estimate of
wc
# of imaginary extra words in class c
c
qic smooth a priory estimate of qic
Document Clustering
• In the NB classifier, we knew the class-label for each document
(supervised learning). What if we don’t know the class label? Can we
find plausible class-labels? (clustering / unsupervised learning).
• This is precisely the same as mixtures of Gaussians. But this time we replace
Gaussians with discrete probabilities qic
• The algorithm again alternates 2 phases:
M-step: Given cluster assignments (labels), update parameters as in NB.
E-step: infer plausible cluster labels given your current parameters.
c 
Dc
W
; qic  ic
D
Wc
“hard” M-step


ydoc  argmax   xi ,doc log qic  log  c 
c
i

“hard” E-step
Soft Assignment Clustering
• We can generalize these equations to soft-assignments rc ,doc  P (ydoc  c )
“soft” E-step
rc ,doc 
P (xdoc | y  c )

 P (xdoc | y  c ')
c'
“soft” M-step
c 
r
doc
c ,doc
D
;
qic
 qic
xi ,doc
i
 q
c'
i'
r

r
doc
doc
i 'c '
xi ',doc
c ,doc
xi ,doc
c ,doc
Wdoc
Wdoc   xi ,doc  # words in document "doc "
i
c
c '
Semi-Supervised Learning
• Now imagine that for a small subset of the data you know
the labels, but for the remainder you don’t.
• This is the general setup for semi-supervised learning.
• The EM algorithm from the previous slides is very well
suited to deal with this problem.
• Run soft E and M steps, but whenever you happen to know
the label for document “doc”, use: rcn   y ,c
doc
• This sets is just a mix between NB and clustering.
• Caution: Local minima are sometimes problematic.
Bayesian Networks
• If all variables are observed
learning just boils down to counting
• Sometimes variables are never observed.
This are called “hidden” or “latent” variables.
• Learning is now a lot harder, because
plausible fill-in values for these variables
need to be inferred.
• BNs are very powerful expert systems.
Full Bayesian Approaches
• The idea is to not fit anything (so you can’t over-fit).
• Instead we consider our parameters as random variables.
• If we place a prior distribution P ( ) on the parameters, we can simply
integrate them out (now they are gone!).
P (x |  )   d  P (x |  ) P ( )  P (x )
• Remember though that bad priors lead to bad models, it’s not a silver bullet.
• In the limit of large numbers of data-items, one can derive the MDL penalty:
LL
1
 # parameters  log N
2
• Computational overhead for full Bayesian approaches can be large.
Conclusions
• Bayesian learning is learning with probabilities and using Bayes rule.
• Full Bayesian “learning” marginalizes out parameters
• Naive Bayes models are “generative models” in that you imagine how data
is generated and then invert it using Bayes rule to classify new data.
Separate model are trained for each class!
• All other classifiers seen so far are discriminative. Decision surface were
trained on all classes jointly. With many data-items discriminative is expected
to better, but for small datasets generative (NB) is better.
• When we don’t know the class label y in NB, we have a hidden variable.
Clustering is like fitting a NB model with hidden class label. MoG uses
Gaussian conditional distributions. If we use discrete distributions (q) we are
fitting a “mixture of multinomials”. This is a good model to cluster text documents.