Transcript p(x)

+++++ EG
Using audio (voice)
signal recording the
computer is able to
infer the phonemes
pronounced by the
subject and
recognise the words
spoken.
Divide your data into a training X and test set . Hand label the digits in the training set
using a target vector t. Then learn the function y(x) which will take a new digit x and
output the Category label that x belongs.
1. Validation set must be added in case of too few data or two many different models.
2. Preprocessing is needed for the digits (feature extraction)
• Translation and scaling inside a box of fixed size.
Find material from within a large
Unstructured collection e.g. Internet
that satisfy the user’s need.
It can be thought as an
inference problem:
Given the user’s query what is the
Relevant items in the data collection?
(e.g. Inference for stock markets)
Graphical models
&
conditional probabilities.
In CBU :
Eleftherios DTI (Clustering, finding the paths of fibers)
Ausaf
fMRI (MVPA)
Hamed
Brain modelling
(DARPA Challenge : award 2 million $)
Latest winner
for 2007
Carnegie’s
Tartan Racing
96 Km (60 miles)
in 4h 10m.
Stanley winner
2005
Goal: NO DRIVER.
Obey all traffic laws.
Planning involves
Bayesian maths.
Online DVD rental.
(100 million ratings
Over 480.000 users
For 18.000 movies)
Goal:
Predict user ratings
for films based on
previous ratings and
reduce the root mean
squared error (RMSE).
(training)
(no training)
(Supervised)
(Supervised)
(Unsupervised)
Unlabeled Data!
Semi-Supervised Learning
Often, it is easy and cheap to obtain large
amounts of unlabelled data (e.g. images, text
documents), while it is hard or expensive to obtain
labelled data.
Semi-supervised learning methods attempt
to use the unlabelled data to improve the
performance on supervised learning tasks,
such as classification.
The Rules of Probability
• Sum Rule
• Product Rule
Bayes’ Theorem
posterior  likelihood × prior
Bayesian ML
•
Bayesian statistics provides a framework for building intelligent learning systems.
•
Bayes rule states that:
P(M|D) = P(D|M)P(M)/P(D)
“the probability of the model given the data P(M|D) is the probability of
the data given the model P(D|M) times the prior probability of the model
P(M) divided by the probability of the data P(D)”.
•
Cox theorems : We should treat degrees of beliefs in exactly the same way as we
treat probabilities.
–
–
P(M) represents numerically how much we believe model M to be the true model of the data
before we actually observe the data.
P(M|D) -//- after observing the data.
•
Think of ML as learning models of data.
•
If our beliefs are not coherent, then according with the Dutch Book theorem we are
guaranteed to lose money. The only way to avoid it is to be Bayesian i.e. represent
and manipulate beliefs using the rules of probabilities.
Bayesian ML (cont.)
•
The Bayesian framework states that
– Start out by enumerating all reasonable models of the data and assigning your
prior belief P(M) to each of these models
– Then upon observing the data D you evaluate how probable the data was under
each of these models to compute P(D|M).
– Multiplying this likelihood by the prior and normalizing results in the posterior
P(M|D) which encapsulates everything that you have learned from the data
regarding the possible models under consideration.
– To compare to models we compute the relative probability given the data:
P(M)P(D|M)/P(M’)P(D|M).
•
In practice applying Bayes’ rule exactly is usually impractical because it
involves summing or integrating over too large a space of models.
Solution : Approximate Bayesian methods
•
•
•
•
•
Laplace approximation
Variational approximations
Expectation Propagation
Markov Chain Monte Carlo …
Bayesian decision theory deals with the problem of making optimal
decisions i.e. decisions that minimize our expected loss.
• k possible Actions , m possible Models, lose Lij dollars. Optimal action is the one that
minimizes the expected loss (sum(L_ij P(M_i|D) over all models)
• Reinforcement learning.
Common mistakes
Conditioning is not the same as implication
P(A | B) = p has a very different meaning from the logical statement
"B implies A with certainty p".
– The logical statement means that whenever B is true then A is true with certainty
p. The probability statement applies when the only thing we know is B. If anything
else is known, e.g. C, than we must refer to P(A | B, C) instead. The only
exception is when we can prove that C is conditionally independent of A given B,
so that P(A | B, C) = P(A | B).
Example
A = "It rained last night"
B = "My grass is wet"
C = "The sprinkler was on last night"
Given only B, it is reasonable to conclude A. But if B is deduced from C, then it is not
reasonable to conclude A.
– You must condition on the event actually observed, not its logical implications.
Common mistakes (cont.)
Randomness is subjective
In English, randomness is an intrinsic property of an event. For example,
"Boston weather is random." But in probability theory, randomness is a function of
the observer; in particular, the amount of information you have.
A basic assumption of probability theory is that given enough information, the status
of any event can be reduced to a certainty. Randomness is therefore the absence of
information, and therefore subjective.
A common, but flawed, rebuttal to the subjectivist argument is that the
success of quantum physics "proves" that some things are intrinsically random. But
quantum theory does not prove intrinsic randomness any more than the fact that
coin flipping, despite being in the realm of Newton's laws, is best described
statistically, or that random number algorithms, which are completely deterministic,
may pass statistical tests. The convenience of a mental model does not prove that
the model is correct.
Regression Example
Sum-of-Squares Error Function
Regularization
Penalize large coefficient values
Maximum Likelihood
Determine
by minimizing sum-of-squares error,
.
MAP
Alpha here is a hyperparameter.
Taking the negative logarithm of the posterior of the
above function determines the weights of the maximum
Posterior probability or maximum a posteriori.
Determine
by minimizing regularized sum-of-squares error,
Bayesian Curve Fitting
Over-fitting can be avoided, the effective number of
parameters adapts automatically to the size of the data.
.
7
6
5
h
4
h
3
2
1
0
0
0.1
0
H(X,Y) = H(Y|X) + H(X)
0.2
0.3
0.4
0.5
P
p
0.6
0.7
0.8
0.9
1
1
•
•
•
Assuming x discrete with 8 possible states; how many bits to
transmit the state of x?
All states equally likely
States have different probabilities
• The uniform distribution has maximum entropy.
"N i.i.d. random variables each with entropy H(X) can be compressed into more than
NH(X) bits with negligible risk of information loss, as N tends to infinity; but conversely, if
they are compressed into fewer than NH(X) bits it is virtually certain that information will
be lost."
KL
Unknown distr. p(x) modelled using an
approximating distr. q(x)
What is the average additional amount
Of information required to specify the
Value of x ?
If x, y are independent then p(x,y) = p(x)p(y)
MI
Mutual information is >=0 with equality iff the x, y are independent and it is related
to the conditional entropy through the following formula.
Mutual information represents the reduction in uncertainty about x as a consequence
of the new observation y.
Used in Bernouli trial with random
outcome {success, failure}.
(conjugate prior of the Bernoulli distribution)
Generalization of the beta distribution and
conjugate prior of the multinomial distr.
describes the number of successes in a series of independent Yes/No trials
As in binomial but each trial can have more than 2 outcomes.
is a probability distribution that arises
in the problem of estimating the mean
of a normally distributed population when the sample size is sm
(t-test) – Much more robust to outliers than the gaussian.
Take away message
Learning
is about
Optimization & Integration
again…
Books:
“Pattern Recognition and Machine Learning”, C. M. Bishop, 2006.
“Information theory, inference and Learning algorithms”, D. J. C. Mackay, 2003.
“Probability theory: the logic of science”, E.T. Jaynes, 2003.