Transcript Document

Machine Learning
ART
Neil Lawrence
Schedule
•
Today: Introduction and Background
30th Sept
Introduction and Background
7th Oct
Maximum Likelihood & Optimization
14th Oct
21st Oct
28th Oct
4th Nov
Bayesian Methods
11th Nov
18th Nov
Gaussian Processes
25th Nov
2nd Dec
Website:
Analysis methods
Machine Learning Applications
•
Classification
– Character recognition
• Neural Networks
• Support Vector Machines
– Face detection
– Face recognition
•
Regression
– Time series prediction
• Financial forecasting
• Pollution Monitoring
•
Other
–
–
–
–
Text processing (visualisation, classification, statistical parsing …)
Bioinformatics (visualisation, classification …)
Face Tracking
Games
Character Recognition
• Classification Problem
– Given a digit
representation.
– What is it’s class?
• AT&T have used
– Neural Networks
– Support Vector Machines
• Error rates ~1.4%
• Inputs are 28x28
greyscale images.
Face Detection
• Classification Problem
• Useful for
– Organising collections of
holiday photos.
– A pre-processing step to
recognition.
Regression
•
Predicting one variable as a
function of another.
These examples are of a noisy
sine wave prediction and a noisy
sinc.
Time Series Prediction
In time series prediction we try and predict the function at time t+1
given the function up to time t.
This is strongly related to regression, you can produce a time series through
auto regression.
Natural Language Processing
• Where is the time in these sentences?
The meeting will be just after lunch.
The meeting will be at 2pm.
There will be no meeting this month.
Bioinformatics
Different
DNA in solution
solutions
Probabilistic Modelling of Spots
Learning and Generalisation
• What is machine learning?
Given a dataset, I can simply store it in my computer and I have
learnt it. Is this machine learning?
Machine learning is about then getting something useful out of the
dataset – like good predictions on previously unseen data.
This is known as generalisation.
• Learning as compression.
There are some relationships between trying to do this and trying to
compress a dataset. However, compression lies in the domain of
information theory.
Probability Distributions
•
Question: What is the probability that you are male given that you are
a computer scientist?
Assign sex the letter Y (what we wish to predict) and profession the letter
X (what we have observed).
P(Y=female|X=computer scientist) = 0.27
Here the data is from web article in IEEE expert 27% of Computer
Science graduates are female. This is a probability distribution. It must be
normalised, i.e.
Y P(Y| X=computer scientist) = 1
Thus since Y can only take on two values (male or female)
P(Y=male|X=computer scientist) = 1 – 0.27 = 0.73
More on Sex
This distribution may be written as a table:
•
•
P(Y=f|X=cs)
P(Y=m|X=cs)
0.27
0.73
Similarly we may specify a distribution for non-Computer Scientists:
P(Y=f|X=~cs)
P(Y=m|X=~cs)
0.51
0.49
Finally, what is the probability of being a Computer Scientist?
P(X=cs) = 0.01
Manipulating Probabilities
• Question: What is the probability that you are a computer
scientist given that you are female?
P(X=cs|Y=f)
To answer this we need to introduce some of the laws of probability. First
though some notation: So far we have denoted the `random variables’ as
capital letters X (profession) and Y (sex) and specified the values they
take (cs, f, m). We will continue to use the capitals, but we will abstract
the problem by replacing the value with their corresponding small letters.
P(X=x|Y=y)
Product Rule of Probability
•
So far we have looked at conditional distributions, what about the
joint distribution?
The probability of Y=y given that X=x is written as P(Y=y|X=x).
If we wish to know the probability that Y=y and X=x we need to know
P(Y=y, X=x), the joint distribution.
The joint distribution is found by multiplying the conditional by its
dependent variable:
P(Y=y, X=x) = P(Y=y|X=x)P(X=x)
From our example above: The probability that someone is male and a
computer scientist is equal to the probability that someone is male given
that they are a computer scientist multiplied by the probability that they
are a computer scientist, or in our previous notation.
P(Y=m, X=cs) = P(Y=m|X=cs)P(X=cs)
Bayes’ Rule
• Is this operation symmetric?
The above statement leads to the following question, does this operation
commute, i.e. does it work the other way around? i.e. Can we write:
P(X=x, Y=y) = P(X=x|Y=y)P(Y=y) = P(Y=y, X=x)
The answer is yes. And this means that we can write:
P(X=x|Y=y)P(Y=y) = P(Y=y|X=x)P(X=x)
Which may be rearranged to form:
This is known as Bayes’ Rule and is very important for this course!
The Normalisation
• Can we answer the original question yet?
In the above formula we required P(Y=y|X=x), P(Y=y) and P(X=x).
These are sometimes known as the likelihood, P(Y=y|X=x), the prior,
P(X=x) and the marginalised likelihood or evidence, P(Y=y).
Whilst above we specified the likelihood and the prior, we didn’t specify
the marginalised likelihood. Fortunately we can obtain it through a
process of marginalisation.
P(X=x) = y P(X=x, Y=y)
This is known as the sum rule of probability. Recalling the formula for the
joint distribution we may write
P(X=x) = y P(Y=y| X=x)P(X=x)
The Answer to the Question
•
We asked the question what is P(X=x|Y=y), here is the full answer
through Bayes’ rule.
•
Thus we can write
Notational Shorthand
Very often the set to which an observation belongs is obvious, i.e.
computer scientist is a profession not a sex. Male is a sex not a
profession. Thus it is common to use a notational shorthand to represent
our variables. Rather than always saying P(sex=male) we write P(male)
and similarly rather than writing P(profession=computer scientist) we
write P(computer scientist). In the latter case, it should be obvious that
the distribution is over professions and not sexes.
Thus we may write a shorthand version of Bayes’ rule:
If it is thought that it is obvious that x is from the set X and y is from the
set Y.
Trial Example
Alan M. Dershowitz, a Harvard professor and advisor to the O. J.
Simpson defense team stated on U.S. television that only about
0.1% of wife batterers actually murder their wives and claimed that
therefore evidence of abuse and battering should not be admissible
in a murder trial. But what is the probability that the husband was
the murderer, given that he battered his wife and the wife was killed
( Good, 1995 ; Koehler, 1997 )?
P(m|b) = 0.001 (See here for article)
P(b) = 0.023 (1992 Report)
P(m, b) = 0.01*0.023 = 2.3e-5
P(m) = 1e-4 (See here for numbers)
P(b|m) = 2.3e-5/1e-4 = 0.23
Or 23 % of murdered women were battered by their husbands vs
approx 2.3% of unmurdered women are battered by their husbands.
Graphical Representation
It is sometimes convenient to represent how we have parameterised a distribution
as a graph. We can move between the parameterisations through Bayes rule.
(shaded variables are ‘observed’, unshaded are unobserved or ‘latent’)
x
x
Bayes’ Rule
P(x|y)P(y)
P(y|x)P(x)
y
y
y
x
cs
~cs
0.01
0.99
P(x)
x
~cs
cs
x
m
f
0.49
0.51
0.73
0.27
P(y|x)
y
m
f
a
(1-a)
P(y)
y
cs
~cs
m
b
(1-b)
f
c
(1-c)
P(x|y)
Bayesian Inference
• By next time.
– Give values for a, b and c.
Graphical Representation
Let’s rename our old feature X1 and add another feature X2 which indicates whether
the person is a football fan.
x1
x2
P(y|x1, x2)P(x1)P(x2)
y
x1
x2
y
x1
x2
m
f
~cs
~ff
0.3
0.7
cs
~cs
ff
~ff
~cs
ff
0.85
0.15
0.01
0.99
0.20
0.80
cs
~ff
0.73
0.27
cs
ff
0.74
0.28
P(x1)
P(x2)
P(y|x1, x2)
Practical Limitation
• In digit recognition we could have many more features, e.g.
for 16x16 binary images (256 features).
• This leads to 2256 rows in the previous table.
• This also means that to determine, P(y) we need to sum
across 2256 values, approx 1.2 x 1077 floating point operations.
• IBM Blue Pacific is a 4 Teraflop machine 4 x 1012 floating point
operations per second. It would take 9 x 1047 billion years to
complete this computation (and it still cost $94 million).
• Many techniques in machine learning involve approximations
which enable undertake these tasks efficiently.