Learning from Data

Download Report

Transcript Learning from Data

Learning from Data
COMP61011 : Machine Learning and Data Mining
Dr Gavin Brown
Machine Learning and Optimization Research Group
Learning from Data
Data is recorded from some real-world phenomenon.
What might we want to do with that data?
Prediction
- what can we predict about this phenomenon?
Description
- how can we describe/understand this phenomenon in a new way?
Optimization
- how can we control and optimize this phenomenon for our own objectives?
COMP61011
Machine Learning
& Data Mining
COMP61021
Modeling & Visualization
of High Dimensional Data
COMP61032
Optimization for Learning,
Planning & Problem Solving
Period 1
Oct/Nov
Period 2
Nov/Dec
Period 3
Feb/Mar
Prediction
Lecturer:
Dr Gavin Brown
Machine Learning and Data Mining
Medical Records / Novel Drugs
What characteristics of a patient indicate they may react well/badly to a new drug?
How can we predict whether it will potentially hurt rather then help them?
AstraZeneca Project Research Bursaries
Limited number of eligible MSc projects, announced Dec 2011
Machine Learning and Data Mining
Handwriting Recognition
Google Books is currently digitizing millions of books.
Smartphones need to process non-European
handwriting to tap into the Asian market.
How can we recognize handwritten digits in a
huge variety of handwriting styles, in real-time?
Learning from Data
Where does all this fit?
Statistics / Mathematics
Artificial Intelligence
Data Mining
Learning from Data
Computer Vision
Robotics
(No definition of a field is perfect – the diagram above is just one interpretation, mine ;-)
Learn your trade…
Learning from Data ….. Prerequisites
MATHEMATICS
This is a mathematical subject.
You must be comfortable with probabilities and algebra.
Maths primer on course website for reviewing.
PROGRAMMING
You must be able to program, and pick up a new language relatively easily.
We use Matlab for the first 2 modules.
In the 3rd module, you may use any language.
http://www.cs.manchester.ac.uk/pgt/COMP61011
Module codes in this theme:
61011 (prediction)
61021 (description)
61032 (optimization)
COMP61011 topic structure
Week 1: Some Data and Simple Predictors
Week 2: Support Vector Machines / Model Selection
Week 3: Decision Trees / Feature Selection
Week 4: Bayes Theorem / Probabilistic Classifiers
Week 5: Ensemble Methods / Industry Guest Lectures
Week 6: No lecture.
COMP61011 assessment structure
50% January exam
50% coursework, broken down as
10% + 10% lab exercises (weeks 2,3)
30% mini-project (weeks 4,5,6)
Lab exercises will be marked at the START of the following lab session.
You should NOT be still working on the previous week’s work.
Extensions will require a medical note.
Matlab
MATrix LABoratory
• Interactive scripting language
• Interpreted (i.e. no compiling)
• Objects possible, not compulsory
• Dynamically typed
• Flexible GUI / plotting framework
• Large libraries of tools
• Highly optimized for maths
Available free from Uni, but usable
only when connected to our network
(e.g. via VPN)
Module-specific software supported
on school machines only.
Books (not compulsory purchase, but recommended)
“Introduction to Machine Learning”
By Ethem Alpaydin
Very Short Introduction to Statistics
By David Hand
Technical.
Contains all necessary material
For modules 1+2 of this theme.
Not technical at all.
More of a motivational,
big-picture read.
Some Data, and Simple Predictors
A Problem to Solve
Distinguish rugby players from ballet dancers.
You are provided with some data.
Fallowfield rugby club (16).
Rusholme ballet troupe (10).
Task
Generate a program which will correctly classify ANY player/dancer in the world.
Hint
We shouldn’t “fine-tune” our system too much so it only works on the local clubs.
Taking measurements….
We have to process the people with
a computer, so it needs to be in a
computer-readable form.
What are the distinguishing characteristics?
1.
2.
3.
4.
Height
Weight
Shoe size
Sex
Taking measurements….
Person
Weight
Height
1
2
3
4
5
…
16
17
18
19
20
…
63kg
55kg
75kg
50kg
57kg
…
85kg
93kg
75kg
99kg
100kg
…
190cm
185cm
202cm
180cm
174cm
height
150cm
145cm
130cm
163cm
171cm
weight
The Nearest Neighbour Rule
height
Person
Weight
Height
Class
1
2
3
4
5
…
16
17
18
19
20
…
63kg
55kg
75kg
50kg
57kg
…
85kg
93kg
75kg
99kg
100kg
…
190cm
185cm
202cm
180cm
174cm
-1
-1
-1
-1
-1
150cm
145cm
130cm
163cm
171cm
+1
+1
+1
+1
+1
“TRAINING” DATA
weight
“TESTING” DATA
Who’s this guy?
- player or dancer?
height = 180cm
weight = 78kg
The Nearest Neighbour Rule
height
Person
Weight
Height
Class
1
2
3
4
5
…
16
17
18
19
20
…
63kg
55kg
75kg
50kg
57kg
…
85kg
93kg
75kg
99kg
100kg
…
190cm
185cm
202cm
180cm
174cm
-1
-1
-1
-1
-1
150cm
145cm
130cm
163cm
171cm
+1
+1
+1
+1
+1
“TRAINING” DATA
weight
height = 180cm
weight = 78kg
1. Find nearest neighbour
2. Assign the same class
The K-Nearest Neighbour Classifier
Testing point x
For each training datapoint x’
measure distance(x,x’)
End
Sort distances
Select K nearest
Assign most common class
Person
Weight
Height
Class
1
2
3
4
5
…
16
17
18
19
20
…
63kg
55kg
75kg
50kg
57kg
…
85kg
93kg
75kg
99kg
100kg
…
190cm
185cm
202cm
180cm
174cm
-1
-1
-1
-1
-1
150cm
145cm
130cm
163cm
171cm
+1
+1
+1
+1
+1
“TRAINING” DATA
height
weight
Quick reminder: Pythagoras’ theorem
. . .
measure distance(x,x’)
. . .
a
c
a 2  b2  c 2
b
So....
a.k.a. “Euclidean” distance
c  a 2  b2
height
distance ( x, x' ) 

( xi  x'i ) 2
i
weight
The K-Nearest Neighbour Classifier
Testing point x
For each training datapoint x’
measure distance(x,x’)
End
Sort distances
Select K nearest
Assign most common class
Person
Weight
Height
Class
1
2
3
4
5
…
16
17
18
19
20
…
63kg
55kg
75kg
50kg
57kg
…
85kg
93kg
75kg
99kg
100kg
…
190cm
185cm
202cm
180cm
174cm
-1
-1
-1
-1
-1
150cm
145cm
130cm
163cm
171cm
+1
+1
+1
+1
+1
“TRAINING” DATA
height
Seems sensible.
But what are the disadvantages?
weight
The K-Nearest Neighbour Classifier
height
Person
Weight
Height
Class
1
2
3
4
5
…
16
17
18
19
20
…
63kg
55kg
75kg
50kg
57kg
…
85kg
93kg
75kg
99kg
100kg
…
190cm
185cm
202cm
180cm
174cm
-1
-1
-1
-1
-1
150cm
145cm
130cm
163cm
171cm
+1
+1
+1
+1
+1
“TRAINING” DATA
weight
Here I chose k=3.
What would happen if I chose k=5?
What would happen if I chose k=26?
The K-Nearest Neighbour Classifier
height
Person
Weight
Height
Class
1
2
3
4
5
…
16
17
18
19
20
…
63kg
55kg
75kg
50kg
57kg
…
85kg
93kg
75kg
99kg
100kg
…
190cm
185cm
202cm
180cm
174cm
-1
-1
-1
-1
-1
150cm
145cm
130cm
163cm
171cm
+1
+1
+1
+1
+1
“TRAINING” DATA
weight
Any point on the left of this “boundary” is closer to the red circles.
Any point on the right of this “boundary” is closer to the blue crosses.
This is called the “decision boundary”
Where’s the decision boundary?
height
weight
Not always a simple straight line!
Where’s the decision boundary?
height
weight
Not always contiguous!
So, we have our first “machine learning” algorithm
The K-Nearest Neighbour Classifier
Testing point x
For each training datapoint x’
measure distance(x,x’)
End
Sort distances
Select K nearest
Assign most common class
Make your own notes on its
advantages / disadvantages.
The most important concept in Machine Learning
The most important concept in Machine Learning
Looks good so far…
The most important concept in Machine Learning
Looks good so far…
Oh no! Mistakes!
What happened?
The most important concept in Machine Learning
Looks good so far…
Oh no! Mistakes!
What happened?
We didn’t have all the data.
We can never assume that we do.
This is called “OVER-FITTING”
to the small dataset.
Overfitting
Overfitting happens when the classifier is too “flexible” for the problem.
If we’d drawn a simpler decision boundary below, maybe a straight line, we may
have gotten lower error.
Break for 10 mins
Possible uses of your break:
1.
Ensure you have a working login for the computer lab this afternoon.
2.
Talk to me or a demonstrator about the material.
3.
Read ahead in the notes.
4.
Go get a coffee.
A more simple, compact rule?
height
q
weight
if (weight > q ) then "player" else "dancer"
What’s an algorithm to find a good threshold?
q = 40
while ( numMistakes != 0 )
{
q = q +1
numMistakes = error(
}
q
)
height
if (weight > q ) then "player" else "dancer"
weight
We have our second Machine Learning procedure.
The threshold classifier (also known as a “Decision Stump”)
if (weight > q ) then "player" else "dancer"
q = 40
while ( numMistakes != 0 )
{
q = q +1
numMistakes = error(
}
q
)
Three “ingredients” of a Machine Learning procedure
“Model”
The final product, the thing you have to package up and
send to a customer. A piece of code with some
parameters that need to be set.
“Error function”
The performance criterion: the function you use to judge
how well the parameters of the model are set.
“Learning algorithm”
The algorithm that optimises the model parameters, using
the error function to judge how well it is doing.
Three “ingredients” of a Threshold Classifier
Error function
q = 40
while ( numMistakes != 0 )
{
q = q +1
numMistakes = error(
}
q
Learning algorithm
)
if (weight > q ) then "player" else "dancer"
Model
What’s the “model” for the k-NN classifier?
For the k-nn, the model is the training data itself !
- very good accuracy 
- very computationally intensive! 
height
Testing point x
For each training datapoint x’
measure distance(x,x’)
End
Sort distances
Select K nearest
weight
Assign most common class
New data: what’s an algorithm to find a good threshold?
height
Our model does not match
the problem!
q
weight
if (weight > q ) then "player" else "dancer"
1 mistake…
New data: what’s an algorithm to find a good threshold?
height

But our current model
cannot represent this…
weight
if (weight > q ) then "player" else "dancer"

We need a more sophisticated model…
if (weight > q ) then "player" else "dancer"
if (f (x) > q ) then "player" else "dancer"
x1 = height (cm)
x2 = weight (kg)
The Linear Classifier

f ( x )  ( w1 * x1 )  ( w2 * x2 )
height

d
w x
i i
i 1
weight
The Linear Classifier
if f (x) > q then "player" else "dancer"

f ( x )  ( w1 * x1 )  ( w2 * x2 )

d
w x
i i
i 1
height
height
weight
w1, w2and q
Decision
boundary
change the position of the DECISION BOUNDARY
weight
Geometry of the Linear Classifier (1)
if f (x) > q then +1 else -1
"player"  1
"dancer"  1
In 2-d, this is a line. In higher dimensions, it is a decision “hyper-plane”.
Any point on the plane evaluates to 0.
Points not on the plane evaluate to +1/-1.
[ w1 , w2 ]
The decision boundary is always
ORTHOGONAL to the weight vector.
See if you can prove this for yourself
before going to the notes.
f(x)=+1
f(x)=+1
f(x)=0
Geometry of the Linear Classifier (2)
We can rearrange the decision rule:
d
if å wi xi > q then +1 else -1
i=1
d
åw x -q > 0
i i
i=1
d
å w x + (-1.w ) > 0
i i
0
i=1
d
åw x
>0
w x
 0 then  1 else - 1
i i
i=0
d
if
i i
i 0
Geometry of the Linear Classifier (3)
[ w1 , w2 ]
On the plane:
d
f (x) = å wi xi - q = 0
i=1
In 2-dimensions:
f(x)=+1
f (x) = w1 x1 + w2 x2 - q = 0
w1
q
x1 + x2 =
w2
w2
w1
q
x2 = - x1 +
w2
w2
This now follows the geometry of a straight line y=mx+c, with
w1
m=w2
c=
f(x)=+1
f(x)=0
q
w2
The Linear Classifier
d
Model
if
w x
i i
 0 then ŷ   1 else ŷ  1
i 0
1
2
f
(x)
y
(
)
2
Error function
e=
Learning algo.
??? .... need to optimise the w values...
height
x
inputs
y
class
weight
Note the terminology! See notes for details!
Gradient Descent
e
¶e
¶wi
1
2
= ( f (x) - y)
2
¶e ¶f
=
¶f ¶wi
=
( f - y) xi
Follow the NEGATIVE gradient.
Stochastic Gradient Descent
initialise weight values to random numbers in range -1 to +1
for n = 1 to NUM_ITERATIONS
for each training example (x,y)

f (x )
calculate
for each weight i
wi = wi - a ( f (x) - y)xi
end
end
end

= a small constant, the “learning rate”
Convergence theorem:
If the data is linearly separable, then application of the learning rule will
find a separating decision boundary, within a finite number of iterations.
A problem…
initialise weight values to random numbers in range -1 to +1
. . .
Depending on the random
initialisation, the linear classifier
will converge to one of the valid
boundaries…but randomly!
height
weight
Break for 30 mins
Possible uses of your break:
1.
Ensure you have a working login for the computer lab this afternoon.
2.
Talk to me or a demonstrator about the material.
3.
Read ahead in the notes.
4.
Go get a coffee.
Another model : logistic regression
Our model f(x) has range plus/minus INFIINTY!
Is this really necessary?
What is the confidence of our decisions? Can we estimate PROBABILITIES?
Logistic regression estimates p( y=1 | x )
Sigmoid function
Output in range [0,1]
p(y = 1| x) = f (x) =
1
1+ e
-(wT x-Q)
Another error : cross entropy
N
e = -å y j ln f (x j ) + (1- y j )ln(1- f (x j ))
j=1
Above we assume y is either 0 or 1.
Derived from the statistical principle of Likelihood.
We’ll see this again in a few weeks.
Gradient Descent
N
e = -å y j ln f (x j ) + (1- y j )ln(1- f (x j ))
j=1
¶e
¶wi
¶e ¶f
=
¶f ¶wi
=
( f - y) xi
Follow the NEGATIVE gradient.
SAME update as for squared error!
Stochastic Gradient Descent
initialise weight values to random numbers in range -1 to +1
for n = 1 to NUM_ITERATIONS
for each training example (x,y)

f (x )
calculate
for each weight i
wi = wi - a ( f (x) - y)xi
end
end
end

= a small constant, the “learning rate”
A natural ‘pairing’ of error function to model
N
e = -å y j ln f (x j ) + (1- y j )ln(1- f (x j ))
j=1
f (x) =
¶e
¶wi
=
1
2
( f (x) - y)
2
d
1
1+ e-(w
e
T
f (x) = å wi xi - q
x-Q)
¶e ¶f
=
¶f ¶wi
i=1
=
( f - y) xi
Still a problem…
initialise weight values to random numbers in range -1 to +1
. . .
Depending on the random
initialisation, the logistic regression
classifier will converge to one of the
valid boundaries…but randomly!
height
weight
Geometry of Linear Models (see notes)
Another problem - new data…. “non-linearly separable”
height
Our model does not match
the problem!
We’ll deal with this
next week!
weight
End of Day 1
Now… read the notes.
Read the “Surrounded by Statistics” chapter in the handouts.
The fog will clear.
This afternoon… learn MATLAB.
This week’s exercise is unassessed, but you are highly advised
to get as much practice in as you can….