Transcript Slide 1

Pattern recognition – basic
concepts
Sample
Input
Attributes
Sepal
Sepal
Inst. Length Width
1
5.1
3.5
2
4.9
3
3
4.7
3.2
4
4.6
3.1
5
5
3.6
Output
Attribute
Petal
Length
Petal
Width
Species
1.4
1.4
1.3
1.5
1.4
0.2
0.2
0.2
0.2
0.2
setosa
setosa
setosa
setosa
setosa
• input attribute, attribute, feature, input variable,
independent variable
(atribut, rys, příznak, vstupní proměnná, nezávisle
proměnná)
• class, output variable, dependendent variable
(třída, výstupní proměnná, závislá proměnná)
• sample
(vzorek)
Handwritten digits
• each digit 28 x 28 pixels
– so each digit can be represented by a vector x
comprising 784 real numbers
• goal:
– build a machine that will take x as input and will
produce the identity of the digit 0 … 9 as the output
– non-trivial problem due to the wide variability of
handwriting
– could be tackled by using rules for distinguishing the
digits based on the shapes of the strokes
– in practice such an approach leads to a proliferation of
rules and of exceptions to the rules and so on, and
invariably gives poor results
• better way – adopt machine learning
algorithm (i.e. use some adaptive model)
input
model
internal parameters influencing the
behavior of the model must be adjusted
• tune the parameters of the model using the
training set
– training set is a data set of N digits {x1, …, xN}
• the categories of the digits in the training set
are known in advance (inspected manually),
they form a target vector t for each digit (one
target vector for each digit)
0 1 2 3
4 5 6 7 8 9
0 0 0 1
0 0 0 0 0 0
The result of running the machine learning algorithm
can be expressed as a function y(x) which takes a
new digit image x as input and that generates an
output vector y, encoded in the same way as the
target vectors.
x1 x2 x3 x4 x5 x6
y(x)
vector y
.. .
x784
• The precise form of the function y(x) is
determined during the training phase
(learning phase).
– Model adapts its parameters (i.e. learns) on the
basis of the training data {x1, …, xN}.
• Trained model can then determine the
identity of new, previously unseen, digit
images which are said to comprise a test set.
• The ability to categorize correctly new
examples that differ from those used for
training is known as generalization.
For most practical applications, the original input variables are
typically preprocessed to transform them into some new
space of variables where, it is hoped, the pattern recognition
problem will be easier to solve.
x1 x2 x3 x4 x5 x6
.. .
Preprocessing
y(x)
vector y
x784
• Translate and scale the images of the digits so
that each digit is contained within a box of a
fixed size.
• This greatly reduces the variability within each
digit class, because the location and scale of
all the digits are now the same.
• This pre-processing stage is sometimes also
called feature extraction.
• Test data must be pre-processed using the
same steps as the training data.
Feature selection and feature extraction
selection
x1
x2
x3
x1
x4
x5
x5
x103
extraction
x6
.. .
x456
x784
x1
x* 1
x2
x* 2
x3
x4
x5
x* 3
x* 4
x* 5
x*18
x*152
x*309
x6
.. .
x* 6
x784
.. .
x*666
x*784
Dimensionality reduction
• We want to reduce number of dimensions
because:
– Efficiency
• measurement costs
• storage costs
• computation costs
– Problem may be solved more easily in the new
space
– Improved classification performance
– Ease of interpretation
Curse of dimensionality
Bishop, Pattern Recognition and Machine Learning
Supervised learning
• training data comprises examples of the input
vectors along with their corresponding target
vectors (e.g. digit recognition)
– classification – aim: assign an input vector to one
of a finite number of discrete categories
– regression (data/curve fitting) - desired output
consists of one or more continuous variables
Unsupervised learning
• training data consists of a set of input vectors
x without any corresponding target value
• goals:
– discover groups of similar examples within the
data – clustering
– project the data from a high-dimensional space
down to two or three dimensions for the purpose
of visualization
Polynomial curve fitting
• regression problem, supervised
• we observe a real-valued input variable x and
we wish to use this observation to predict the
value of a real-valued target variable t
• artificial example - sin(2πxn) + random noise
• training set: N observations of x written as x =
(x1, … , xN)T + corresponding observations of
the values of t: t = (t1, … , tN)T
sin(2πxn) + random noise
N = 10
x1 -> t1
x2 -> t2
etc.
training data set {x, t}
adapted from Bishop, Pattern Recognition and Machine Learning
• goal: exploit the training set in order to make prediction of
the target variable t’ for new value x’ of the input variable
• this is generally difficult, as
– we have to generalize from the finite data set
– data are corrupted by the noise, so for the given x’ there is
uncertainty in the value of t’
t’
x’
adapted from Bishop, Pattern Recognition and Machine Learning
• decision: which method to use?
• From the plethora of possibilities (you do not
know about yet) I chose a simple one – data
will be fitted using this polynomial function
𝑀
𝑦 𝑥, 𝑤 = 𝑤0 + 𝑤1 𝑥 + 𝑤2 𝑥 2 + ⋯ + 𝑤𝑀 𝑥 𝑀 =
𝑤𝑗 𝑥 𝑗
𝑗=0
• polynomial coefficients w0, …, wM form a
vector w
– they represent parameters of the model that
must be set in the training phase
– the polynomial model is still linear regression!!!
• The values of coefficients will be determined
by minimizing the error function
• It measures the misfit between the function
y(x, w) and the training set data points
• one simple choice: the sum of squared errors
- SSE between the predictions y(xn, w) for
each data point xn and the correspoding
values tn
1
𝑆𝑆𝐸 =
2
𝑁
𝑦 𝑥𝑛 , 𝒘 − 𝑡𝑛
𝑛=1
2
1
𝑆𝑆𝐸 =
2
𝑁
𝑦 𝑥𝑛 , 𝒘 − 𝑡𝑛
2
𝑛=1
displacement of the
data point tn from the
function y(xn, w)
fitted function
Bishop, Pattern Recognition and Machine Learning
• solving curve fitting problem means choosing
the value of w for which E(w) is as small as
possible … w* → y(x, w*)
– as small as possible means to find a minimum of
E(w) (i.e. its derivatives)