Estimate - Computer Vision Group
Download
Report
Transcript Estimate - Computer Vision Group
Augmented Computing
and Sensory
Perceptual
Summer’09
Learning
Machine
Machine Learning – Lecture 2
Probability Density Estimation
21.04.2009
Bastian Leibe
RWTH Aachen
http://www.umic.rwth-aachen.de/multimedia
[email protected]
Many slides adapted from B. Schiele
Announcements
• Course webpage
Augmented Computing
and Sensory
Perceptual
Summer’09
Learning
Machine
http://www.umic.rwth-aachen.de/multimedia
Teaching Machine Learning
Slides will be made available on the webpage
• L2P electronic repository
Exercises and supplementary materials will be posted on the L2P
• Please subscribe to the lecture on the Campus system!
Important to get email announcements and L2P access!
B. Leibe
2
Announcements
• Lecture slot on Wednesday 16:00-17:30h
Augmented Computing
and Sensory
Perceptual
Summer’09
Learning
Machine
Who has conflicts there?
Do we need an alternative slot?
B. Leibe
3
Course Outline
• Fundamentals (2 weeks)
Augmented Computing
and Sensory
Perceptual
Summer’09
Learning
Machine
Bayes Decision Theory
Probability Density Estimation
• Discriminative Approaches (5 weeks)
Linear Discriminant Functions
Statistical Learning Theory & SVMs
Boosting, Decision Trees
• Generative Models (5 weeks)
Bayesian Networks
Markov Random Fields
• Regression Problems (2 weeks)
Gaussian Processes
B. Leibe
4
Augmented Computing
and Sensory
Perceptual
Summer’09
Learning
Machine
Recap: Bayes Decision Theory
p x | a
p x | b
p x | a p (a )
x
Likelihood
p x | b p(b)
Likelihood £ Prior
x
Decision boundary
p a | x
p b | x
x
Slide credit: Bernt Schiele
B. Leibe
Likelihood £ Prior
Posterior =
NormalizationFactor
5
Image source: C.M. Bishop, 2006
Recap: Bayes Decision Theory
• Optimal decision rule
Augmented Computing
and Sensory
Perceptual
Summer’09
Learning
Machine
Decide for C1 if
p(C1jx) > p(C2 jx)
This is equivalent to
p(xjC1)p(C1) > p(xjC2 )p(C2 )
Which is again equivalent to (Likelihood-Ratio test)
p(xjC1 )
p(C2 )
>
p(xjC2 )
p(C1 )
Decision threshold
Slide credit: Bernt Schiele
B. Leibe
6
Recap: Bayes Decision Theory
Augmented Computing
and Sensory
Perceptual
Summer’09
Learning
Machine
• Decision regions: R1, R2, R3, …
Slide credit: Bernt Schiele
B. Leibe
7
Recap: Classifying with Loss Functions
• In general, we can formalize this by introducing a
L k j = loss for decision Cj if truth is Ck :
• Example: cancer diagnosis
Decision
L cancer
di agnosi s
=
Truth
Augmented Computing
and Sensory
Perceptual
Summer’09
Learning
Machine
loss matrix Lkj
B. Leibe
8
Recap: Minimizing the Expected Loss
• Optimal solution is the one that minimizes the loss.
Augmented Computing
and Sensory
Perceptual
Summer’09
Learning
Machine
But: loss function depends on the true class, which is unknown.
• Solution: Minimize the expected loss
• This can be done by choosing the regions R j such that
which is easy to do once we know the posterior class
probabilities p(Ck jx) .
B. Leibe
9
Augmented Computing
and Sensory
Perceptual
Summer’09
Learning
Machine
Recap: The Reject Option
• Classification errors arise from regions where the largest
posterior probability p(Ck jx) is significantly less than 1.
These are the regions where we are relatively uncertain about
class membership.
For some applications, it may be better to reject the automatic
decision entirely in such a case and e.g. consult a human expert.
B. Leibe
10
Image source: C.M. Bishop, 2006
Topics of This Lecture
• Probability Density Estimation
Augmented Computing
and Sensory
Perceptual
Summer’09
Learning
Machine
General concepts
Gaussian distribution
• Parametric Methods
Maximum Likelihood approach
Bayesian vs. Frequentist views on probability
Bayesian Learning
• Non-Parametric Methods
Histograms
Kernel density estimation
K-Nearest Neighbors
k-NN for Classification
Bias-Variance tradeoff
B. Leibe
11
Probability Density Estimation
• Up to now
Augmented Computing
and Sensory
Perceptual
Summer’09
Learning
Machine
Bayes optimal classification
Based on the probabilities
p(xjCk )p(Ck )
• How can we estimate (=learn) those probability
densities?
Supervised training case: data and class labels are known.
Estimate the probability density for each class Ck separately:
p(xjCk )
(For simplicity of notation, we will drop the class label Ck in the
following.)
Slide credit: Bernt Schiele
B. Leibe
12
Probability Density Estimation
Augmented Computing
and Sensory
Perceptual
Summer’09
Learning
Machine
• Data: x1, x2, x3, x4, …
x
• Estimate: p(x)
x
• Methods
Parametric representations
Non-parametric representations
Mixture models (next lecture)
Slide credit: Bernt Schiele
B. Leibe
13
The Gaussian (or Normal) Distribution
• One-dimensional case
Augmented Computing
and Sensory
Perceptual
Summer’09
Learning
Machine
Mean ¹
Variance ¾2
½
¾
2
1
(x ¡ ¹ )
2
N (xj¹ ; ¾ ) = p
exp ¡
2¾2
2¼¾
• Multi-dimensional case
Mean ¹
Covariance §
N (xj¹ ; § ) =
1
(2¼) D =2 j§ j 1=2
½
¾
1
exp ¡ (x ¡ ¹ ) T § ¡ 1 (x ¡ ¹ )
2
B. Leibe
14
Image source: C.M. Bishop, 2006
Gaussian Distribution – Properties
• Central Limit Theorem
Augmented Computing
and Sensory
Perceptual
Summer’09
Learning
Machine
“The distribution of the sum of N i.i.d. random variables
becomes increasingly Gaussian as N grows.”
In practice, the convergence to a Gaussian can be very rapid.
This makes the Gaussian interesting for many applications.
• Example: N uniform [0,1] random variables.
B. Leibe
15
Image source: C.M. Bishop, 2006
Gaussian Distribution – Properties
• Quadratic Form
Augmented Computing
and Sensory
Perceptual
Summer’09
Learning
Machine
N depends on x through the exponent
Here, M is often called the
Mahalanobis distance from ¹ to x.
• Shape of the Gaussian
§ is a real, symmetric matrix.
We can therefore decompose it into its eigenvectors
XD
§ =
¸ i u i u Ti
i= 1
and thus obtain
with
.
Constant density on ellipsoids withpmain directions along the
eigenvectors ui and scaling factors ¸ i .
B. Leibe
16
Image source: C.M. Bishop, 2006
Gaussian Distribution – Properties
• Special cases
Augmented Computing
and Sensory
Perceptual
Summer’09
Learning
Machine
Full covariance matrix
§ = [¾i j ]
General ellipsoid shape
Diagonal covariance matrix
§ = di agf ¾i g
Axis-aligned ellipsoid
Uniform variance
2
§ = ¾I
Hypersphere
B. Leibe
17
Image source: C.M. Bishop, 2006
Gaussian Distribution – Properties
Augmented Computing
and Sensory
Perceptual
Summer’09
Learning
Machine
• The marginals of a Gaussian are again Gaussians:
B. Leibe
18
Image source: C.M. Bishop, 2006
Topics of This Lecture
• Probability Density Estimation
Augmented Computing
and Sensory
Perceptual
Summer’09
Learning
Machine
General concepts
Gaussian distribution
• Parametric Methods
Maximum Likelihood approach
Bayesian vs. Frequentist views on probability
Bayesian Learning
• Non-Parametric Methods
Histograms
Kernel density estimation
K-Nearest Neighbors
k-NN for Classification
Bias-Variance tradeoff
B. Leibe
19
Parametric Methods
• Given
Augmented Computing
and Sensory
Perceptual
Summer’09
Learning
Machine
Data X = f x 1 ; x 2 ; : : : ; x N g
Parametric form of the distribution
with parameters µ
E.g. for Gaussian distrib.: µ =
(¹ ; ¾)
• Learning
x
Estimation of the parameters µ
x
• Likelihood of µ
Probability that the data X have indeed been generated from a
probability density with parameters µ
L(µ) = p(X jµ)
Slide adapted from Bernt Schiele
B. Leibe
20
Maximum Likelihood Approach
Augmented Computing
and Sensory
Perceptual
Summer’09
Learning
Machine
• Computation of the likelihood
p(x n jµ)
Single data point:
Assumption: all data points are independent
YN
L (µ) = p(X jµ) =
p(x n jµ)
n= 1
Log-likelihood
XN
E (µ) = ¡ ln L (µ) = ¡
ln p(x n jµ)
n= 1
Estimation of the parameters µ (Learning)
– Maximize the likelihood
– Minimize the negative log-likelihood
Slide credit: Bernt Schiele
B. Leibe
21
Maximum Likelihood Approach
• Likelihood: L (µ) = p(X jµ) =
YN
p(x n jµ)
Augmented Computing
and Sensory
Perceptual
Summer’09
Learning
Machine
n= 1
^ is maximized.
• We want to obtain µ^ such that L ( µ)
p(X jµ)
µ^
Slide credit: Bernt Schiele
B. Leibe
µ
22
Maximum Likelihood Approach
• Minimizing the log-likelihood
How do we minimize a function?
Take the derivative and set it to zero.
Augmented Computing
and Sensory
Perceptual
Summer’09
Learning
Machine
N
XN @ p(x n jµ) !
@
@X
@µ
E (µ) = ¡
ln p(x n jµ) = ¡
= 0
@µ
@µ n = 1
p(x n jµ)
n= 1
• Log-likelihood for Normal distribution (1D case)
XN
E (µ) =
ln p(x n j¹ ; ¾)
n= 1
µ
XN
= ¡
ln
n= 1
½
¾¶
2
1
jjx n ¡ ¹ jj
p
exp ¡
2¾2
2¼¾
B. Leibe
23
Maximum Likelihood Approach
Augmented Computing
and Sensory
Perceptual
Summer’09
Learning
Machine
• Minimizing the log-likelihood
@
E (¹ ; ¾) = ¡
@¹
@
@¹ p(x n j¹
XN
n= 1
; ¾)
p(x n j¹ ; ¾)
p(x n j¹ ; ¾) =
jjxn ¡ ¹ jj2
1
2 ¾2
p
e¡
2¼¾
XN
2(x n ¡ ¹ )
= ¡
¡
2
2¾
n= 1
N
N
X
X
11
=
(x(xnn¡ ¡ ¹ ¹) )
2
¾ nn==11
ÃÃ NN
!!
11 XX
=
xxnn¡ ¡ NN¹ ¹
2
¾ nn==11
@
!
E (¹ ; ¾) = 0
@¹
,
N
X
1
¹^ =
xn
N n= 1
B. Leibe
24
Maximum Likelihood Approach
Augmented Computing
and Sensory
Perceptual
Summer’09
Learning
Machine
• We thus obtain
N
1 X
¹^ =
xn
N n= 1
“sample mean”
• In a similar fashion, we get
N
X
1
2
¾
^ =
(x n ¡ ¹^ ) 2
N n= 1
“sample variance”
^ ) is the Maximum Likelihood estimate for the
• µ^ = ( ¹^; ¾
parameters of a Gaussian distribution.
• This is a very important result.
• Unfortunately, it is wrong…
B. Leibe
25
Maximum Likelihood Approach
Augmented Computing
and Sensory
Perceptual
Summer’09
Learning
Machine
• Or not wrong, but rather biased…
• Assume the samples x1, x2, …, xN come from a true
Gaussian distribution with mean ¹ and variance ¾2
We can now compute the expectations of the ML estimates with
respect to the data set values. It can be shown that
E(¹ M L ) = ¹
µ
¶
N¡ 1
2
E(¾M L ) =
¾2
N
The ML estimate will underestimate the true variance.
• Corrected estimate:
N
N
1 X
2
2
¾
~ =
¾M L =
(x n ¡ ¹^ ) 2
N¡ 1
N ¡ 1 n= 1
B. Leibe
26
Maximum Likelihood – Limitations
• Maximum Likelihood has several significant limitations
Augmented Computing
and Sensory
Perceptual
Summer’09
Learning
Machine
It systematically underestimates the variance of the distribution!
E.g. consider the case
N = 1; X = f x 1 g
x
Maximum-likelihood estimate:
¾
^ = 0!
¹^
x
We say ML overfits to the observed data.
We will still often use ML, but it is important to know about this
effect.
Slide adapted from Bernt Schiele
B. Leibe
27
Deeper Reason
• Maximum Likelihood is a Frequentist concept
Augmented Computing
and Sensory
Perceptual
Summer’09
Learning
Machine
In the Frequentist view, probabilities are the frequencies of
random, repeatable events.
These frequencies are fixed, but can be estimated more
precisely when more data is available.
• This is in contrast to the Bayesian interpretation
In the Bayesian view, probabilities quantify the uncertainty
about certain states or events.
This uncertainty can be revised in the light of new evidence.
• Bayesians and Frequentists do not like
each other too well…
B. Leibe
28
Bayesian vs. Frequentist View
• To see the difference…
Augmented Computing
and Sensory
Perceptual
Summer’09
Learning
Machine
Suppose we want to estimate the uncertainty whether the Arctic
ice cap will have disappeared by the end of the century.
This question makes no sense in a Frequentist view, since the
event cannot be repeated numerous times.
In the Bayesian view, we generally have a prior, e.g. from
calculations how fast the polar ice is melting.
If we now get fresh evidence, e.g. from a new satellite, we may
revise our opinion and update the uncertainty from the prior.
Posterior / Likelihood £ Prior
This generally allows to get better uncertainty estimates for
many situations.
• Main Frequentist criticism
The prior has to come from somewhere and if it is wrong, the
result will be worse.
B. Leibe
29
Bayesian Approach to Parameter Learning
• Conceptual shift
Augmented Computing
and Sensory
Perceptual
Summer’09
Learning
Machine
Maximum Likelihood views the true parameter vector µ to be
unknown, but fixed.
In Bayesian learning, we consider µ to be a random variable.
• This allows us to use knowledge about the parameters µ
i.e. to use a prior for µ
Training data then converts this
prior distribution on µ into
a posterior probability density.
The prior thus encodes knowledge we have about the type of
distribution we expect to see for µ.
Slide adapted from Bernt Schiele
B. Leibe
30
Bayesian Learning Approach
Augmented Computing
and Sensory
Perceptual
Summer’09
Learning
Machine
• Bayesian view:
Consider the parameter vector µ as a random variable.
When estimating the parameters, what we compute is
Z
p(xjX ) =
p(x; µjX )dµ
Assumption: given µ, this
doesn’t depend on X anymore
p(x; µjX ) = p(xjµ; X )p(µjX )
Z
p(xjX ) =
p(xjµ)p(µjX )dµ
This is entirely determined by the parameter µ
(i.e. by the parametric form of the pdf).
Slide adapted from Bernt Schiele
B. Leibe
31
Bayesian Learning Approach
Z
Augmented Computing
and Sensory
Perceptual
Summer’09
Learning
Machine
p(xjX ) =
p(xjµ)p(µjX )dµ
p(X jµ)p(µ)
p(µ)
p(µjX ) =
=
L (µ)
p(X )
p(X )
Z
p(X ) =
Z
p(X jµ)p(µ)dµ =
• Inserting this above, we obtain
Z
p(xjX ) =
Slide credit: Bernt Schiele
p(xjµ)L (µ)p(µ)
dµ =
p(X )
B. Leibe
Z
L (µ)p(µ)dµ
p(xjµ)L (µ)p(µ)
R
dµ
L (µ)p(µ)dµ
32
Bayesian Learning Approach
Augmented Computing
and Sensory
Perceptual
Summer’09
Learning
Machine
• Discussion
Likelihood of the parametric
form µ given the data set X.
Estimate for x based on
parametric form µ
Z
p(xjX ) =
Prior for the
parameters µ
p(xjµ)L (µ)p(µ)
R
dµ
L (µ)p(µ)dµ
Normalization: integrate
over all possible values of µ
If we now plug in a (suitable) prior p(µ), we can estimate p(xjX )
from the data set X.
B. Leibe
33
Bayesian Density Estimation
Augmented Computing
and Sensory
Perceptual
Summer’09
Learning
Machine
• Discussion
p(xjX ) =
Z
Z
p(xjµ)p(µjX )dµ =
p(xjµ)L (µ)p(µ)
R
dµ
L (µ)p(µ)dµ
The probability p(µjX ) makes the dependency of the estimate
on the data explicit.
^, then
If p(µjX ) is very small everywhere, but is large for one µ
^
p(xjX ) ¼ p(xj µ)
The more uncertain we are about µ, the more we average over
all parameter values.
Slide credit: Bernt Schiele
B. Leibe
34
Bayesian Density Estimation
• Problem
Augmented Computing
and Sensory
Perceptual
Summer’09
Learning
Machine
In the general case, the integration over µ is not possible
(or only possible stochastically).
• Example where an analytical solution is possible
Normal distribution for the data, ¾2 assumed known and fixed.
Estimate the distribution of the mean:
p(X j¹ )p(¹ )
p(¹ jX ) =
p(X )
Prior: We assume a Gaussian prior over ¹,
Slide credit: Bernt Schiele
B. Leibe
35
Bayesian Learning Approach
Augmented Computing
and Sensory
Perceptual
Summer’09
Learning
Machine
• Sample mean:
N
1 X
x¹ =
xn
N n= 1
• Bayes estimate:
¹N
¾2 ¹ 0 + N ¾02 x¹
=
N ¾02 + ¾2
p(¹ jX )
1
1
N
= 2+ 2
2
¾N
¾0
¾
• Note:
¹0= 0
Slide adapted from Bernt Schiele
B. Leibe
36
Image source: C.M. Bishop, 2006
Summary: ML vs. Bayesian Learning
• Maximum Likelihood
Augmented Computing
and Sensory
Perceptual
Summer’09
Learning
Machine
Simple approach, often analytically possible.
Problem: estimation is biased, tends to overfit to the data.
Often needs some correction or regularization.
But:
– Approximation gets accurate for N !
1 .
• Bayesian Learning
General approach, avoids the estimation bias through a prior.
Problems:
– Need to choose a suitable prior (not always obvious).
– Integral over µ often not analytically feasible anymore.
But:
– Efficient stochastic sampling techniques available (see Lecture 15).
(In this lecture, we’ll use both concepts wherever appropriate)
B. Leibe
37
Topics of This Lecture
• Probability Density Estimation
Augmented Computing
and Sensory
Perceptual
Summer’09
Learning
Machine
General concepts
Gaussian distribution
• Parametric Methods
Maximum Likelihood approach
Bayesian vs. Frequentist views on probability
Bayesian Learning
• Non-Parametric Methods
Histograms
Kernel density estimation
K-Nearest Neighbors
k-NN for Classification
Bias-Variance tradeoff
B. Leibe
38
Non-Parametric Methods
• Non-parametric representations
Augmented Computing
and Sensory
Perceptual
Summer’09
Learning
Machine
Often the functional form of the distribution is unknown
x
• Estimate probability density from data
Histograms
Kernel density estimation (Parzen window / Gaussian kernels)
k-Nearest-Neighbor
Slide credit: Bernt Schiele
B. Leibe
39
Histograms
• Basic idea:
Augmented Computing
and Sensory
Perceptual
Summer’09
Learning
Machine
Partition the data space into distinct
bins with widths ¢i and count the
number of observations, ni, in each
bin.
3
N = 10
2
1
0
0
0.5
Often, the same width is used for all bins, ¢i = ¢.
This can be done, in principle, for any dimensionality D…
1
…but the required
number of bins
grows exponentially with D!
B. Leibe
40
Image source: C.M. Bishop, 2006
Histograms
Augmented Computing
and Sensory
Perceptual
Summer’09
Learning
Machine
• The bin width M acts as a smoothing factor.
not smooth enough
about OK
too smooth
B. Leibe
41
Image source: C.M. Bishop, 2006
Summary: Histograms
• Properties
Augmented Computing
and Sensory
Perceptual
Summer’09
Learning
Machine
Very general. In the limit (N!1), every probability density can
be represented.
No need to store the data points once histogram is computed.
Rather brute-force
• Problems
High-dimensional feature spaces
– D-dimensional space with M bins/dimension will require MD bins!
Requires an exponentially growing number of data points
“Curse of dimensionality”
Discontinuities at bin edges
Bin size?
– too large: too much smoothing
– too small: too much noise
Slide credit: Bernt Schiele
B. Leibe
42
Statistically Better-Founded Approach
• Data point x comes from pdf p(x)
Augmented Computing
and Sensory
Perceptual
Summer’09
Learning
Machine
Probability that x falls into small region R
Z
P=
p(y)dy
R
• If R is sufficiently small, p(x) is roughly constant
Let V be the volume of R
Z
P=
p(y)dy ¼ p(x)V
R
• If the number N of samples is sufficiently large, we can
estimate P as
K
P=
N
Slide credit: Bernt Schiele
K
) p(x) ¼
NV
B. Leibe
43
Augmented Computing
and Sensory
Perceptual
Summer’09
Learning
Machine
Statistically Better-Founded Approach
K
p(x) ¼
NV
fixed V
determine K
Kernel Methods
fixed K
determine V
K-Nearest Neighbor
• Kernel methods
Example: Determine
the number K of data
points inside a fixed
hypercube…
Slide credit: Bernt Schiele
B. Leibe
44
Kernel Methods
• Parzen Window
Augmented Computing
and Sensory
Perceptual
Summer’09
Learning
Machine
Hypercube of dimension D with edge length h:
½
k(u) =
1; jui ·
0; else
1
2;
i = 1; : : : ; D
“Kernel function”
Z
XN
x ¡ xn
K =
k(
)
h
n= 1
V=
k(u)du = hd
Probability density estimate:
K
1
p(x) ¼
=
NV
N hD
Slide credit: Bernt Schiele
B. Leibe
XN
x ¡ xn
k(
)
h
n= 1
45
Kernel Methods: Parzen Window
Augmented Computing
and Sensory
Perceptual
Summer’09
Learning
Machine
• Interpretations
1. We place a kernel window k at
location x and count how many
data points fall inside it.
2. We place a kernel window k around
each data point xn and sum up
their influences at location x.
Direct visualization of the density.
• Still, we have artificial discontinuities at the cube
boundaries…
We can obtain a smoother density model if we choose a
smoother kernel function, e.g. a Gaussian
B. Leibe
46
Kernel Methods: Gaussian Kernel
• Gaussian kernel
Augmented Computing
and Sensory
Perceptual
Summer’09
Learning
Machine
Kernel function
1
k(u) =
(2¼h2 ) 1=2
XN
K =
k(x ¡ x n )
½
¾
2
u
exp ¡
2h2
Z
V=
k(u)du = 1
n= 1
Probability density estimate
XN
½
¾
2
K
1
1
jjx ¡ x n jj
p(x) ¼
=
exp ¡
D
=2
NV
N n = 1 (2¼)
2h2
h
Slide credit: Bernt Schiele
B. Leibe
47
Augmented Computing
and Sensory
Perceptual
Summer’09
Learning
Machine
Gauss Kernel: Examples
not smooth enough
about OK
too smooth
h acts as a smoother.
B. Leibe
48
Image source: C.M. Bishop, 2006
Kernel Methods
• In general
Augmented Computing
and Sensory
Perceptual
Summer’09
Learning
Machine
Any kernel such that
can be used. Then
XN
K =
k(x ¡ x n )
n= 1
And we get the probability density estimate
XN
K
1
p(x) ¼
=
NV
N
Slide adapted from Bernt Schiele
k(x ¡ x n )
n= 1
B. Leibe
49
Augmented Computing
and Sensory
Perceptual
Summer’09
Learning
Machine
Statistically Better-Founded Approach
K
p(x) ¼
NV
fixed V
determine K
Kernel Methods
fixed K
determine V
K-Nearest Neighbor
• K-Nearest Neighbor
Slide credit: Bernt Schiele
B. Leibe
Increase the volume V
until the K next data
points are found.
50
K-Nearest Neighbor
• Nearest-Neighbor density estimation
Augmented Computing
and Sensory
Perceptual
Summer’09
Learning
Machine
Fix K, estimate V from the data.
Consider a hypersphere centred
on x and let it grow to a volume V ?
that includes K of the given N data
points.
Then
K = 3
• Side note
Strictly speaking, the model produced by K-NN is not a true
density model, because the integral over all space diverges.
E.g. consider K = 1 and a sample exactly on a data point x = xj.
B. Leibe
51
Augmented Computing
and Sensory
Perceptual
Summer’09
Learning
Machine
k-Nearest Neighbor: Examples
not smooth enough
about OK
too smooth
K acts as a smoother.
B. Leibe
52
Image source: C.M. Bishop, 2006
Summary: Kernel and k-NN Density Estimation
• Properties
Very general. In the limit (N!1), every probability density can
be represented.
No computation involved in the training phase
Simply storage of the training set
Augmented Computing
and Sensory
Perceptual
Summer’09
Learning
Machine
• Problems
Requires storing and computing with the entire dataset.
Computational cost linear in the number of data points.
This can be improved, at the expense of some computation
during training, by constructing efficient tree-based search
structures.
Kernel size / K in K-NN?
– Too large: too much smoothing
– Too small: too much noise
B. Leibe
53
K-Nearest Neighbor Classification
Augmented Computing
and Sensory
Perceptual
Summer’09
Learning
Machine
• Bayesian Classification
• Here we have
p(xjCj )p(Cj )
p(Cj jx) =
p(x)
K
p(x) ¼
NV
Kj
p(xjCj ) ¼
Nj V
Nj
p(Cj ) ¼
N
Slide credit: Bernt Schiele
K j Nj N V
Kj
p(Cj jx) ¼
=
Nj V N K
K
k-Nearest Neighbor
classification
B. Leibe
54
Augmented Computing
and Sensory
Perceptual
Summer’09
Learning
Machine
K-Nearest Neighbors for Classification
K=3
K=1
B. Leibe
55
Image source: C.M. Bishop, 2006
K-Nearest Neighbors for Classification
Augmented Computing
and Sensory
Perceptual
Summer’09
Learning
Machine
• Results on an example data set
• K acts as a smoothing parameter.
• Theoretical guarantee
For N!1, the error rate of the 1-NN classifier is never more
than twice the optimal error (obtained from the true conditional
class distributions).
56
B. Leibe
Image source: C.M. Bishop, 2006
Bias-Variance Tradeoff
• Probability density estimation
Augmented Computing
and Sensory
Perceptual
Summer’09
Learning
Machine
Histograms: bin size?
– M too large: too smooth
– M too small: not smooth enough
Too much bias
Too much variance
Kernel methods: kernel size?
– h too large: too smooth
– h too small: not smooth enough
K-Nearest Neighbor: K?
– K too large: too smooth
– K too small: not smooth enough
• This is a general problem of many probability density
estimation methods
Including parametric methods and mixture models
Slide credit: Bernt Schiele
B. Leibe
57
Discussion
Augmented Computing
and Sensory
Perceptual
Summer’09
Learning
Machine
• The methods discussed so far are all simple and easy to
apply. They are used in many practical applications.
• However…
Histograms scale poorly with increasing dimensionality.
Only suitable for relatively low-dimensional data.
Both k-NN and kernel density estimation require the entire data
set to be stored.
Too expensive if the data set is large.
Simple parametric models are very restricted in what forms of
distributions they can represent.
Only suitable if the data has the same general form.
• We need density models that are efficient and flexible!
Next lecture…
B. Leibe
58
References and Further Reading
• More information in Bishop’s book
Augmented Computing
and Sensory
Perceptual
Summer’09
Learning
Machine
Gaussian distribution and ML:
Bayesian Learning:
Nonparametric methods:
Ch. 1.2.4 and 2.3.1-2.3.4.
Ch. 1.2.3 and 2.3.6.
Ch. 2.5.
• Additional information can be found in Duda & Hart
ML estimation:
Bayesian Learning:
Nonparametric methods:
Ch. 3.2
Ch. 3.3-3.5
Ch. 4.1-4.5
Christopher M. Bishop
Pattern Recognition and Machine Learning
Springer, 2006
R.O. Duda, P.E. Hart, D.G. Stork
Pattern Classification
2nd Ed., Wiley-Interscience, 2000
B. Leibe
59