Computer Vision - RWTH Aachen University

Download Report

Transcript Computer Vision - RWTH Aachen University

Augmented Computing
and Sensory
Perceptual
Learning,
Summer’12
Machine
Machine Learning – Lecture 16
Repetition
10.07.2012
Bastian Leibe
RWTH Aachen
http://www.vision.rwth-aachen.de
[email protected]
Announcements
• Today, I’ll summarize the most important points from
Augmented Computing
and Sensory
Perceptual
Learning,
Summer’12
Machine
the lecture.



It is an opportunity for you to ask questions…
…or get additional explanations about certain topics.
So, please do ask.
• Today’s slides are intended as an index for the lecture.


But they are not complete, won’t be sufficient as only tool.
Also look at the exercises – they often explain algorithms in
detail.
• Oral exam procedure



Oral exam, form depends on B.Sc./M.Sc./Diplom specifics
Procedure: 4 questions, will have to answer 3 of them
Special rule for Diplom V4 exam
B. Leibe
2
Course Outline
• Fundamentals
Augmented Computing
and Sensory
Perceptual
Learning,
Summer’12
Machine



Bayes Decision Theory
Probability Density Estimation
Mixture Models and EM
• Discriminative Approaches




Linear Discriminant Functions
Statistical Learning Theory & SVMs
Ensemble Methods & Boosting
Decision Trees & Randomized Trees
• Generative Models



Bayesian Networks
Markov Random Fields
Exact Inference
B. Leibe
3
Augmented Computing
and Sensory
Perceptual
Learning,
Summer’12
Machine
Recap: Bayes Decision Theory
p  x | a
p  x | b
p  x | a  p(a)
x
Likelihood
p  x | b p(b)
Likelihood £ Prior
x
Decision boundary
p  a | x
p b | x 
x
Slide credit: Bernt Schiele
B. Leibe
Likelihood £ Prior
Posterior =
NormalizationFactor
4
Image source: C.M. Bishop, 2006
Recap: Bayes Decision Theory
• Optimal decision rule
Augmented Computing
and Sensory
Perceptual
Learning,
Summer’12
Machine

Decide for C1 if
p(C1jx) > p(C2 jx)

This is equivalent to
p(xjC1)p(C1) > p(xjC2 )p(C2 )

Which is again equivalent to (Likelihood-Ratio test)
p(xjC1 )
p(C2 )
>
p(xjC2 )
p(C1 )
Decision threshold 
Slide credit: Bernt Schiele
B. Leibe
5
Recap: Minimizing the Expected Loss
• Optimal solution minimizes the loss.
Augmented Computing
and Sensory
Perceptual
Learning,
Summer’12
Machine

But: loss function depends on the true class,
which is unknown.
• Solution: Minimize the expected loss
• This can be done by choosing the regions R j such that
which is easy to do once we know the posterior class
probabilities p(Ck jx) .
B. Leibe
8
Course Outline
• Fundamentals
Augmented Computing
and Sensory
Perceptual
Learning,
Summer’12
Machine



Bayes Decision Theory
Probability Density Estimation
Mixture Models and EM
• Discriminative Approaches




Linear Discriminant Functions
Statistical Learning Theory & SVMs
Ensemble Methods & Boosting
Decision Trees & Randomized Trees
• Generative Models



Bayesian Networks
Markov Random Fields
Exact Inference
B. Leibe
10
Recap: Gaussian (or Normal) Distribution
• One-dimensional case
Augmented Computing
and Sensory
Perceptual
Learning,
Summer’12
Machine


Mean ¹
Variance ¾2
½
¾
2
1
(x ¡ ¹ )
2
N (xj¹ ; ¾ ) = p
exp ¡
2¾2
2¼¾
• Multi-dimensional case


Mean ¹
Covariance §
N (xj¹ ; § ) =
1
(2¼) D =2 j§ j 1=2
½
¾
1
exp ¡ (x ¡ ¹ ) T § ¡ 1 (x ¡ ¹ )
2
B. Leibe
11
Image source: C.M. Bishop, 2006
Recap: Maximum Likelihood Approach
Augmented Computing
and Sensory
Perceptual
Learning,
Summer’12
Machine
• Computation of the likelihood
p(x n jµ)

Single data point:

Assumption: all data points X = f x 1 ; : : : ; x n g are independent
YN
L (µ) = p(X jµ) =

p(x n jµ)
n= 1
Log-likelihood
XN
E (µ) = ¡ ln L (µ) = ¡
ln p(x n jµ)
n= 1
• Estimation of the parameters µ (Learning)
Maximize the likelihood (= minimize the negative log-likelihood)
 Take the derivative and set it to zero.

XN @ p(x n jµ) !
@
@µ
E (µ) = ¡
= 0
@µ
p(x n jµ)
n= 1
Slide credit: Bernt Schiele
B. Leibe
12
Recap: Bayesian Learning Approach
Augmented Computing
and Sensory
Perceptual
Learning,
Summer’12
Machine
• Discussion
Likelihood of the parametric
form µ given the data set X.
Estimate for x based on
parametric form µ
Z
p(xjX ) =
Prior for the
parameters µ
p(xjµ)L (µ)p(µ)
R
dµ
L (µ)p(µ)dµ
Normalization: integrate
over all possible values of µ

The more uncertain we are about µ, the more we average over
all possible parameter values.
B. Leibe
14
Recap: Histograms
• Basic idea:
Augmented Computing
and Sensory
Perceptual
Learning,
Summer’12
Machine

Partition the data space into distinct
bins with widths ¢i and count the
number of observations, ni, in each
bin.
3
N = 10
2
1
0
0
0.5

Often, the same width is used for all bins, ¢i = ¢.

This can be done, in principle, for any dimensionality D…
1
…but the required
number of bins
grows exponentially with D!
B. Leibe
15
Image source: C.M. Bishop, 2006
Recap: Kernel Density Estimation
Augmented Computing
and Sensory
Perceptual
Learning,
Summer’12
Machine
• Approximation formula:
K
p(x) ¼
NV
fixed V
determine K
Kernel Methods
fixed K
determine V
K-Nearest Neighbor
• K-Nearest Neighbor
• Kernel methods

Place a kernel window k
at location x and count
how many data points
fall inside it.
Slide adapted from Bernt Schiele

B. Leibe
Increase the volume V
until the K next data
points are found.
16
Course Outline
• Fundamentals
Augmented Computing
and Sensory
Perceptual
Learning,
Summer’12
Machine



Bayes Decision Theory
Probability Density Estimation
Mixture Models and EM
• Discriminative Approaches




Linear Discriminant Functions
Statistical Learning Theory & SVMs
Ensemble Methods & Boosting
Decision Trees & Randomized Trees
• Generative Models



Bayesian Networks
Markov Random Fields
Exact Inference
B. Leibe
17
Recap: Mixture of Gaussians (MoG)
Augmented Computing
and Sensory
Perceptual
Learning,
Summer’12
Machine
• “Generative model”
p(j ) = ¼j
j
1
“Weight” of mixture
component
2 3
p(x)
Mixture
component
p(xjµj )
x
Mixture density
XM
p(x)
p(xjµ) =
x
Slide credit: Bernt Schiele
B. Leibe
p(xjµj )p(j )
j=1
18
Recap: MoG – Iterative Strategy
Augmented Computing
and Sensory
Perceptual
Learning,
Summer’12
Machine
• Assuming we knew the values of the hidden variable…
ML for Gaussian #1
assumed known
ML for Gaussian #2
1 111
h(j = 1jx n ) = 1 111
h(j = 2jx n ) = 0 000
PN
n = 1 h(j = 1jx n )x n
¹1= P N
i = 1 h(j = 1jx n )
Slide credit: Bernt Schiele
22
2
2
00 0
0
11
1
1
j
P
¹2=
B. Leibe
N
n = 1 h(j = 2jx n )x n
PN
i = 1 h(j = 2jx n )
19
Recap: MoG – Iterative Strategy
Augmented Computing
and Sensory
Perceptual
Learning,
Summer’12
Machine
• Assuming we knew the mixture components…
assumed known
p(j = 1jx)
p(j = 2jx)
1 111
22
2
2
j
• Bayes decision rule: Decide j = 1 if
p(j = 1jx n ) > p(j = 2jx n )
Slide credit: Bernt Schiele
B. Leibe
20
Recap: K-Means Clustering
Augmented Computing
and Sensory
Perceptual
Learning,
Summer’12
Machine
• Iterative procedure
1. Initialization: pick K arbitrary
centroids (cluster means)
2. Assign each sample to the closest
centroid.
3. Adjust the centroids to be the
means of the samples assigned
to them.
4. Go to step 2 (until no change)
• Algorithm is guaranteed to
converge after finite #iterations.


Local optimum
Final result depends on initialization.
Slide credit: Bernt Schiele
B. Leibe
21
Recap: EM Algorithm
• Expectation-Maximization (EM) Algorithm
Augmented Computing
and Sensory
Perceptual
Learning,
Summer’12
Machine


E-Step: softly assign samples to mixture components
¼j N (x n j¹ j ; § j )
° j (x n ) Ã P N
8j = 1; : : : ; K ; n = 1; : : : ; N
k = 1 ¼k N (x n j¹ k ; § k )
M-Step: re-estimate the parameters (separately for each mixture
component) based on the soft assignments
XN
N^ j Ã
° j (x n ) = soft number of samples labeled j
n= 1
¼
^ jnew
¹^ new
j
§^ new
j
N^ j
Ã
N
N
1 X
Ã
° j (x n )x n
^
Nj n = 1
N
1 X
new T
^
Ã
° j (x n )(x n ¡ ¹^ new
)(x
¡
¹
)
n
j
j
^
Nj n = 1
Slide adapted from Bernt Schiele
B. Leibe
22
Course Outline
• Fundamentals
Augmented Computing
and Sensory
Perceptual
Learning,
Summer’12
Machine



Bayes Decision Theory
Probability Density Estimation
Mixture Models and EM
• Discriminative Approaches




Linear Discriminant Functions
Statistical Learning Theory & SVMs
Ensemble Methods & Boosting
Decision Trees & Randomized Trees
• Generative Models



Bayesian Networks
Markov Random Fields
Exact Inference
B. Leibe
23
Recap: Linear Discriminant Functions
• Basic idea
Augmented Computing
and Sensory
Perceptual
Learning,
Summer’12
Machine


Directly encode decision boundary
Minimize misclassification probability directly.
• Linear discriminant functions
y(x) = w T x + w0
weight vector


y= 0
y> 0
y< 0
“bias”
(= threshold)
w, w0 define a hyperplane in RD.
If a data set can be perfectly classified by a linear discriminant,
then we call it linearly separable.
Slide adapted from Bernt Schiele
B. Leibe
24
Recap: Least-Squares Classification
• Simplest approach
Augmented Computing
and Sensory
Perceptual
Learning,
Summer’12
Machine

Directly try to minimize the sum-of-squares error
XN
E (w ) =
(y(x n ; w ) ¡ t n ) 2
n= 1
n
o
1
f ) = Tr ( X
eW
f ¡ T )T (X
eW
f ¡ T)
ED (W
2

Setting the derivative to zero yields
f = (X
eT X
e )¡ 1X
eT T = X
e yT
W

We then obtain the discriminant function as
³
´T
f T xe = T T X
e y xe
y (x) = W

 Exact, closed-form solution for the discriminant function
parameters.
B. Leibe
25
Augmented Computing
and Sensory
Perceptual
Learning,
Summer’12
Machine
Recap: Problems with Least Squares
• Least-squares is very sensitive to outliers!

The error function penalizes predictions that are “too correct”.
B. Leibe
26
Image source: C.M. Bishop, 2006
Augmented Computing
and Sensory
Perceptual
Learning,
Summer’12
Machine
Recap: Generalized Linear Models
• Generalized linear model
y(x) = g(w T x + w0 )

g( ¢ ) is called an activation function and may be nonlinear.

The decision surfaces correspond to
y(x) = const :

,
w T x + w0 = const :
If g is monotonous (which is typically the case), the resulting
decision boundaries are still linear functions of x.
• Advantages of the non-linearity


Can be used to bound the influence of outliers
and “too correct” data points.
When using a sigmoid for g(¢), we can interpret
the y(x) as posterior probabilities.
g(a) ´
B. Leibe
1
1 + exp(¡ a)
27
Recap: Extension to Nonlinear Basis Fcts.
• Generalization
Augmented Computing
and Sensory
Perceptual
Learning,
Summer’12
Machine

Transform vector x with M nonlinear basis functions Áj(x):
XM
yk (x ) =
wk i Áj (x ) + wk 0
j=1
• Advantages


Transformation allows non-linear decision boundaries.
By choosing the right Áj, every continuous function can (in
principle) be approximated with arbitrary accuracy.
• Disadvatage
The error function can in general no longer be minimized in
closed form.
 Minimization with Gradient Descent

B. Leibe
29
Recap: Classification as Dim. Reduction
Augmented Computing
and Sensory
Perceptual
Learning,
Summer’12
Machine
bad separation
good separation
• Classification as dimensionality reduction

Interpret linear classification as a projection onto a lower-dim.
space.
y = wT x
 Learning problem: Try to find the projection vector w that
maximizes class separation.
30
Image source: C.M. Bishop, 2006
Recap: Fisher’s Linear Discriminant Analysis
• Maximize distance between classes
• Minimize distance within a class
Augmented Computing
and Sensory
Perceptual
Learning,
Summer’12
Machine
Class 1
w T SB w
• Criterion: J (w ) = T
w SW w
x
SB … between-class scatter matrix
SW … within-class scatter matrix
x
Class 2
• The optimal solution for w can be
obtained as:
w / S¡W1 (m 2 ¡ m 1 )
w
• Classification function:
y
where
Slide adapted from Ales Leonardis
w0 = ¡ w T m
31
Recap: Probabilistic Discriminative Models
Augmented Computing
and Sensory
Perceptual
Learning,
Summer’12
Machine
• Consider models of the form
p(C1 jÁ) = y(Á) = ¾(wT Á)
with
p(C2 jÁ) = 1 ¡ p(C1jÁ)
• This model is called logistic regression.
• Properties



Probabilistic interpretation
But discriminative method: only focus on decision hyperplane
Advantageous for high-dimensional spaces, requires less
parameters than explicitly modeling p(Á|Ck) and p(Ck).
B. Leibe
32
Augmented Computing
and Sensory
Perceptual
Learning,
Summer’12
Machine
Recap: Logistic Regression
• Let’s consider a data set {Án,tn} with n = 1,…,N,
where Án = Á(x n ) and t n 2 f 0; 1g, t = (t 1 ; : : : ; t N ) T .
• With yn = p(C1|Án), we can write the likelihood as
YN
ynt n
p(t jw ) =
1¡ t n
f 1 ¡ yn g
n= 1
• Define the error function as the negative log-likelihood
E (w ) = ¡ ln p(t jw )
XN
= ¡
f t n ln yn + (1 ¡ t n ) ln(1 ¡ yn )g
n= 1

This is the so-called cross-entropy error function.
33
Recap: Iteratively Reweighted Least Squares
Augmented Computing
and Sensory
Perceptual
Learning,
Summer’12
Machine
• Update equations
w (¿+ 1) = w (¿) ¡ (© T R ©) ¡ 1 © T (y ¡ t )
n
o
= (© T R ©) ¡ 1 © T R ©w ( ¿) ¡ © T (y ¡ t )
= (© T R ©) ¡ 1 © T Rz
with
z = ©w (¿) ¡ R ¡ 1 (y ¡ t )
• Very similar form to pseudo-inverse (normal equations)

But now with non-constant weighing matrix R (depends on w).
Need to apply normal equations iteratively.
 Iteratively Reweighted Least-Squares (IRLS)

35
Course Outline
• Fundamentals
Augmented Computing
and Sensory
Perceptual
Learning,
Summer’12
Machine



Bayes Decision Theory
Probability Density Estimation
Mixture Models and EM
• Discriminative Approaches




Linear Discriminant Functions
Statistical Learning Theory & SVMs
Ensemble Methods & Boosting
Decision Trees & Randomized Trees
• Generative Models



Bayesian Networks
Markov Random Fields
Exact Inference
B. Leibe
36
Augmented Computing
and Sensory
Perceptual
Learning,
Summer’12
Machine
Recap: Generalization and Overfitting
test error
training error
• Goal: predict class labels of new observations
Train classification model on limited training set.
 The further we optimize the model parameters, the more the
training error will decrease.
 However, at some point the test error will go up again.
 Overfitting to the training set!

B. Leibe
37
Image source: B. Schiele
Recap: Statistical Learning Theory
• Idea
Augmented Computing
and Sensory
Perceptual
Learning,
Summer’12
Machine


Compute an upper bound on the actual risk based on the
empirical risk
R(®) · Remp (®) + ²(N; p¤ ; h)
where
N: number of training examples
p*: probability that the bound is correct
h: capacity of the learning machine (“VC-dimension”)
Slide adapted from Bernt Schiele
B. Leibe
39
Recap: VC Dimension
• Vapnik-Chervonenkis dimension
Augmented Computing
and Sensory
Perceptual
Learning,
Summer’12
Machine

Measure for the capacity of a learning machine.
• Formal definition:


`
If a given set of ` points can be labeled in all possible 2 ways,
and for each labeling, a member of the set {f(®)} can be found
which correctly assigns those labels, we say that the set of
points is shattered by the set of functions.
The VC dimension for the set of functions {f(®)} is defined as
the maximum number of training points that can be shattered
by {f(®)}.
B. Leibe
40
Recap: Upper Bound on the Risk
• Important result (Vapnik 1979, 1995)
Augmented Computing
and Sensory
Perceptual
Learning,
Summer’12
Machine

With probability (1-´), the following bound holds
r
R(®) · Remp (®) +
h(log(2N=h) + 1) ¡ log(´ =4)
N
“VC confidence”


This bound is independent of PX ;Y (x; y) !
If we know h (the VC dimension),
we can easily compute the risk
bound
R(®) · Remp (®) + ²(N; p¤ ; h)
Slide adapted from Bernt Schiele
B. Leibe
41
Course Outline
• Fundamentals
Augmented Computing
and Sensory
Perceptual
Learning,
Summer’12
Machine



Bayes Decision Theory
Probability Density Estimation
Mixture Models and EM
• Discriminative Approaches




Linear Discriminant Functions
Statistical Learning Theory & SVMs
Ensemble Methods & Boosting
Decision Trees & Randomized Trees
• Generative Models



Bayesian Networks
Markov Random Fields
Exact Inference
B. Leibe
43
Recap: Support Vector Machine (SVM)
• Basic idea
Augmented Computing
and Sensory
Perceptual
Learning,
Summer’12
Machine


Margin
The SVM tries to find a classifier which
maximizes the margin between pos. and
neg. data points.
Up to now: consider linear classifiers
T
w x + b= 0
• Formulation as a convex optimization problem

Find the hyperplane satisfying
1
2
arg min kwk
w ;b
2
under the constraints
t n (w T x n + b) ¸ 1 8n
based on training data points xn and target values t n 2 f ¡ 1; 1g.
B. Leibe
44
Recap: SVM – Primal Formulation
Augmented Computing
and Sensory
Perceptual
Learning,
Summer’12
Machine
• Lagrangian primal form
Lp
XN
©
ª
1
2
T
=
kwk ¡
an t n (w x n + b) ¡ 1
2
n= 1
XN
1
=
kwk2 ¡
an f t n y(x n ) ¡ 1g
2
n= 1
• The solution of Lp needs to fulfill the KKT conditions

Necessary and sufficient conditions
an ¸
0
t n y(x n ) ¡ 1 ¸
0
an f t n y(x n ) ¡ 1g = 0
B. Leibe
KKT:
¸ ¸
0
f (x ) ¸
0
¸ f (x ) = 0
45
Recap: SVM – Support Vectors
• The training points for which an > 0 are called
Augmented Computing
and Sensory
Perceptual
Learning,
Summer’12
Machine
“support vectors”.
• Graphical interpretation:


The support vectors are the
points on the margin.
They define the margin
and thus the hyperplane.
 All other data points can
be discarded!
Slide adapted from Bernt Schiele
B. Leibe
47
Image source: C. Burges, 1998
Recap: SVM – Dual Formulation
• Maximize
Augmented Computing
and Sensory
Perceptual
Learning,
Summer’12
Machine
XN
L d (a) =
an ¡
n= 1
N XN
X
1
2 n= 1 m= 1
an am t n t m (x Tm x n )
under the conditions
an ¸
0 8n
XN
an t n = 0
• Comparison



n= 1
Ld is equivalent to the primal form Lp, but only depends on an.
Lp scales with O(D3).
Ld scales with O(N3) – in practice between O(N) and O(N2).
Slide adapted from Bernt Schiele
B. Leibe
48
Recap: SVM for Non-Separable Data
• Slack variables
Augmented Computing
and Sensory
Perceptual
Learning,
Summer’12
Machine

One slack variable »n ¸ 0 for each training data point.
• Interpretation


»n = 0 for points that are on the correct side of the margin.
»n = |tn – y(xn)| for all other points.
»1 w
»2
»3
»4
Point on decision
boundary: »n = 1
Misclassified point:
»n > 1
We do not have to set the slack variables ourselves!
 They are jointly optimized together with w.

B. Leibe
49
Recap: SVM – New Dual Formulation
• New SVM Dual: Maximize
Augmented Computing
and Sensory
Perceptual
Learning,
Summer’12
Machine
XN
L d (a) =
an ¡
n= 1
N XN
X
1
2 n= 1 m= 1
under the conditions
0 · an · C
XN
an am t n t m (x Tm x n )
This is all
that changed!
an t n = 0
n= 1
• This is again a quadratic programming problem
 Solve as before…
Slide adapted from Bernt Schiele
B. Leibe
50
Recap: Nonlinear SVMs
Augmented Computing
and Sensory
Perceptual
Learning,
Summer’12
Machine
• General idea: The original input space can be mapped to
some higher-dimensional feature space where the
training set is separable:
©: x → Á(x)
51
Slide credit: Raymond Mooney
Recap: The Kernel Trick
• Important observation
Augmented Computing
and Sensory
Perceptual
Learning,
Summer’12
Machine

Á(x) only appears in the form of dot products Á(x)TÁ(y):
y(x ) = w T Á(x ) + b
XN
=
an t n Á(x n ) T Á(x) + b
n= 1

Define a so-called kernel function k(x,y) = Á(x)TÁ(y).

Now, in place of the dot product, use the kernel instead:
XN
y(x) =
an t n k(x n ; x) + b
n= 1

The kernel function implicitly maps the data to the higherdimensional space (without having to compute Á(x) explicitly)!
B. Leibe
52
Recap: Nonlinear SVM – Dual Formulation
• SVM Dual: Maximize
Augmented Computing
and Sensory
Perceptual
Learning,
Summer’12
Machine
XN
L d (a) =
N XN
X
1
an ¡
2 n= 1 m= 1
n= 1
an am t n t m k(x m ; x n )
under the conditions
0 · an · C
XN
an t n = 0
n= 1
• Classify new data points using
XN
y(x) =
an t n k(x n ; x ) + b
n= 1
B. Leibe
54
Course Outline
• Fundamentals
Augmented Computing
and Sensory
Perceptual
Learning,
Summer’12
Machine



Bayes Decision Theory
Probability Density Estimation
Mixture Models and EM
• Discriminative Approaches




Linear Discriminant Functions
Statistical Learning Theory & SVMs
Ensemble Methods & Boosting
Decision Trees & Randomized Trees
• Generative Models



Bayesian Networks
Markov Random Fields
Exact Inference
B. Leibe
55
Recap: Stacking
• Idea
Augmented Computing
and Sensory
Perceptual
Learning,
Summer’12
Machine


Learn L classifiers (based on the training data)
Find a meta-classifier that takes as input the output of the L
first-level classifiers.
Classifier 1
Classifier 2
Data
…
• Example



Combination
Classifier
Learn L classifiers with
Classifier L
leave-one-out.
Interpret the prediction of the L classifiers as L-dimensional
feature vector.
Learn “level-2” classifier based on the examples generated this
way.
Slide credit: Bernt Schiele
B. Leibe
57
Recap: Bayesian Model Averaging
• Model Averaging
Augmented Computing
and Sensory
Perceptual
Learning,
Summer’12
Machine


Suppose we have H different models h = 1,…,H with prior
probabilities p(h).
Construct the marginal distribution over the data set
XH
p(X ) =
p(X jh)p(h)
h= 1
• Average error of committee
ECOM


1
=
EAV
M
This suggests that the average error of a model can be reduced
by a factor of M simply by averaging M versions of the model!
Unfortunately, this assumes that the errors are all uncorrelated.
In practice, they will typically be highly correlated.
B. Leibe
59
Recap: Boosting (Schapire 1989)
Augmented Computing
and Sensory
Perceptual
Learning,
Summer’12
Machine
• Algorithm: (3-component classifier)
1. Sample N1 < N training examples (without
replacement) from training set D to get set D1.
– Train weak classifier C1 on D1.
2. Sample N2 < N training examples (without
replacement), half of which were misclassified
by C1 to get set D2.
– Train weak classifier C2 on D2.
3. Choose all data in D on which C1 and C2
disagree to get set D3.
– Train weak classifier C3 on D3.
4. Get the final classifier output by majority
voting of C1, C2, and C3.
(Recursively apply the procedure on C1 to C3)
B. Leibe
60
Image source: Duda, Hart, Stork, 2001
Recap: AdaBoost – Intuition
Augmented Computing
and Sensory
Perceptual
Learning,
Summer’12
Machine
Consider a 2D feature
space with positive and
negative examples.
Each weak classifier splits
the training examples with
at least 50% accuracy.
Examples misclassified by
a previous weak learner
are given more emphasis
at future rounds.
Slide credit: Kristen Grauman
B. Leibe
62
Figure adapted from Freund & Schapire
Augmented Computing
and Sensory
Perceptual
Learning,
Summer’12
Machine
Recap: AdaBoost – Intuition
Slide credit: Kristen Grauman
B. Leibe
63
Figure adapted from Freund & Schapire
Augmented Computing
and Sensory
Perceptual
Learning,
Summer’12
Machine
Recap: AdaBoost – Intuition
Final classifier is
combination of the
weak classifiers
Slide credit: Kristen Grauman
B. Leibe
64
Figure adapted from Freund & Schapire
Augmented Computing
and Sensory
Perceptual
Learning,
Summer’12
Machine
Recap: AdaBoost – Algorithm
1. Initialization: Set wn( 1) = 1
N
2. For m = 1,…,M iterations
for n = 1,…,N.
a) Train a new weak classifier hm(x) using the current weighting
coefficients W(m) by minimizing the weighted error function
XN
wn( m ) I (hm (x) 6
= tn )
Jm =
n= 1
b) Estimate the weighted error of this classifier on X:
P
²m =
N
n= 1
(m )
wn I (hm (x) 6
= tn )
PN
(m )
n = 1 wn
c) Calculate a weighting½coefficient
for hm(x):
¾
®m = ln
1 ¡ ²m
²m
d) Update the weighting coefficients:
wn(m + 1) = wn(m ) exp f ®m I (hm (x n ) 6
= t n )g
B. Leibe
65
Augmented Computing
and Sensory
Perceptual
Learning,
Summer’12
Machine
Recap: Comparing Error Functions




Ideal misclassification error function
“Hinge error” used in SVMs
Exponential error function
X
“Cross-entropy error”
E= ¡
f t n ln yn + (1 ¡ t n ) ln(1 ¡ yn )g
– Similar to exponential error for z>0.
– Only grows linearly with large negative values of z.
 Make AdaBoost more robust by switching  “GentleBoost”
B. Leibe
67
Image source: Bishop, 2006
Course Outline
• Fundamentals
Augmented Computing
and Sensory
Perceptual
Learning,
Summer’12
Machine



Bayes Decision Theory
Probability Density Estimation
Mixture Models and EM
• Discriminative Approaches




Linear Discriminant Functions
Statistical Learning Theory & SVMs
Ensemble Methods & Boosting
Decision Trees & Randomized Trees
• Generative Models



Bayesian Networks
Markov Random Fields
Exact Inference
B. Leibe
68
Recap: CART Framework
• Six general questions
Augmented Computing
and Sensory
Perceptual
Learning,
Summer’12
Machine
1. Binary or multi-valued problem?
– I.e. how many splits should there be at each node?
2. Which property should be tested at a node?
– I.e. how to select the query attribute?
3. When should a node be declared a leaf?
– I.e. when to stop growing the tree?
4. How can a grown tree be simplified or pruned?
– Goal: reduce overfitting.
5. How to deal with impure nodes?
– I.e. when the data itself is ambiguous.
6. How should missing attributes be handled?
B. Leibe
70
Recap: Picking a Good Splitting Feature
• Goal
Augmented Computing
and Sensory
Perceptual
Learning,
Summer’12
Machine

Select the query (=split) that decreases impurity the most
4 i(N ) = i (N ) ¡ PL i(N L ) ¡ (1 ¡ PL )i (NR )
i (P)
• Impurity measures

Entropy impurity (information gain):
X
i (N ) = ¡
p(Cj jN ) log2 p(Cj jN )
P
j

Gini impurity:
X
i (N ) =
i6
=j
2
3
X
14
p(Ci jN )p(Cj jN ) =
1¡
p2 (Cj jN ) 5
2
j
B. Leibe
71
Image source: R.O. Duda, P.E. Hart, D.G. Stork, 2001
Recap: ID3 Algorithm
• ID3 (Quinlan 1986)
Augmented Computing
and Sensory
Perceptual
Learning,
Summer’12
Machine


One of the first widely used decision tree algorithms.
Intended to be used with nominal (unordered) variables
– Real variables are first binned into discrete intervals.

General branching factor
– Use gain ratio impurity based on entropy (information gain)
criterion.
• Algorithm

Select attribute a that best classifies examples, assign it to root.

For each possible value vi of a,
– Add new tree branch corresponding to test a = vi.
– If example_list(vi) is empty, add leaf node with most common label
in example_list(a).
– Else, recursively call ID3 for the subtree with attributes A \ a.
B. Leibe
73
Course Outline
• Fundamentals
Augmented Computing
and Sensory
Perceptual
Learning,
Summer’12
Machine



Bayes Decision Theory
Probability Density Estimation
Mixture Models and EM
• Discriminative Approaches




Linear Discriminant Functions
Statistical Learning Theory & SVMs
Ensemble Methods & Boosting
Decision Trees & Randomized Trees
• Generative Models



Bayesian Networks
Markov Random Fields
Exact Inference
B. Leibe
78
Recap: Randomized Decision Trees
• Decision trees: main effort on finding good split
Augmented Computing
and Sensory
Perceptual
Learning,
Summer’12
Machine



Training runtime: O(DN 2 log N )
This is what takes most effort in practice.
Especially cumbersome with many attributes (large D).
• Idea: randomize attribute selection



No longer look for globally optimal split.
Instead randomly use subset of K attributes on which to base
the split.
Choose best splitting attribute e.g. by maximizing the
information gain (= reducing entropy):
XK jSk j XN
4E=
pj log2 (pj )
jSj j = 1
k= 1
B. Leibe
79
Recap: Random Forests (Breiman 2001)
• General ensemble method
Augmented Computing
and Sensory
Perceptual
Learning,
Summer’12
Machine

Idea: Create ensemble of many (50 - 1,000) trees.
• Empirically very good results


Often as good as SVMs (and sometimes better)!
Often as good as Boosting (and sometimes better)!
• Injecting randomness

Bootstrap sampling process
– On average only 63% of training examples used for building the tree
– Remaining 37% out-of-bag samples used for validation.

Random attribute selection
– Randomly choose subset of K attributes to select from at each node.
– Faster training procedure.
• Simple majority vote for tree combination
B. Leibe
81
Recap: Extremely Randomized Decision Trees
• Random queries at each node…
Augmented Computing
and Sensory
Perceptual
Learning,
Summer’12
Machine



Tree gradually develops from a classifier to a
flexible container structure.
Node queries define (randomly selected)
structure.
Each leaf node stores posterior probabilities
• Learning

Patches are “dropped down” the trees.
– Only pairwise pixel comparisons at each node.
– Directly update posterior distributions at leaves
 Very fast procedure, only few pixel-wise comparisons.
 No need to store the original patches!
B. Leibe
84
Image source: Wikipedia
Course Outline
• Fundamentals
Augmented Computing
and Sensory
Perceptual
Learning,
Summer’12
Machine



Bayes Decision Theory
Probability Density Estimation
Mixture Models and EM
• Discriminative Approaches




Linear Discriminant Functions
Statistical Learning Theory & SVMs
Ensemble Methods & Boosting
Decision Trees & Randomized Trees
• Generative Models



Bayesian Networks
Markov Random Fields
Exact Inference
B. Leibe
85
Recap: Graphical Models
• Two basic kinds of graphical models
Augmented Computing
and Sensory
Perceptual
Learning,
Summer’12
Machine


Directed graphical models or Bayesian Networks
Undirected graphical models or Markov Random Fields
• Key components

Nodes
– Random variables

Edges
– Directed or undirected

Directed
graphical model
Undirected
graphical model
The value of a random variable may be known or unknown.
unknown
Slide credit: Bernt Schiele
B. Leibe
known
86
Recap: Factorization of the Joint Probability
Augmented Computing
and Sensory
Perceptual
Learning,
Summer’12
Machine
• Computing the joint probability
p(x 1 ; : : : ; x 7 ) = p(x 1 )p(x 2 )p(x 3 )p(x 4 jx 1 ; x 2 ; x 3 )
p(x 5 jx 1 ; x 3 )p(x 6 jx 4 )p(x 7 jx 4 ; x 5 )
General factorization
We can directly read off the factorization
of the joint from the network structure!
B. Leibe
89
Image source: C. Bishop, 2006
Recap: Factorized Representation
• Reduction of complexity
Augmented Computing
and Sensory
Perceptual
Learning,
Summer’12
Machine

Joint probability of n binary variables requires us to represent
values by brute force
O(2n )

terms
The factorized form obtained from the graphical model only
requires
O(n ¢2k )
terms
– k: maximum number of parents of a node.
 It’s the edges that are missing in the graph that are important!
They encode the simplifying assumptions we make.
Slide credit: Bernt Schiele, Stefan Roth
B. Leibe
90
Recap: Conditional Independence
Augmented Computing
and Sensory
Perceptual
Learning,
Summer’12
Machine
• X is conditionally independent of Y given V

Definition:

Also:

Special case: Marginal Independence

Often, we are interested in conditional independence between
sets of variables:
B. Leibe
91
Recap: Conditional Independence
• Three cases
Augmented Computing
and Sensory
Perceptual
Learning,
Summer’12
Machine

Divergent (“Tail-to-Tail”)
– Conditional independence when c is observed.

Chain (“Head-to-Tail”)
– Conditional independence when c is observed.

Convergent (“Head-to-Head”)
– Conditional independence when neither c,
nor any of its descendants are observed.
B. Leibe
92
Image source: C. Bishop, 2006
Recap: D-Separation
• Definition
Augmented Computing
and Sensory
Perceptual
Learning,
Summer’12
Machine


Let A, B, and C be non-intersecting subsets of nodes
in a directed graph.
A path from A to B is blocked if it contains a node such that
either
– The arrows on the path meet either head-to-tail or
tail-to-tail at the node, and the node is in the set C, or
– The arrows meet head-to-head at the node, and neither
the node, nor any of its descendants, are in the set C.

If all paths from A to B are blocked, A is said to be d-separated
from B by C.
• If A is d-separated from B by C, the joint distribution
over all variables in the graph satisfies

.
Read: “A is conditionally independent of B given C.”
Slide adapted from Chris Bishop
B. Leibe
93
Recap: “Bayes Ball” Algorithm
Augmented Computing
and Sensory
Perceptual
Learning,
Summer’12
Machine
• Graph algorithm to compute d-separation

Goal: Get a ball from X to Y without being blocked by V.

Depending on its direction and the previous node, the ball can
– Pass through (from parent to all children, from child to all parents)
– Bounce back (from any parent/child to all parents/children)
– Be blocked
• Game rules


An unobserved node (W  V) passes through balls from parents,
but also bounces back balls from children.
An observed node (W 2 V) bounces back balls from parents, but
blocks balls from children.
Slide adapted from Zoubin Gharahmani
B. Leibe
94
Augmented Computing
and Sensory
Perceptual
Learning,
Summer’12
Machine
Recap: The Markov Blanket
• Markov blanket of a node xi

Minimal set of nodes that isolates xi from the rest of the graph.

This comprises the set of
– Parents,
– Children, and
– Co-parents of xi.
This is what we have to watch out for!
B. Leibe
95
Image source: C. Bishop, 2006
Course Outline
• Fundamentals
Augmented Computing
and Sensory
Perceptual
Learning,
Summer’12
Machine



Bayes Decision Theory
Probability Density Estimation
Mixture Models and EM
• Discriminative Approaches




Linear Discriminant Functions
Statistical Learning Theory & SVMs
Ensemble Methods & Boosting
Decision Trees & Randomized Trees
• Generative Models



Bayesian Networks
Markov Random Fields
Exact Inference
B. Leibe
96
Recap: Undirected Graphical Models
• Undirected graphical models (“Markov Random Fields”)
Augmented Computing
and Sensory
Perceptual
Learning,
Summer’12
Machine

Given by undirected graph
• Conditional independence for undirected graphs


If every path from any node in set A to set B passes through at
least one node in set C, then
.
Simple Markov blanket:
B. Leibe
97
Image source: C. Bishop, 2006
Recap: Factorization in MRFs
• Joint distribution
Augmented Computing
and Sensory
Perceptual
Learning,
Summer’12
Machine

Written as product of potential functions over maximal cliques
in the graph:
Y
1
p(x ) =
Z

ÃC (x C )
C
The normalization constant Z is called the partition function.
X Y
Z=
ÃC (x C )
x
• Remarks


C
BNs are automatically normalized. But for MRFs, we have to
explicitly perform the normalization.
Presence of normalization constant is major limitation!
– Evaluation of Z involves summing over O(KM) terms for M nodes!
B. Leibe
98
Recap: Converting Directed to Undirected Graphs
Augmented Computing
and Sensory
Perceptual
Learning,
Summer’12
Machine
• Problematic case: multiple parents
Fully connected,
no cond. indep.!
Need a clique of x1,…,x4 to represent this factor!
Need to introduce additional links (“marry the parents”).
 This process is called moralization. It results in the moral graph.

Slide adapted from Chris Bishop
B. Leibe
100
Image source: C. Bishop, 2006
Course Outline
• Fundamentals
Augmented Computing
and Sensory
Perceptual
Learning,
Summer’12
Machine



Bayes Decision Theory
Probability Density Estimation
Mixture Models and EM
• Discriminative Approaches




Linear Discriminant Functions
Statistical Learning Theory & SVMs
Ensemble Methods & Boosting
Decision Trees & Randomized Trees
• Generative Models



Bayesian Networks
Markov Random Fields
Exact Inference
B. Leibe
105
Augmented Computing
and Sensory
Perceptual
Learning,
Summer’12
Machine
Recap: Factor Graphs
• Joint probability


1Y
f s (x s )
Can be expressed as product of factors: p(x) =
Z s
Factor graphs make this explicit through separate factor nodes.
• Converting a directed polytree


Conversion to undirected tree creates loops due to moralization!
Conversion to a factor graph again results in a tree!
B. Leibe
106
Image source: C. Bishop, 2006
Recap: Sum-Product Algorithm
• Two kinds of messages
Augmented Computing
and Sensory
Perceptual
Learning,
Summer’12
Machine

Message from factor node to variable nodes:
– Sum of factor contributions
X
¹ f s!
x (x)
´
Fs (x; X s )
Xs
X
=
Y
f s (x s )
Xs

¹ xm !
f s (x m )
m 2 ne( f s ) nx
Message from variable node to factor node:
– Product of incoming messages
Y
¹ xm !
f s (x m )
´
¹ fl!
xm
(x m )
l 2 ne( x m ) nf s
 Simple propagation scheme.
B. Leibe
108
Augmented Computing
and Sensory
Perceptual
Learning,
Summer’12
Machine
Recap: Sum-Product from Leaves to Root
fa
fb
Message definitions:
fc
X
¹ f s!
x (x)
´
Y
f s (x s )
Xs
f s (x m )
´
f s (x m )
m 2 ne( f s ) nx
Y
¹ xm !
¹ xm !
¹ fl!
xm
(x m )
l 2 ne( x m ) nf s
B. Leibe
109
Image source: C. Bishop, 2006
Recap: Max-Sum Algorithm
Augmented Computing
and Sensory
Perceptual
Learning,
Summer’12
Machine
• Objective: an efficient algorithm for finding

Value xmax that maximises p(x);

Value of p(xmax).
 Application of dynamic programming in graphical models.
• Key ideas

We are interested in the maximum value of the joint distribution
p(x max ) = max p(x)
x
 Maximize the product p(x).

For numerical reasons, use the logarithm.
 Maximize the sum (of log-probabilities).
Slide adapted from Chris Bishop
B. Leibe
111
Recap: Max-Sum Algorithm
Augmented Computing
and Sensory
Perceptual
Learning,
Summer’12
Machine
• Initialization (leaf nodes)
• Recursion


Messages
For each node, keep a record of which values of the variables
gave rise to the maximum state:
Slide adapted from Chris Bishop
B. Leibe
112
Recap: Max-Sum Algorithm
Augmented Computing
and Sensory
Perceptual
Learning,
Summer’12
Machine
• Termination (root node)

Score of maximal configuration

Value of root node variable giving rise to that maximum

Back-track to get the remaining
variable values
max
x max
)
n ¡ 1 = Á(x n
Slide adapted from Chris Bishop
B. Leibe
113
Recap: Junction Tree Algorithm
• Motivation
Augmented Computing
and Sensory
Perceptual
Learning,
Summer’12
Machine



Exact inference on general graphs.
Works by turning the initial graph into a junction tree and then
running a sum-product-like algorithm.
Intractable on graphs with large cliques.
• Main steps
1. If starting from directed graph, first convert it to an undirected
graph by moralization.
2. Introduce additional links by triangulation in order to reduce
the size of cycles.
3. Find cliques of the moralized, triangulated graph.
4. Construct a new graph from the maximal cliques.
5. Remove minimal links to break cycles and get a junction tree.
 Apply regular message passing to perform inference.
B. Leibe
114
Course Outline
• Fundamentals
Augmented Computing
and Sensory
Perceptual
Learning,
Summer’12
Machine



Bayes Decision Theory
Probability Density Estimation
Mixture Models and EM
• Discriminative Approaches




Linear Discriminant Functions
Statistical Learning Theory & SVMs
Ensemble Methods & Boosting
Decision Trees & Randomized Trees
• Generative Models



Bayesian Networks
Markov Random Fields & Applications
Exact Inference
B. Leibe
117
Recap: MRF Structure for Images
Augmented Computing
and Sensory
Perceptual
Learning,
Summer’12
Machine
• Basic structure
Noisy observations
“True” image content
• Two components

Observation model
– How likely is it that node xi has label Li given observation yi?
– This relationship is usually learned from training data.

Neighborhood relations
– Simplest case: 4-neighborhood
– Serve as smoothing terms.
 Discourage neighboring pixels to have different labels.
– This can either be learned or be set to fixed “penalties”.
B. Leibe
118
Recap: Graph Cuts for Binary Problems
Augmented Computing
and Sensory
Perceptual
Learning,
Summer’12
Machine
Dp (t )
t
n-links
a cut
w pq
Dp (s)
s
“expected” intensities of
object and background
I s and I t

D (t )  exp  || I
p
can be re-estimated


Dp (s)  exp  || I p  I s ||2 / 2 2
t 2
2

I
||
/
2

p
EM-style optimization
Slide credit: Yuri Boykov
B. Leibe
121
[Boykov & Jolly, ICCV’01]
Recap: s-t-Mincut Equivalent to Maxflow
Augmented Computing
and Sensory
Perceptual
Learning,
Summer’12
Machine
Flow = 0
Augmenting Path Based
Algorithms
Source
2
v1
5
9
1
2
Sink
v2
4
1. Find path from source to sink
with positive capacity
2. Push maximum possible flow
through this path
3. Repeat until no path can be
found
Algorithms assume non-negative capacity
Slide credit: Pushmeet Kohli
B. Leibe
122
Recap: -Expansion Move
• Basic idea:
Augmented Computing
and Sensory
Perceptual
Learning,
Summer’12
Machine

Break multi-way cut computation into a sequence of
binary s-t cuts.


other labels
No longer globally optimal result, but guaranteed approximation
quality and typically converges in few iterations.
Slide credit: Yuri Boykov
B. Leibe
124
Recap: Simple Binary Image Denoising Model
Augmented Computing
and Sensory
Perceptual
Learning,
Summer’12
Machine
• MRF Structure
Noisy observations
Observation process
“True” image content
x i ; yi 2 f 0; 1g
“Smoothness constraints”

Example: simple energy function
X
E (x; y ) = h
X
xi + ¯
i
X
±(x i 6
= xj ) + ´
f i ;j g
±(x i 6
= yi )
i
Prior
Smoothness
Observation
("Potts model")
– Smoothness term: fixed penalty ¯ if neighboring labels disagree.
– Observation term: fixed penalty ´ if label and observation disagree.
125
Image source: C. Bishop, 2006
Augmented Computing
and Sensory
Perceptual
Learning,
Summer’12
Machine
Any Questions?
So what can you do with all of this?
128
Augmented Computing
and Sensory
Perceptual
Learning,
Summer’12
Machine
Any More Questions?
Good luck for the exam!
131