Transcript Lecture 6
Overview
• Basis expansion
• Splines
• (Natural) cubic splines
• Smoothing splines
• Nonparametric logistic regression
• Multidimensional splines
• Wavelets
Data mining and statistical
learning - lecture 6
1
Linear basis expansion (1)
Linear regression
x1
x2
x3
y
1
-3
6
12
…
…
…
…
True model:
y f ( x) 1 x1 2 x2 3 x3
Question: How to find fˆ ?
Answer: Solve a system of linear equations to obtain ˆ1 , ˆ2 , ˆ3
Data mining and statistical
learning - lecture 6
2
Linear basis expansion (2)
Nonlinear model
x1
x2
x3
y
1
-3
-1
12
…
…
…
…
True model:
y 1 x1 x2 2 x2 e x3 3 sin x3 4 x12
Question: How to find fˆ ?
Answer: A) Introduce new variables
u1 x1 x2 ,
u 2 x2 e x3 ,
u3 sin x3 ,
u 4 x12
Data mining and statistical
learning - lecture 6
3
Linear basis expansion (3)
Nonlinear model
B) Transform the data set
u1
u2
u3
-3
-1.1 -0.84 1
12
…
…
…
…
u4
y
True model:
y 1u1 2u2 3u3 4u4
C) Apply linear regression to obtain
ˆ1 , ˆ2 , ˆ3 , ˆ4
Data mining and statistical
learning - lecture 6
4
Linear basis expansion (4)
Conclusion:
We can easily fit any model of the type
M
f X m hm X
m 1
i.e., we can easily undertake a linear basis expansion in X
Example: If the model is known to be nonlinear, but the exact
form is unknown, we can try to introduce interaction terms
f X 1 X 1 p X p 11 X 12 12 X 1 X 2
Data mining and statistical
learning - lecture 6
5
Piecewise polynomial functions
Assume X is one-dimesional
Def. Assume the domain [a, b] of X is split into intervals [a, ξ1],
[ξ 1, ξ 2], ..., [ξ n, b]. Then f(X) is said to be piecewise
polynomial if f(X) is represented by separate polynomials in
the different intervals.
Note The points ξ1,..., ξ n are called knots
Data mining and statistical
learning - lecture 6
6
Piecewise polynomials
Example. Continuous piecewise linear function
Alternative A. Introduce linear functions on each interval and a
set of constraints
y1 1 x 1
y2 2 x 2
y x
3
3
3
(4 free parameters)
y1 1 y 2 1
y 2 2 y 3 2
INS. FIG 5.1
lower left
Alternative B. Use a basis expansion (4 free parameters)
h1 X 1, h2 X X , h3 X X 1 , h4 X X 2
Theorem. The given formulations are equivalent.
Data mining and statistical
learning - lecture 6
7
Splines
Definition A piecewise polynomial is called order-M spline if it has
continuous derivatives up to order M-1 at the knots.
Alternative definition An order-M spline is a function which can be
represented by basis functions ( K= #knots )
h j X X
j 1
, j 1, , M
hM l X X l , l 1, , K
M 1
Theorem. The definitions above are equivalent.
Terrminology. Order-4 spline is called cubic spline
INS. FIG 5.2 LR
(look at basis and compare #free parameters)
Note. Cubic splines: knot-discontinuity is not visible
Data mining and statistical
learning - lecture 6
8
Variance of spline estimators – boundary effects
INSERT FIG 5.3
Data mining and statistical
learning - lecture 6
9
Natural cubic spline
Def. A cubic spline f is called natural cubic spline if the its 2nd
and 3rd derivatives are zero at a and b
Note It implies that f is linear on extreme intervals
Basis functions of natural cubic splines
N1 X 1, N 2 X X , N k 2 d k X d K 1 X , k 1, ..., K 2
where d k X
X k 3 X K 3
K k
Data mining and statistical
learning - lecture 6
10
Fitting smooth functions to data
Minimize a penalized sum of squared residuals
N
RSS f , y i f xi f t dt
2
2
i 1
where λ is smoothing parameter.
λ=0 : any function interpolating data
λ=+ : least squares line fit
Data mining and statistical
learning - lecture 6
11
Optimality of smoothing splines
Theorem The function f minimizing RSS for a given is a natural
cubic spline with knots at all unique values of xi (NOTE: N
knots!)
The optimal spline can be computed as follows.
N
f x N j x j N x
T
j 1
RSS , y N y N T N
T
Nij
N j xi
N ij N i'' t N 'j' t dt
ˆ N T N
N
1
NT y
Data mining and statistical
learning - lecture 6
12
A smoothing spline is a linear smoother
The fitted function
fˆ N NT N N
1
NT y S y
is linear in the response values.
Data mining and statistical
learning - lecture 6
13
Degrees of freedom of smoothing splines
The effective degrees of freedom is
dfλ = trace(Sλ)
i.e., the sum of the diagonal elements of S.
Data mining and statistical
learning - lecture 6
14
Smoothing splines and eigenvectors
It can be shown that
S I K
1
where K is the so-called penalty matrix
Furthermore, the eigen-decomposition is
N
S k u k u Tk
k 1
k
1
1 d k
Note: dk and uk are eigenvalues and
eigenvectors, respectively, of K
Data mining and statistical
learning - lecture 6
15
Smoothing splines and shrinkage
N
S y u k k u Tk , y
k 1
• Smoothing spline decomposes vector y with respect to
basis of eigenvectors and shrinks respective contributions
• The eigenvectors ordered by ρ increase in complexity. The
higher the complexity, the more the contribution is shrunk.
Data mining and statistical
learning - lecture 6
16
Smoothing splines and local curve fitting
• Eigenvalues are reverse functions of λ. The higher λ, the
higher penalization.
• Smoother matrix is has banded nature -> local fitting method
N
• df traceS
k 1
1
1 d k
Data mining and statistical
learning - lecture 6
INSERT fig 5.8
17
Fitting smoothing splines in practice (1)
Reinsch form:
S I K
1
Theorem. If f is natural cubic spline with values at knots f and
second derivative at knots then
QT f R
where Q and R are band matrices, dependent on ξ only.
Theorem.
1
K QR Q
T
Data mining and statistical
learning - lecture 6
18
Fitting smoothing splines in practice (2)
Reinsch algorithm
• Evaluate QTy
• Compute R+λQTQ and find Cholesky decomposition (in linear
time!)
• Solve matrix equation (in linear time!)
• Obtain f=y-λQγ
Data mining and statistical
learning - lecture 6
19
Automated selection of smoothing parameters (1)
What can be selected:
Regression splines
• Degree of spline
• Placement of knots
->MARS procedure
Smoothing spline
• Penalization parameter
Data mining and statistical
learning - lecture 6
20
Automated selection of smoothing parameters (2)
Fixing the degrees of freedom
N
1
k 1
1 d k
df traceS
• If we fix dfλ then we can find λ by solving the equation
numerically
• One could try two different dfλ and choose one based on Ftests, residual plots etc.
Data mining and statistical
learning - lecture 6
21
Automated selection of smoothing parameters (3)
The bias-variance trade off
N
1
k 1
1 d k
df traceS
INSERT FIG. 5.9
EPE – integrated squared
prediction error,
CV- cross validation
Data mining and statistical
learning - lecture 6
22
Nonparametric logistic regression
Logistic regression model
log
Pr Y 1 | X x
f (X )
Pr Y 0 | X x
Note: X is one-dimensional
What is f:
Linear -> ordinary logistic regression (Chapter 4)
• Enough smooth -> nonparametric logistic regression
(splines+others)
• Other choices are possible
Data mining and statistical
learning - lecture 6
23
Nonparametric logistic regression
Problem formulation:
Minimize penalized log-likelihood
2
1
min l p f , l u f , f t dt
2
Good news: Solution is still a natural cubic spline.
Bad news: There is no analytic expression of that spline
function
Data mining and statistical
learning - lecture 6
24
Nonparametric logistic regression
How to proceed?
Use Newton-Rapson to compute spline numerically, i.e
•
Compute
l p
l p
, lp
2
2l p
T
(analytically)
1. Compute Newton direction using current value of parameter
and derivative information
2. Compute new value of parameter using old value and
update formula
new
old
l
1
2
p
l p
Data mining and statistical
learning - lecture 6
25
Multidimensional splines
How to fit data smoothly in higher dimensions?
A) Use basis of one dimensional functions and produce basis
by tensor product
g jk X h1 j X 1 h2 k X 2 ,
g X jk g jk X
Problem: Exponential
INS FIG. 6.10
growth of basis with dim
Data mining and statistical
learning - lecture 6
26
Multidimensional splines
How to fit data smoothly in higher dimensions?
B) Formulate a new problem
min
y
f xi J f
2
i
i
• The solution is thin-plate splines
• The similar properties for λ=0.
• The solution in 2 dimension is essentially sum of radial basis
functions
f x 0 T x j x x j
Data mining and statistical
learning - lecture 6
27
Wavelets
Introduction
• The idea: to fit bumpy function by removing noise
• Application area: Signal processing, compression
• How it works: The function is represented in the basis of
bumpy functions. The small coefficients are filtered.
Data mining and statistical
learning - lecture 6
28
Wavelets
Basis functions (Haar Wavelets, Symmlet-8 Wavelets)
INSERT FIG 5.13
Data mining and statistical
learning - lecture 6
29
Wavelets
Example
Insert FIG 5.14
Data mining and statistical
learning - lecture 6
30