Transcript Slide 1

One-Dimensional Curve-Fitting
Presented by Wenli Li, Shuhong Li,
and Vivian Tam
Venables and Ripley Section 8.7
Novemeber 22, 2004
INTRODUCTION
• Curve-fitting:
• Sample data:{(x0,y0), (x1,y1), ... (xn, yn)}
• interpolation & extrapolation
• One-dimensional curve-fitting (section
8.7):
• The functional form is not pre-specified
• SPLINES (ns, smooth.spline)
• Local Regression (LOESS, SUPSMU, KERNEL
SMOOTHER and LOCPOLY)
• Data set:
• One independent & one dependent
Examples: GAGurine & Mercury level
GAGurine (MASS)
•
Dataset:
–
–
•
Variables:
• Age: independent
• GAG: dependent
Sample size: 314
Classical way:
library(MASS)
attach(GAGurine)
plot(Age, GAG, main=”Degree
6 polynomial”)
GAG.lm<-lm(GAG~Age+I(Age^2)
+I(Age^3) +I(Age^4)
+I(Age^5) +I(Age^6)
+I(Age^7) +I(Age^8))
anova(GAG.lm)
GAG.lm2<-lm(GAG~Age+I(Age^2)
+I(Age^3) +I(Age^4)
+I(Age^5) +I(Age^6))
xx<-seq(0, 17, len=200)
lines(xx, predict(GAG.lm2,
data.frame(Age=xx),
col=“red”)
Age: 0.00 0.00……0.46 0.47.….17.30 7.67
GAG 23.0 23.8……18.6 26.4.…..1.9
9.3
=======================================
Terms added sequentially (first to last)
Df
Sum of Sq
Mean Sq
Age
1
12590
12590
I(Age^2) 1
3751
3751
I(Age^3) 1
1492
1492
I(Age^4) 1
449
449
I(Age^5) 1
174
174
I(Age^6) 1
286
286
I(Age^7) 1
57
57
I(Age^8) 1
45
45
F-value
593.58
176.84
70.32
21.18
8.22
13.48
2.70
2.12
Pr(F)
0.0000
0.0000
0.0000
0.00001
0.00444
0.00028
0.10151
0.14667
SPLINES
• Algorithm:
• Function: ns( )
• Generate a Basis Matrix for Natural Cubic Splines
• Usage: ns(x, df, knots, intercept=F,
Boundary.knots,derivs)
• Arguments:
• Required: x the predictor variable.
• Optional:
• Df: degrees of freedom. One can supply df rather than
knots; ns then chooses df-1-intercept knots at suitably
chosen quantiles of x. This argument is ignored if knots
is supplied.
• Knots: breakpoints that define the spline.
SPLINES
Function: smooth.spline( )
• Fits a cubic B-spline smooth to the input data.
• Usage: smooth.spline(x, y, w = <<see below>>, df =
<<see below>>, spar = 0, cv = F, all.knots = F,
df.offset = 0, penalty = 1)
• Arguments:
• Required: X, values of the predictor variable. There
should be at least ten distinct x values.
• Optional:
• Y: response variable, of the same length as x.
• Df:a number which supplies the degrees of freedom =
trace(S)rather than a smoothing parameter.
SPLINES
library(splines)
plot(Age, GAG, type=”n”, main=”Spline”)#splines
lines(Age, fitted(lm(GAG~ns(Age, df=5))), col=”red”)
lines(Age, fitted(lm(GAG~ns(Age, df=10))), lty=3, col=”green”)
lines(Age, fitted(lm(GAG~ns(Age, df=20))), lty=4, col=”blue”)
lines(smooth.spline(Age, GAG), lwd=3, col=”black”)# Smoothing splines
legend(12, 50, c(“red: df=5”, “green:df=10”, “blue:df=20”, “Smoothing”), lty=c(1,3,
4,1), lwd=c(1, 1,1, 3), bty=”n”)
KERNEL SMOOTH
Function: ksmooth( )
• Estimates a probability density or performs
scatterplot smoothing using kernel estimates.
• Usage: ksmooth(x, y=NULL, kernel="box",
bandwidth=0.5, range.x=range(x), n.points=length(x),
x.points=<<see below>>)
• Arguments:
• Required: X, vector of x data
• Optional:
• Y: vector of y data. This must be same length as x,
and missing values are not accepted.
• Kernel: "box“,"triangle“,"parzen“,"normal”
• Bandwidth: Larger values of bandwidth make smoother
estimates, smaller values of bandwidth make less smooth
estimates.
Kernel Smoother
#kernel smoother:
plot(Age, GAG, type=”n”, main=”ksmooth”)
lines(ksmooth(Age, GAG, “normal”, bandwidth=1), col=”red”)
lines(ksmooth(Age, GAG, “normal”, bandwidth=5))
legend(12, 50, c(“red: bandwidth=1”, “black: bandwidth=5”),bty=”n”)
LOESS
• Using Local Polynomial Regression fit a curve
determined by one or more numerical predictors
• gets a predicted value at each point by fitting
a weighted linear regression, where the weights
decrease with distance from the point of
interest
LOESS Parameters
• f:controls the window size
• weights: distance from some point x
• span: the parameter alpha which
controls the degree of smoothing
• degree: the degree of the polynomials
to be used, up to 2
LOESS
Code:
library(MASS)
attach(GAGurine)
plot(Age,GAG,type="n",main="loess")
lines(loess.smooth(Age,GAG,span=2/3,degree=1),col="red",lwd=1)
lines(loess.smooth(Age,GAG,span=2/3,degree=4),col="blue",lwd=2)
lines(loess.smooth(Age,GAG,span=1/3,degree=4),col="green",lwd=1)
legend(10,45, c("Red: span=2/3,deg=1","Blue:
span=2/3,deg=4",”green: span=1/3,deg=4"),bty="n")
SUPSMU
• Serves a purpose similar to that of the function
loess
• The best of the three smoothers is chosen by
cross-validation
• If there are substantial correlations in x-value,
then a pre-specified fixed span smoother should
be used. Reasonable span values are 0.2 to 0.4
SUPSMU Parameters:
• span: the fraction of the observations in the
span of the running(lines smoother, or ‘“cv”’
to choose this by leave-one-out crossvalidation)
• bass: controls the smoothness of the fitted
curve. Values of up to 10 indicate increasing
smoothness
• periodic: if TRUE, the smoother assumes x is a
periodic variable with values in the range [0.0,
1.0] and period 1.0. An error occurs if x has
values outside this range
References:
Friedman, J. H. (1984) A variable span scatterplot smoother. Laboratory for Computational
Statistics, Stanford University Technical
Report No. 5
Code:
plot(Age,GAG,type="n",main="supsmu")
lines(supsmu(Age,GAG))
lines(supsmu(Age,GAG,bass=3),lty=3)
lines(supsmu(Age,GAG,bass=10),lty=4)
legend(12,50,c("default","bass=3","bass=10"),lty
=c(1,3,4),bty="n")
LOCPOLY
•
•
•
•
Estimates a probability density function using local
polynomials
A fast binned implementation over an equally-spaced grid is
used
Use approximations over an equally-spaced grid for fast
computation
In a simple form : locpoly(x, y, degree=#, bandwidth=# )
Parameters:
•
•
•
•
•
locpoly(x, y, drv=0, degree=<<see below>>, kernel="normal“
bandwidth,gridsize=401, bwdisc=25, range.x=<<see below>>,
binned=FALSE, truncate=TRUE )
drv: order of derivative to be estimated
degree: degree of local polynomial used
bandwidth: the kernel bandwidth smoothing parameter
range.x: vector containing the minimum and maximum values of
'x' at which to compute the estimate
Code:
LOCPOLY
library(MASS)
attach(GAGurine)
library(KernSmooth)
plot(Age, GAG, type="n", main="(Age, GAG) Locpoly")
(h<- dpill(Age, GAG))
lines(locpoly(Age, GAG, degree=0, bandwidth=h),
col="red",lty=1,lwd=2)
lines(locpoly(Age, GAG, degree=1, bandwidth=h),
col="blue",lty=3,lwd=3)
lines(locpoly(Age, GAG, degree=2, bandwidth=h),
col="green",lty=4,lwd=3)
legend(10,40,c("const=0 red","linear=1 blue","quad=2
green"),lty=c(1,3,4),bty="n")
detach()
LOCPOLY : GAGurine
Example: Mercury Level
• Model : Mercury and Alkalinity
• In 1990 to 1991, largemouth bass fish were
studied in 53 different Florida lakes to
examine the Mercury contamination level and the
factors that influenced the level of mercury
absorpsion in the fish
• One factor studied was the Alkaliniity level of
the water
• The graph of Mercury level and Alkalinity level
is plotted to study the relationship
Mercury Level Graphs Coding:
•
•
•
•
#1 loess
plot( Alkalinity, Mercury, main="Alkalinity and Mercury, Loess")
lines(loess.smooth(Alkalinity,Mercury,span = 2/3, degree = 1),
col="red",lwd=2)
lines(loess.smooth(Alkalinity,Mercury,span = 2/3, degree = 2),
col="blue",lwd=2)
legend(65,1.0, c("deg=1 Red","deg=2 Blue"),bty="n")
•
•
•
•
•
#2 supsmu
plot( Alkalinity, Mercury, main="Alkalinity and Mercury, Supsmu")
lines(supsmu(Alkalinity,Mercury, bass=1), lty=1,col="red",lwd=2)
lines(supsmu(Alkalinity,Mercury, bass=10), lty=3,col="blue",lwd=3)
legend(58,1.0, c("base=1red","base=10blue"),lty=c(1,3),bty="n",lwd=2)
•
•
•
•
#3 ksmooth
plot(Alkalinity, Mercury, type="n", main="Alkalinity and Mercury, Ksmooth")
lines(ksmooth(Alkalinity, Mercury, "normal", bandwidth=1),col="green",lwd=2)
lines(ksmooth(Alkalinity, Mercury, "normal", bandwidth=5),col="red",
lty=2,lwd=2)
legend(75,1.0, c("bw=1","bw=5"),lty=c(1,2),bty="n")
•
•
•
•
•
•
•
•
•
•
•
#4 locpoly
library(KernSmooth)
plot( Alkalinity, Mercury, type="n",main="Alkalinity and Mercury, Locpoly")
#select bandwidth
(h <- dpill(Alkalinity,Mercury))
lines(locpoly(Alkalinity,Mercury,degree=0,
bandwidth=h),lty=1,col="green",lwd=2)
lines(locpoly(Alkalinity,Mercury,degree=1, bandwidth=h),lty=2,col="red",lwd=2)
lines(locpoly(Alkalinity,Mercury,degree=2,
bandwidth=h),lty=3,col="purple",lwd=3)
legend(75,1.0, c("const","linear","quad"),lty=c(1,2,3),bty="n")
SUMMARY
•
Use One-Dimensional Curve-Fitting when:
Scatter Plot does not result in a Linear Model
Data Transformation does not give satisfactory
Linear Model result
Accommodate future data
Include previous outliers
Business applications
•
•
Several methods discussed including:
1. SPLINES
2. LOESS
3. SUPSMU
4. KSMOOTH
5. LOCPOLY
Parameters: such as bandwidth, df, derivative, smoothness,
degree etc can help the curve fitting.