Basic principles of probability theory

Download Report

Transcript Basic principles of probability theory

Basics of regression analysis
•
•
•
Purpose of linear models
Least-squares solution for linear models
Analysis of diagnostics
Reason for linear models
Purpose of regression is to reveal statistical relations between input and output variables.
Statistics cannot reveal functional relationship. It is purpose of other scientific studies.
Statistics can help validation various functional relationship (models). Let us assume that
we suspect that functional relationship is
y  f (x,  ,)
where  is a vector of unknown parameters, x=(x1,x2,,,xp) a vector of controllable parameters, and
y is output,  is an error associated with the experiment. Then we can set for various values
of x experiments and get output (or response) for them. If number of experiments is n then
 values. Denote them as a vector y=(y ,y ,,,y ). Purpose of statistics is
we will have n output
1 2
n
to evaluate parameter vector using input and output values. If function f is a linear
function of the parameters and errors are additive then we are dealing with linear model.
For this model we can write
yi  1x1i  2 x2i  ...   p x pi  i
Linear model is linearly dependent on parameters but not on input variables. For example
y  0  1x1  2 x12  
is a linear model. But
y  0  1x1  12 x2  
is not.
Assumptions
Basic assumptions for analysis of linear model are:
1)
the model is linear in parameters
2)
the error structure is additive
3)
Random errors have 0 mean, equal variances and they are uncorrelated.
These assumptions are sufficient to deal with linear models. Uncorrelated with equal
variance assumptions (number 3) can be removed. Then the treatments
becomes a little bit more complicated.
Note that for general solution normality assumption is not used. This assumption is
necessary to design test statistics. If this assumption does not work then we can
use bootstrap to design test statistic.
These assumptions can be written in a vector form:
y  X  , E()  0, V()   2I
where y, 0, I,  are vectors and X is a matrix. This matrix is called a design matrix,
input matrix etc. I is nxn identity matrix.

One parameter case
When we have one variable (x) for predictor and one for response then problem
becomes considerable simpler. In this case we have:
y   0  1 x  
Now let us assume that we have observations for n values of x (x1,,,xn) the result of
response is (y1,,,yn). If we assume that errors are independent, have equal
 and normally distributed then we can use least-squares:
variances
n
n
  (y
2
i
i1
i
 (0  1 x i ))2 - > min
i=1
If we solve this minimisation
problem we get for estimations for coefficients:

y i  x i2   x i  y i x i

ˆ
 0 
n  x i2  ( x i ) 2
ˆ1 

n y i x i   y i  x i
n  x i2  ( x i ) 2
One parameter case
If you divide denominator and numerator by n2 and use definitions of correlation and
standard deviations then the second equation can be written as:
ˆ   (x, y)

1
s(y)
s(x)
In some
 sense slope is description of correlation between input and output variables.
If slope is equal 0 then intercept becomes mean value of observations
Solution for general case
Solution to least-squares with linear model and and given assumptions is:
ˆ  (XT X)1 XT y

Let us show this. If we use the form of the model and write least squares equation (since we want
to find solution with minimum least-squares error):
T
T
S




(y

X

)
(y  X )  min

and get the first and solve the equation then we can see that this solution is correct.

S
 2(XT X  XT y)  0


ˆ  (XT X)1 XT y

If we use the formula for the solution and the expression of y then we can write:
T
1 T
T
1 T
T
1 T
T
1 T
ˆ
E(
)  E((X X) X y)  E((X X) X (X  ))  E((X X) X X  (X X) X )
 E( )  (XT X)1 XT E()  
So solution is unbiased. Variance of estimation is:
ˆ )  E((
ˆ  )(
ˆ  )T )  E((XT X)1 XTT X(XT X)1)  (XT X)1 XT E( T )X(XT X)1   2 (XT X)1
V(
Here we used the form of the solution and the assumption number 3)
Variance
To calculate covariance matrix we need to be able to calculate 2. Since it is the
variance of the error term we can find it using the form of the solution. For the
estimated error (denoted by r) we can write:
ˆ  y  X(XT X)1 XT y  (I  X(XT X)1 XT )y
r  y  X
If we use:
It gives

y  X  
r  (I  X(XT X)1 XT )  M
Since the matrix M is idempotent and symmetric, i.e. M2=M=MT, we can write:
 E(r T r)  E( T M)  E(tr( T M))  E(tr(M T )  tr(ME( T ) 
2 
 tr(M)   2 (tr(I)  tr(X(X T X) 1 XT )   2 (n  tr((XT X) 1 XT X)) 
 2 (n  p)
Where n is the number of the observations and p is the number of the fitted parameters.
Then for unbiased estimator for the variance of the residual we can write:
ˆ )T (y  X
ˆ ) /(n  p)
ˆ 2  (y  X


Singular case: SVD and psuedoinversion
The above given form of the solution is true if matrices X and XTX are non-singular. I.e. the
rank of the matrix X is equal to the number of parameters. If it is not true then either
singular value decomposition or eignevalue filtering techniques are used. Fortunately
most good properties of the linear model remains.
Singular value decomposition (SVD): Any nxp matrix can be decomposed in a form:
X  UDV
Where U is nxn and V is pxp orthogonal matrices (inverse is equal to transpose). D is nxp
diagonal matrix of the singular values. If X is singular then number of non-zero diagonal
elements of D is less than p. Then for XTX we can write:
XTX  (UDV)T UDV  VTDTUTUDV  VTDTDV
DTD is pxp diagonal matrix. If the matrix is non-singular then we can write:
(XTX)1  VT (DTD)1 V
Since DTD is a diagonal matrix therefore its inverse is also diagonal matrix. Main trick used in
SVD technique for equation solution is that when diagonals are 0 or close to 0 then
instead of their inversion zero is used. I.e. pseudo inverse is calculated using:
(DD)
1*
ii

0
if corresponding diagonalof DTD is 0

T
otherwise
1/(D D)ii
Singular case: Ridge regression
Another technique to deal with singular case is ridge regression. In this technique a
constant value added to the diagonal terms XTX before inverting it.
ˆ  (XT X  I)1 XT y

Mathematically it is equivalent to Tikhonov regularisation for ill-posed problems.

In this technique
one of the problems is how to find regularisation parameter - . That
is usually by trial and error.
R-squared
One of the statistics used for goodness of fit is R-squared. It is defined as:
SSe
SSt
SSe  (y i  yˆ i ) 2
R 2  1
SSt   (y i  y ) 2
yˆ i is fitt ed values.
Adjusted R-squared:

2
Radj
 1 (1 R 2 )
n 1
n  p 1
n - the number of observations, p - the number of paramters

Analysis of diagnostics
Residuals and hat matrix: Residuals are differences between observation and fitted
values:
ˆ  y  X(XT X) 1 XT y  (I X(XT X) 1 XT )y  (I  H)y
r  y  yˆ  y  X
H is called a hat matrix. Diagonal terms hi are leverage of the observations. If these
values are close to one then that fitted value is determined by this observation.
Sometimes hi’=hi/(1-hi) is used to enhance high leverages.
Q-Q plot can be used to check normality assumption. Q-Q plot is plot of quantiles of
two distributions. If assumption on distribution is correct then this plot should be
nearly linear. If the distribution is normal then tests designed for normal
distributions can be used. Otherwise bootstrap can be used to derive desired
distributions.
Analysis of diagnostics: Cont.
Other analysis tools include:
ri'  ri /( s(1  hi )1 / 2 ) standardized (stutendized) residual
ri*  ri /( si (1  hi )1 / 2 ) externallystandardized residual
Ci  (hi ' )1 / 2 ri*
DFFIT S
Where hi is leverage, hi’ is enhanced leverage, s2 is unbiased estimator of 2, si2 is
unbiased estimator of 2 after removal of i-th observation
Bootstrap
Simplest application of bootstrap for this problem is as follows:
1)
Calculate residuals using
ˆ
r  y  X
2)
3)
Sample with replacement from the residual vector and denote them rrandom
Design new “observations” using

4)
5)
6)
ˆ r
y  X
random
Estimate parameters
Repeat steps 2 3 and 4
Estimate bootstrap
 estimation, variances, covariance matrix or the
distribution
Another technique for bootstrapping is: Resample observations and corresponding
row of the design matrix simultaneously - (yi,x1i,x2i,,,,xpi),i=1,n. It meant to
be less sensitive to misspecified models.
Note that for some samples, the matrix may become singular and problem may
become ill defined.
Limitations
Statistical not causal link
Regression analysis gives statistical links and may not say anything about causal link. An
example: If you do regression analysis where number of fire-fighters are predictor
variables and fire size is response then you will certainly find good fit. But it does not
mean that bigger fires are caused by the number of fire-fighters.
Processes are rarely linear
In nature it is rarely case that processes are linear. Have a look example anscombe in R. It
demonstrates it very nicely. Linearity works in explaining processes in the vicinity of a
paint or in small regions of the space. If linearity does not hold then non-linear regressions
might be more useful
Assumptions may not hold
Some or all assumption may not hold. If violation is the assumption about normality then
it is not very serious usually. Most statistics are robust (F, t and other stats) in this sense. If
this is serious then Maximum likelihood could b used.
Independence and equal variance assumptions also should be carefully analysed. Violation
of independence assumption might be very serious.
R commands
R command
lm - general linear model
lm.ridge - ridge regression
References
1.
2.
Berthold, M. and Hand, DJ (2003) “Intelligent data analysis”
Stuart, A., Ord, JK, and Arnold, S. (1991) Kendall’s advanced
Theory of statistics. Volume 2A. Classical Inference and the Linear
models. Arnold publisher, London, Sydney, Auckland