Basic principles of probability theory

Download Report

Transcript Basic principles of probability theory

Linear and generalised linear models
•
•
•
•
•
Purpose of linear models
Solution for linear models
Some statistics related with the linear model
Analysis of diagnostics
Exponential family and generalized linear model
Reason for linear models
Purpose of regression is to reveal statistical relations between input and output variables.
Statistics cannot reveal functional relationship. It is purpose of other scientific studies.
Statistics can help validation various functional relationship. Let us assume that we suspect
that functional relationship is
y  f ( x,  ,  )
where  is a vector of unknown parameters, x=(x1,x2,,,xp) a vector of controllable parameters, and
y is output,  is an error associated with the experiment. Then we can set for various values
of x experiment and get output (or response) for them. If number of experiments is n then
we will have n output values. Denote them as a vector y=(y1,y2,,,,yn). Purpose of statistics
is to evaluate parameter vector using input and output values. If function f is a linear
function of the parameters and errors are additive then we are dealing with linear model.
For this model we can write
yi  1x1i  2 x2i  ...   p x pi  i
Note that linear model is linearly dependent on parameters but not on input variables. For
example
y  0  1x1  2 x12  
is a linear model. But
y  0  1x1  12 x2  
is not a linear model.
Assumptions
Basic assumptions for analysis of linear model are:
1)
the model is linear in parameters
2)
the error structure is additive
3)
Random errors have 0 mean and equal variances and errors are uncorrelated.
These assumptions are sufficient to deal with linear models. Uncorrelated with equal
variance assumption can be removed. Then the treatments becomes a little bit
more complicated.
Note that for general solution normality assumption is not used. This assumption is
necessary to design test statistics.
These assumption can be written in a vector form:
y  Xβ  ε, E(ε)  0, V(ε)  σ2I
where y, 0, I,  are vectors and X is a matrix. This matrix is called design matrix,
input matrix etc. I is nxn identity matrix.
Solution
Solution with given model and assumptions is:
βˆ  (XTX)1 XTy
If we use the form of the model and write least squares equation (since we want to find solution
with minimum least-squares error):
S  εTε  ( y  Xβ)T ( y  Xβ)  min
and get the first and the second derivatives and solve the equation then we can see that this
solution is correct. S
T
T
β
 2( X Xβ  X y )
2S
 2X T X
2
ββ
This solution is unbiased. If we use the formula for the solution and the expression of y then we
can write:
E (βˆ )  E ((X T X ) 1 X T y )  E ((X T X ) 1 X T ( Xβ  ε))  E ((X T X ) 1 X T Xβ  ( X T X ) 1 X Tε)
 E (β)  ( X T X ) 1 X T E (ε)  β
So solution is unbiased. Variance of estimation is:
V (βˆ )  E((βˆ  β)(βˆ  β)T )  E((XTX)1 XTεεTX(XTX)1 )  (XTX)1 XT E(εεT )X(XTX)1   2 (XTX)1
Here we used the form of the solution and assumption 3)
Variance
To calculate variance we need to be able to calculate 2. Since it is variance of the error
term we can find it using form of the solution. For the estimated error (denoted by
e) we can write:
r  y  Xβˆ  y  X(XT X)1 XTy  (I  X(XT X)1 XT )y
If we use:
y  Xβ  ε
Immediately gives
r  (I  X(XT X)1 XT )ε  Mε
Since matrix M is idempotent and symmetric i.e. M2=M=MT we can write:
E (r Tr )  E (ε T Mε)  E (tr (ε T Mε))  E (tr (MεεT )  tr (ME (εεT ) 
 2tr (M )   2 (tr (I)  tr ( X(XT X) 1 X T )   2 (n  tr ((X T X) 1 X T X)) 
 2 (n  p)
Where n is the number of the observations and p is the number of the fitted parameters.
Then for unbiased estimator for variance of the residual we can write:
ˆ 2  (y  Xβˆ )T (y  Xβˆ ) /(n  p)
Singular case
This forms of the solution is true if matrices X and XTX are non-singular. I.e. rank of matrix X
is equal to the number of parameters. If it is not true then either singular value
decomposition or eignevalue filtering techniques are used. Fortunately most good
properties of the linear model remains.
Singular value decomposition: Any nxp matrix can be decomposed in a form:
X  UDV
Where U is nxn and V is pxp orthogonal matrices. I.e.multiplication of transpose of the matrix
with itself gives unit matrix. D is nxp diagonal matrix of the singular values. If X is
singular then number of non-zero diagonal elements of D is less than p. Then for XTX we
can write:
XTX  (UDV)T UDV  VTDTUTUDV  VTDTDV
DTD is pxp diagonal matrix. If the matrix is non-singular then we can write:
(XTX)1  VT (DTD)1 V
Since DTD is diagonal its inverse is the diagonal matrix with diagonals inversed. Main trick
used in singular value decomposition techniques for equation solution is that when
diagonals are 0 or close to 0 then instead of their inversion 0 is used. I.e. if E is the
inverse of the DTD then pseudo inverse is calculated:

0
if corresponding diagonal of DTD is 0
E 
T
otherwise
1/(D D)ii
*
ii
Test of hypothesis
Sometimes question arises if some of the parameters are significant. To test this type of
hypothesis it is necessary to understand elements of likelihood ratio test. Let us assume
that we want to test the following null-hypothesis vs alternative hypothesis:
H0 : β1  0, vs, H1 : β1  0
where 1 is a subvector of the parameter vector. It is equivalent to saying that we want to test of
one or several parameters are 0 or not. Likelihood ratio test for this case works like that.
Let us assume we have the likelihood function for the parameters:
L( y | β)
where parameters are partitioned into to subvectors:
β  (β1 , β2 )
Then maximum likelihood estimators are found for two cases. 1st case is when whole parameter
vector is assumed to be variable. 2nd case is when subvector 1 is fixed to a value defined
by the null hypothesis. Then values of the likelihood function for this two cases is found
and their ratio is calculated. Assume that L0 is value of the likelihood under null
hypothesis (subvector is fixed to the given value) and L1 is under the alternative
hypothesis. Then ratio of these values is found and statistics related to this ratio is found
and is used for testing. Ratio is:
L1
L0
If this ratio is sufficiently small then null-hypothesis
is rejected. It is not always possible to find
lr 
distribution of this ratio.
Likelihood ratio test for linear model
Let us assume that we have found maximum likelihood values for the variances under null and
alternative hypothesis and they are:
ˆˆ and ˆ
furthermore let us assume that n is the number of the observations, p is the number of all
parameters and r is the number of the parameters we want test. Then it turns out that
relevant likelihood ratio test statistic for this case is related with F distribution. Relevant
random variable is:
n  p ˆˆ 2  ˆ 2
F
r
ˆ 2
This random variable has F distribution with (r,n-p) degrees of freedom. It is true if the
distribution of the errors is normal. As we know in this case maximum likelihood and
least-squares coincide.
Note: Distribution becomes F distribution if null-hypothesis is true. If it is not true then
distribution becomes non-central F distribution
Note: if there are two random variables distributed by 2 distribution with n and m degrees of
freedom respectively then their ration has F distribution with (n,m) degrees of freedom.
Analysis of diagnostics
Residuals and hat matrix: Residuals are differences between observation and fitted
values:
r  y  yˆ  y  Xβˆ  y  X(XTX) 1 XTy  (I  X(XTX) 1 XT )y  (I  H)y
H is called a hat matrix. Diagonal terms hi are leverage of the observations. If these
values are close to one then that fitted value is determined by this observation.
Sometimes hi’=hi/(1-hi) is used to enhance high leverages.
Q-Q plot can be used to check normality assumption. Q-Q plot is plot of quantiles of
two distributions (in this case frequency distribution of residuals vs normal
distribution). If assumption on distribution is correct then this plot should be nearly
linear.
Cook’s distance: Observations in turn are removed (it is equivalent to removing
corresponding row from the design matrix X). Then parameters estimated without
this observation. Cook’s distance is defined as:
(βˆ i  β)T X T X( βˆ i  β)
Di 
pˆ 2
βˆ is estimatedwithout i - th observation
i
Analysis of diagnostics: Cont.
Other analysis' tools include:1/ 2
ri  ri /( s(1  hi ) ) standardized (stutendized) residual
ri*  ri /( si (1  hi )1/ 2 ) externallystandardized residual
Ci  (hi ' )1/ 2 ri*
DFFIT S
Di  (ri ' ) 2 hi ' / p
Where hi is leverage, hi’ is enhanced leverage, s2 is unbiased estimator of 2, si2 is
unbiased estimator of 2 after removal of i-th observation
Exponential family
Natural exponential family of distributions has a form:
L( y |  , )  e( yA( )G( y ) B( )) / S ( )
S() is called a scale parameter. When A is identity then it is called. By change of
variables A can be changed by identity. So it is usual to use
L( y |  , )  e( y G( y ) B( )) / S ( )
Many distributions including normal, binomial, Poisson, exponential distributions
belong to this family.
Moment generating function if exists for this distribution is:
M (t )  e( B( )B( tS ( ))) / S ( )
Then the first moment (mean value) and the second central moments can be calculated:
E( y)  
dB( )
d
d 2 B ( )
  E ( y  E ( y ))  
S ( )
d 2
2
2
Generalised linear model
If the distribution of errors is one of the distributions from the exponential family and
some function of the expected value of the observations is linear function of the
parameters then generalised linear models are used:
g ( E( y))  g(B' (1 ),, ,B' (n ))  Xβ
Function g is called the link function. Here is list of the popular distribution and
corresponding link functions:
binomial - logit = ln(p/(1-p))
normal - identity
Gamma - inverse
poisson - log
All good statistical packages have implementation of many generalised linear models.
To use them finding initial values might be necessary.
To fit using generalised linear model the likelihood function is used (if  is constant
and we ignore constants not depending on parameter of interest):
1
l( y |  ) 
 ( y   B( ))
C ( )
n
i 1
i i
i
Then expression for natural exponential family can be used to fit a model. Most
natural way is using =X
Additive model and non-linear models
Let us consider several non-linear models briefly.
1)
Additive model. If model is described as
p
y j   0   si ( xij )
i 1
then model is called an additive model. Where si can be some set of functions.
Usually they are smooth functions. These type of models are used for
smoothening.
2) If model is a non-linear function of the parameters and the input variables then it is
called non-linear model. In general form it can be written:
yi  f ( x1i , , , xmi , 1, , ,  p , i )
Form of the function depends on the subject studied. This type of models do not have
closed form and elegant solutions. Non-linear least-squares may not have
unique solution or may have many local minimum. This type of models are
usually solved iteratively. I.e. initial values of the parameters are found and
then using some optimisation techniques they are iteratively updated. Statistical
properties of non-linear models is not straightforward to derive.
Exercise: linear model
Consider hypothesis testing. We have n observation. Parameter vector has dimension p.
We have partitioning of the parameter vector like (dimension of 1 is r):
β  (β1 , β2 )
Corresponding partitioning of the design (input) matrix is
X  (X1 , X 2 )
Assume that all observations are distributed with equal variance normally and they are
uncorrelated. Find maximum likelihood estimators for parameters and variance under
null and alternative hypothesis:
H0 : β1  0 vs H1 : β1  0
Hint: -loglikelihood function under null hypothesis is (since 1=0):
l
n
1
ln  2 
( y  X 2β2 )T ( y  X 2β2 )
2
2
2
and under the alternative hypothesis:
l
n
1
ln  2 
( y  Xβ)T ( y  Xβ)
2
2
2
Find minimum of these functions. They will be maximum likelihood estimators.