Statistical Weather Forecasting

Download Report

Transcript Statistical Weather Forecasting

Statistical Weather Forecasting
INDEPENDENT STUDY
DARIA KLUVER
F R O M STATISTICAL METHODS
IN THE ATMOSPHERIC
SCIENCES B Y D A N I E L W I L K S
Outline
 Review of Least-Squares Regression
 Simple Linear Regression
 Distribution of the Residuals
 The Analysis-of-variance table
 Goodness-of-fit Measures
 Sampling Distributions of the Regression Coefficients
 Examining Residuals
 Prediction Intervals
 Multiple Linear Regression
 Derived predictor variables in multiple regression
 Objective Forecasts- Without NWP
 Stratification and Compositing
 When the Predictand is a Probability
 Predictor Selection
 Screening regression
 Stopping rules
 Cross-validation
 Analog Forecasting
Background
 Statistical methods are necessary because the atmosphere
is a nonlinear dynamical system, and not perfectly
predictable in a deterministic sense.
 Classical statistical forecast methods- do not operate with
info from fluid-dynamical NWP models. Still useful at very
short lead times (hrs in advance), or very long lead times
(weeks or more).
Review of Least-squares regression
 Least-squares regression - Describes the linear
relationship between 2 variables.
 Usually more than 1 predictor (independent) variable
is needed in practical forecasts, but the ideas of
Multiple Linear Regression are covered in the simple
case.
Simple Linear Regression
 Basically, simple linear regression seeks to
summarize the relationship between 2 variables, by
a straight line.
 The regression procedure chooses the line that
produces the least error for predictions of y based
on x.
 Why least-squares?

Line-fitting is fairly tolerant of small discrepancies between
line and points, but is not resistant to outliers.
Example
46°N 121°W
Total Seasonal Snowfall
8000
15.0 miles (24.1 km) NW of
Goldendale, Klickitat, WA, USA
Approx. altitude: 1089 m (3572 ft)
total snowfall (cm)
7000
6000
5000
4000
3000
2000
1000
0
1899
1919
1939
1959
snow season
1979
1999
snow(cm)
The linear regression as calculated by
SPSS:
 Eqn is:
Total Seasonal Snowfall
 Y^=40.88x-
7000
total snowfall (cm)
76956.87
8000
6000
5000
4000
3000
2000
1000
0
1899
1919
1939
1959
snow season
1979
1999
snow(cm)
regression
 Line is y=a+bx
 And the error, or residuals are ei=yi-y(xi)
 Combine line + residuals to get regression eqn:
yi=yi+ei=a+bxi+ei
 Means that the true value of the predictand is the
sum of predicted value + residual.
 In order to minimize the sum of squared residuals
sum(ei)2=sum(i-yi)2=sum(yi-[a+bxi]2
 Set derivative wrt parameters a and b to zero and
solve derivaties are:… rearrange to get normal
equations: sum yi=na+bsum(xi)
sumxiyi=asumxi+bsum(xi)2
 Solving the normal equations for the regression
parameters: b=nsum(xiyi)sumxisumyi/nsum(xi)2-(sumxi)2 and a=ybarbxbar
Distribution of the Residuals
 Conventional to assume ‘ei’s are
residuals
5000
4000
3000
residual (ei)
independent random variables
with zero mean and constant
variance. It is also sometimes
assumed that these residuals
follow a Gaussian distribution.
 A constant-variance assumption
really means that the variance of
the residuals is constant in x.
Therefore a given residual is
equally likely to occur at any part
of the regression line. Fig 6.2
same distributions but means are
shifted depending on the level of
the regression line.
 In order to make statistical
inferences in the regression setting
this constant residual variance
from the sample of the residuals
must be made.
2000
1000
0
1880
-1000
1900
1920
1940
1960
1980
2000
-2000
-3000
x value (season)
residuals
2020
 Sample average of residuals squared
 S2e=1/n-2sumei2 =1847651.6
 Remember ei=yi-y(xi)
 So s2e=1/n-2sum[yi-y(xi)]2 to compute the
estimated residual variance.
 Commonly written as SST=SSR+SSE
 SST=SSR+SSE
 SST
 Total sum of squares
 Sst=sum(yi-ybar)2=sumy2i-ny
 Proportional(by a factor of n-1) to the sample variance of y
 Measures the overall variability of the predictand.
 SSR
 Regression sum of squares
 SSR=sum[y(xi)-ybar]2 sum squared difference between the regression predictions
and the sample mean of y
 Relates to regression eqn by: SSR=b2sum(xi-xbar)2=b2[sumx2i-nxbar2]
 SSE

SSE=sum of squared errors
 sum of squared differences btwn the residuals and their mean (z)
 SSE=sum ei2
 So s2e=1/n-2{SST-SSR}=1/n-2{sumy2i-ny-2-b2[sumxi2-nxbar2]}
computational form.
Analysis-of-variance table
 Output from regression analysis is often given in an
ANOVA table.
Source
df
SS
MS
Total
N-1
SST
Regression
K
SSR
MSR=SSR/k
Residual
N-k-1
SSE
MSE=SSE/(n-k-1)
F=MSR/MSE
Mode
l
1
Sum of
Squares
Mean
Square
df
Regression
13930436
1.091
1
139304361.0
91
Residual
18104619
3.251
98
1847410.135
Total
32035055
4.342
99
F
75.405
Sig.
.000(a)
Goodness-of-Fit Measures
 ANOVA table allows computation of 3 related measures of the “fit” of
a regression
 Fit=correspondence between the regression line and a scatterplot of
the data
Summary(b)
 MSE-fundamental because Model
it indicates
the variabillity of the
observed (predictand) y values around the forecast regression line


Reflects the accuracies of resulting forecasts.
MSE=se2 and shows how residuals cluster around the regression line.
 R2-coefficient of determination


R2=SSR/SST=1-SSE/SST portion of the variation of the predictand
“described” by the regression.
|sqrtR2|=pearson correlation
 F ratio = MSR/MSE strong xy relationship=increased MSR, dec
MSE

In multiple regression problems of multiplicity invalidate it.
Change Statistics
Model
R
R Square
Adjusted R
Square
Std. Error of
the Estimate
R Square
Change
Sig. F Change
F Change
.435
75.405
1
df1
df2
1
.659(a)
.435
.429
1359.19466
98
.000
1.333
DurbinWatson
Sampling Distributions of the
Regression Coefficients
 Estimation of a’s and b’s sampling distributions allows you




to construct confidence intervals around the parameter
values obtained. (also used for hypothesis testing about
population values).
We assumed Gaussian distributions.
Eqns of intersept and slope
Showing that the percision with which the b and a can be
estimated depends on the estimated standard deviation of
the residuals, Se.
Some packages output “t ratio” = ratios of estimate
parameters to their standard errors >>> a 1 sample t test is
implied. Null hypothesis being that the underlying
population mean for the parameter is zero reject null hy. At
5% level of est slop is ~2 as large as standard error.
Examining Residuals
 It is important to check that all the assumptions you made hold true.
And these pertain to the residuals.
 Plot the residuals as a function of predicted value yhat.
residuals as a function of Predictand
5000
4000
residuals
3000
2000
1000
0
0
1000
2000
3000
4000
5000
6000
-1000
-2000
-3000
yhat (predicted total snowfall)
residuals
 Heteroscedasticity – non constant residual variance
 Solution – apply a logarithmic transformation to y, which reduces
larger values more strongly.
 Or y**2 which stretches the small values?
 **remember to recover predictand! Y=e^lny=e^(a+bx)
 To determine graphically if the residuals folow a gaussian
distribution do a quantile-quantile (Q-Q_ plot.
 Investigate the degree to which the residuals are mutually
independent.

Plot regression residuals as a function of time. If + or – cluster tog,
time correlation is suspected.
 Formal test: Durbin-Watson test. 1.33 for this example,





reject null hypothesis.
Ho: residuals are serially independent
Halt: consistent with a 1st order autoregressive process
D=sum(ei-ei-1)2/sum ei2 if es are + corr d is small if es are
randomly distributed d is large, so don’t reject Ho.
Fig 6.7 shows critical d values at 5% level.
**A trend in the residuals may indicate the nature of the
relationship changing with time and another predictor
variable may be needed.
Prediction Intervals
 To calculate confidence intervals around forecast values.
 Residuals must follow a gaussian distribution.
 MSE=se2 residual estimated variance
 So 95% confidence interval is approx bounded by yhat+-2se
(TABLEB1) good with large sample size


Sqrt of se2 is standard dev of residuals = 1359.3
Interval is +- 2718.6
 But predictand and slope are subject to sampling variations
 For a forecast of y using predictand value xo prediction
variance = s2yhant=se2[1+1/n+(xo-xbar)2/sum(xi-xbar)2]
valid for simple LR
Total Seasonal Snowfall
8000
total snowfall (cm)
6000
4000
2000
0
1899
1919
1939
1959
1979
1999
-2000
snow(cm)
regression
-4000
upperbound
snow season
lowerbound
Multiple Linear Regression
 Prediction eqn (where k=number of predictands)
 Yhat=b0+b1x1+b2x2…+bkxk
 Several things are the same but with vectors and
matrix algebra.
 MSE=SSE/(n-k-1)
 R2 is the same, but is NOT the square of the pearson
correlation coef btwn y and any of the x variables.
Model Summary(c)
ANOVA(c)
Model
R
R Square
Change Statistics
Adjusted R
Square
Std. Error of
the Estimate
R Square
Change
F Change
Sig. F Change
df1
DurbinWatson
df2
1
.659(a)
2
.435
.429
1359.19466
.435
75.405
1
98
.000
.714(b)
.510
.499
1272.73544
.075
14.767
1
97
Sum of
Squares
Model
1
a Predictors: (Constant), season
b Predictors: (Constant), season, stations
c Dependent Variable: total
2
a Predictors: (Constant), season
b Predictors: (Constant), season, stations
c Dependent Variable: total
df
Mean Square
Regression
139304361
.091
1
139304361.09
1
Residual
181046193
.251
98
1847410.135
Total
320350554
.342
99
Regression
163224572
.001
2
81612286.000
Residual
157125982
.341
97
1619855.488
Total
320350554
.342
99
.000
F
75.405
50.382
Sig.
.000(a)
.000(b)
1.554
Coefficients(a)
Unstandardized
Coefficients
(Constant)
season
2
95% Confidence Interval for B
Beta
Mode
l
1
Collinearity Statistics
Standardized
Coefficients
(Constant)
season
stations
B
Std. Error
76956.86
9
9180.468
40.888
4.709
27394.47
6
15499.90
4
14.915
8.070
228.467
59.454
a Dependent Variable: total
t
Sig.
Lower Bound
Upper Bound
Tolerance
-8.383
.000
-95175.210
-58738.528
8.684
.000
31.544
50.232
-1.767
.080
-58157.494
3368.542
.241
1.848
.068
-1.102
30.931
.299
.500
3.843
.000
110.468
346.466
.299
.659
1.000
VIF
1.000
3.350
3.350
Derived Predictor variables in MR
 “derived” predictors- mathematical transformation
of predictor variables.
 Ex power x2=x12
 Trid sin, cos
 Binary or “dummy” (or 0 based on a threshold)
Objective Forecasts- Without NWP
 Classical statistical forecasting = construction of weather
forecasts through purely statistical means, without infor
from NWP models.






Mostly based on multiple regression eqns
Objective – particular set of inputs will always produce the same
forecast, once the eqns are developed.
**however, several subjective decesions go into the development of
the forecast eqns.
You need developmental data:-past values of the predictand and
matching collection of potential predictors, whose values will be
known prior to the forecast time.
Regression eqn is fit to this historical “training” data.
*predictors that are not available in time for operational production
of forecasts are of NO VALUE.
Stratification and Compositing
 Physical and statistical relationships may change according




to snow know condition (ie seasons)
Have a predictor that is a function of day
Sometimes better by stratifying data according to time fo
year and producting separate forecast eqns for each month
or season.
Stratify by meteorologic or climatological criteria.
Ex:


Separate temp forecasts for snow/non snow covered
Long-range forecasts for el nino vs non el nino
 You can also composite datasets to increase sample size.

Data forom nearby (climatologically similar) stations
When the Predictand is a Probability
 *Advantage of statistical over (deterministic) dynamical forecasting is
the capacity to produce probability forecasts.


Explicit expression of the inherent uncertainty
Allows users to extract more value in decision making.
 To produce probability info about a predictand


1st transform predictand to a binary variable
2nd do regression: 2 regression approaches:

(simplest) Regression Estimation of Event Probabilities (REEP)




MLR to derive a forecast eqn for binary predictand
Values are btwn 0 and 1, treated as specifications of probabilities of y=1
No more computationally demanding than any other linear regression problem.
Logistic Regression (more theoretically satisfying)




Also binary predictand
Fits regression parameters to nonlinear eqn …
Guaranteed to produce properly bounded probability estimates.
Parameters must be estimated using iterative techniques, computationally intensive.
ANOVA(b)
Mode
l
1
Sum of
Squares
Regression
Residual
Total
Mean
Square
df
F
.679
3
.226
7.901
46
.172
8.580
49
Sig.
1.317
.280(a)
 y=1 if total snow more than 100 year mean.
Model Summary(b)
Change Statistics
Model
R
R Square
Adjusted R
Square
Std. Error of
the Estimate
R Square
Change
F Change
Sig. F Change
.079
1.317
3
df1
DurbinWatson
df2
1
.281(a)
.079
.019
.41445
a Predictors: (Constant), EA, stations, season Unstandardized
b Dependent Variable: VAR00017
Coefficients
.280
2.154
95% Confidence Interval for B
Standardized
Coefficients
Beta
Mode
l
1
46
B
(Constant)
season
stations
EA
Std. Error
t
Sig.
Lower Bound
Upper Bound
44.153
1.058
.296
-13.737
a Predictors: (Constant), EA, stations, season
b Dependent Variable: VAR00017
.007
-.945
.350
-.021
15.208
14.380
-.007
.007
-.230
-.166
.092
-.404
-1.816
.076
-.351
.027
.068
.066
.402
.689
-.110
.018
.164
Coefficients(a)
Predictor Selection
 How to find a good subset of potential predictors
 Dangers in including too many predictor variables
 Example with nonsense predictors and a perfect fit.
 Development sample, dependent sample, or training
sample- portion of available data used to produce the
forecast eqn.
 Any K=n-1 predictors will produce a perfect regression fit to
any predictand for which there are n observations. Called
“overfitting”
 Will not work for independent, or verification data- data
not used in the development of the equation.
 Only the first 50 years:
Model Summary
Change Statistics
Mode
l
1
R
Adjusted
R Square
Std. Error of
the Estimate
R Square
Change
F Change
.648
.633
904.77237
.648
42.369
.805(a)
ANOVA(b)
Mode
l
1
R Square
a Predictors: (Constant), stations1, year
b Dependent Variable: total1
Sum of
Squares
Mean
Square
df
Regression
69367224
.396
2
34683612.19
8
Residual
37656200
.290
46
818613.050
Total
10702342
4.685
48
F
42.369
Sig.
.000(a)
df1
df2
2
Sig. F Change
46
.000
Coefficients(a)
Unstandardized
Coefficients
Beta
Mode
l
1
95% Confidence Interval for B
Standardized
Coefficients
(Constant)
year
stations1
B
Std. Error
55797.73
4
25901.80
6
29.564
13.508
584.760
131.753
a Dependent Variable: total1
t
Sig.
Lower Bound
-2.154
.036
-107935.366
.283
2.189
.034
2.374
.574
4.438
.000
319.554
Upper Bound
-3660.102
56.753
849.965
 Lessons learned:
 Chose only physically reasonable or meaningful
predictors
 Tentative regression eqn needs to be tested on a sample
of data not involved in its development. (a very large
difference in performance btwn the dependent and
independent samples would lead to suspicion that the
eqn was over fit).
 You need a reasonably large developmental sample if the
resulting regression eqn is to be stable-regression
coefficients are also applicable to independent (future)
data and not from an overfit. *in forecasting, little gained
form 12+ predictors.
Screening Regression
 Relevant predictor variables are almost always
mutually correlated, so there is redundant info.
 Screening regression – the problem of selecting a
good set of predictors from a pool of potential
predictors

Common type of screening:
Forward selection AKA stepwise regression (highest R2, lowest
MSE, largest F ratio)
 Backward elimination- remove (smallest t ratio)

Stopping Rules
 Both forward selection and backward elimination require stopping criterion.
 Ex: what if every time you recalculate the reg the F ratios change.
 In practice, other less rigorous stopping criteria are used
 Stop adding when none of the remaining predictors would reduce the R2 by an
amount (ex 0.05%)
 Use MSE. sqrtMSE directly reflects the anticipated forecast accuracy f the
regression. Ex if MSE 0.01F2 why add more. Indicate +-2stdev (~95%) confidence
interval of +-squr 0.01=.2F. Stopping criterion is where MSE does not decline
appreciably with the addition of more predictors.
 Artificial skill-underestimate of the operational MSE proveded by the
performance of a forecast eqn on a developmental data set.



Sometimes the minimum number of K is smaller for the independent data.
For black box, you just minimize MSE not worry about identities of predictors.
For scientific understanding it is not the reduction of MSE tha is most important,
but the relationships of the variables suggested by the regression procedure.
Cross-Validation
 A resampling technique
 The available data re repeatedly divided into developmental
and verification data subsets.
 Often: developmental data sets are size n-1 and verification
data sets contain remaining ob of predictand therefore n
partitions of the data
 The cross-validation estimate of the prediction MSE is
computed by forecasting each omitted ob, computing the
squared difference btwn prediction and predictand and
averaging the n squared differences.
 Could do this for any m withheld obs and n-m size
developmental data sets.
Analog Forecasting
 To search archives of climatological synoptic data
for maps closely resembling current observations,
and assume that the future evolution of the
atmosphere will be similar to the flows that
followed the historical analog.
The atmosphere is a perfect model of itself
 The atmosphere apparently does not ever exactly repeat
itself
 You must match all relevant fields (not just heights and
temp, but SST etc)
 Stats can be used to search for analogs and rank historical
cases for closeness to current
 Take average outcomes of all good analogs.
