Statistical Weather Forecasting
Download
Report
Transcript Statistical Weather Forecasting
Statistical Weather Forecasting
INDEPENDENT STUDY
DARIA KLUVER
F R O M STATISTICAL METHODS
IN THE ATMOSPHERIC
SCIENCES B Y D A N I E L W I L K S
Outline
Review of Least-Squares Regression
Simple Linear Regression
Distribution of the Residuals
The Analysis-of-variance table
Goodness-of-fit Measures
Sampling Distributions of the Regression Coefficients
Examining Residuals
Prediction Intervals
Multiple Linear Regression
Derived predictor variables in multiple regression
Objective Forecasts- Without NWP
Stratification and Compositing
When the Predictand is a Probability
Predictor Selection
Screening regression
Stopping rules
Cross-validation
Analog Forecasting
Background
Statistical methods are necessary because the atmosphere
is a nonlinear dynamical system, and not perfectly
predictable in a deterministic sense.
Classical statistical forecast methods- do not operate with
info from fluid-dynamical NWP models. Still useful at very
short lead times (hrs in advance), or very long lead times
(weeks or more).
Review of Least-squares regression
Least-squares regression - Describes the linear
relationship between 2 variables.
Usually more than 1 predictor (independent) variable
is needed in practical forecasts, but the ideas of
Multiple Linear Regression are covered in the simple
case.
Simple Linear Regression
Basically, simple linear regression seeks to
summarize the relationship between 2 variables, by
a straight line.
The regression procedure chooses the line that
produces the least error for predictions of y based
on x.
Why least-squares?
Line-fitting is fairly tolerant of small discrepancies between
line and points, but is not resistant to outliers.
Example
46°N 121°W
Total Seasonal Snowfall
8000
15.0 miles (24.1 km) NW of
Goldendale, Klickitat, WA, USA
Approx. altitude: 1089 m (3572 ft)
total snowfall (cm)
7000
6000
5000
4000
3000
2000
1000
0
1899
1919
1939
1959
snow season
1979
1999
snow(cm)
The linear regression as calculated by
SPSS:
Eqn is:
Total Seasonal Snowfall
Y^=40.88x-
7000
total snowfall (cm)
76956.87
8000
6000
5000
4000
3000
2000
1000
0
1899
1919
1939
1959
snow season
1979
1999
snow(cm)
regression
Line is y=a+bx
And the error, or residuals are ei=yi-y(xi)
Combine line + residuals to get regression eqn:
yi=yi+ei=a+bxi+ei
Means that the true value of the predictand is the
sum of predicted value + residual.
In order to minimize the sum of squared residuals
sum(ei)2=sum(i-yi)2=sum(yi-[a+bxi]2
Set derivative wrt parameters a and b to zero and
solve derivaties are:… rearrange to get normal
equations: sum yi=na+bsum(xi)
sumxiyi=asumxi+bsum(xi)2
Solving the normal equations for the regression
parameters: b=nsum(xiyi)sumxisumyi/nsum(xi)2-(sumxi)2 and a=ybarbxbar
Distribution of the Residuals
Conventional to assume ‘ei’s are
residuals
5000
4000
3000
residual (ei)
independent random variables
with zero mean and constant
variance. It is also sometimes
assumed that these residuals
follow a Gaussian distribution.
A constant-variance assumption
really means that the variance of
the residuals is constant in x.
Therefore a given residual is
equally likely to occur at any part
of the regression line. Fig 6.2
same distributions but means are
shifted depending on the level of
the regression line.
In order to make statistical
inferences in the regression setting
this constant residual variance
from the sample of the residuals
must be made.
2000
1000
0
1880
-1000
1900
1920
1940
1960
1980
2000
-2000
-3000
x value (season)
residuals
2020
Sample average of residuals squared
S2e=1/n-2sumei2 =1847651.6
Remember ei=yi-y(xi)
So s2e=1/n-2sum[yi-y(xi)]2 to compute the
estimated residual variance.
Commonly written as SST=SSR+SSE
SST=SSR+SSE
SST
Total sum of squares
Sst=sum(yi-ybar)2=sumy2i-ny
Proportional(by a factor of n-1) to the sample variance of y
Measures the overall variability of the predictand.
SSR
Regression sum of squares
SSR=sum[y(xi)-ybar]2 sum squared difference between the regression predictions
and the sample mean of y
Relates to regression eqn by: SSR=b2sum(xi-xbar)2=b2[sumx2i-nxbar2]
SSE
SSE=sum of squared errors
sum of squared differences btwn the residuals and their mean (z)
SSE=sum ei2
So s2e=1/n-2{SST-SSR}=1/n-2{sumy2i-ny-2-b2[sumxi2-nxbar2]}
computational form.
Analysis-of-variance table
Output from regression analysis is often given in an
ANOVA table.
Source
df
SS
MS
Total
N-1
SST
Regression
K
SSR
MSR=SSR/k
Residual
N-k-1
SSE
MSE=SSE/(n-k-1)
F=MSR/MSE
Mode
l
1
Sum of
Squares
Mean
Square
df
Regression
13930436
1.091
1
139304361.0
91
Residual
18104619
3.251
98
1847410.135
Total
32035055
4.342
99
F
75.405
Sig.
.000(a)
Goodness-of-Fit Measures
ANOVA table allows computation of 3 related measures of the “fit” of
a regression
Fit=correspondence between the regression line and a scatterplot of
the data
Summary(b)
MSE-fundamental because Model
it indicates
the variabillity of the
observed (predictand) y values around the forecast regression line
Reflects the accuracies of resulting forecasts.
MSE=se2 and shows how residuals cluster around the regression line.
R2-coefficient of determination
R2=SSR/SST=1-SSE/SST portion of the variation of the predictand
“described” by the regression.
|sqrtR2|=pearson correlation
F ratio = MSR/MSE strong xy relationship=increased MSR, dec
MSE
In multiple regression problems of multiplicity invalidate it.
Change Statistics
Model
R
R Square
Adjusted R
Square
Std. Error of
the Estimate
R Square
Change
Sig. F Change
F Change
.435
75.405
1
df1
df2
1
.659(a)
.435
.429
1359.19466
98
.000
1.333
DurbinWatson
Sampling Distributions of the
Regression Coefficients
Estimation of a’s and b’s sampling distributions allows you
to construct confidence intervals around the parameter
values obtained. (also used for hypothesis testing about
population values).
We assumed Gaussian distributions.
Eqns of intersept and slope
Showing that the percision with which the b and a can be
estimated depends on the estimated standard deviation of
the residuals, Se.
Some packages output “t ratio” = ratios of estimate
parameters to their standard errors >>> a 1 sample t test is
implied. Null hypothesis being that the underlying
population mean for the parameter is zero reject null hy. At
5% level of est slop is ~2 as large as standard error.
Examining Residuals
It is important to check that all the assumptions you made hold true.
And these pertain to the residuals.
Plot the residuals as a function of predicted value yhat.
residuals as a function of Predictand
5000
4000
residuals
3000
2000
1000
0
0
1000
2000
3000
4000
5000
6000
-1000
-2000
-3000
yhat (predicted total snowfall)
residuals
Heteroscedasticity – non constant residual variance
Solution – apply a logarithmic transformation to y, which reduces
larger values more strongly.
Or y**2 which stretches the small values?
**remember to recover predictand! Y=e^lny=e^(a+bx)
To determine graphically if the residuals folow a gaussian
distribution do a quantile-quantile (Q-Q_ plot.
Investigate the degree to which the residuals are mutually
independent.
Plot regression residuals as a function of time. If + or – cluster tog,
time correlation is suspected.
Formal test: Durbin-Watson test. 1.33 for this example,
reject null hypothesis.
Ho: residuals are serially independent
Halt: consistent with a 1st order autoregressive process
D=sum(ei-ei-1)2/sum ei2 if es are + corr d is small if es are
randomly distributed d is large, so don’t reject Ho.
Fig 6.7 shows critical d values at 5% level.
**A trend in the residuals may indicate the nature of the
relationship changing with time and another predictor
variable may be needed.
Prediction Intervals
To calculate confidence intervals around forecast values.
Residuals must follow a gaussian distribution.
MSE=se2 residual estimated variance
So 95% confidence interval is approx bounded by yhat+-2se
(TABLEB1) good with large sample size
Sqrt of se2 is standard dev of residuals = 1359.3
Interval is +- 2718.6
But predictand and slope are subject to sampling variations
For a forecast of y using predictand value xo prediction
variance = s2yhant=se2[1+1/n+(xo-xbar)2/sum(xi-xbar)2]
valid for simple LR
Total Seasonal Snowfall
8000
total snowfall (cm)
6000
4000
2000
0
1899
1919
1939
1959
1979
1999
-2000
snow(cm)
regression
-4000
upperbound
snow season
lowerbound
Multiple Linear Regression
Prediction eqn (where k=number of predictands)
Yhat=b0+b1x1+b2x2…+bkxk
Several things are the same but with vectors and
matrix algebra.
MSE=SSE/(n-k-1)
R2 is the same, but is NOT the square of the pearson
correlation coef btwn y and any of the x variables.
Model Summary(c)
ANOVA(c)
Model
R
R Square
Change Statistics
Adjusted R
Square
Std. Error of
the Estimate
R Square
Change
F Change
Sig. F Change
df1
DurbinWatson
df2
1
.659(a)
2
.435
.429
1359.19466
.435
75.405
1
98
.000
.714(b)
.510
.499
1272.73544
.075
14.767
1
97
Sum of
Squares
Model
1
a Predictors: (Constant), season
b Predictors: (Constant), season, stations
c Dependent Variable: total
2
a Predictors: (Constant), season
b Predictors: (Constant), season, stations
c Dependent Variable: total
df
Mean Square
Regression
139304361
.091
1
139304361.09
1
Residual
181046193
.251
98
1847410.135
Total
320350554
.342
99
Regression
163224572
.001
2
81612286.000
Residual
157125982
.341
97
1619855.488
Total
320350554
.342
99
.000
F
75.405
50.382
Sig.
.000(a)
.000(b)
1.554
Coefficients(a)
Unstandardized
Coefficients
(Constant)
season
2
95% Confidence Interval for B
Beta
Mode
l
1
Collinearity Statistics
Standardized
Coefficients
(Constant)
season
stations
B
Std. Error
76956.86
9
9180.468
40.888
4.709
27394.47
6
15499.90
4
14.915
8.070
228.467
59.454
a Dependent Variable: total
t
Sig.
Lower Bound
Upper Bound
Tolerance
-8.383
.000
-95175.210
-58738.528
8.684
.000
31.544
50.232
-1.767
.080
-58157.494
3368.542
.241
1.848
.068
-1.102
30.931
.299
.500
3.843
.000
110.468
346.466
.299
.659
1.000
VIF
1.000
3.350
3.350
Derived Predictor variables in MR
“derived” predictors- mathematical transformation
of predictor variables.
Ex power x2=x12
Trid sin, cos
Binary or “dummy” (or 0 based on a threshold)
Objective Forecasts- Without NWP
Classical statistical forecasting = construction of weather
forecasts through purely statistical means, without infor
from NWP models.
Mostly based on multiple regression eqns
Objective – particular set of inputs will always produce the same
forecast, once the eqns are developed.
**however, several subjective decesions go into the development of
the forecast eqns.
You need developmental data:-past values of the predictand and
matching collection of potential predictors, whose values will be
known prior to the forecast time.
Regression eqn is fit to this historical “training” data.
*predictors that are not available in time for operational production
of forecasts are of NO VALUE.
Stratification and Compositing
Physical and statistical relationships may change according
to snow know condition (ie seasons)
Have a predictor that is a function of day
Sometimes better by stratifying data according to time fo
year and producting separate forecast eqns for each month
or season.
Stratify by meteorologic or climatological criteria.
Ex:
Separate temp forecasts for snow/non snow covered
Long-range forecasts for el nino vs non el nino
You can also composite datasets to increase sample size.
Data forom nearby (climatologically similar) stations
When the Predictand is a Probability
*Advantage of statistical over (deterministic) dynamical forecasting is
the capacity to produce probability forecasts.
Explicit expression of the inherent uncertainty
Allows users to extract more value in decision making.
To produce probability info about a predictand
1st transform predictand to a binary variable
2nd do regression: 2 regression approaches:
(simplest) Regression Estimation of Event Probabilities (REEP)
MLR to derive a forecast eqn for binary predictand
Values are btwn 0 and 1, treated as specifications of probabilities of y=1
No more computationally demanding than any other linear regression problem.
Logistic Regression (more theoretically satisfying)
Also binary predictand
Fits regression parameters to nonlinear eqn …
Guaranteed to produce properly bounded probability estimates.
Parameters must be estimated using iterative techniques, computationally intensive.
ANOVA(b)
Mode
l
1
Sum of
Squares
Regression
Residual
Total
Mean
Square
df
F
.679
3
.226
7.901
46
.172
8.580
49
Sig.
1.317
.280(a)
y=1 if total snow more than 100 year mean.
Model Summary(b)
Change Statistics
Model
R
R Square
Adjusted R
Square
Std. Error of
the Estimate
R Square
Change
F Change
Sig. F Change
.079
1.317
3
df1
DurbinWatson
df2
1
.281(a)
.079
.019
.41445
a Predictors: (Constant), EA, stations, season Unstandardized
b Dependent Variable: VAR00017
Coefficients
.280
2.154
95% Confidence Interval for B
Standardized
Coefficients
Beta
Mode
l
1
46
B
(Constant)
season
stations
EA
Std. Error
t
Sig.
Lower Bound
Upper Bound
44.153
1.058
.296
-13.737
a Predictors: (Constant), EA, stations, season
b Dependent Variable: VAR00017
.007
-.945
.350
-.021
15.208
14.380
-.007
.007
-.230
-.166
.092
-.404
-1.816
.076
-.351
.027
.068
.066
.402
.689
-.110
.018
.164
Coefficients(a)
Predictor Selection
How to find a good subset of potential predictors
Dangers in including too many predictor variables
Example with nonsense predictors and a perfect fit.
Development sample, dependent sample, or training
sample- portion of available data used to produce the
forecast eqn.
Any K=n-1 predictors will produce a perfect regression fit to
any predictand for which there are n observations. Called
“overfitting”
Will not work for independent, or verification data- data
not used in the development of the equation.
Only the first 50 years:
Model Summary
Change Statistics
Mode
l
1
R
Adjusted
R Square
Std. Error of
the Estimate
R Square
Change
F Change
.648
.633
904.77237
.648
42.369
.805(a)
ANOVA(b)
Mode
l
1
R Square
a Predictors: (Constant), stations1, year
b Dependent Variable: total1
Sum of
Squares
Mean
Square
df
Regression
69367224
.396
2
34683612.19
8
Residual
37656200
.290
46
818613.050
Total
10702342
4.685
48
F
42.369
Sig.
.000(a)
df1
df2
2
Sig. F Change
46
.000
Coefficients(a)
Unstandardized
Coefficients
Beta
Mode
l
1
95% Confidence Interval for B
Standardized
Coefficients
(Constant)
year
stations1
B
Std. Error
55797.73
4
25901.80
6
29.564
13.508
584.760
131.753
a Dependent Variable: total1
t
Sig.
Lower Bound
-2.154
.036
-107935.366
.283
2.189
.034
2.374
.574
4.438
.000
319.554
Upper Bound
-3660.102
56.753
849.965
Lessons learned:
Chose only physically reasonable or meaningful
predictors
Tentative regression eqn needs to be tested on a sample
of data not involved in its development. (a very large
difference in performance btwn the dependent and
independent samples would lead to suspicion that the
eqn was over fit).
You need a reasonably large developmental sample if the
resulting regression eqn is to be stable-regression
coefficients are also applicable to independent (future)
data and not from an overfit. *in forecasting, little gained
form 12+ predictors.
Screening Regression
Relevant predictor variables are almost always
mutually correlated, so there is redundant info.
Screening regression – the problem of selecting a
good set of predictors from a pool of potential
predictors
Common type of screening:
Forward selection AKA stepwise regression (highest R2, lowest
MSE, largest F ratio)
Backward elimination- remove (smallest t ratio)
Stopping Rules
Both forward selection and backward elimination require stopping criterion.
Ex: what if every time you recalculate the reg the F ratios change.
In practice, other less rigorous stopping criteria are used
Stop adding when none of the remaining predictors would reduce the R2 by an
amount (ex 0.05%)
Use MSE. sqrtMSE directly reflects the anticipated forecast accuracy f the
regression. Ex if MSE 0.01F2 why add more. Indicate +-2stdev (~95%) confidence
interval of +-squr 0.01=.2F. Stopping criterion is where MSE does not decline
appreciably with the addition of more predictors.
Artificial skill-underestimate of the operational MSE proveded by the
performance of a forecast eqn on a developmental data set.
Sometimes the minimum number of K is smaller for the independent data.
For black box, you just minimize MSE not worry about identities of predictors.
For scientific understanding it is not the reduction of MSE tha is most important,
but the relationships of the variables suggested by the regression procedure.
Cross-Validation
A resampling technique
The available data re repeatedly divided into developmental
and verification data subsets.
Often: developmental data sets are size n-1 and verification
data sets contain remaining ob of predictand therefore n
partitions of the data
The cross-validation estimate of the prediction MSE is
computed by forecasting each omitted ob, computing the
squared difference btwn prediction and predictand and
averaging the n squared differences.
Could do this for any m withheld obs and n-m size
developmental data sets.
Analog Forecasting
To search archives of climatological synoptic data
for maps closely resembling current observations,
and assume that the future evolution of the
atmosphere will be similar to the flows that
followed the historical analog.
The atmosphere is a perfect model of itself
The atmosphere apparently does not ever exactly repeat
itself
You must match all relevant fields (not just heights and
temp, but SST etc)
Stats can be used to search for analogs and rank historical
cases for closeness to current
Take average outcomes of all good analogs.