SPATIAL ANALYSIS OF PM2.5 DATA

Download Report

Transcript SPATIAL ANALYSIS OF PM2.5 DATA

FOUR METHODS OF ESTIMATING
PM2.5 ANNUAL AVERAGES
Yan Liu and Amy Nail
Department of Statistics
North Carolina State University
EPA
Office of Air Quality, Planning, and Standards
Emissions Monitoring, and Analysis Division
Project Objectives





Estimation of annual average of PM2.5 concentration
Estimation of standard errors associated with annual
average estimates
Estimation of the probability that a site’s annual average
exceeds 15 mg/m3
At 2400 lattice points for 2000, 2001
Comparisons of 4 different methodologies:
1. Quarter-based analysis (Yan)
2. Annual-based analysis (Yan)
Daily-based analyses:
3. Doug Nychka’s method (Bill)
4. Generalized least squares in
SAS Proc Mixed (Amy)
Why are Standard Errors Important?

We may estimate that the annual
average for lattice point 329 is 16
mg/m3, which exceeds the standard of
15. But since our estimate has some
uncertainty or standard error, we’d like
to take this uncertainty into account in
order to determine the probability that
lattice point 329 exceeds 15.
In addition to maps like this ...
…we also want maps like this.
Note: This Map is WRONG--so don’t show it to anyone!
We haven’t figured out the correct way to determine
errors, so we cannot correctly draw a probability map yet.
Data Description
Concentrations of PM2.5 measured
during 2000, 2001
 The domain analyzed: the portion of
the U.S. east of –100o longitude
 Concentrations measured every third
day

Map of 2400 Lattice Points
Method 1 – Quarterly Analysis

3 months in each quarter
• Q1(Jan. - Mar.) Q2(Apr. - Jun.)
• Q3(Jul. - Sep.) Q4(Oct. - Dec.)

Within quarters, 75% completeness
 Found quarter mean conc. at each site
 For each quarter, kriged mean conc. over
lattice
 Averaged the quarter predictions to get
annual average estimate
Annual Average Predictions
Method 2 – Annual Analysis
Used sites common to all 4 quarters
in quarterly analysis
 Found annual mean conc. at each
site
 Kriged annual mean conc. over
lattice

The Number of Sites
2000
2001
Quarter 1
510
631
Quarter 2
575
642
Quarter 3
619
682
Quarter 4
613
666
Annual
394
517
Model for Quarterly and Annual Analyses

Predicted value =
quadratic surface prediction (SP)
+
error prediction (KP)
Estimating Quadratic Surface

Model:
Conc = 0 + 1lat + 2lon + 3lat2 + 4lon2 + 5lat * lon + 
Assume: 1) E() = 0, Var() = 2 I
2) The betas are estimated by SAS
assuming errors iid

Fit parameters using ordinary least squares in
SAS proc reg

Obtained surface predictions (SP) and their
standard errors (SEsp) and the ’s
Kriging the Error Surface

Model:
 {(s) : s  R2}
E((s) ) = 0
Var((s) - (s’) ) = 0
if
s=s’
2n + 2(1- e-dist/)
if ss’
 Estimated variogram parameters using
nonlinear least squares in Splus
 Obtained kriging predictions (KP) and
their standard errors (SEkp)
Variogram Models

3 commonly used variogram models:
– Exponential
• (h)=1 – exp (-3h/a)
– Spherical
• (h)=1.5 • (h/a) - 0.5 • (h/a)3 if h a
• (h)=1 otherwise
– Gaussian
• (h)=1 - exp (-3h2 /a2)
sill
(h)
a: range
h: distance
Spherical model
Exponential model
Gaussian model
range
h
Cross Validation to Select Variogram Model

Idea: temporarily remove the sample value at a
particular location one at a time, estimate this
value from remaining data using the different
variogram models.

Prediction error = observed - predicted

MSE = 1/(n-1) (prediction error)2
Cross Validation MSE for Three Variogram Models
2001
2000
4.5
4
4
3.5
3.5
3
MSE
3
2.5
Exponential
2.5
2
2
Gaussian
1.5
1.5
1
1
0.5
0.5
0
0
Q1
Q2
Q3
Q4
annual.real
• Exponential model has the least MSE.
• Conclusion: use Exponential model
Spherical
Q1
Q2
Q3
Q4
annual.real
Calculating Predicted Annual Averages

Quarter averages:
PQi = SPQi + KPQi

Annual average from quarterly analysis:
4
Pannual = (  PQi) / 4
i=1

Annual average from annual analysis:
Pannual = SPannual + KPannual
Calculation of Standard Error for
Annual Averages

Standard errors of quarterly averages:
SEQi = (SEspi)2 + (SEkpi)2

Standard errors of annual averages from
quarterly analysis:
SEannual = 1/16 (SEQi)2

Standard errors of annual averages from annual
analysis:
SEannual = (SEsp)2 + (SEkp)2
Sources of Error
2001
2.5
2.5
2
2
Surface
Kriging
1.5
1.5
Q
3
Q
4
n
an
ua
p
om
c
l.
an
al
.re
l
a
nu
ua
l.r
ea
l
2
an
n
Q
ua
l.c
om
p
1
4
Q
an
n
0
Q
0
3
0.5
Q
0.5
2
1
Q
1
1
Total
Q
Prediction Standard Error
2000
• Less than 5% of total errors is coming from fitting a quadratic surface.
• Kriging prediction error dominates.
Problems With Quarterly & Annual Analysis

The surface prediction and kriging
prediction are not independent.
 Var (SP + KP)  Var (SP) + Var (KP)
Annual 2001 SP vs. KP
1
-1
0
kriging prediction
2
0
-2
-2
kriging prediction
2
3
4
4
Annual 2000 SP vs. KP
0
5
10
s urfac e predic tion
15
4
6
8
s urfac e predic tion
10
12
2000 Quarter 2 SP vs. KP
12
14
0
2
4
10
-2
kriging prediction
2
0
-2
kriging prediction
4
2000 Quarter 1 SP vs. KP
16
0
5
surface prediction
15
surface prediction
2000 Quarter 4 SP vs. KP
4
2
0
-2
-4
-2
0
2
kriging prediction
4
6
2000 Quarter 3 SP vs. KP
kriging prediction
10
5
10
surface prediction
15
0
5
10
surface prediction
15
2001 Quarter 2 SP vs. KP
4
2
0
-4
-2
kriging prediction
2
0
-2
-4
-6
kriging prediction
4
2001 Quarter 1 SP vs. KP
10
12
14
16
2
4
6
8
10
12
14
surface prediction
surface prediction
2001 Quarter 3 SP vs. KP
2001 Quarter 4 SP vs. KP
2
-2
0
kriging prediction
2
0
-2
-4
kriging prediction
4
8
0
5
10
surface prediction
15
4
6
8
10
surface prediction
12
16
More Problems With Quarterly and Annual Analysis

Not using all available data

When kriging residuals, estimated
variogram is biased low (Kim and Boos
2002) (This problem could be solved by
using generalized least squares.)

Ignored standard deviation of annual
and/or quarterly averages in calculation of
kriging prediction error

Quarterly averages may not be
independent
Methods 3 & 4 - Daily-Based

Used every third day data (122 days per
year)
 Kriged each day to obtain predictions at
2400 lattice points
 At each lattice point fit a timeseries to the
122 days’ estimates to estimate annual
average
 Calculated timeseries error for annual
average using proc arima
Method 3 - “Doug’s Method”





Fit a quadratic surface using the Krig function in
Splus
Used an algorithm that minimizes generalized cross
validation error in order to estimate all parameters-including both quadratic surface parameters and
covariance parameters
Did not assume errors iid when fitting quad surf, so
coefficients in quad surf estimated based on cov
structure
Specified an exponential covariance structure with a
nugget
Provided the fixed value of 200 km for range
parameter for all 122 days
Method 4 - “Amy’s Method”





Fit a quadratic surface using Generalized Least
Squares in SAS Proc Mixed
Restricted (or residual) Maximum Likelihood used
to estimate all parameters
Did not assume errors iid when fitting quad surf,
so coefficients in quad surf estimated based on
cov structure
Specified an exponential covariance structure
with a nugget
Estimated each parameter each day
Problems with Doug’s Method
Using the same value for range
parameter every day requires
assumption that the range parameter
is constant over time. Not a valid
assumption. Amy’s method does not
make this assumption.
 Ignored kriging prediction error in
calculation of timeseries error for
annual average.

Problems with Amy’s Method


REML assumes data for each day is normally
distributed. It isn’t. Can fix by using a
transformation, but must be careful not to
introduce bias in back-transform. There is an
unbiased back-transform predictor and an
associated estimate of error in Cressie section
3.2.2. Also must decide whether to transform
each day using the same function. Doug’s
method does not require normality assumption.
Ignored kriging prediction error in calculation of
timeseries error for annual average.
What if we “propagate” errors?

At a given lattice point we have 122
days’ worth of predictions, each with
a kriging prediction error. What if we
treat the 122 days as independent
observations (they aren’t, they are
AR1) and combine the errors
accordingly? And we do this for
each of our 2400 lattice points.
The Big Problem
None of our standard error estimates
are correct!
 They are all underestimates!
 We need to learn how to put spatial
error components together with
temporal error components.

Model for one day

Yij = o + 1i + 2i2 + 3j + 4j2 + 5ij + ij
Where i = lattitude j = longitude
 E(ij) = 0
 Cov(ij, I’j’) = 2n + 2e-dist/ i=i’and j=j’
2e-dist/
ii’ or jj’

Model for one site
Yk =  + (Yk-1- ) + ek
k = 1,…,122
 Where E(ek) = 0
 Var (ek) = 2
 Note: this is an AR1 model. The errors
are iid (0, 2) because the temporal
correlation is accounted for using the
(Yk-1- ) term.

Model for all sites and days?

Yijk = o,k + 1,ki + 2,ki2 + 3,kj + 4,kj2 +
5,kij + ijk + eijk
Where E(ijk ) = 0, E(eijk) = 0
 We’ve assumed isotropy and
stationarity for simplicity.
 But how do we model Cov(ijk, I’j’k’),
Cov(eijk, ei’j’k’), and Cov (ijk, ei’j’k’)?

Separability

We’ve been treating the covariance structure
as separable--meaning that the 1-D temporal
and 2-D spatial covariance structures can be
estimated separately and then can be
mathematically combined to obtain a 3-D
space-time covariance structure. We need to
test for separability, and if the covariance
components are separable, we need to
appropriately combine them. We are just now
learning how to do this.
Next Steps….

Re-do Quarterly and Annual analyses using
generalized least squares
 Perform Amy’s analysis using
transformations, making sure to use an
unbiased estimator in the back-transform and
the appropriate error estimator. How much
does the lack of normality in the original
analysis affect results?
More next steps….


Investigate the separability of the covariance
structure and the correct method for combining
space and time covariance components.
Attempt a 3-dimensional kriging. No assumption
of separability is required to do this. We must,
however, write our own code for this project
because there is no software package (to our
knowledge) that performs such an analysis. This
method would allow us to use even more data
than we are using now, as we would not be
restricted to every third day.
That’s all, folks!