Incorporating spatial information in the standard

Download Report

Transcript Incorporating spatial information in the standard

Lecture 22
Spatial Modelling 1 : Incorporating
spatial modelling in a random
effects structure
Lecture Contents
•
•
•
•
•
•
Introduction to spatial modelling
Nested random effect levels
House price dataset
Including distance as a fixed effect
Direction effects
Focused clustering (Falkirk dataset)
Spatial statistical modelling
• Here we require a statistical approach that
accounts for the spatial location at which a
response is collected. This means that the
model that is fitted to the data needs to account
for the spatial effects.
• This may be to account for any effects due to
location in the model or to predict values of the
response at other locations via some form of
interpolation that accounts for both other
predictor variables and/or the spatial location.
Types of spatial data
There are many forms of spatial data but we can
broadly divide these into three types: (Cressie
1993)
1. Geostatistical data – here measurements are
taken at a fixed number of chosen locations in
a geographical area.
2. Lattice data – here measurement are taken at
on a regular lattice and at each point on this
lattice a measurement is collected.
3. Point process data – here each observation is
the location of a response and its co-ordinates
are also recorded.
Geostatistical data
Such data are collected in various fields,
particularly mining and earth sciences.
A measurement e.g. %age coal ash is taken at
each of a number of locations.
Methods such as variograms and spatial Kriging
are used to analyse such data.
Other application areas include weather maps and
agricultural field trials.
Note such data is not ideally suited to standard
random effect modelling.
Disease mapping
One particular type of spatial modelling that is
often linked with random effect modelling is
disease mapping.
Here cases of a disease (either human or animal)
are observed over a chosen region e.g. a
country. We then wish to infer the relative risk of
the disease for a particular individual at a
particular location based on the data collected.
Both our practicals this afternoon will consider
disease mapping datasets. The other two types
of spatial data relate to disease data.
Lattice Data
Such data is common in many fields, for example
image analysis where the pixels in an image are
found on a regular rectangular lattice.
More importantly we will consider disease count
data where counts of a disease are recorded for
contiguous regions on a map.
Although a map is not regular we can construct a
lattice from a map by identifying neighbouring
regions and linking neighbouring regions to form
a lattice.
Example
Here we see a map of 5 regions in the left hand
picture, and on the right it has been converted to a
lattice with connections between regions that share
boundaries.
→
Point process data
This data is also commonly found in disease
mapping although may be used in many
applications where cases of an event are seen
at particular locations.
Each item of data consists of the location of an
event, the response (type of event) and
potentially predictor variables for the event.
Note Rasmus has worked more extensively in this
area and will be happy to answer questions
here. 
Disease point process modelling
In disease mapping our data is typically binary i.e. people
are infected (or die from) a disease or are not.
The data occur in point process form but there are 2
problems with analysing them as a point process:
1. All our responses are 1 as we only observe the
infected/dead people!
2. Due to confidentiality and the sensitive nature of
medical data the data cannot often be released as
individual records.
To counter point 1 we could sample control cases at
random from the population however point 2 means
that we typically total up cases for fixed areas and use
a Poisson model on the lattice data that this creates.
Why might there be spatial effects?
This depends on the response variable and
application area.
It is possible that geography is itself a predictor for
our response or is a surrogate for other factors.
Many factors can be linked to location e.g.
weather, deprivation, altitude, pollution, wealth
which might influence the response.
So if our response is influenced by any of these
factors then accounting for spatial effects many
improve our model.
Nested random effects/ levels of
geography
The simplest link to random effect models is to
consider nested random effects.
We have considered pupils nested in schools and
cows nested in herds.
In some sense the schools and herds are spatial
units in that schools generally take children from
their locality and a herd is based on a particular
farm. However we could also fit where the pupils
live as another classification of the data which is
more spatial.
On the next slide we consider a dataset with more
levels of geography.
UK house prices dataset
An MMath student of mine (David Goodacre)
studied a dataset of house prices in the UK. The
data supplied by the Nationwide building society
consists of average house prices in areas of the
UK over a 12 year period (1992-2003). The data
is for 753 towns in the UK and there are 3 levels
of geography (towns nested in counties nested
in regions.)
Note that if we had individual house sale
information then we could have considered point
process approaches but here we consider
random effect modelling.
A 4-level VC model for the house
price dataset
The following model was fitted to the data
yijkl   0  1 yeari   2 yeari 2  f l  vkl  u jkl  eijkl
f l ~ N (0,  2f ), kl ~ N (0,  v2 ),
u jkl ~ N (0,  u2 ), eijkl ~ N (0,  e2 )
where i indexes year, j indexes town, k indexes county
and l indexes region. The response, y is the log of the
average price.
This model can be fitted using both frequentist and
likelihood methods in packages that allow four levels in
the model.
Links with other topics
It is worth noting that this house price
dataset is a repeated measures dataset as
you considered yesterday.
It also contains missing data as in any year
in which there were less than 50 sales in a
postal town will lead to a missing
observation.
However we here assume MAR conditional
on the model we are fitting.
Estimates for house price dataset
Below are given IGLS estimates for the model:
Parameter Estimate (SE)
β0
4.036 (0.067)
β1
-0.020 (0.002)
β2
0.009 (0.0001)
σ2f
0.045 (0.021)
σ2v
0.016 (0.004)
σ2u
0.045 (0.003)
σ2e
0.013 (0.0002)
Here we see that the
model consists of
parallel curves with
both year and year2
very significant.
The variance is
greatest between
regions and between
postal towns
Region Level Effects
values for pred
Here we see that the
south east of the UK
and London are the
most expensive whilst
Scotland the North and
Wales are the
cheapest.
(5) < -0.225
(21) -0.225 -
N
(9)
-0.1 -
0.0
(8)
0.0 -
0.1
(7)
0.1 -
0.2
(1)
0.2 -
0.35
(12) >= 0.35
200.0km
-0.1
County level effects
values for u3
(8) < -0.1
N
After accounting for
regions the pattern of
county effects is more
sporadic. We can
however pick up 2
regions, Cheshire in the
North West and Surrey
in the South East that
are more expensive than
their neighbours.
(24)
-0.1 -
0.0
(21)
0.0 -
0.1
(8)
(2) >=
200.0km
0.1 0.2
0.2
Region level predictions
Here we see a graph of
region level
predictions:
Further Modelling
In his project Dave looked at random slopes
models at the various levels of the model,
so that we could pick out whether the
increase in prices was different in different
regions.
He also looked at fitting models of a more
spatial nature! See next lecture.
Why are spatial effects different?
The main difference with spatial effects is that we have
additional information about each (spatial) unit.
For example if we observe the average house price of a
town in Grampian, a town in Surrey and 2 towns in
Berkshire then we know something of the spatial relation
of these towns.
We might expect the prices in the 2 towns in Berkshire to
be similar and to be more similar to Surrey which is also
in the South East than Grampian that is in Scotland.
In our current models we will fit an effect for Berkshire
which will capture some of the relationship between its 2
towns and a South East effect that will capture the link
with the Surrey town.
Problems with the nested
classification approach
As we have seen the nested classification approach can
capture much of the spatial variability however we have
to decide on the geographic definitions of areas.
We generally use easily available definitions e.g. county
and region but there is no guarantee that these are the
best classifications.
We also have the problem of border effects, for example
two towns on either side of a region border will not share
either region or county effects but may have very similar
prices.
We will look at another approach here before studying
more complex spatial approaches in the next lecture.
Including location in fixed effects
It may be the case that there is a trend e.g. in house
prices in the UK they generally fall as we move North
and West. We could therefore add in two (fixed effect)
predictors giving the N/S and E/W co-ordinates of
each point.
If the unit of observation is an area e.g. postal town we
would generally use the co-ordinates of the centroid of
the unit.
If a linear relationship is not sensible then we could
consider polynomial terms in each direction. For
example (excluding random effects)
yi   0  1 Northi   2 Northi2   3West i  ei
Distance effects
Another possibility in terms
of UK house prices is to
consider the distance
from London. This
distance can be
constructed from the coordinates of each point.
The graph to the left
gives the combined
region and county effects
and suggests a distance
from London effect might
be appropriate.
values for pred
(10) < -0.25
(24) -0.25 -
N
(19)
(6)
0.0 -
0.25 -
(4) >=
200.0km
0.5
0.0
0.25
0.5
Distance and direction effects
In some scenarios the direction as well as the
distance from a particular point is important.
This is not the case with house prices however in
pollution data then direction can be very
important where a dominant wind direction will
suggest that particular directions away from the
source will experience more pollution than
others.
We will next look at a dataset from Falkirk in
Scotland that is analysed in Lawson, Browne &
Vidal Rodeiro (2003)
Focused Clustering
One research area in public health looks at the impact of
sources of pollution on the health status of communities.
The detection of patterns of health events associated
with pollution sources is known as focused clustering.
The statistical modelling involved usually relates to the
point process nature of such data.
Lawson, Browne & Vidal Rodeiro (2003) devote a whole
chapter to Focused clustering and include some fairly
complex models that can be considered in WinBUGS.
Here we will look at some simpler models that can be
fitted in MLwiN to a dataset from Falkirk in Scotland.
Respiratory cancer in Falkirk
The figure to the right shows
the census geographies of
26 regions found around a
foundry (marked by *) in
Falkirk, Scotland. It is
thought conceivable that
the foundry was an air
pollution hazard in the early
1970s prior to the study.
This could have an impact
on the respiratory cancer
experience of those living in
the areas close to the
foundry
Falkirk dataset
The data consists of observed and expected
counts of respiratory cancer cases in the
time period 1978-1983.
We first compare the standardized mortality
rates (SMRs) = observed/expected against
the locations of the centroids of the 26
areas in Falkirk (relative to the foundry) to
look for patterns.
Position of the sites
Note in the graphs to
the right that the 3
highest SMRs are
close to the source
both in the N/S and
E/W directions.
We can convert these
locations to distance
and direction
measures.
Distance and direction
Here we see that there
appears to be a
negative relationship
between distance and
SMR but no obvious
pattern with regard to
the direction
relationship.
(Extra) Poisson modelling
We have modelled the effects of deprivation, distance
and direction in the following Poisson model:
Note that we have used 1st order MQL in MLwiN and
allowed extra-Poisson variation. This shows there is
less variation than a Poisson distribution so we will
also try fitting SMR as a Normal distributed response.
Normal response model for SMR
Here we see that none of the predictors has a significant
effect which is probably because the dataset is so
small.
We do see however that the risk reduces as distance
from the foundry increases and for areas with larger
deprivation scores. (suggesting higher rates in less
deprived areas but not significantly.)
Information for the practical
In the practical we will return to using nested
random effects to account for spatial effects.
Our data is from the European community and
consists of male deaths from malignant
melanoma in 9 countries in the EU.
The practical is a (modified) chapter from Browne
(2003) and looks at MCMC methods for this
dataset. It is also analysed using quasilikelihood
methods in the MLwiN users guide and you are
welcome to also try these methods.