Exploring Data for Regression Modeling

Download Report

Transcript Exploring Data for Regression Modeling

Best Model
Dylan Loudon
Linear Regression Results
Erin Alvey
Who will you trust?
• Field technicians?
• Software
programmers?
• Statisticians?
• Instructors?
• GIS technicians?
• Other researchers?
• Yourself?
Regression (Correlation)
Modeling
• Creates a model in N-Dimensional
“Hyper-Space”
• Defined by:
– Covariates
– Response variables
– Mathematics used to create the model
– Statistics used to optimize parameters
– Options for model evaluation
– Predictor variables
Multiple Linear Regression
Linear Regression:
2 Predictors
Mathworks.com
Non-Linear Regression
Regression Methods
• Continuous Regression:
– Linear Regression
– Generalized Linear Models (GLM)
– Generalized Additive Models (GAMs)
• Categorical Regression (trees):
– Regression Trees
– Classification and regression trees (CART)
• Machine Learning:
– Maximum Entropy (Maxent)
– NPMR, HEMI, BRTs, etc.
Brown Shrimp Size
• Add graph from work
Terminology
• Plant uses:
– Measured value and response variable
– Explanatory variable
• I prefer:
– Response variable
– I’ll use “measured value” to identify
measured values in field data
– Covariate: Explanatory variable used to
build the model
– Predictor: Explanatory variable used to
predict
Douglas Fir Habitat Model
Habitat Quality
1
0
0
Precipitation (mm)
1000
Model
Predictor
Prediction
Model Selection and
Parameter
Estimation
Field
Data
Covariate
Model
Predictor
Prediction
Model Selection and
Parameter
Estimation
Field or
Sample
Data
Covariate
Model
Predictor
Model Validation
Prediction
Douglas-Fir sample data
Lat
Lon
F3
40.893634 -121.802272
40.987702 -122.117088
40.987702 -122.117088
40.987702 -122.117088
40.987702 -122.117088
40.987702 -122.117088
MeanTempPrecip
41
69
1070
45
96
1406
40
96
1406
43
96
1406
42
96
1406
46
96
1406
Create the
Model
Model
“Parameters”
Precip
Extract
Prediction
To Points
Text File
X
Attributes
Y
MeanTempPrecip
Predict
-123.677 41.61906
71
1548
193.6
-123.344 41.61906
55
1212
150.4
-123.011 41.61906
79
887 187.5667
-122.677 41.61906
68
584 155.4667
-122.344 41.61906
102
513
221.1
To Raster
Data
• Response Variable
– From the field data (sample data)
• Covariates
– From the field or remotely sensed
• Predictors
– Typically remotely sensed
– Sample as covariates for training
– Can be different for predicting to new
scenarios
Response Variable
• What is the:
– Spatial uncertainty?
– Temporal uncertainty?
– Measurement uncertainty?
• Will it answer your question?
Covariate Variables
• What is the:
– Spatial uncertainty?
– Temporal uncertainty?
– Measurement uncertainty?
• How well does the collection time of the
covariates match the field data?
• Do they co-vary with the phenomena?
• Do the covariates “correlate”?
Types of uncertainty
• Accuracy (bias)
• Precision (repeatability)
• Reliability (consistency of a set of
measurements)
• Resolution (fineness of detail)
• Logical consistency
– Adherence to structural rules, attributes,
and relationships
• Completeness
Types of Errors
• Gross errors
– Transcription
– Sinks in DEMs
• Random
– Estimated using probability theory
• Systematic errors
– “Drift” in instruments
– Dropped lines in Landsat
Gross Errors
• Lat/Lon:
– Reversed
– 0, names, dates, etc.
• Dates:
– Extended in databases
• Measurements:
– Inconsistent units
– Inconsistent protocols
– What can you expect from a field team?
Occurrences of Polar Bears
From The Global Biodiversity Information Facility (www.gbif.org, 2011)
Systematic Errors
Landsat Scan line Error
Response Variable Qualification
Tools
• Maps (various resolutions)
• Examine the data values:
– How many digits?
– Repeating patterns, gross errors?
• “Documentation”
• Measurements:
– Occurrences?
– Binary: Histogram
– Categorical: Histogram
– Continuous: Histogram
What’s the Impact on Models?
Significant Digits
• How many digits to represent 1 meter?
– Geographic: Lat/Lon?
– UTM: Eastings/Northings?
Significant Digits
• Geographic:
– 1 digit = 1 degree
– 1 degree ~ 110 km
– 0.00001 ~ 1.1 meters
• UTM:
– 1 digit = 1 meter
Covariate Qualification
• Maps
• Documentation
• Examine the data:
– How many digits?
• Integer or floating point?
– Repeating patterns?
• Histograms
CONUS Annual Percip.
-231
-219
-207
-195
-183
-172
-160
-148
-136
-124
-112
-100
-88
-77
-65
-53
-41
-29
-17
-5
7
19
30
42
54
66
78
90
102
Number of Pixels Scaled to 1
Covariate Uncertinaty
Min Temp of Coldest Month
1.20
1.00
0.80
0.60
0.40
0.20
0.00
Degrees C Times 10
-230
-215
-201
-186
-172
-157
-143
-128
-114
-100
-85
-71
-56
-42
-27
-13
2
16
31
45
60
74
88
103
Number of Occurrences Scaled to 1
Min Temp of Coldest Month
Min Temp: Envrionment
1.20
1.00
0.80
0.60
0.40
0.20
0.00
Degrees C Times 10
Histograms
hist(Temp,breaks=400)
Covariate Correlation
• Correlation Plots
• Pearson product-moment correlation
coefficient
• Spearman’s rho – non parametric
correlation coefficient
Correlation plots
California Correlations
California Predictors
Response vs. Covariates
• For Occurrences:
– Histogram covariates at occurrences vs.
overall covariates
• For Binary Data:
– Histogram covariates for each value
• For Categorical Data :
– Histogram covariates for each value
– Or scatter plots
• For Continuous Data
– Scatter plots
Covariate Occurrence Histograms
Precipitation with Douglas-Fir Occurrences
Douglas Fir Model In HEMI 2
Green: Histogram of all of California
Red: Histogram of Douglas-Fir Occurrences
Doug-Fir Height vs. Precip.
Douglas Fir Height
Terrestrial Predictors
• Elevation:
– Slope
– Aspect
– Absolute Aspect
• Distance to:
– Roads
– Streams (streamline)
• Climate
– Precip
– Temp
• Soil Type
• RS:
– Landsat
– MODIS
– NDVI, etc.
Marine Predictors
•
•
•
•
•
Temp
DO2
Salinity
Depth
Rugosity
(roughness)
• Current (at depths)
• Wind
More Complicated
•
•
•
•
Associated species
Trophic levels
Temporal
Cyclical
Predictor Layers
•
•
•
•
Means, mins, maxes
Range of values
Heterogeneity
Spatial layers:
– Distance to…
– Topography: elevation, slope, aspect
Field Data and Predictors
• As close to field measurements as
possible
• Clean and aggregate data as needed
– Documenting as you go
• Estimate overall uncertainty
• Answer the question:
– What spatial, temporal, and measurement
scales are appropriate to model at given the
data?
Temporal Issues
• Divide data into months, seasons, years,
decades.
– Consistent between predictors and
response
• Extract predictors as close to sample
location and dates as possible
• Use the “best” predictor layers
Additional Slides
Dimensions of uncertainty
•
•
•
•
•
Space
Time
Attribute
Scale
Relationships
Basic Tools
• Histograms: What is the distribution of
occurrences of values (range and shape)
• Scattergrams: What is the relationship
between response and predictor
variables and between predictor
variables
• QQPlots: Are the residuals normally
distributed?
Types of Data
• “God does not play dice”
– Einstein
• “the end of certainty”
– Prigogine, 1977 Nobel Prize
• What remains is:
– Quantifiable probability with uncertainty
Uncertainty Factors
•
•
•
•
Inherent uncertainty in the world
Limitation of human congnition
Limitation of measurement
Uncertainty in processing and analysis