Transcript AIC

Robert Plant != Richard Plant
May be the same data
Field Data
Response, coordinates
Covariates
Sample Data
Qualify,
Prep
Response, covariates
Random split?
Qualify,
Prep
Direct or Remotely
sensed
Predictors
Remotely sensed
Qualify,
Prep
Randomness
Inputs
Outputs
Test Data
Training Data
Temp Data
Processes
Repeated
Over and Over
Build Model
The Model
Statistics
Validate
Predict
Predicted Values
Uncertainty
Maps
Summarize
Predictive Map
Randomness
Cross-Validation
• Split the data into training (build model)
and test (validate) data sets
• Leave-p-out cross-validation
– Validate on p samples, train on remainder
– Repeated for all combinations of p
• Non-exhaustive cross-validation
– Leave-p-out cross-validation but only on a
subset of possible combinations
– Randomly splitting into 30% test and 70%
training is common
K-fold Cross Validation
•
•
•
•
Break the data into K sections
Test on 𝐾𝑖 , Training remainder
Repeat for all 𝐾𝑖
Test
10-fold is common
1
2
3
4
5
Training
6
7
8
9
10
Bootstrapping
• Drawing N samples from the sample
data (with replacement)
• Building the model
• Repeating the process over and over
Random Forest
• N samples drawn from the data with
replacement
• Repeated to create many trees
– A “random forest”
• “Splits” are selected based on the most
common splits in all the trees
• Bootstrap aggregation or “Bagging”
Boosting
• Can a set of weak learners create a
single strong learner? (Wikipedia)
– Lots of “simple” trees used to create a really
complex tree
• "convex potential boosters cannot
withstand random classification noise,“
– 2008 Phillip Long (at Google) and Rocco A.
Servedio (Columbia University)
Boosted Regression Trees
• BRTs combine thousands of trees to
reduce deviance from the data
• Currently popular
• More on this later
Sensitivity Testing
• Injecting small amounts of “noise” into
our data to see the effect on the model
parameters.
– Plant
• The same approach can be used to
model the impact of uncertainty on our
model outputs and to make uncertainty
maps
• Note: This is not the same as sensitivity
testing for model parameters
Jackknifing
• Trying all combinations of covariates
Extrapolation vs. Prediction
Extrapolation
Prediction
From model
Modeling: Creating a model that allows us to estimate values between data
Extrapolation: Using existing data to estimate values outside the range of
our data
Building Models
• Selecting the method
• Selecting the predictors (“Model
Selection”)
• Optimizing the coefficients/parameters of
the model
May be the same data
Field Data
Response, coordinates
Covariates
Sample Data
Qualify,
Prep
Response, covariates
Random split?
Qualify,
Prep
Direct or Remotely
sensed
Predictors
Remotely sensed
Qualify,
Prep
Randomness
Inputs
Outputs
Test Data
Training Data
Temp Data
Processes
Repeated
Over and Over
Build Model
The Model
Statistics
Validate
Predict
Predicted Values
Uncertainty
Maps
Summarize
Predictive Map
Randomness
Model Selection
• Need a method to select the “best” set of
predictors
– Really to select the best method, predictors,
and coefficients (parameters)
• Should be a balance between fitting the
data and simplicity
– R2 – only considers fit to data (but linear
regression is pretty simple)
Simplicity
• Everything should
be made as simple
as possible, but not
simpler.
– Albert Einstein
"Albert Einstein Head" by Photograph by
Oren Jack Turner, Princeton, licensed
through Wikipedia
Parsimony
• “…too few parameters and the model will
be so unrealistic as to make prediction
unreliable, but too many parameters and
the model will be so specific to the
particular data set so to make prediction
unreliable.”
– Edwards, A. W. F. (2001). Occam’s bonus. p. 128–
139; in Zellner, A., Keuzenkamp, H. A., and
McAleer, M. Simplicity, inference and modelling.
Cambridge University Press, Cambridge, UK.
Parsimony
Under fitting
model structure
…included in the
residuals
Parsimony
Anderson
Over fitting
residual variation
is included as if it were
structural
Akaike Information Criterion
• AIC
• K = number of estimated parameters in
the model
• L = Maximized likelihood function for the
estimated model
𝐴𝐼𝐶 = 2𝑘 − 2 ln(𝐿)
AIC
• Only a relative meaning
• Smaller is “better”
• Balance between complexity:
– Over fitting or modeling the errors
– Too many parameters
• And bias
– Under fitting or the model is missing part of
the phenomenon we are trying to model
– Too few parameters
Likelihood
• Likelihood of a set of parameter values
given some observed data=probability of
observed data given parameter values
• Definitions
–
–
–
–
𝑥 = all sample values
𝑥𝑖 = one sample value
θ = set of parameters
𝑝 𝑥 𝜃 =probability of x, given θ
• See:
– ftp://statgen.ncsu.edu/pub/thorne/molevocla
ss/pruning2013cme.pdf
Likelihood
-2 Times Log Likelihood
p(x) for a fair coin
𝐴𝐼𝐶 = 2𝑘 − 2 ln 𝐿
𝐿 = 𝑝 𝑥1 𝜃 ∗ 𝑝 𝑥2 𝜃 …
0.5
Heads
Tails
What happens as we flip a “fair” coin?
p(x) for an unfair coin
𝐴𝐼𝐶 = 2𝑘 − 2 ln 𝐿
𝐿 = 𝑝 𝑥1 𝜃 ∗ 𝑝 𝑥2 𝜃 …
0.8
Heads
0.2
Tails
What happens as we flip a “fair” coin?
p(x) for a coin with two heads
𝐴𝐼𝐶 = 2𝑘 − 2 ln 𝐿
𝐿 = 𝑝 𝑥1 𝜃 ∗ 𝑝 𝑥2 𝜃 …
1.0
Heads
0.0
Tails
What happens as we flip a “fair” coin?
Does likelihood from p(x) work?
•
if the likelihood is the probability of the data given the parameters,
•
and a response function provides the probability of a piece of data (i.e. probability
that this is suitable habitat)
•
we can use the probability that a specific occurrence is suitable as the
p(x|Parameters)
•
Thus the likelihood of a habitat model (while disregarding bias)
•
Can be computed by
L(ParameterValues|Data)=p(Data1|ParameterValues)*p(Data2|ParameterValues).
..
•
Does not work, the highest likelihood will be to have a model with 1.0 everywhere,
have to divide the model by it’s area so the area under the model = 1.0
•
Remember: This only works when comparing the same dataset!
Akaike…
• Akaike showed that:
• log ℒ 𝜃 𝑑𝑎𝑡𝑎
− 𝐾 ≈ 𝐸𝑦 𝐸𝑥 log 𝑔 𝑥|𝜃(𝑦)
• Which is equivalent to:
• log ℒ 𝜃 𝑑𝑎𝑡𝑎
− 𝐾 = 𝑐𝑜𝑛𝑠𝑡𝑎𝑛𝑡 − 𝐸𝜃 I 𝑓, 𝑔
• Akaike then defined:
• AIC = −2log ℒ 𝜃 𝑑𝑎𝑡𝑎
+ 2𝐾
AICc
• Additional penalty for more parameters
• 𝐴𝐼𝐶𝑐 = 𝐴𝐼𝐶
2𝑘(𝑘+1)
+
𝑛−𝑘−1
• Recommended when n is small or k is
large
BIC
• Bayesian Information Criterion
• Adds n (number of samples)
𝐵𝐼𝐶 = 2𝑘 ∗ 𝑙𝑛(𝑛) − 2 ln(𝐿)
Extra slides
• Discrete:
• 𝐷𝐾𝐿 =
𝑃 𝑖
ln(
𝑄 𝑖
)𝑃(𝑖)
• Continuous:
• 𝐷𝐾𝐿 (𝑃||𝑄) =
∞
𝑝 𝑥
ln
−∞
𝑞 𝑥
𝑝 𝑥 𝑑𝑥
• Justification:
• 𝐷𝐾𝐿 (𝑃| 𝑄 =
− 𝑝 𝑥 log(𝑞 𝑥 +
𝑝 𝑥 log(𝑝(𝑥)
• The distance can also be expressed as:
– 𝐼 𝑓, 𝑔 =
𝑓 𝑥 𝑙𝑜𝑔 𝑓 𝑥
•
𝑑𝑥 − 𝑓 𝑥 𝑙𝑜𝑔 𝑔 𝑥 𝜃 𝑑𝑥
𝑓 𝑥 is the expectation of 𝑓 𝑥 so:
– 𝐼 𝑓, 𝑔 = 𝐸𝑓 log 𝑓 𝑥
• Treating 𝐸𝑓 log 𝑓 𝑥
constant:
− 𝐸𝑓 log 𝑔 𝑥 𝜃
as an unknown
– 𝐼 𝑓, 𝑔 − 𝐶 = 𝐸𝑓 log 𝑔 𝑥 𝜃
Distance between g and f
= Relative