Likelihood and Information Theoretic Methods in Forest Ecology

Download Report

Transcript Likelihood and Information Theoretic Methods in Forest Ecology

Lecture 3
Hypothesis Testing and Statistical Inference using
Likelihood:
The Central Role of Models
Outline…

Statistical inference:
-
it’s what we use statistics for, but there are some surprisingly
tricky philosophical difficulties that have plagued statisticians
for over a century…

The “frequentist” vs. “likelihoodist” solutions

Hypothesis testing as a process of comparing alternate
models

Examples – ANOVA and ANCOVA

The issue of parsimony
Inference defined...
“a : the act of passing from one proposition,
statement, or judgment considered as true to
another whose truth is believed to follow from
that of the former
b : the act of passing from statistical sample
data to generalizations (as of the value of
population parameters) usually with calculated
degrees of certainty”
Source: Merriam-Webster Online Dictionary
Statistical Inference...
... Typically concerns inferring properties of an
unknown distribution from data generated by
that distribution ...
Components:
-- Point estimation
-- Hypothesis testing
-- Model comparison
Probability and Inference

How do you choose the “correct inference” from your data,
given inevitable uncertainty and error?

Can you assign a probability to your certainty in the
correctness of a given inference?
-

(hint: if this is really important to you, then you should
consider becoming a Bayesian, as long as you can accept what I
consider to be some fairly objectionable baggage…)
How do you choose between alternate hypotheses?
-
Can you assess the strength of your evidence for alternate
hypotheses?
The crux of the problem...
“Thus, our general problem is to assess the relative
merits of rival hypotheses in the light of
observational or experimental data that bear upon
them....” (Edwards, pg 1).
Edwards, A.W.F. 1992. Likelihood. Expanded Edition.
Johns Hopkins University Press.
Assigning Probabilities to Hypotheses

Unfortunately, hypotheses (or even different parameter
estimates) can not generally be treated as “data” (outcomes
of trials)

Statisticians have debated alternate solutions to this
problem for centuries
-
(with no generally agreed upon solution)
One Way Out: Classical “Frequentist”
Statistics and Tests of Null Hypotheses

Probability is defined in terms of the outcome of a series of
repeated trials..

Hypothesis testing via “significance” of pre-defined “statistics”
-
What is the probability of observing a particular value of a
predefined test statistic, given an assumed hypothesis about the
underlying scientific model, and assumptions about the
probability model of the test statistic...
-
Hypotheses are never “accepted”, but are “rejected” (categorically)
if the probability of obtaining the observed value of the test
statistic is very small (“p-value”)
An Implicit Assumption

The data are an approximate “sample” of an underlying
“true” reality –
i.e., there is a true population mean, and the
sample provides an estimate of it...
Limitations of Frequentist Statistics

Do not provide a means of measuring relative strength of
observational support for alternate hypotheses (merely
helps decide when to “reject” individual hypotheses in
comparison to a single “null” hypothesis...)

So you conclude the slope of the line is not = 0. How strong is
your evidence that the slope is really 0.45 vs. 0.50?
Extremely non-intuitive: just what is a “confidence interval”
anyway...
Confidence Intervals
A typical definition:
0.5
Probability
“...If a series of samples are drawn
and the mean of each calculated,
95% of the means would be
expected to fall within the range of
two* standard errors above and
two below the mean of these
means...”
Standard Normal Distribution
0.4
0.3
0.2
0.1
cumulative prob. = 95%
0
-3
-2
-1
0
1
Standard Error of the Mean
*actually, 1.96
Source: http://bmj.bmjjournals.com/collections/statsbk/4.shtml
2
3
The “null hypothesis” approach

When and where is “strong inference” really useful?

When is it just an impediment to progress?
Platt, J. R. 1964. Strong inference. Science 146:347-353
Stephens et al. 2005. Information theory and hypothesis testing: a call
for pluralism. Journal of Applied Ecology 42:4-12.
Chamberlain’s alternative:
multiple working hypotheses

Science rarely progresses through a series of dichotomously
branched decisions…

Instead, we are constantly trying to choose among a large
set of alternate hypotheses
-
Concept is very old, but the computational power needed to
adopt this approach has only recently become available…
Chamberlain, T. C. 1890. The method of multiple working hypotheses.
Science 15:92.
Hypothesis testing and
“significance”
Nester’s (1996) Creed:
•TREATMENTS: all treatments differ
•FACTORS: all factors interact
•CORRELATIONS: all variables are correlated
•POPULATIONS: no two populations are identical in any respect
•NORMALITY: no data are normally distributed
•VARIANCES: variances are never equal
•MODELS: all models are wrong
•EQUALITY: no two numbers are the same
•SIZE: many numbers are very small
Nester, M. R. 1996. An applied statistician’s creed. Applied Statistician
45:401-410
Hypothesis testing vs. estimation
“The problem of estimation is of more central
importance, (than hypothesis testing).. for in almost all
situations we know that the effect whose significance
we are measuring is perfectly real, however small;
what is at issue is its magnitude.” (Edwards, 1992, pg.
2)
“An insignificant result, far from telling us that the
effect is non-existent, merely warns us that the
sample was not large enough to reveal it.” (Edwards,
1992, pg. 2)
Hypothesis testing and probability:
the likelihood compromise

Probability (of the data) can not generally be used
directly to test alternate hypotheses (about
parameters)...
P( | x)  P( x |  )
The “Likelihood Principle”
L( | x)  P( x |  )
In plain English: “The likelihood (L) of the set of
parameters (θ) (in the scientific model), given an
observation (x) is proportional to the probability of
observing the data, given the parameters...”
{and this probability is something we can calculate, using
the appropriate underlying probability model (i.e. a PDF)}
n
Log - likelihood  lnL | X    lng ( xi |  )
i 1
The most important point of
the course…
Any hypothesis test can be framed as
a comparison of alternate models…
(and being free of the constraints imposed by the alternate
models embedded in classical statistical tests is perhaps
the most important benefit of the likelihood approach…)
A simple example:
The likelihood alternative to 1-way ANOVA

Basic model: a set of observations (j=1..n) that can be classified
into i = 1..a distinct groups (i.e. levels of treatment A)
yij    Ai   ij , for i  1..a groups
 ij  I .I .D.  N ( 0, 2 )

A likelihood alternative
yij  Ai   ij
 ij  I .I .D.  N ( 0, 2 )
Differences in Frequentist vs.
Likelihood Approaches


Traditional Frequentist Approach:
-
Report “significance” of a test that …… based on a test statistic
calculated from sums of squares (F statistic), with a necessary
assumption of a homogeneous and normally distributed error
Likelihood Approach
-
Compare a set of alternate models, assess the strength of
evidence in your data for each of them, and identify the “best”
model
If the assumption about the error term isn’t appropriate, use a
different error term!
So, what would make sense as
alternate models?
Our first model
yij  Ai   ij
for i  1..a groups
 ij  I .I .D.  N ( 0, 2 )
A “null” model:
yij  A   ij
 ij  N ( 0, 2 )
Could and should you test additional models that lump some
groups together (particularly if that lumping is based on looking
at the estimated group means)?
Remember that the error term is
part of the model…
And you don’t just have to accept that a simple, normally
distributed, homogeneous error is appropriate…
Estimate a separate error
term for each group
yij  Ai   ij
ij  N ( 0, i 2 )
Or an error term that
varies as a function of the
predicted value
yij  Ai   ij
 ij  N ( 0, 2 )
Or where the error isn’t
normally distributed
   ˆy
yij  Ai   ij
yij  Gam m a( shape, scale )
A more general notation for the model…
The “scientific model”
yi  ˆyi   i , where ˆyi  f ( xi )
2
ˆ
 i  I .I .D , and yi  N ( yi , )
And a likelihood function [ g(yi|θ) ]specifies the
probability of observing yi, given the predicted value
for that observation ( ˆy ) (i.e. calculated as a function of
i
the parameters in the scientific model), and any
parameters in the PDF (i.e. σ)
Another Example:
Analysis of Covariance

A traditional ANCOVA model (homogeneous slopes):
yij  a j  bxij   ij
for j  1..A groups

What is restrictive about this model?

How would you generalize this in a likelihood framework?
-
What alternate models are you testing with the standard
frequentist statistics?
What more general alternate models might you like to test?
But is likelihood enough?
The challenge of parsimony
The importance of seeking simple answers...
“It will not be sufficient, when faced with a mass of
observations, to plead special creation, even though, as
we shall see, such a hypothesis commands a higher
numerical likelihood than any other.”
(Edwards, 1992, pg. 1, in explaining the need for a
rigorous basis for scientific inference, given uncertainty
in nature...)
Models, Truth, and “Full Reality”
(The Burnham and Anderson view...)
“We believe that “truth” (full reality) in the biological sciences
has essentially infinite dimension, and hence ... cannot be
revealed with only ... finite data and a “model” of those data...
... We can only hope to identify a model that provides a good
approximation to the data available.”
(Burnham and Anderson 2002, pg. 20)
The “full” model

What I irreverently call the “god” model: everything is
the way it is because it is…

In statistical terms, this is simply a model with as many
parameters as observations
- i.e.: xi = θi
This will always be the model with the highest likelihood!
(but it won’t be the most parsimonious)…
Parsimony, Ockham’s razor, and
drawing elephants...
William of Ockham (1285-1349):
“Pluralitas non est ponenda sine neccesitate”
“entities should not be multiplied
unnecessarily”
“Parsimony: ... 2 : economy in the use of means to an end;
especially : economy of explanation in conformity with Occam's
razor”
(Merriam-Webster Online Dictionary)
So how many parameters DOES it
take to draw an elephant...?*
Information Theory perspective:
“How much information is lost when using a simple
model to approximate reality?”
Answer: the Kullback-Leibler Distance (generally
unknowable)
More Practical Answer: Akaike’s Information Criterion
(AIC) identifies the model that minimizes KL distance
AIC  2 ln(L( | x)  2K
*30 would “carry a chemical engineer into preliminary design”
(Wel, 1975) (cited in B&A, pg 30)
The brave new world…

Science is the development of simplified models as
explanations (approximations) of reality…

The “quality” of the explanation (the model) will be a
balance of many factors (both quantitative and qualitative)
Consilience… E.O. Wilson’s view of science

In his book Consilience: The Unity of Knowledge (1998) ,
Wilson asserts that the sciences, humanities, and arts have a
common goal: to give a purpose to understanding the
details, to lend to all inquirers "a conviction, far deeper
than a mere working proposition, that the world is
orderly and can be explained by a small number of
natural laws."
Source: Wikipedia