Transcript Modeling

Modeling
Wim Buysse
RUFORUM 1 December 2006
Research Methods Group
Part 1. General Linear
Models
Research Methods Group
General Linear Models
Dataset from
Research Methods Group
General Linear Models
Dataset from
p. 89 - 95
Research Methods Group
General Linear Models
Effects of three
levels of sorbic
acid (Sorbic) and
six levels of water
activity (Water) on
survival
of
Salmonella
typhimurium
(Density)
Water density =
log(density/ml)
Research Methods Group
General Linear Models
ANOVA approach
Research Methods Group
General Linear Models
Results
Research Methods Group
General Linear Models
The same data, but each treatment is presented
as a ‘dummy variable’. (Warning: for educational
purposes only.)
Research Methods Group
General Linear Models
Regression with a first independent variable.
Research Methods Group
General Linear Models
We add a second independent variable.
Research Methods Group
General Linear Models
We add a third one.
Research Methods Group
General Linear Models
We add a fourth one.
Research Methods Group
General Linear Models
We continue to construct the model.
Research Methods Group
General Linear Models
Finally, the results.
Research Methods Group
General Linear Models
Comparison of the two approaches.
Research Methods Group
General Linear Models
Comparison of the two approaches:
- They give the same results (in terms of SS.)
- The approach to choose depends on what you
want to know.
- The regression approach still works when the
ANOVA approach is not possible anymore (for
instance when there are missing values).
Research Methods Group
Example: modelling approach with
normally distributed data.
Protocol and dataset.
Research Methods Group
Example: modelling approach with
normally distributed data.
Data: Screening of suitable species for three-year
fallow
file = Fallow N.xls
Protocol: p. 13
Research Methods Group
Example: modelling approach
with normally distributed data.
The analysis approach is written down in chapter 19 of
‘Good statistical practice for natural resources research’
Research Methods Group
Modelling approach: general
5 steps:
1. (Visual) exploration to discover trends and relationships
2. Choose a possible model:
• The trend you see
• Knowledge of the experimental design
• Biological/scientific knowledge of the
process
3. Fitting = estimation of parameters
4. Check = assessing the ‘fit’
5. Interpretation to answer the objectives.
Research Methods Group
Expanding the model
ANOVA and regression
• Same calculations
• Data
= pattern + noise
= systematic component + random component
• Same assumptions
• Systematic components are additive
• Variability of the groups is similar
• The random component is (rather) normally
distributed. The random variability of “y”
around the systematic component is not
affected by this systematic component.
Research Methods Group
GENERAL LINEAR MODELS
Research Methods Group
GENERAL LINEAR MODELS
Research Methods Group
GENERAL LINEAR MODELS
Data
=
pattern
+
noise
Pattern: is explained by a linear combination of the
independent variables
(Data ≈ N(m,v) and the variance is rather constant
across the different groups)
Noise: N(0,1) and the variance is rather constant
across the different groups
Research Methods Group
Expanding the model
If the data are not normally distributed or if the
variance of the different groups is not similar:
Possible approach = transformation of the data =
« linearising » the model
Problems:
- You don’t work anymore on a scale that has a
biological meaning.
- Retransforming the standard errors back to the
original scale is not possible anymore.
Research Methods Group
Expanding the model
Better solution: GENERAL LINEAR MODELS =>
GENERALIZED LINEAR MODELS
Less restrictions; two essential differences:
1. Data can be distributed according to the family of
exponential distributions = Normal, Binomial, Poisson,
Gamma, Negative binomial
2. Link function: the link between E(Y) and the independent
variables is not longer a linear combination of the
independent variables. It is also possible that the linear
combination of the independent variables is a function of
can also be a linear combination of a function of E(Y). (We
don’t transform the dependent variables but include the
transformation into the model).
Research Methods Group
Expanding the model
Better solution: GENERAL LINEAR MODELS =>
GENERALIZED LINEAR MODELS
Also:
- The systematic component (linear combination of
independent variables) can include both continuous and
categorical variables and even polynomials
But still:
- The variance is constant across the different groups (or
has become constant because of the transformation
through the link function)
Research Methods Group
Generalised linear models
Statistical theory is more difficult, but the menus in
GenStat and the way you can interpret the output is very
similar to what we know from ANOVA and regression.
Research Methods Group
=
=
Research Methods Group
Example 1. Logistic regression
Example: cardio-vascular disease according to
age
age and chd.xls
Research Methods Group
Example 1. Logistic regression
Example: same data but according to age group
Research Methods Group
Example 1. Logistic regression
Example: the linear regression is not an
appropriate model and the predictions at the
extremes will not be correct
Research Methods Group
Example 1. Logistic regression
Example: test χ2 test: limited information
Research Methods Group
Example 1. Logistic regression
• Bernoulli process: an (independent) event that
can have two possible outcomes (1 – 0, successfailure, …); with a given probability of succes
• Tossing a coin: head or tail; p = 0,5
• Throwing 6 with a dice (success) compared to
throwing any other number; p = 1/6
• Conducting a survey: is the head of the
household male or female?; calculate p from
the proportion found in the collected data
• Screening of cardio-vascular diseases. p
disease = 43 out of 100 individuals = 0.43
Research Methods Group
Example 1. Logistic regression
• In GenStat
Research Methods Group
Example 1. Logistic regression
• Logistic function
Research Methods Group
Example 1. Logistic regression
• Logistic function
• Sigmoid form
• Linear in the middle
• The probability is restricted between 0
et 1
• Small values: flatten towards 0; large
values: flatten towards 1
Research Methods Group
Example 1. Logistic regression
• GenStat output
• Similar, but ‘deviance’ instead of ‘variance’ and
test χ2 instead of F
Research Methods Group
Example 1. Logistic regression
• GenStat output
• model
• Logit(CHD) = -5,31 + 0,1109 AGE
Research Methods Group
Example 1. Logistic regression
• Logit(CHD) = -5,31 + 0,1109 AGE
Research Methods Group
Example 1. Logistic regression
Research Methods Group
Example 1. Logistic regression
• Binomial distribution: when we repeat the
Bernoulli process, the order of success or failure
can change
• Example: head of household in a survey
Research Methods Group
Example 1. Logistic regression
• Calculation of probabilities if success = female
headed household with p = 0,2
Research Methods Group
Example 1. Logistic regression
• Calculated probabilities for obtaining success
• We can now construct a frequency distribution of
obtaining success
• Probability = long-run frequency = frequency
when very many data
• = binomial distribution
Research Methods Group
Example 1. Logistic regression
• Binomial distribution
• Counts of a categorical variable
• Example: experiment of survival of trees from
different provenances
• File: survival trees.xls
Research Methods Group
Example 1. Logistic regression
• Several approaches possible
1
Research Methods Group
Example 1. Logistic regression
• Several approaches possible
1
Research Methods Group
Example 1. Logistic regression
• Several approaches possible
2
Research Methods Group
Example 1. Logistic regression
• Several approaches possible
2
Research Methods Group
Example 1. Logistic regression
• Several approaches possible
3
Research Methods Group
Example 1. Logistic regression
• Several approaches possible
3
Research Methods Group
Example 1. Logistic regression
• The Bernoulli distribution is a special case of the
binomial distribution
• There exist ‘families of distributions’.
Research Methods Group
Example 1. Logistic regression
• There is of course a difference in the variability
that is explained.
1
2
3
Research Methods Group
Example 2. Modelling counts
• We used logistic regression to analyse counts.
• Bernoulli distribution: distribution of success of
events that follow a Bernoulli process (1 or 0,
yes or no)
• Binomial distribution: distribution of possible
(and independent) combinations of Bernoulli
events
• So, more like analysis of proportions.
• Next: Poisson distribution: distribution of counts
of Bernoulli events
Research Methods Group
Example 2. Modelling counts
• Poisson distribution: distribution of counts of
Bernoulli events
• BUT:
• p is very small
• n is very big
• p*n < 5
• Events happen randomly and independent of
each other.
Research Methods Group
Example 2. Modelling counts
• Poisson distribution = distribution of rare events
• Number of civil airplane crashes (when there is
no war) in the whole world during several
years.
• Number of infected seeds in seed lots that are
certified by a controlling agency.
• Number of individuals of a rare tree species in
a square kilometre in the same Agro Ecological
Zone.
Research Methods Group
Example 2. Modelling counts
THUS
• The distribution that best describes counts is not
automatically a Poisson distribution.
• It depends of the context.
Research Methods Group
Example 2. Modelling counts
Some mathematical statistics
The proportion
mean/variance
must be 1.
= Poisson index
In GenStat:
(s2-m)/m
Research Methods Group
Example 2. Modelling counts
We briefly have seen already other counts: χ2
test
χ2 test: is there evidence
of an association between
two discrete variables
H0: no association
H1: association
Research Methods Group
Example 2. Modelling counts
We could use another kind of probability to
calculate the test statistic
Research Methods Group
Example 2. Modelling counts
But now we look at the table in another way. If
we consider the counts in the table as a variable,
we could construct a frequency distribution.
Research Methods Group
Example 2. Modelling counts
Long run frequency distribution = probability
distribution
We just expanded the binomial distribution into
the multinomial distribution.
Binomial distribution:
• Independent observations
• p success = everywhere the same. The
probability that an individual observation falls
into a specific cell of the table is the same for
all cells.
Multinomial observation:
• + The number of total observations is fixed.
Research Methods Group
Example 2. Modelling counts
If the total number of observations was not fixed
=> Poisson distribution
BUT
Thanks to a lot of difficult statistical theory: we
can also use the Poisson distribution even if the
total number of observation is not fixed.
Research Methods Group
Example 2. Modelling counts
CONCLUSION
Even though the context is important to decide
whether we can use the Poisson distribution to
analyse counts (‘distribution of rare events’)
Generally:
Analysis of ‘multiway contingency tables’ =>
Poisson distribution + logarithm link
= LOGLINEAR MODELING
Research Methods Group
Example 2. Modelling counts
Analysis of counts =
• Often we can use the Poisson distribution
• But not always
Research Methods Group
Example 2. Loglinear modelling
=
Research Methods Group
Example 2. Loglinear modelling
Adding interactions
Research Methods Group
Example 2. Loglinear modelling
χ2 test
=
Loglinear
modelling
Research Methods Group
Example 2. Loglinear modelling
Modelling of complex datasets:
• Adding or dropping terms and interactions
in the model and changing their order
• Good model (‘good fit’ ) when the
‘residual deviance’ becomes almost equal
to the number of degrees of freedom (or
‘mean deviance’ = 0)
• At that moment we can assume that the
remaining residual variability is caused by
the random variability (noise)
• Adding too many terms: ‘residual
deviance’ => 0
Research Methods Group
Example 2. Loglinear modelling
Example: lambs.xls
Research Methods Group