Posterior - WordPress.com

Download Report

Transcript Posterior - WordPress.com

Bayesian estimation
Why and How to Run Your
First Bayesian Model
Rens van de Schoot
Rensvandeschoot. com
Classical null hypothesis testing
Wainer:
"One Cheer for Null-Hypothesis
Significance Testing“
(1999; Psych. Meth., 4, 212-213)
… however …
NHT vs. Bayes
Pr (Data | H0)
≠
Pr (Hi | Data)
Bayes Theorem
Pr (Hi | Data) =
Posterior ≈ prior * data
Posterior probability is proportional
to the product of the prior
probability and the likelihood
Bayes theorem: prior, data and posterior
Bayes Theorem:
Bayes Theorem
Pr (Hi| Data) =
Posterior ≈ prior * data
Posterior probability is proportional
to the product of the prior
probability and the likelihood
Intelligence (IQ)
-∞
IQ
∞
Prior Knowledgde
1
-∞
IQ
∞
Intelligence Interval
Cognitive Designation
40 - 54
Severely challenged (<1%)
55 - 69
Challenged (2.3% of test takers)
70 - 84
Below average
85 - 114
Average (68% of test takers)
115 - 129
Above average
130 - 144
Gifted (2.3% of test takers)
145 - 159
Genius (Less than 1% of test takers)
160 - 175
Extraordinary genius
9
Prior Knowledgde
40
IQ
180
Prior Knowledgde
2
40
IQ
180
Prior Knowledgde
3
40
100
IQ
180
Prior Knowledgde
4
40
100
IQ
180
Prior Knowledgde
5
40
100
IQ
180
Prior Knowledgde
1
2
-∞
∞
3
4
5
Prior
Prior
-∞
IQ
∞
Data
Data
Prior
-∞
IQ
∞
Posterior
Posterior
Data
Prior
-∞
IQ
∞
Prior - Data
Data
Prior
40
100
IQ
180
Prior - Data
Data
Prior
40
100
IQ
180
How to obtain posterior?

In complex models, the posterior is
often intractable (impossible to
compute exactly)

Solution: approximate posterior by
simulation
– Simulate many draws from posterior
distribution
– Compute mode, median, mean, 95% interval
et cetera from the simulated draws
21
ANOVA example
4 unknown parameters μj (j=1,...,4) and
one common but unknown σ2.
Statistical model:
Y = μ1*D1 + μ2*D2 + μ3*D3 + μ4*D4 + E
with E ~ N(0, σ2)
The Gibbs sampler
Specify prior:
Pr(μ1, μ2, μ3, μ4, σ2)
Prior (μj) ~ Nor(μ0, var0)
Prior (μj) ~ Nor(0,10000)
Prior (σ2) ~ IG(0.001, 0.001)
Prior is Inverse Gamma
a (shape), b (scale)
24
The Gibbs sampler
Combine prior with likelihood provides
posterior:
Post ( μ1, μ2, μ3, μ4, σ2 | data )
…this is a 5 dimensional distribution…
The Gibbs sampler
Iterative evaluation via conditional
distributions:
Post ( μ1 | μ2, μ3, μ4, σ2, data ) ~ Prior (μ1) X Data (μ1)
Post ( μ2 | μ1, μ3, μ4, σ2, data ) ~ Prior (μ2) X Data (μ2)
Post ( μ3 | μ1, μ2, μ4, σ2, data ) ~ Prior (μ3) X Data (μ3)
Post ( μ4 | μ1, μ2, μ3, σ2, data ) ~ Prior (μ4) X Data (μ4)
Post ( σ2 | μ1, μ2, μ3, μ4, data ) ~ Prior (σ2) X Data (σ2)
The Gibbs sampler
1.Assign starting values
2.Sample μ1 from conditional distribution
3.Sample μ2 from conditional distribution
4.Sample μ3 from conditional distribution
5.Sample μ4 from conditional distribution
6.Sample σ2 from conditional distribution
7.Go to step 2 until enough iterations
The Gibbs sampler
Iteration
μ1
μ2
μ3
μ4
σ2
1
3.00
5.00
8.00
3.00
10
2
3.75
4.25
7.00
4.30
8
3
3.65
4.11
6.78
5.55
5
.
.
.
.
.
.
15
4.45
3.19
5.08
6.55
1.1
.
.
.
.
.
.
.
.
.
.
.
.
199
4.59
3.75
5.21
6.36
1.2
200
4.36
3.45
4.65
6.99
1.3
Trace plot
Trace plot: posterior
Posterior Distribution
31
Burn In

Gibbs sampler must run t iterations ‘burn in’
before we reach target distribution f(Z)
– How many iterations are needed to converge on
the target distribution?

Diagnostics
– Examine graph of burn in
– Try different starting values
– Run several chains in parallel
32
Convergence
33
Convergence
34
Convergence
35
Convergence
36
Convergence
37
Conclusion about convergenge


Burn-in: Mplus deletes first half of chain
Run multiple chains (Mplus default 2)
– Decrease Bconvergence: default .05
but better use .01

ALWAYS do a graphical evaluation of each and
every parameter
38
Summing up

Probability

Degree of belief

Prior

What is known before observing the data

Posterior

What is known after observing the

Informative prior

Tool to include subjective knowledge

Non-informative prior  Try to express absence of prior knowledge
Posterior mainly determined by data

MCMC methods

Simulation (sampling) techniques to obtain
the posterior distribution and all posterior
summary measures

Convergence

Important to check
39
IQ








N = 20
Data are generated
Mean = 102
SD = 15
N = 20
Data are generated
Mean = 102
SD = 15
IQ
40
IQ
41
IQ
Prior type
ML
Prior 1 A
Prior 2a M or A
Prior2b M or A
Prior2c M or A
Prior 3A
Prior 4W
Prior 5 W
Prior 6a W
Prior 6b W
Prior Variance used
large variance, SD=100
medium variance, SD=10
small variance, SD=1
medium variance, SD=10
small variance, SD=1
Large variance, SD=100
medium variance, SD=10
Posterior Mean IQ score
95% C.I./C.C.I.
102.00
101.99
101.99
101.99
102.00
102.03
102.00
102.00
99.37
86.56
94.42 – 109.57
94.35 – 109.62
94.40 – 109.42
94.89 – 109.07
100.12 – 103.87
94.22 – 109.71
97.76 – 106.80
100.20-103.90
92.47 – 106.10
80.17 – 92.47
42
43
Uncertainty in Classical Statistics

Uncertainty = sampling distribution
– Estimate population parameter  by
– Imagine drawing an infinity of samples
– Distribution of ˆ over samples

Problem is that we have only one sample
– Estimate ˆ and its sampling distribution
– Estimate 95% confidence interval
44
Inference in Classical Statistics

What does 95% confidence interval actually
mean?
– Over an infinity of samples, 95% of these contain
the true population value 
– But we have only one sample
– We never know if our present estimateˆ and
confidence interval is one of those 95% or not
45
Inference in Classical Statistics

What does 95% confidence interval NOT mean?

We have a 95% probability that the true
population value  is within the limits of our
confidence interval

We only have an aggregate assurance that in
the long run 95% of our confidence intervals
contain the true population value
46
Uncertainty in Bayesian Statistics



Uncertainty = probability distribution for the
population parameter
In classical statistics the population parameter
 has one single true value
In Bayesian statistics we imagine a distribution
of possible values of population parameter 
47
Inference in Bayesian Statistics


What does a95% central credibility interval mean?
We have a 95% probability that the population
value  is within the limits of our confidence
interval
48
What have we learned so far?
Results are compromise of prior & data
However:
-> non/low-informative priors
-> informative priors
-> misspecification of the prior
-> convergence
Results are easier to communicate
(eg CCI compared to confidence interval)
Software

WinBUGS/ OpenBUGS



R packages


Special implementation for multilevel regression
AMOS


LearnBayes, R2Winbugs, MCMCpack
MLwiN


Bayesian inference Using Gibbs Sampling
Very general, user must set up model
Special implementation for SEM
Mplus

Very general (SEM + ML + many other models)
MPLUS - ML
DATA: FILE IS data.dat;
VARIABLE:
NAMES ARE IQ;
ANALYSIS:
ESTIMATOR IS ML;
MODEL:
[IQ];
51
MPLUS – BAYES: default settings
DATA: FILE IS data.dat;
VARIABLE:
NAMES ARE IQ;
ANALYSIS:
ESTIMATOR IS BAYES;
MODEL:
[IQ];
52
MPLUS – BAYES: default settings
Prior for IQ:
Prior mean = 0
Prior variance/precision = 1010
0
IQ
53
MPLUS – BAYES: change prior
DATA: FILE IS data.dat;
VARIABLE:
NAMES ARE IQ;
ANALYSIS:
ESTIMATOR IS BAYES;
MODEL:
[IQ] (p1);
54
MPLUS – BAYES: change prior
DATA: FILE IS data.dat;
VARIABLE:
NAMES ARE IQ;
ANALYSIS:
ESTIMATOR IS BAYES;
MODEL:
[IQ] (p1);
MODEL PRIOR:
p1 ~ N(a,b);
a = prior mean
b = prior precission
55
MPLUS – BAYES: change prior
DATA: FILE IS data.dat;
VARIABLE:
NAMES ARE IQ;
ANALYSIS:
ESTIMATOR IS BAYES;
MODEL:
[IQ] (p1);
MODEL PRIOR:
p1 ~ N(100,10);
56
MPLUS – BAYES: change prior
DATA: FILE IS data.dat;
VARIABLE:
NAMES ARE IQ;
ANALYSIS:
ESTIMATOR IS BAYES;
MODEL:
[IQ] (p1);
MODEL PRIOR:
p1 ~ N(100,10);
PLOT: type is plot2;
57
MPLUS – BAYES: change prior
DATA: FILE IS data.dat;
VARIABLE:
NAMES ARE IQ;
ANALYSIS:
ESTIMATOR IS BAYES;
CHAINS = 4;
BITERATIONS = (1000);
BCONVERGENCE = .01;
MODEL:
[IQ] (p1);
MODEL PRIOR:
p1 ~ N(100,10);
PLOT: type is plot2;
58
MPLUS – BAYES: change prior
DATA: FILE IS data.dat;
VARIABLE:
NAMES ARE IQ;
ANALYSIS:
ESTIMATOR IS BAYES;
CHAINS = 4;
BITERATIONS = (1000);
BCONVERGENCE = .01;
MODEL:
[IQ] (p1);
MODEL PRIOR:
p1 ~ N(100,10);
PLOT: type is plot2;
OUTPUT: stand sampstat
TECH4 TECH8;
59
Bayesian updating
Dynamic interactionism where adolescents are believed to develop
through a dynamic and reciprocal transaction between personality
and the environment
60
Bayesian updating
Dynamic interactionism where adolescents are believed to develop
through a dynamic and reciprocal transaction between personality
and the environment
In 1998, Asendorpf and Wilpers stated that
"empirical evidence on the relative strength of personality effects on
relationships and vice versa is surprisingly limited"
Back in 1998, there had been very few longitudinal studies about
personality development. Personality was not often used as
outcome variable because it was seen as stable
These authors investigated for the first time personality and
relationships over time in a sample of young students (n = 132)
after their transition to university. The main conclusion of their
analyses was that personality influenced change in social
relationships, but not vice versa.
61
Bayesian updating
In 2001, Neyer and Asendorpf replicated the personality–relationship
model, but now using a large representative sample of young
adults
Based on the previous results Neyer and Asendorpf
“[…] hypothesized that personality effects would have a clear
superiority over relationships effects“
In line with Asendorpf and Wilpers, they concluded that
“Path analyses showed that once initial correlations were
controlled, personality traits predicted change in various
aspects of social relationships, whereas effects of antecedent
relationships on personality were rare and restricted to very
specific relationships with one's pre-school children"
62
Bayesian updating
Hypothesized to
be >0
e1
β1
T1
Extraversion
T2
Extraversion
r2
β3
r1
e2
β4
T1
Friends
β2
T2
Friends
Hypothesized to
be 0
63
Bayesian updating
In 2003 Asendorpf and van Aken continued working on studies into
personality–relationship transaction The authors stated that
"The aim of the present study was to apply the methodology
used by Asendorpf and Wilpers (1998) and Neyer and
Asendorpf (2001) to the study of personality–relationship
transaction over adolescence, to try to replicate key
findings of these earlier studies, particularly the dominance
of […] traits over relationship quality“
Asendorpf and van Aken confirmed previous findings:
"The stronger effect was an extraversion effect on perceived
support from peers. This result replicates, once more, similar
findings in adulthood." (p.653)
64
Bayesian updating
In 2010, Sturaro, Denissen, van Aken, and Asendorpf, once again,
investigated the personality–relationship transaction model
Sturaro et al. found some contradictory results compared to the
previously described studies
"[The Five-Factor theory] predicts significant paths from personality
to change in social relationship quality, whereas it does not
predict social relationship quality to have an impact on personality
change. Contrary to our expectation, however, personality did not
predict changes in relationship quality"
65
Bayesian updating
In conclusion, the four papers described above clearly illustrate
how theory building works in daily practice.
Asendorpf and Wilpers (1998) started with testing theoretical
ideas on the association between personality and social
relationships, tracing back to McCrae and Costa (1996),
and although their results were replicated by Neyer and
Asendorpf (2001), and Asendorpf and van Aken (2003),
Sturaro, Denissen, van Aken, and Asendorpf (2010) were not
able to do so. This latter finding let to re-formulations of the
original theoretical ideas.
66
Bayesian updating
Why not update the results instead of testing the null hypothesis
over and over again?
Let’s use Bayesian updating and impost subjective priors
In the first scenario we only focus on those data sets with
similar age groups.
Therefore we first re-analyze the data of Neyer and Asendorpf
(2001) without using prior knowledge. Thereafter, we reanalyze the data of Sturaro et al. (2010) using prior
information based on the data of Neyer and Asendorpf; both
data sets contain young adults between 17-30 years of age.
67
Bayesian updating
Why not update the results instead of testing the null hypothesis
over and over again?
Let’s us Bayesian updating and impost subjective priors
In the second scenario we assume the relation between
personality and social relationships is independent of age and
we re-analyze the data of Sturaro et al. using prior
information taken from Neyer and Asendorpf and from
Asendorpf and van Aken.
In this second scenario we make a strong assumption, namely
that the cross lagged effects for young adolescents are equal
to the cross lagged effects of young adults.
This assumption implicates similar developmental trajectories
across age groups and indicates a full replication study.
68
Bayesian updating
Hypothesized to
be >0
e1
β1
T1
Extraversion
T2
Extraversion
r2
β3
r1
e2
β4
T1
Friends
β2
T2
Friends
Hypothesized to
be 0
69
Scenario 1
Model 1:
Neyer & Asendorpf data
without prior knowledge
Estimate (SD)
95% PPI
β1
0.605 (0.037)
0.532 - 0.676
β2
0.293 (0.047)
0.199 - 0.386
β3
0.131 (0.046)
0.043 - 0.222
β4
-0.026 (0.039)
-0.100 0.051
70
Scenario 1
Model 1:
Model 2:
Neyer & Asendorpf data
Sturaro et al. data
without prior knowledge
without prior knowledge
Estimate (SD)
95% PPI
Estimate (SD)
95% PPI
β1
0.605 (0.037)
0.532 - 0.676
0.291 (0.063)
0.169 - 0.424
β2
0.293 (0.047)
0.199 - 0.386
0.157 (0.103)
-0.042 - 0.364
β3
0.131 (0.046)
0.043 - 0.222
0.029 (0.079)
-0.132 0.180
β4
-0.026 (0.039)
-0.100 -
0.303 (0.081)
0.144 - 0.462
0.051
71
Scenario 1
Model 1:
Model 2:
Model 3:
Neyer & Asendorpf data
Sturaro et al. data
Sturaro et al. data
without prior knowledge
without prior knowledge
with priors based on Model 1
Estimate (SD)
95% PPI
Estimate (SD)
95% PPI
Estimate (SD)
95% PPI
β1
0.605 (0.037)
0.532 - 0.676
0.291 (0.063)
0.169 - 0.424
0.337 (0.058)
0.228 - 0.449
β2
0.293 (0.047)
0.199 - 0.386
0.157 (0.103)
-0.042 - 0.364
0.287 (0.082)
0.130 - 0.448
β3
0.131 (0.046)
0.043 - 0.222
0.029 (0.079)
-0.132 - 0.180
0.106 (0.072)
-0.038 - 0.247
β4
-0.026 (0.039)
-0.100 - 0.051
0.303 (0.081)
0.144 - 0.462
0.249 (0.067)
0.111 - 0.375
72
Scenario 2
Model 4:
Asendorpf & van Aken data
without prior knowledge
Estimate (SD)
95% PPI
β1
0.512 (0.069)
0.376 - 0.649
β2
0.115 (0.083)
-0.049 - 0.277
β3
0.217 (0.106)
0.006 - 0.426
β4
0.072 (0.055)
-0.036 - 0.179
73
Scenario 2
Model 4:
Model 5:
Asendorpf & van Aken data
Asendorpf & van Aken data
without prior knowledge
with priors based on Model 1
Estimate (SD)
95% PPI
Estimate (SD)
95% PPI
β1
0.512 (0.069)
0.376 - 0.649
0.537 (0.059)
0.424 - 0.654
β2
0.115 (0.083)
-0.049 - 0.277
0.140 (0.071)
0.005 - 0 .283
β3
0.217 (0.106)
0.006 - 0.426
0.212 (0.079)
0.057 - 0.361
β4
0.072 (0.055)
-0.036 - 0.179
0.073 (0.051)
-0.030 - 0.171
74
Scenario 2
Model 4:
Model 5:
Model 6:
Asendorpf & van Aken data
Asendorpf & van Aken data
Sturaro et al. data
without prior knowledge
with priors based on Model 1
with priors based on Model 5
Estimate (SD)
95% PPI
Estimate (SD)
95% PPI
Estimate (SD)
95% PPI
β1
0.512 (0.069)
0.376 - 0.649
0.537 (0.059)
0.424 - 0.654
0.313 (0.059)
0.199 - 0.427
β2
0.115 (0.083)
-0.049 - 0.277
0.140 (0.071)
0.005 - 0 .283
0.246 (0.087)
0.079 - 0.420
β3
0.217 (0.106)
0.006 - 0.426
0.212 (0.079)
0.057 - 0.361
0.100 (0.076)
-0.052 - 0.248
β4
0.072 (0.055)
-0.036 - 0.179
0.073 (0.051)
-0.030 - 0.171
0.259 (0.070)
0.116 - 0.393
75
Final results Sturaro et al
Scenario 1
Scenario 2
Model 3:
Model 6:
Sturaro et al. data
Sturaro et al. data
with priors based on Model 1
with priors based on Model 5
Estimate (SD)
95% PPI
Estimate (SD)
95% PPI
β1
0.337 (0.058)
0.228 - 0.449
0.313 (0.059)
0.199 - 0.427
β2
0.287 (0.082)
0.130 - 0.448
0.246 (0.087)
0.079 - 0.420
β3
0.106 (0.072)
-0.038 - 0.247
0.100 (0.076)
-0.052 - 0.248
β4
0.249 (0.067)
0.111 - 0.375
0.259 (0.070)
0.116 - 0.393
76
Final results Sturaro et al
Scenario 1
Scenario 2
Model 3:
Model 6:
Sturaro et al. data
Sturaro et al. data
with priors based on Model 1
with priors based on Model 5
Model 2:
Sturaro et al. data
without prior knowledge
Estimate (SD)
95% PPI
Estimate (SD)
0.291 (0.063)
0.157 (0.103)
β1
β2
0.169 - 0.424
0.337 (0.058)
-0.042 - 0.364
0.287 (0.082)
95% PPI
Estimate (SD)
95% PPI
0.228 - 0.449
0.313 (0.059)
0.199 - 0.427
0.130 - 0.448
0.246 (0.087)
0.079 - 0.420
0.029 (0.079)
β3 -0.132 - 0.180
0.106 (0.072)
-0.038 - 0.247
0.100 (0.076)
-0.052 - 0.248
0.303 (0.081)
β4
0.111 - 0.375
0.259 (0.070)
0.116 - 0.393
0.249 (0.067)
0.144 - 0.462
77
Conclusions
The updating procedure of both scenarios leads us to
conclude that the that using subjective priors
decrease confidence intervals.
=> More certainty about the relations
However…
78
Conclusions
Using subjective priors never changed the real issue
namely that Sturaro et al found opposite effects to
Neyer and Asendorpf.
The results supported the robustness of a conclusion
that effects occurring between ages 17 and 23 are
different from those occurring between ages 18-30,
i.e., the clearly higher age in the Neyer and
Asendorpf data.
79
Overall Conclusions
Excellent tool to include prior knowledge if available
Estimates (including intervals) always lie in the sample space if
prior is chosen wisely
Results are easier to communicate
Better small-sample performance, large-sample theory not
needed
Analyses can be made less computationally demanding
BUT: Bayes doesn’t solve misspecification of the model