Statistical Inference

Download Report

Transcript Statistical Inference

Introduction to Statistical
A Comparison of Classical
Orthodoxy with the Bayesian
Statistical Inference
• Statistical Inference is a problem in which data have
been generated in accordance with some unknown
probability distribution which must be analyzed and
some type of inferences about the unknown distributions
to be made.
• In other words, in a statistics problem, there are two or
more probability distributions which may have generated
the data.
• By analyzing the data, we attempt to learn about the
unknown distribution, to make some inferences about
certain properties of the distribution, and to determine
the relative likelihood that each possible distribution is
actually the correct one.
Introduction to Classical
Hypothesis Testing
Hypothesis Testing is the standard procedure that social
scientists use to determine the empirical value of their
Hypothesis Testing is a form of proof by statistical
Evidence is mustered in favor of theory by demonstrating
that the data is unlikely to be observed if the postulated
theoretical model were false.
Why do we do it this way?
Epistemological Foundations of
Classical Hypothesis Testing
Foundation 1. There exists one and only one
process that generates the actions of a
population with respect to some variable.
Foundation 2. There are many examples of long
accepted scientific theories losing credibility.
Once objective “truths” are rejected.
Foundation 3. If we cannot be sure that a theory is
“true”, then the next best thing is to judge the
probability that a theory is true.
How do we express the probability that
a theory is true?
• We’d like to be able to express our uncertainty
P ( Model is True | Observed Data )
• But, based on our epistemological foundations,
we cannot state that the model is true with
Probability X. Either the model is true, or not.
• Instead, we are limited to a knowledge of:
P ( Observed Data | Model is True )
Interpretation of P( Observed Data | Model is True )
If P( Data | Model) is close to one, then the data is
consistent with the model, and we would not reject it as
an objective interpretation of reality.
Hypothesis: men have higher wages than woman
Data: The median income for a male is $38, 275
The median income for a female is $29, 215.
We would say that the data is consistent with the model.
That is, P( Data | Model) is close to one.
Interpretation of probabilities cont.
If P( Data | Model) is not close to one, then the data is
inconsistent with the model’s predictions, and we reject
the model.
Hypothesis: People born in the U.S. have higher incomes
than immigrants.
Data: The median income for someone who is native born
is $42,917.
The median income for a naturalized immigrant is
We would say that the data is not consistent with the
model. That is, P( Data | Model) is not close to one and
the model is not a useful representation of reality.
The Classical Hypothesis Testing Setup
Step 1. Define the Research Hypothesis.
A Research or Alternative Hypothesis is a statement derived from
theory about what the researcher expects to find in the data.
Step 2. Define the Null Hypothesis.
The Null Hypothesis is a statement of what you would not expect to
find if your research or alternative hypothesis was consistent with
Step 3. Conduct an analysis of the data to determine whether or not you
can reject the null hypothesis with some pre-determined probability.
If you can reject the null hypothesis with some probability, then
the data is consistent with the model.
If you cannot reject the null hypothesis with some probability,
then the data is not consistent with the model.
The Bayesian Approach
Bayesians, in contrast, try to do the following:
1) Make inferences based on all information at our
2) See how new data effects our (old) inferences
3) Need to identify all hypotheses (or states of nature)
that may be true
4) Need to know what each hypothesis (or state of
nature) predicts that we will observe
5) Need to know how to compute the consequences. i.e.
we need to know how to update our old inferences in
light of our observations
In sum, we try to do statistics like how scientists think.
A schematic representation
of Bayesian reasoning
Epistemic Relationships
Induction and Deduction
Deduction: Deduce outcomes from hypotheses.
If A then B
Therefore, B
Induction: Infer hypotheses from outcomes.
If A, then we are like to observe B and C
B and C are observed
Therefore, A is supported
How new data supports our
Suppose we have the following hypotheses:
H1: A  B
H2: A  C
H3: A  D
What do we infer if we observe A, B, and C?
How new data supports our
Suppose we have the following hypotheses:
H1: A  B
H2: A  C
H3: A  D
What do we infer if we observe A, B, and C?
 We infer that we have evidence in support of
H1 and H2.
 We infer that we have evidence to refute H3.
The key difference between
classical and Bayesian reasoning
The key difference between classical and
Bayesian reasoning is that the Bayesian
believes that knowledge is subjective.
Consequently, the Bayesian rejects the
epistemological foundation that there
exists a “true” data-generating process
that can be revealed through process of
Motivating Example of Bayesian
Approach to Inference
“What is Bayesian statistics and
why everything else is wrong.”
Paper by Michael Lavine
Discuss a simple example that illustrates:
1) The likelihood principle.
2) Bayesian and classical inference.
3) why classical statistics should not be
used to compare rival hypotheses.
The Example
Cancer at Slater School. Example taken from an article by Paul Brodeur in the New Yorker in Dec. 1992.
 Slater School is an elementary school where the staff was concerned that their
high cancer rate could be due to two nearby high voltage transmission lines.
Key Facts
 there were 8 cases of invasive cancer over a long time among 145 staff members
whose average age was between 40 and 44
 based on the national cancer rate among woman this age (approximately 3/100),
the expected number of cancers is 4.2
1) the 145 staff members developed cancer independently of each other
2) the chance of cancer, , was the same for each staff person.
Therefore, the number of cancers, X, follows a binomial distribution: X ~ bin (145, )
How well do each of Four Simplified Competing Theories explain the data?
Theory A:  = .03
Theory B:  = .04
Theory C:  = .05
Theory D:  = .06
The Likelihood of Theories A-D
To compare the theories, we see how well each explains the data.
That is, for each hypothesized , we use elementary results about
the binomial distribution to calculate:
145  8
 (1   )137
Pr( X  8 |  )  
 8 
Theory A: Pr(X = 8 |  = .03 )  .036
Theory B: Pr(X = 8 |  = .04 )  .096
Theory C: Pr(X = 8 |  = .05 )  .134
Theory D: Pr(X = 8 |  = .06 )  .136
This is a ratio of approximately 1:3:3:4. So, Theory B explains the
data about 3 times as well as theory A.
The Likelihood Principle
145  X
 (1   )145 X
Pr( X |  ,145 )  
 X 
Initially, Pr( X |  ) is a function of two variables: X and .
Once X = 8 has been observed, then Pr( X |  ) describes how well
each theory, or value of  explains the data. No other value of X is
relevant. This is an example of the likelihood principal.
The Likelihood Principal says that once X has been observed,
say X = x, then no other value of X matters and we should treat
Pr(X | ) simply as Pr( X = x | ).
The Likelihood principal is central to Bayesian reasoning.
A Bayesian Analysis
There are other sources of information about whether cancer can be
induced by proximity to high-voltage transmission lines.
- Epidemiologists show positive correlations between cancer and
- Other epidemiologists don’t show these correlations, and physicists
and biologists maintain believe that energy in magnetic fields
associated with high-voltage power lines is too small to have an
appreciable biological effect.
Supposes we judge the pro and con sources equally reliable.
Therefore, Theory A (no effect) is as likely as Theories B, C, and D
together, and we judge theories B, C, and D to be equally likely.
So, Pr(A)  .5  Pr(B) + Pr(C) + Pr(D)
Also, Pr(B)  Pr(C)  Pr(D)  1/6
These quantities will represent our prior beliefs.
Bayes’ Theorem
Based on the definition of conditional probability, we know that:
Pr( A | X  8) 
Pr( A and X  8)
Pr(X  8)
Pr( A) Pr(X  8 | A)
Pr( A) Pr(X  8 | A)  Pr( B) Pr(X  8 | B)  Pr(C) Pr(X  8 | C)  Pr( D) Pr(X  8 | D)
(1 / 2)(.036)
Pr( A | X  8) 
(1 / 2)(.036)  (1 / 6)(.096)  (1 / 6)(.134)  (1 / 6)(.136)
Pr( A | X  8)  0.23
Pr( A | X  8) 
Pr( B | X = 8 ) = .21
Pr( C | X = 8 ) = .28
Pr( D | X = 8 ) = .28
Accordingly, we’d say that each of these four theories is equally likely,
and the odds are 3:1 that the cancer rate at Slater is greater than .03
A non-Bayesian Analysis
Classical statisticians, to test the hypothesis that Ho:  = .03 against
the alternative hypothesis calculate the p-value, defined as the
probability under Ho of observing an outcome at least as extreme as
that actually observed.
i.e. For the Slater problem, we find:
p-value = Pr(X=8|  = .03 )+ Pr(X=9|  = .03 )+ Pr(X=10|  = .03 )
+…+ Pr(X=145|  = .03 )
 .07
Note that under a classical hypothesis test, we would probably reject
the null hypothesis of no effect from the power lines at Slater.
By comparison, the Bayesian analysis revealed that the probability that
Pr( > .03)  .77
Which would not be sufficient to reject the null hypothesis.
The Bayesian critique of p-values
Bayesians claim that the p-value should not be used to compare
hypotheses because:
1) hypotheses should be compared by how well they explain the
2) the p-value does not account for how well the alternative
hypotheses explain the data
3) the p-value summands are irrelevant because they don’t explain
how well any hypothesis explains any observed data.
In short, the p-value does not obey the likelihood principle because it
uses Pr(X=x|) for values of x other than the observed value x=8.
The same thing is true of all classical hypothesis tests and confidence
Criticisms of the Bayesian Approach
The results are subjective. With only a few observations, the parameter
estimates may be sensitive to the choice of priors. In the Slater School
case, the range or possible posterior probabilities would be large.
Bayesian reply: Bayesians use “diffuse priors,” sensitivity analysis, etc. to
mitigate the influence of priors on their results.
The Bayesian analysis is philosophically unsound. Bayesians treat 
as a random variable where classical analysis treats  as a fixed, but
unknown constant. The classical analyst believes that  either is, or is
not, equal to .03.
Bayesian Reply: treating  as random does not necessarily mean that  is
random; rather, it expresses our uncertainty about . It is a statement about
our knowledge of the quantity of interest.