#### Transcript Statistical Inference

Introduction to Statistical Inference A Comparison of Classical Orthodoxy with the Bayesian Approach Statistical Inference • Statistical Inference is a problem in which data have been generated in accordance with some unknown probability distribution which must be analyzed and some type of inferences about the unknown distributions to be made. • In other words, in a statistics problem, there are two or more probability distributions which may have generated the data. • By analyzing the data, we attempt to learn about the unknown distribution, to make some inferences about certain properties of the distribution, and to determine the relative likelihood that each possible distribution is actually the correct one. Introduction to Classical Hypothesis Testing Hypothesis Testing is the standard procedure that social scientists use to determine the empirical value of their theory. Hypothesis Testing is a form of proof by statistical contradiction. Evidence is mustered in favor of theory by demonstrating that the data is unlikely to be observed if the postulated theoretical model were false. Why do we do it this way? Epistemological Foundations of Classical Hypothesis Testing Foundation 1. There exists one and only one process that generates the actions of a population with respect to some variable. Foundation 2. There are many examples of long accepted scientific theories losing credibility. Once objective “truths” are rejected. Foundation 3. If we cannot be sure that a theory is “true”, then the next best thing is to judge the probability that a theory is true. How do we express the probability that a theory is true? • We’d like to be able to express our uncertainty as: P ( Model is True | Observed Data ) • But, based on our epistemological foundations, we cannot state that the model is true with Probability X. Either the model is true, or not. • Instead, we are limited to a knowledge of: P ( Observed Data | Model is True ) Interpretation of P( Observed Data | Model is True ) If P( Data | Model) is close to one, then the data is consistent with the model, and we would not reject it as an objective interpretation of reality. Hypothesis: men have higher wages than woman Data: The median income for a male is $38, 275 The median income for a female is $29, 215. We would say that the data is consistent with the model. That is, P( Data | Model) is close to one. Interpretation of probabilities cont. If P( Data | Model) is not close to one, then the data is inconsistent with the model’s predictions, and we reject the model. Hypothesis: People born in the U.S. have higher incomes than immigrants. Data: The median income for someone who is native born is $42,917. The median income for a naturalized immigrant is $43,968. We would say that the data is not consistent with the model. That is, P( Data | Model) is not close to one and the model is not a useful representation of reality. The Classical Hypothesis Testing Setup Step 1. Define the Research Hypothesis. A Research or Alternative Hypothesis is a statement derived from theory about what the researcher expects to find in the data. Step 2. Define the Null Hypothesis. The Null Hypothesis is a statement of what you would not expect to find if your research or alternative hypothesis was consistent with reality. Step 3. Conduct an analysis of the data to determine whether or not you can reject the null hypothesis with some pre-determined probability. If you can reject the null hypothesis with some probability, then the data is consistent with the model. If you cannot reject the null hypothesis with some probability, then the data is not consistent with the model. The Bayesian Approach Bayesians, in contrast, try to do the following: 1) Make inferences based on all information at our disposal 2) See how new data effects our (old) inferences 3) Need to identify all hypotheses (or states of nature) that may be true 4) Need to know what each hypothesis (or state of nature) predicts that we will observe 5) Need to know how to compute the consequences. i.e. we need to know how to update our old inferences in light of our observations In sum, we try to do statistics like how scientists think. A schematic representation of Bayesian reasoning Theory, Creativity Inference, Hypothesis, Verification, Model Falsification Deduction Induction Epistemic Relationships Observation Predictions Data Induction and Deduction Deduction: Deduce outcomes from hypotheses. If A then B B A Therefore, B A C D Induction: Infer hypotheses from outcomes. If A, then we are like to observe B and C A B and C are observed B C Therefore, A is supported E D How new data supports our hypotheses Suppose we have the following hypotheses: H1: A B H2: A C H3: A D What do we infer if we observe A, B, and C? How new data supports our hypotheses Suppose we have the following hypotheses: H1: A B H2: A C H3: A D What do we infer if we observe A, B, and C? We infer that we have evidence in support of H1 and H2. We infer that we have evidence to refute H3. The key difference between classical and Bayesian reasoning The key difference between classical and Bayesian reasoning is that the Bayesian believes that knowledge is subjective. Consequently, the Bayesian rejects the epistemological foundation that there exists a “true” data-generating process that can be revealed through process of elimination. Motivating Example of Bayesian Approach to Inference “What is Bayesian statistics and why everything else is wrong.” Paper by Michael Lavine Agenda Discuss a simple example that illustrates: 1) The likelihood principle. 2) Bayesian and classical inference. 3) why classical statistics should not be used to compare rival hypotheses. The Example Cancer at Slater School. Example taken from an article by Paul Brodeur in the New Yorker in Dec. 1992. Slater School is an elementary school where the staff was concerned that their high cancer rate could be due to two nearby high voltage transmission lines. Key Facts there were 8 cases of invasive cancer over a long time among 145 staff members whose average age was between 40 and 44 based on the national cancer rate among woman this age (approximately 3/100), the expected number of cancers is 4.2 Assumptions: 1) the 145 staff members developed cancer independently of each other 2) the chance of cancer, , was the same for each staff person. Therefore, the number of cancers, X, follows a binomial distribution: X ~ bin (145, ) How well do each of Four Simplified Competing Theories explain the data? Theory A: = .03 Theory B: = .04 Theory C: = .05 Theory D: = .06 The Likelihood of Theories A-D To compare the theories, we see how well each explains the data. That is, for each hypothesized , we use elementary results about the binomial distribution to calculate: 145 8 (1 )137 Pr( X 8 | ) 8 Theory A: Pr(X = 8 | = .03 ) .036 Theory B: Pr(X = 8 | = .04 ) .096 Theory C: Pr(X = 8 | = .05 ) .134 Theory D: Pr(X = 8 | = .06 ) .136 This is a ratio of approximately 1:3:3:4. So, Theory B explains the data about 3 times as well as theory A. The Likelihood Principle 145 X (1 )145 X Pr( X | ,145 ) X Initially, Pr( X | ) is a function of two variables: X and . Once X = 8 has been observed, then Pr( X | ) describes how well each theory, or value of explains the data. No other value of X is relevant. This is an example of the likelihood principal. The Likelihood Principal says that once X has been observed, say X = x, then no other value of X matters and we should treat Pr(X | ) simply as Pr( X = x | ). The Likelihood principal is central to Bayesian reasoning. A Bayesian Analysis There are other sources of information about whether cancer can be induced by proximity to high-voltage transmission lines. - Epidemiologists show positive correlations between cancer and proximity - Other epidemiologists don’t show these correlations, and physicists and biologists maintain believe that energy in magnetic fields associated with high-voltage power lines is too small to have an appreciable biological effect. Supposes we judge the pro and con sources equally reliable. Therefore, Theory A (no effect) is as likely as Theories B, C, and D together, and we judge theories B, C, and D to be equally likely. So, Pr(A) .5 Pr(B) + Pr(C) + Pr(D) Also, Pr(B) Pr(C) Pr(D) 1/6 These quantities will represent our prior beliefs. Bayes’ Theorem Based on the definition of conditional probability, we know that: Pr( A | X 8) Pr( A and X 8) Pr(X 8) Pr( A) Pr(X 8 | A) Pr( A) Pr(X 8 | A) Pr( B) Pr(X 8 | B) Pr(C) Pr(X 8 | C) Pr( D) Pr(X 8 | D) (1 / 2)(.036) Pr( A | X 8) (1 / 2)(.036) (1 / 6)(.096) (1 / 6)(.134) (1 / 6)(.136) Pr( A | X 8) 0.23 Pr( A | X 8) Likewise, Pr( B | X = 8 ) = .21 Pr( C | X = 8 ) = .28 Pr( D | X = 8 ) = .28 Accordingly, we’d say that each of these four theories is equally likely, and the odds are 3:1 that the cancer rate at Slater is greater than .03 A non-Bayesian Analysis Classical statisticians, to test the hypothesis that Ho: = .03 against the alternative hypothesis calculate the p-value, defined as the probability under Ho of observing an outcome at least as extreme as that actually observed. i.e. For the Slater problem, we find: p-value = Pr(X=8| = .03 )+ Pr(X=9| = .03 )+ Pr(X=10| = .03 ) +…+ Pr(X=145| = .03 ) .07 Note that under a classical hypothesis test, we would probably reject the null hypothesis of no effect from the power lines at Slater. By comparison, the Bayesian analysis revealed that the probability that Pr( > .03) .77 Which would not be sufficient to reject the null hypothesis. The Bayesian critique of p-values Bayesians claim that the p-value should not be used to compare hypotheses because: 1) hypotheses should be compared by how well they explain the data. 2) the p-value does not account for how well the alternative hypotheses explain the data 3) the p-value summands are irrelevant because they don’t explain how well any hypothesis explains any observed data. In short, the p-value does not obey the likelihood principle because it uses Pr(X=x|) for values of x other than the observed value x=8. The same thing is true of all classical hypothesis tests and confidence intervals. Criticisms of the Bayesian Approach 1) The results are subjective. With only a few observations, the parameter estimates may be sensitive to the choice of priors. In the Slater School case, the range or possible posterior probabilities would be large. Bayesian reply: Bayesians use “diffuse priors,” sensitivity analysis, etc. to mitigate the influence of priors on their results. 2) The Bayesian analysis is philosophically unsound. Bayesians treat as a random variable where classical analysis treats as a fixed, but unknown constant. The classical analyst believes that either is, or is not, equal to .03. Bayesian Reply: treating as random does not necessarily mean that is random; rather, it expresses our uncertainty about . It is a statement about our knowledge of the quantity of interest.