#### Transcript Producing Data: Sampling

Producing Data: Sampling Inference: Probabilities and Distributions Week 8 March 1-2, 2011 (REVISED March 8 22:30) To start…. • To start today I just want to briefly go over the material we did last time on Chapter 8 and then spend the bulk of tonight on Material from Chapter 10 and 11 Producing Data • MY BEST ADVICE REGARDING THE GENERATION OF DATA SUITABLE FOR STATISTICAL ANALYSIS IS DON’T!!!!!!! • HOWEVER, having noted that, you do need to be aware of how people go about collecting the data you are using and the limits that are imposed on your work by these methods The first half of today’s class will look at sampling • Some Key Terms – Population (the entire group of individual cases about which we want information) – Sample (a part of the population from which we actually collect information and from which we will draw conclusions about the entire population) – Sampling Design (the method we use to choose a sample from the population) – Inference (the process of generalizing our results from our sample to the wider population) The Ideal Sample • • A sound sampling design is essential if we are going to produce results that are generalizable from our sample to the entire population In the ideal sample design: 1. Each individual case has an equal probability of being drawn as part of the sample 2. Each characteristic of relevance to the study is present in equal probability in our sample and the population-at-large. ****NOTE: by the end of today’s lesson we will see that this second characteristic is not essential as long as: 1. The first point is realized 2. and the sample we draw is large enough However, for now we will assume it is a good idea to aim for Garbage In… Garbage Out… • The truth of the matter is – It is very hard to absolutely achieve both of the ideal characteristics of a sample • For some populations it might be impossible such as relatively small groups (think about policy-makers and advisors for example) – It is getting harder all the time to do so – The best we can probably ever hope to do is “approximate” the two characteristics. • The closer we approximate the two characteristics • The better will be the accuracy of our inferences Sample types • Convenience Samples • Voluntary Response Samples • Purposive Samples – Snowball Samples • Simple Random Samples – Not only need to randomly choose the individual cases, but also need to figure out the correct size for the sample so that every possible sample has an equal chance to be chosen • Stratified Random Sample – Sometimes you want to go beyond a Simple Random Sample so as to ensure certain characteristics are not left to even the chance of error. This is common in public opinion polling where several Simple Random Samples will be drawn for populations identified along geographic lines. • The next slide portrays a multistage stratified random sample National Stratified Random Sample BC Randomly Select A set of Postal Sorting Areas for each Region and Randomly Select People in Those PSAs Prairie Greater Vancouver Greater Victoria ON QU Other Van. Island Atlantic The South Interior The North • Stratification need not be done on geographic lines (though it often is) • Any characteristic can be used as the basis of stratification. • What matters is that it is something of importance to your research Caution • As the book notes, a sample’s quality is built on more than just the sampling technique • We need to have reliable information about the population before we can even begin to select a decent sample • Non-responses • Response bias • And many other factors can also degrade a sample’s quality and the quality of the data we can get out of it. A funny little thing called probability • As we noted earlier, when we take a sample and conduct a study we want to generalize (or infer) the results of our study to the wider population our sample is drawn from. • If 40 percent of our sample say they will vote Conservative we would like to estimate that this is the situation among the general population • However, we know there is a chance that if we had drawn two separate samples and done two simultaneous studies, we might have gotten different results for each sample • If the variation of results between samples is too great, then we cannot generalize our results to the wider population. • Therefore we need to know about probability to estimate the chance that our results will vary from the actual situation in the population at large. • The statistician might tell us there is a 95% chance that our survey result is accurate plus or minus 3% (meaning there is a 95% chance that the real support for the Conservatives among the general population is between 37% and 43%) In the Long-Run • The law of probability is based on a regularly documented observation that while chance can produce erratic results over the short-term or when small numbers are looked at, it generates regular and predictable outcomes over the long-term and as numbers increase • Random – Outcomes are uncertain but there is nonetheless a regular distribution of outcomes in a large number of repetitions (this is not a pattern). • Probability – The probability of any outcome of a random phenomenon is the proportion of times the outcome would occur in a very long series of repetitions. Examples of Commonly Known Probable Long-term Results • Coin Tossing – Fifty/Fifty • Stockmarket returns – Reversion to the mean • The fall of cards in a game (if you are able to count them properly and quickly enough) Randomness • Perfect Randomness is very rare and the ability to select numbers totally at random (so that each and every other number had just as good a chance of being selected and no pattern can ever be predicted) is valuable Probability Models • Sample Space “S” of a random phenomenon is the set of all possible outcomes • An Event is an outcome or a set of outcomes of a random phenomenon (the roll of the dice, the flip of the coin, etc.) • A Probability Model is a mathematical description of a random phenomenon consisting of – A sample space S – A way of assigning probability to events As in figure 10.2 in the book: There are 36 possible combinations if you roll two standard dice. If we wanted to define a sample space S for (5) it would be comprised of the four possible ways to roll 5 (i.e. the four “events” that result in 5 A ={ roll 1 & 4, roll 2 & 3, roll 3 & 2, roll 4 &1} Some Formal Probability Rules • A probability is a number between 0 and 1 – An event with a probability of 0 ought never to occur, An event with a probability of 1 out to always occur • All possible outcomes together must equal 1 • If two events have no outcomes in common, the probability that one or the other occurs is the sum of their individual probabilities. Eg. If one event occurs in 40% of cases, and the other in 25% and the two cannot occur together then the probability of one or the other occurring is 65% • The probability that an event does not occur is 1 minus the probability that it does occur Discrete vs. Continuous Models • Discrete Probability Models – Assume that the sample space is finite – To assign probabilities list the probabilities of all the individual outcomes (must be between 1 and 0 and add up to 1).The probability of an event is the sum of the outcomes making up the event. – Think of our dice example: What is the probability of rolling a five? • • • • • roll 1 & 4 = 1/36 roll 2 & 3 = 1/36 roll 3 & 2 = 1/36 roll 4 &1 = 1/36 Total Probability = 4/36 = 1/9 = 0.111 • Continuous Probability Models – Assign probabilities as areas under a density curve (such as the normal curve) – The area under the curve and above any range of values is the probability of an outcome in that range – This is what we did in chapter 3! Sampling Distributions • Some Key Words – Parameter: a number that describes the population. We often can only speculate on this as we only have data for a sample. – Statistic is a number that can be computed from the sample data without making use of any unknown parameters. We often use statistics to estimate parameters. • The Law of Large Numbers: – Draw observations at random from any population with finite mean µ . – As the number of observations drawn increases, the mean xof the observed values gets closer and closer to the mean µ of the population . Two types of distribution of variables (be careful) • The Population Distribution of a variable is the distribution of values of the variable in the population • The Sampling Distribution of a statistic is the distribution of values taken by the statistic in all possible samples of the same size from the same population (in other words, how a statistic varies in many samples drawn from the same population) For those who like math Suppose that X is the mean of a SRS of size “n” drawn from a large population with mean and standard deviation Then the sampling distribution of and standard deviation n X has mean Bottom Line and Caution • If we can compute the average among a sample we have a decent guess as to what the average is in the population • The average of a sample is generally a better guess of the average of the population than any one case in the population. In other words, based on a survey of incomes, I can tell you with a reasonable level of certainty what the average income of Canadians is. I cannot tell you just from that what the income of any specific Canadian is. The Central Limit Theorem • When the number of samples is large enough, the sample distribution of the mean will look like a normal distribution even if the population from which the sample is drawn does not exhibit a normal distribution for the variable as long as the population has a finite standard deviation • This means we can use the known properties of the normal curve to explore our data and study it, even if we do not know for certain what the parameter values are for variables in the general population.