#### Transcript No Slide Title

Stat 651 Lecture 4 Copyright (c) Bani Mallick 1 Topics in Lecture #4 Probability The bell-shaped (normal) curve Normal probability plots (the q-q plot) to check for normality of continuous data Use of Table 1 in the back of the book Copyright (c) Bani Mallick 2 Topics in Lecture #4 Normal probability calculations Data Transformations Sampling distributions: sample means are random variables! Standard error of the sample mean Central Limit Theorem A simple confidence interval Copyright (c) Bani Mallick 3 Book Sections Covered in Lecture #4 Chapter 4.10, in detail Chapter 4.11 (read on your own) Chapter 4.12, in detail Chapter 5.1 Chapter 5.2 Copyright (c) Bani Mallick 4 Lecture 3 Review Box plots are probably the best way to compare populations graphically You can detect shifts and changes in variation Also identifies outliers Copyright (c) Bani Mallick 5 Lecture 3 Review q-q plots are a simple way to understand whether the data are approximately bellshaped Population Relative Frequency Histogram Bell-shaped curve!! .5 .4 .3 .2 Normal Density .1 0.0 -.1 -4 -3 -2 -1 0 1 2 3 4 X Copyright (c) Bani Mallick 6 Lecture 3 Review q-q plots are a simple way to understand whether the data are approximately bellshaped If they are sort of straight, then normality of the population relative frequency histogram is not too badly off Copyright (c) Bani Mallick 7 q-q plot for the healthy women Normal Q-Q Plot of Log(Saturated Fat) 4.5 4.0 Expected Normal Value 3.5 3.0 2.5 2.0 1.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 Observed Value Copyright (c) Bani Mallick 8 Lecture 3 Review For bell-shaped populations, we have empirical rules Approximately 68% (90%) (95%) of the population lies within 1 (1.645) (1.96) population standard deviations s of the population mean m Copyright (c) Bani Mallick 9 Lecture 3 Review In many of our examples, we have seen that there look to be differences among populations. How can we tell if the differences are real? We will say that populations are different if the differences we observe are more than can be expected by sample-to-sample variability. Copyright (c) Bani Mallick 10 Lecture 3 Review Random variables are any outcome (qualitative or numerical) from an experiment involving random sampling from a population The idea of a model is to write down a formula for the population histogram as a function of 1-2 parameters which are estimated from the data. If you know the parameters of the model, then you know everything about probabilities in that population Copyright (c) Bani Mallick 11 Using the Normal Model The entire point of the normal model is to make probability statements In practice, we estimate the population mean m by the sample mean We estimate the population standard deviation s by the sample standard deviation Then we estimate probabilities, by pretending the sample quantities = the population ones Copyright (c) Bani Mallick 12 Various Cases Suppose we want to know what % of a population lies below a specified value, c We write this by asking: what is Pr(X < c) The value c is any arbitrary value, e.g., 6 X is any random variable with a population mean m and a population standard deviation s Copyright (c) Bani Mallick 13 Pr(X < c) for Normal Populations Compute the z-score z= c-μ σ Look up value in Table 1, page 1091 (white board explanation) Copyright (c) Bani Mallick 14 Mechanics NHANES: suppose healthy women’s ages are normally distributed with mean m = 40 and standard deviation s = 6 What is the chance that a randomly selected person from this population is aged c = 43.3 or less We write this in symbols as pr(X < 43.3) Copyright (c) Bani Mallick 15 Mechanics m = 40, s = 6 pr(X < 43.3) is what we want z = (43.3 - m)/ s = 0.55 = z-score Look up in Table 1: The value 0.55 is on page 1092: first column is 0.5, first row is 0.05: add them to get 0.55, and look up the value Pr(X < 43.3) = 0.7088 Copyright (c) Bani Mallick 16 Various Cases Suppose we want to know what % of a population lies above a specified value, c We write this by asking: what is Pr(X > c) The value c is any arbitrary value, e.g., 6 X is any random variable with a population mean m and a population standard deviation s Copyright (c) Bani Mallick 17 Pr(X > c) for Normal Populations This is simply 1 – Pr(X <= c). Compute the z-score (c- m)/s Look up the value for z in Table 1 Subtract this value from 1.0 Copyright (c) Bani Mallick 18 Mechanics m = 40, s = 6 Chance that a randomly selected person from this population is aged 46 or more pr(X > 46) z = (46 - m)/ s = 1 Look up in Table 1 for 1.00: get 0.8413 Because you are asking for > 46, subtract from 1 to get pr(X > 46) = 1 – 0.8413 = .1587 Copyright (c) Bani Mallick 19 Mechanics m = 40, s = 6 Chance that a randomly selected person from this population is aged 46 or less pr(X <= 46) z = (46 - m)/ s = 1 Look up in Table 1: chance is 84.13% Copyright (c) Bani Mallick 20 Mechanics m = 40, s = 6 Chance that a randomly selected person from this population is aged 34 or less pr(X <= 34) z = (34 - m)/ s = -1 Look up in Table 1: chance is 0.1587 = 15.87% Copyright (c) Bani Mallick 21 Aortic Stenosis Data Two populations: healthy kids and kids with aortic stenosis Two outcomes: body surface area and aortic value area Size adjusted aortic value areas is the ratio of aortic value area to body surface area Copyright (c) Bani Mallick 22 8 125 6 4 2 99 72 79 88 0 Stenosis Data, AVA to BSA Ratio: Note the huge outlier in the stenotic kids. He/she has a huge aortic value area relative to his/her body size -2 N= 70 56 Healthy Stenoti Health Status Copyright (c) Bani Mallick 23 Aortic Stenosis Data Healthy kids and AVA/BSA Ratio Sample mean = 1.38, s = 0.51 Let’s pretend the population has m = 1.4, s = 0.5 As it turns out, the sample mean of stenotic kids is 0.7 So, let’s ask: for healthy kids, what is pr(X < 0.7)? Copyright (c) Bani Mallick 24 Aortic Stenosis Data Healthy kids and AVA/BSA Ratio m = 1.4, s = 0.5 For healthy kids, what pr(X <= 0.7)? z = (0.7 - m)/s = -1.4 look up in Table 1 You should get 0.0808 Copyright (c) Bani Mallick 25 Aortic Stenosis Data For healthy kids, pr(X <= 0.7) = 0.0808 Stenotic kids have a mean ava/bsa ratio of 0.7 Thus, the average stenotic kid has a lower ava/bsa ratio than 91.92% of healthy kids 91.92% = 100% - 8.08% Copyright (c) Bani Mallick 26 Not all Data are Normally Distributed “Time to an event”, e.g., time to a heart attack Number of things that happen, e.g., number of heart attacks These typically have a skew shape .2 .1 DENSITY 0.0 -.1 -1 0 1 2 3 4 5 6 X Copyright (c) Bani Mallick 27 Not all Data are Normally Distributed These typically have a skew shape Statisticians have special models to handle this (Gamma, Poisson) You will usually try to eliminate some of the skewness by data transformation .2 .1 DENSITY 0.0 -.1 -1 0 1 2 3 4 5 6 X Copyright (c) Bani Mallick 28 Not all Data are Normally Distributed The standard data transformations are Square root Logarithm: but if you have zeros in the data set, you have to add a small constant, since log(0) = Copyright (c) Bani Mallick 29 Inference The basic building blocks for inference are statistics Let’s start with the population mean m, the sample mean and the sample standard deviation s Standard error (of the mean) is Copyright (c) Bani Mallick s/ n 30 Inference The sample mean is a random variable This means that it varies from sample to sample Of course, if we were able to “sample” the entire population, the sample mean would equal the population mean m Copyright (c) Bani Mallick 31 Inference The sample mean Its own “population” mean is m It’s standard deviation is σ / n Note how the standard deviation of the sample mean becomes smaller as the sample size becomes larger Why does this make sense? is a random variable Copyright (c) Bani Mallick 32 Central Limit Theorem The sample mean Its own “population” mean is m It’s standard deviation is σ / n In “large enough” samples, the sample mean is very nearly normally distributed, i.e., has a bell--shaped histogram What does this mean? is a random variable Copyright (c) Bani Mallick 33 Warning It is incredibly easy to have difficulty understanding that the sample mean is itself a random variable But it is the crucial concept If I take repeated samples and compute the sample mean each time, I will not get the same number. Thus, the sample mean is a random variable Copyright (c) Bani Mallick 34 Women’s Interview Survey of Health Funny case-control study Seemed to indicate that those women who ate a lot of non-chocolate sweets were at higher risk of breast cancer 271 women controls were interview for their diets They completed 6 24-hour recalls Copyright (c) Bani Mallick 35 Women’s Interview Survey of Health 271 women controls were interview for their diets and completed 6 24-hour recalls Hawthorne effect: the more you ask people about their lives, the more they will change Does this happen here? If so, we’d expect that their caloric intake decreased the more they were asked about their diet Copyright (c) Bani Mallick 36 Women’s Interview Survey of Health To test the Hawthorne effect, we took the average caloric intake from the first two interviews, and subtracted it from the average caloric intake from the last 2 interviews X = (average of 5 & 6) – (average of 1 & 2) Do you think the population mean of X is positive or negative? Copyright (c) Bani Mallick 37 WOMEN’S INTERVIEW SURVEY OF HEALTH (WISH) My guess was that because of various factors (societal pressure, awareness of diet, Hawthorne effect), they will report fewer calories at the second time period My hypothesis is that the population mean of X is < 0. Copyright (c) Bani Mallick 38 WISH: Change in Caloric Intake 2000 247 1000 0 Does it look like a big change? -1000 -2000 217 239 208 -3000 N= 271 Change in mean Energ Copyright (c) Bani Mallick 39 WISH: Change in Calories Normal Q-Q Plot of Change in mean Energy 2000 Expected Normal Value 1000 0 -1000 -2000 -3000 -2000 -1000 0 1000 2000 Does this look straight enough to be happy thinking that X is approximately normally distributed? Observed Value Copyright (c) Bani Mallick 40 What does an IQR of 838 mean? WISH Descriptives Statistic Change in mean Energy: las t 2 recalls minus first 2 recalls Mean -180.1262 95% C onfidence Interval for Mean Lower Bound Upper Bound 37.2202 -253.4050 -106.8474 5% Trimmed Mean -171.6543 Median -128.2150 Variance 375428.7 Std. Deviation 612.7223 Minimum -2235.00 Maximum 1567.96 Range 3802.96 Interquartile Range Std. Error 838.1900 Skewness -.253 .148 Kurtos is .608 .295 Copyright (c) Bani Mallick 41 WISH The sample size is n = 271 The sample mean change = -180 calories! The sample standard deviation = 612 The sample standard error = 37 Empirical rule, the chance is 95% that the population mean is with 1.96 * 37 = 74 of 180, i.e., between - 254 and -106 Copyright (c) Bani Mallick 42 WISH Empirical rule, the chance is 95% that the population mean between - 254 and -106 What does this mean? Is there a Hawthorne effect going on? Can you attach a probability to this? Copyright (c) Bani Mallick 43