#### Transcript Smoking and Lung Cancer

Basic statistical concepts [email protected] http://dambe.bio.uottawa.ca Simpson’s paradox Treatment A Treatment B Small Stones 93% (81/87) 87% (234/270) Large Stones 73% (192/263) 69% (55/80) Pooled 78% (273/350) 83% (289/350) C. R. Charig et al. 1986. Br Med J (Clin Res Ed) 292 (6524): 879–882 Treatment A: all open procedures Treatment B: percutaneous nephrolithotomy Question: which treatment is better? Applied Biostatistics • What is biostatistics: Biostatistics is for collecting, organizing, summarizing, presenting and analyzing data of biological importance, with the objective of drawing conclusions and facilitating decisionmaking. • Statistical estimation/description – point estimation (e.g., mean X = 3.4, slope = 0.37) – interval estimation (e.g., 0.5 < mean X < 8.5) • Significance tests – Statistic, e.g. t, F, 2 which are indices measuring the difference between the observed value and the expected value derived from the null hypothesis – Significance level and p value (e.g., p < 0.01) – Distribution of the statistic (We cannot obtain p value with the distribution) Descriptive Statistics • Normal distribution: – – – – Central tendency dispersion skewness kurtosis. • Confidence Limits. • Degree of freedom, e.g., N, (N-1): – Use a random number generator to get 40 variables from standard normal distribution and compute variance by dividing SS/N and SS/(N-1) N xi N fi xi x i 1 ; x i 1 N N N ( xi x ) s 2 i 1 sx N 1 2 N fi ( xi x )2 ; s 2 i 1 N 1 s -6 -4 -2 0 2 4 N Confidence Limits: Mean ± t,df SE 6 Example of data analysis 1200 1000 Number 800 600 400 200 0 30 35 40 45 50 Chest Number E(N) Probability density function (PDF) Cumulative distribution function (CDF) Grouped data Null hypothesis Significance test and p value Chest Number (X-meanX)^2 E(P_cumul) E(P) E(N) E(N) Chi-sq 33 3 46.6738 0.0010 0.0010 5.7513 5.7513 1.3162 34 18 34.0102 0.0046 0.0036 20.8698 20.8698 0.3946 35 81 23.3465 0.0173 0.0126 72.4853 72.4853 1.0002 36 185 14.6829 0.0520 0.0347 199.2925 199.2925 1.0250 37 420 8.0192 0.1276 0.0756 433.7970 433.7970 0.4388 38 749 3.3556 0.2579 0.1303 747.6065 747.6065 0.0026 39 1073 0.6919 0.4357 0.1778 1020.1789 1020.1789 2.7349 40 1079 0.0283 0.6278 0.1921 1102.3292 1102.3292 0.4937 41 934 1.3646 0.7922 0.1644 943.1520 943.1520 0.0888 42 658 4.7010 0.9035 0.1114 638.9692 638.9692 0.5668 43 370 10.0373 0.9633 0.0597 342.7579 342.7579 2.1652 44 92 17.3737 0.9886 0.0254 145.5710 145.5710 19.7144 45 50 26.7101 0.9972 0.0085 48.9444 48.9444 0.0228 46 21 38.0464 0.9994 0.0023 13.0264 16.2278 1.4034 47 4 51.3828 0.9999 0.0005 2.7440 48 1 66.7191 1.0000 0.0001 0.4574 Sum 5738 1 5737.9328 31.3675 Mean 39.8318 p= 0.0010 Var 4.2002 Std 2.0494 𝑝 𝑥, 𝜇, 𝜎 = 1 𝜎 2𝜋 𝑥−𝜇 2 − 𝑒 2𝜎2 Assignment 1 Original data X 1 1 2 2 2 3 3 3 3 4 4 4 5 5 Mean Std SE CV 95% LL 95% UL Grouped data X 1 2 3 4 5 Mean Var Std SE CV 95% LL 95% UL Num Compute mean, standard deviation (Std), standard error (SE), coefficient of variation (CV), and 95% confidence interval (lower and upper limits or LL and UL). you can use EXCEL function T.INV Group the data and do the same calculation for the grouped data Print this slide, fill in your answer, and hand at the beginning of class next Tuesday. Name: ID Data: same pattern but smaller N Number 24 74 38 21 11 8 11 5 2 1 3 1 0 0 0 1 Assignment 3: 80 70 Do the same test of normality as the previous slide with comparable N. 60 50 Number Grade 400 750 1250 1750 2250 2750 3250 3750 4250 4750 5250 5750 6250 6750 7250 7750 40 30 20 10 0 0 -10 2000 4000 Grade 6000 8000 No need to hand in, but get ready to discuss the result R and distributions • Discrete variable: poisson, binomial, geometric, … • Continuous variable: normal, t, F, 2, … • Functions: – – – – CDF: add p, e.g., pnorm, pt, pf, pchisq PDF: add d, e.g., dnorm, dt, df, dchisq Random numbers: add r, e.g., rnorm, rt, rf, rchisq Quantile (inverse of CDF): add q, e.g., qnorm, qt, qf,qchisq – Empirical cumulative distribution function: ecdf(x) – Empirical density: density(x) • Graphic functions – qqnorm(x) – hist(x,n), truehist(x,n) in MASS Descriptive stats in R • Built-in: mean, sd, var, … • R package: fBasics with skewness, kurtosis, etc. • Test of normality – Univariate • Significance test (Shapiro-Wilk-test): shapiro.test(xVec) • Graphic: qqnorm(xVec), plot(density(xVec)) – Multivariate energy test in library "energy"): mvnorm.etest(xMatrix) • Custom-made: – Describe in Describe.R – StatsGroupedData in StatsGroupData.R Decision making and risks Decision H0 (e.g., 𝑥1 = 𝑥2 ) Accept Reject True Correct Type I error (false positive) False Type II error (false negative) Correct 1. Type I error is also called Producer’s risk (rejecting a good product). To limit the chance of committing a Type I error, we typically set its rate () small. is referred to as significance level in significance test. 2. Type II error is often referred to as consumer’s risk (accepting an inferior product), and its rate is typically represented by . One can avoid making Type II errors by making no decision until we have a sample size large enough to give us sufficient power to reject the null hypothesis. 3. The power of a test is 1- which depends on sample size and effect size. Inference: Population and Sample Variable ID Population Sample Sample Statistic Variate 1 2 3 4 5 6 Weight (in kg) 2.3 2.5 2.5 2.4 2.4 2.3 Length (in m) 0.3 0.3 0.5 0.4 0.4 0.5 Mean 2.4 0.4 Individual Observation Essential definitions • Statistic: any one of many computed or estimated statistical quantities, such as the mean, the standard variation, the correlation coefficient between two variables, the t statistic for two-sample t-test. • Parameter: a numerical descriptive measure (attribute) of a population. • Population: a specified set of individuals (or individual observations) about which inferences are to be made. – Idealized population – Operationally defined population. • Sample: a subset of individuals (or individual observations), generally used to make inference about the population from which the sample is taken from. Optional materials • Central tendency – Arithmetic mean – Geometric mean – Harmonic mean • Dispersion – Moments – Skewness – Kurtosis • Basic probability theory – – – – – – Events and event space Independent events Mutually exclusive events Joint events Empirical probability Conditional probability Elementary Probability Theory • Empirical probability of an event is taken as the relative frequency of occurrence of the event when the number of observations is very large. • A coin is tossed 10000 times, and we observed head 5136 times. The empirical probability of observing a head when a coin is tossed is then 5136/10000 = 0.5136. • A die is tossed 10000 times and we observed number 2 up 1703 times. What is the empirical probability of getting a 2 when the die is tossed? • If the coin and the die are even, what is the expected probabilities for getting a head or a number 2? Mutually Exclusive Events • Two or more events are mutually exclusive if the occurrence of one of them exclude the occurrence of others. • Example: – observing a head and observing a tail in a single coin tossing experiment – events represented by null hypothesis and the alternative hypothesis – being a faithful husband and having extramarital affairs. • Binomial distribution Coin-Tossing Expt. If I ask someone to toss a coin 6 times and record the number of heads, and he comes back to tell me that the number of heads is exactly 3. If I ask him repeat the tossing experiment three more times, and he always comes back to say that the number of heads in each experiment is exactly 3. What would you think? Experiment 1 2 3 4 Outcome (Number of Heads out of 6 Coin-tossing) 3 3 3 3 The probability of getting 3 heads out of 6 coin-tossing is 0.3125 for a fair coin following the binomial distribution (0.5 + 0.5)6, and the probability of getting this result 4 times in a roll is 0.0095. The person might not have done the experiment at all! Thinking Critically Now suppose Mendel obtained the following results: Breeding Experiment 1 2 3 Number of Round Seeds 21 24 18 Number of Wrinkled Seeds 7 8 6 Based on (0.75+0.25)n: P1 = 0.171883; P2 = 0.161041; P3 = 0.185257; P = 0.0051 Edwards, A. W. F. 1986. Are Mendel’s results really too close? Biol. Rev. 61:295-312. Compound Event • A compound event, denoted by E1E2 or E1E2…EN, refers to the event when two or more events occurring together. • For independent events, Pr{E1E2} = Pr{E1}Pr{E2} • For dependent events, Pr{E1E2} = Pr{E1}Pr{E2|E1} Probability of joined events Criteria Prob. Between 25 and 45 1/2 Very bright 1/25 Liberal 1/3 Relatively nonreligious Self-supporting 1/2 No kids 1/3 Funny, sense of humor Warm, considerate 1/2 Sexually assertive 1/2 Attractive 1/2 Doesn’t drink or smoke Is not presently attached Would fall in love quickly 2/3 1/3 1/2 1/2 1/5 The probability of meeting such a person satisfying all criteria is 1/648,000, i.e., if you meet one new candidate per day, it will take you, on the average, 1775 years to find your partner. Fortunately, many criteria are correlated, e.g., a very bright adult is almost always selfsupporting. Conditional Probability • Let E1 be the probability of observing number 2 when a die is tossed, and E2 be the probability of observing even numbers. The conditional probability, denoted by Pr{E1|E2} is called the conditional probability of E1 when E2 has occurred. • What is the expected value for the conditional probability of P{E1|E2} with a fair die? • What is the expected value for the conditional probability of P{E2|E1}? Independent Events • Two events (E1 and E2) are independent if the occurrence or non-occurrence of E1 does not affect the probability of occurrence of E2, so that Pr{E2|E1} = Pr{E2}. • When one person throw a coin in Hong Kong, and another person throw a die in US, the event of observing a head and the event of getting a number 2 can be assumed to be independent. • The event of grading students unfairly and the event of students making an appeal can be assumed to be dependent. Various Kinds of Means • • • • Arithmetic mean Geometric mean Harmonic mean Quadratic mean (or root mean square) Geometric Mean • The geometric mean (Gx) is expressed as: Gx n x1 x2 ...xn n n x i 1 i • where is called the product operator (and you know that is called the summation operator. When to Use Geometric Mean • The geometric mean is frequently used with rates of change over time, e.g., the rate of increase in population size, the rate of increase in wealth. • Suppose we have a population of 1000 mice in the 1st year (x1 = 1000), 2000 mice the 2nd year (x2 = 2000), 8000 mice the 3rd year (x3 = 8000), and 8000 mice the 4th year (x4 = 8000). This scenario is summarized in the following table: Year Population size (t) Population size (t+1) Rate of increase PSt+1 / PSt 1 1000 2000 2 (population size doubled) 2 2000 8000 4 (population size quadrupled) 3 8000 8000 1 (population size stable) What is the mean rate of increase? (2+4+1) / 3 ? Wrong Use of Arithmetic Mean • The arithmetic mean is (2+4+1) / 3 = 7/3, which might lead us to conclude that the population is increasing with an average rate of 7/3. • This is a wrong conclusion because 1000 * 7/3 * 7/3 * 7/3 8000 • The arithmetic mean is not good for ratio variables. Using Geometric Mean • The geometric mean is: Gx 2 4 1 8 2. 3 3 • This is the correct average rate of increase. On average, the population size has doubled every year over the last three years, so that x4 = 1000 222 = 8000 mice. • Alternative: 1000*r3 = 8000 The Ratio Variable • Example: – Year 1: – Year 2: Milk price per quart 3 Bread price per loaf Milk price per quart 2 Bread price per loaf • The arithmetic mean ratio is r1 = 2.5 • What is the mean ratio of bread price to milk price? – Ratio r1 = 1/3; Ratio r2 = 1/2 – Arithmetic mean ratio is r2 = (1/3 + 1/2) / 2 = 5/12 = 0.4167. • But r1 1/r2. What’s wrong? • Conclusion: Arithmetic mean is no good for ratios Using Geometric Mean • Geometric mean of the milk/bread ratios: r1 (3 2) 6 2.4495 • Geometric mean of the bread/milk ratios: r2 (1/ 3) (1/ 2) 1/ 6 0.4082 1 r2 0.4082 r1 Moments and distribution • The moment (mr) N mr X1r • The central moment (r) ... N X 2r • The first moment is the arithmetic mean • The second central moment r XN X ir i 1 N N r ( X i X )r i 1 N – is the population variance when N is equal to population size (typically assumed to be infinitely large) – is the sample variance when N = n-1 where n is sample size • Standardized moment (r) = the moment of the standardized x. r X X i r N – 1 = 0 – 2 = 1 – 3 is population skewness; the sample skewness with sample size (n) is n ( n 1) 3 n2 Xi X N r 2 = u2/n r r r Skewness Right-Skewed (+) -6 -4 -2 Left-Skewed (-) 0 2 4 6 Kurtosis Leptokurtic (Kurtosis < 0) Normally distributed Platykurtic (Kurtosis > 0) -6 -4 -2 0 2 4 6 n zi4 2 n n(n 1) 3(n 1) x x i 1 3 i n (n 1)(n 2)(n 3) i 1 s (n 2)(n 3) 4 Empirical frequency distributions • Chest (inches) 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 Number of Men 3 18 81 185 420 749 1073 1079 934 658 370 92 50 21 4 1 • Marks (mid-point) 400 750 1250 1750 2250 2750 3250 3750 4250 4750 5250 5750 6250 6750 7250 7750 Number of candidates 24 74 38 21 11 8 11 5 2 1 3 1 0 0 0 1 SAS Program and Output Univariate Procedure data chest; input chest number; cards; Variable=CHEST N Mean Std Dev Skewness USS CV T:Mean=0 Num ^= 0 M(Sign) Sgn Rank D:Normal 5738 39.83182 2.049616 0.03333 9127863 5.145674 1472.102 5738 2869 8232596 0.098317 Sum Wgts Sum Variance Kurtosis CSS Std Mean Pr>|T| Num > 0 Pr>=|M| Pr>=|S| Pr>D USS = Sum(xi2) CSS = Sum(xi – MeanX)2 5738 228555 4.200925 0.06109 24100.71 0.027058 0.0001 5738 0.0001 0.0001 <.01 33 3 34 18 35 81 36 185 37 420 38 749 39 1073 40 1079 41 934 42 658 43 370 44 92 45 50 46 21 47 4 48 1 ; proc univariate normal plot; freq number; var chest; run; SAS Program and Output Univariate Procedure Variable= marks N Mean Std Dev Skewness USS CV T:Mean=0 Num ^= 0 M(Sign) Sgn Rank W:Normal 200 1465.5 1179.392 2.031081 7.0634E8 80.47708 17.57287 200 100 10050 0.767621 Sum Wgts Sum Variance Kurtosis CSS Std Mean Pr>|T| Num > 0 Pr>=|M| Pr>=|S| Pr<W 200 293100 1390965 5.180086 2.768E8 83.39558 0.0001 200 0.0001 0.0001 0.0001 data Grade; input marks number; cards; 400 24 750 74 1250 38 1750 21 2250 11 2750 8 3250 11 3750 5 4250 2 4750 1 5250 3 5750 1 6250 0 6750 0 7250 0 7750 1 ; proc univariate normal plot; freq number; var marks; run; SAS Graph DATA; DO X=-5 TO 5 BY 0.25; DO Y=-5 TO 5 BY 0.25; DO Z=SIN(SQRT(X*X+Y*Y)); OUTPUT; END; END; END; PROC G3D; PLOT Y*X=Z/CAXIS=BLACK CTEXT=BLACK; TITLE 'Hat plot'; FOOTNOTE 'Fig. 1, Xia'; RUN;