Smoking and Lung Cancer

download report

Transcript Smoking and Lung Cancer

Basic statistical concepts
[email protected]
http://dambe.bio.uottawa.ca
Simpson’s paradox
Treatment A
Treatment B
Small Stones
93% (81/87)
87% (234/270)
Large Stones
73% (192/263)
69% (55/80)
Pooled
78% (273/350)
83% (289/350)
C. R. Charig et al. 1986. Br Med J (Clin Res Ed) 292 (6524): 879–882
Treatment A: all open procedures
Treatment B: percutaneous nephrolithotomy
Question: which treatment is better?
Applied Biostatistics
• What is biostatistics: Biostatistics is for collecting,
organizing, summarizing, presenting and analyzing
data of biological importance, with the objective of
drawing conclusions and facilitating decisionmaking.
• Statistical estimation/description
– point estimation (e.g., mean X = 3.4, slope = 0.37)
– interval estimation (e.g., 0.5 < mean X < 8.5)
• Significance tests
– Statistic, e.g. t, F, 2 which are indices measuring the
difference between the observed value and the expected
value derived from the null hypothesis
– Significance level and p value (e.g., p < 0.01)
– Distribution of the statistic (We cannot obtain p value with
the distribution)
Descriptive Statistics
• Normal distribution:
–
–
–
–
Central tendency
dispersion
skewness
kurtosis.
• Confidence Limits.
• Degree of freedom, e.g., N, (N-1):
– Use a random number generator to get
40 variables from standard normal
distribution and compute variance by
dividing SS/N and SS/(N-1)
N
 xi
N
 fi xi
x  i 1 ; x  i 1
N
N
N
 ( xi  x )
s 2  i 1
sx 
N 1
2
N
 fi ( xi  x )2
; s 2  i 1
N 1
s
-6
-4
-2
0
2
4
N
Confidence Limits: Mean ± t,df SE
6
Example of data analysis
1200
1000
Number
800
600
400
200
0
30
35
40
45
50
Chest
Number
E(N)
Probability density function (PDF)
Cumulative distribution function (CDF)
Grouped data
Null hypothesis
Significance test and p value
Chest Number (X-meanX)^2 E(P_cumul) E(P)
E(N)
E(N) Chi-sq
33
3
46.6738
0.0010 0.0010
5.7513
5.7513 1.3162
34
18
34.0102
0.0046 0.0036 20.8698 20.8698 0.3946
35
81
23.3465
0.0173 0.0126 72.4853 72.4853 1.0002
36
185
14.6829
0.0520 0.0347 199.2925 199.2925 1.0250
37
420
8.0192
0.1276 0.0756 433.7970 433.7970 0.4388
38
749
3.3556
0.2579 0.1303 747.6065 747.6065 0.0026
39
1073
0.6919
0.4357 0.1778 1020.1789 1020.1789 2.7349
40
1079
0.0283
0.6278 0.1921 1102.3292 1102.3292 0.4937
41
934
1.3646
0.7922 0.1644 943.1520 943.1520 0.0888
42
658
4.7010
0.9035 0.1114 638.9692 638.9692 0.5668
43
370
10.0373
0.9633 0.0597 342.7579 342.7579 2.1652
44
92
17.3737
0.9886 0.0254 145.5710 145.5710 19.7144
45
50
26.7101
0.9972 0.0085 48.9444 48.9444 0.0228
46
21
38.0464
0.9994 0.0023 13.0264 16.2278 1.4034
47
4
51.3828
0.9999 0.0005
2.7440
48
1
66.7191
1.0000 0.0001
0.4574
Sum
5738
1
5737.9328 31.3675
Mean 39.8318
p=
0.0010
Var
4.2002
Std
2.0494
𝑝 𝑥, 𝜇, 𝜎 =
1
𝜎 2𝜋
𝑥−𝜇 2
−
𝑒 2𝜎2
Assignment 1
Original data
X
1
1
2
2
2
3
3
3
3
4
4
4
5
5
Mean
Std
SE
CV
95% LL
95% UL
Grouped data
X
1
2
3
4
5
Mean
Var
Std
SE
CV
95% LL
95% UL
Num
Compute mean, standard deviation
(Std), standard error (SE),
coefficient of variation (CV), and
95% confidence interval (lower
and upper limits or LL and UL).
you can use EXCEL function
T.INV
Group the data and do the same
calculation for the grouped data
Print this slide, fill in your answer,
and hand at the beginning of class
next Tuesday.
Name:
ID
Data: same pattern but smaller N
Number
24
74
38
21
11
8
11
5
2
1
3
1
0
0
0
1
Assignment 3:
80
70
Do the same test of
normality as the
previous slide with
comparable N.
60
50
Number
Grade
400
750
1250
1750
2250
2750
3250
3750
4250
4750
5250
5750
6250
6750
7250
7750
40
30
20
10
0
0
-10
2000
4000
Grade
6000
8000
No need to hand in, but
get ready to discuss the
result
R and distributions
• Discrete variable: poisson, binomial, geometric, …
• Continuous variable: normal, t, F, 2, …
• Functions:
–
–
–
–
CDF: add p, e.g., pnorm, pt, pf, pchisq
PDF: add d, e.g., dnorm, dt, df, dchisq
Random numbers: add r, e.g., rnorm, rt, rf, rchisq
Quantile (inverse of CDF): add q, e.g., qnorm, qt,
qf,qchisq
– Empirical cumulative distribution function: ecdf(x)
– Empirical density: density(x)
• Graphic functions
– qqnorm(x)
– hist(x,n), truehist(x,n) in MASS
Descriptive stats in R
• Built-in: mean, sd, var, …
• R package: fBasics with skewness, kurtosis,
etc.
• Test of normality
– Univariate
• Significance test (Shapiro-Wilk-test): shapiro.test(xVec)
• Graphic: qqnorm(xVec), plot(density(xVec))
– Multivariate energy test in library "energy"):
mvnorm.etest(xMatrix)
• Custom-made:
– Describe in Describe.R
– StatsGroupedData in StatsGroupData.R
Decision making and risks
Decision
H0 (e.g., 𝑥1 = 𝑥2 )
Accept
Reject
True
Correct
Type I error (false positive)
False
Type II error (false negative)
Correct
1. Type I error is also called Producer’s risk (rejecting a good product). To limit
the chance of committing a Type I error, we typically set its rate () small.  is
referred to as significance level in significance test.
2. Type II error is often referred to as consumer’s risk (accepting an inferior
product), and its rate is typically represented by . One can avoid making Type
II errors by making no decision until we have a sample size large enough to
give us sufficient power to reject the null hypothesis.
3. The power of a test is 1- which depends on sample size and effect size.
Inference: Population and Sample
Variable
ID
Population
Sample
Sample
Statistic
Variate
1
2
3
4
5
6
Weight
(in kg)
2.3
2.5
2.5
2.4
2.4
2.3
Length
(in m)
0.3
0.3
0.5
0.4
0.4
0.5
Mean
2.4
0.4
Individual
Observation
Essential definitions
• Statistic: any one of many computed or estimated statistical
quantities, such as the mean, the standard variation, the
correlation coefficient between two variables, the t statistic for
two-sample t-test.
• Parameter: a numerical descriptive measure (attribute) of a
population.
• Population: a specified set of individuals (or individual
observations) about which inferences are to be made.
– Idealized population
– Operationally defined population.
• Sample: a subset of individuals (or individual observations),
generally used to make inference about the population from
which the sample is taken from.
Optional materials
• Central tendency
– Arithmetic mean
– Geometric mean
– Harmonic mean
• Dispersion
– Moments
– Skewness
– Kurtosis
• Basic probability theory
–
–
–
–
–
–
Events and event space
Independent events
Mutually exclusive events
Joint events
Empirical probability
Conditional probability
Elementary Probability Theory
• Empirical probability of an event is taken as the relative
frequency of occurrence of the event when the number of
observations is very large.
• A coin is tossed 10000 times, and we observed head 5136
times. The empirical probability of observing a head when a
coin is tossed is then
5136/10000 = 0.5136.
• A die is tossed 10000 times and we observed number 2 up
1703 times. What is the empirical probability of getting a 2
when the die is tossed?
• If the coin and the die are even, what is the expected
probabilities for getting a head or a number 2?
Mutually Exclusive Events
• Two or more events are mutually exclusive if the
occurrence of one of them exclude the occurrence of
others.
• Example:
– observing a head and observing a tail in a single coin
tossing experiment
– events represented by null hypothesis and the alternative
hypothesis
– being a faithful husband and having extramarital affairs.
• Binomial distribution
Coin-Tossing Expt.
If I ask someone to toss a coin 6 times and record the number of heads, and he
comes back to tell me that the number of heads is exactly 3. If I ask him repeat
the tossing experiment three more times, and he always comes back to say that
the number of heads in each experiment is exactly 3. What would you think?
Experiment
1
2
3
4
Outcome (Number of Heads out of 6 Coin-tossing)
3
3
3
3
The probability of getting 3 heads out of 6 coin-tossing is 0.3125 for a fair coin
following the binomial distribution (0.5 + 0.5)6, and the probability of getting
this result 4 times in a roll is 0.0095.
The person might not have done the experiment at all!
Thinking Critically
Now suppose Mendel obtained the following results:
Breeding Experiment
1
2
3
Number of Round Seeds
21
24
18
Number of Wrinkled Seeds
7
8
6
Based on (0.75+0.25)n: P1 = 0.171883; P2 = 0.161041; P3 = 0.185257; P = 0.0051
Edwards, A. W. F. 1986. Are Mendel’s results really too close? Biol. Rev. 61:295-312.
Compound Event
• A compound event, denoted by E1E2 or E1E2…EN, refers to
the event when two or more events occurring together.
• For independent events, Pr{E1E2} = Pr{E1}Pr{E2}
• For dependent events, Pr{E1E2} = Pr{E1}Pr{E2|E1}
Probability of joined events
Criteria
Prob.
Between 25 and 45 1/2
Very bright
1/25
Liberal
1/3
Relatively nonreligious
Self-supporting 1/2
No kids
1/3
Funny, sense of humor
Warm, considerate 1/2
Sexually assertive 1/2
Attractive
1/2
Doesn’t drink or smoke
Is not presently attached
Would fall in love quickly
2/3
1/3
1/2
1/2
1/5
The probability of meeting such
a person satisfying all criteria is
1/648,000, i.e.,
if you meet one new candidate
per day, it will take you, on the
average, 1775 years to find
your partner.
Fortunately, many criteria are
correlated, e.g., a very bright
adult is almost always selfsupporting.
Conditional Probability
• Let E1 be the probability of observing number 2
when a die is tossed, and E2 be the probability of
observing even numbers. The conditional probability,
denoted by Pr{E1|E2} is called the conditional
probability of E1 when E2 has occurred.
• What is the expected value for the conditional
probability of P{E1|E2} with a fair die?
• What is the expected value for the conditional
probability of P{E2|E1}?
Independent Events
• Two events (E1 and E2) are independent if the
occurrence or non-occurrence of E1 does not affect
the probability of occurrence of E2, so that Pr{E2|E1}
= Pr{E2}.
• When one person throw a coin in Hong Kong, and
another person throw a die in US, the event of
observing a head and the event of getting a number 2
can be assumed to be independent.
• The event of grading students unfairly and the event
of students making an appeal can be assumed to be
dependent.
Various Kinds of Means
•
•
•
•
Arithmetic mean
Geometric mean
Harmonic mean
Quadratic mean (or root mean square)
Geometric Mean
• The geometric mean (Gx) is expressed as:
Gx  n x1 x2 ...xn  n
n
x
i 1
i
• where  is called the product operator (and you
know that  is called the summation operator.
When to Use Geometric Mean
• The geometric mean is frequently used with rates of change
over time, e.g., the rate of increase in population size, the rate
of increase in wealth.
• Suppose we have a population of 1000 mice in the 1st year (x1
= 1000), 2000 mice the 2nd year (x2 = 2000), 8000 mice the 3rd
year (x3 = 8000), and 8000 mice the 4th year (x4 = 8000). This
scenario is summarized in the following table:
Year
Population size
(t)
Population size
(t+1)
Rate of increase
PSt+1 / PSt
1
1000
2000
2 (population size doubled)
2
2000
8000
4 (population size quadrupled)
3
8000
8000
1 (population size stable)
What is the mean rate of increase? (2+4+1) / 3 ?
Wrong Use of Arithmetic Mean
• The arithmetic mean is (2+4+1) / 3 = 7/3, which
might lead us to conclude that the population is
increasing with an average rate of 7/3.
• This is a wrong conclusion because
1000 * 7/3 * 7/3 * 7/3  8000
• The arithmetic mean is not good for ratio variables.
Using Geometric Mean
• The geometric mean is:
Gx  2  4  1  8  2.
3
3
• This is the correct average rate of increase. On
average, the population size has doubled every year
over the last three years, so that x4 = 1000 222 =
8000 mice.
• Alternative: 1000*r3 = 8000
The Ratio Variable
• Example:
– Year 1:
– Year 2:
Milk price per quart
3
Bread price per loaf
Milk price per quart
2
Bread price per loaf
• The arithmetic mean ratio is r1 = 2.5
• What is the mean ratio of bread price to milk price?
– Ratio r1 = 1/3; Ratio r2 = 1/2
– Arithmetic mean ratio is r2 = (1/3 + 1/2) / 2 = 5/12 =
0.4167.
• But r1  1/r2. What’s wrong?
• Conclusion: Arithmetic mean is no good for ratios
Using Geometric Mean
• Geometric mean of the milk/bread ratios:
r1  (3  2)  6  2.4495
• Geometric mean of the bread/milk ratios:
r2  (1/ 3)  (1/ 2)  1/ 6  0.4082
1
r2   0.4082
r1
Moments and distribution
• The moment (mr)
N
mr 
X1r

• The central moment (r)
 ... 
N
X 2r
• The first moment is the arithmetic mean
• The second central moment
r
XN
 X ir

i 1
N
N
r 
 ( X i  X )r
i 1
N
– is the population variance when N is equal to population
size (typically assumed to be infinitely large)
– is the sample variance when N = n-1 where n is sample
size
• Standardized moment (r) = the moment of the
standardized x.
r
X X 
  i 
 
r  
N
– 1 = 0
– 2 = 1
– 3 is population skewness; the sample skewness with
sample size (n) is
n ( n  1)
3
n2
 Xi  X 
N r
2 = u2/n
r

 r
r
Skewness
Right-Skewed (+)
-6
-4
-2
Left-Skewed (-)
0
2
4
6
Kurtosis
Leptokurtic
(Kurtosis < 0)
Normally
distributed
Platykurtic
(Kurtosis > 0)
-6
-4
-2
0
2
4
6
n
zi4

2

n
n(n  1)
3(n  1)

x x  

 i 1  3
 i



n
 (n  1)(n  2)(n  3) i 1  s  

 (n  2)(n  3)
4
Empirical frequency distributions
•
Chest
(inches)
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
Number of Men
3
18
81
185
420
749
1073
1079
934
658
370
92
50
21
4
1
•
Marks
(mid-point)
400
750
1250
1750
2250
2750
3250
3750
4250
4750
5250
5750
6250
6750
7250
7750
Number of
candidates
24
74
38
21
11
8
11
5
2
1
3
1
0
0
0
1
SAS Program and Output
Univariate Procedure
data chest;
input chest number;
cards;
Variable=CHEST
N
Mean
Std Dev
Skewness
USS
CV
T:Mean=0
Num ^= 0
M(Sign)
Sgn Rank
D:Normal
5738
39.83182
2.049616
0.03333
9127863
5.145674
1472.102
5738
2869
8232596
0.098317
Sum Wgts
Sum
Variance
Kurtosis
CSS
Std Mean
Pr>|T|
Num > 0
Pr>=|M|
Pr>=|S|
Pr>D
USS = Sum(xi2)
CSS = Sum(xi – MeanX)2
5738
228555
4.200925
0.06109
24100.71
0.027058
0.0001
5738
0.0001
0.0001
<.01
33
3
34
18
35
81
36 185
37 420
38 749
39 1073
40 1079
41 934
42 658
43 370
44
92
45
50
46
21
47
4
48
1
;
proc univariate normal plot;
freq number;
var chest;
run;
SAS Program and Output
Univariate Procedure
Variable= marks
N
Mean
Std Dev
Skewness
USS
CV
T:Mean=0
Num ^= 0
M(Sign)
Sgn Rank
W:Normal
200
1465.5
1179.392
2.031081
7.0634E8
80.47708
17.57287
200
100
10050
0.767621
Sum Wgts
Sum
Variance
Kurtosis
CSS
Std Mean
Pr>|T|
Num > 0
Pr>=|M|
Pr>=|S|
Pr<W
200
293100
1390965
5.180086
2.768E8
83.39558
0.0001
200
0.0001
0.0001
0.0001
data Grade;
input marks number;
cards;
400 24
750 74
1250 38
1750 21
2250 11
2750 8
3250 11
3750 5
4250 2
4750 1
5250 3
5750 1
6250 0
6750 0
7250 0
7750 1
;
proc univariate normal plot;
freq number;
var marks;
run;
SAS Graph
DATA;
DO X=-5 TO 5 BY 0.25;
DO Y=-5 TO 5 BY 0.25;
DO Z=SIN(SQRT(X*X+Y*Y));
OUTPUT;
END;
END;
END;
PROC G3D;
PLOT Y*X=Z/CAXIS=BLACK CTEXT=BLACK;
TITLE 'Hat plot';
FOOTNOTE 'Fig. 1, Xia';
RUN;