No Slide Title
Download
Report
Transcript No Slide Title
Stat 651
Lecture 4
Copyright (c) Bani Mallick
1
Topics in Lecture #4
Probability
The bell-shaped (normal) curve
Normal probability plots (the q-q plot) to
check for normality of continuous data
Use of Table 1 in the back of the book
Copyright (c) Bani Mallick
2
Topics in Lecture #4
Normal probability calculations
Data Transformations
Sampling distributions: sample means are
random variables!
Standard error of the sample mean
Central Limit Theorem
A simple confidence interval
Copyright (c) Bani Mallick
3
Book Sections Covered in Lecture #4
Chapter 4.10, in detail
Chapter 4.11 (read on your own)
Chapter 4.12, in detail
Chapter 5.1
Chapter 5.2
Copyright (c) Bani Mallick
4
Lecture 3 Review
Box plots are probably the best way to
compare populations graphically
You can detect shifts and changes in variation
Also identifies outliers
Copyright (c) Bani Mallick
5
Lecture 3 Review
q-q plots are a simple way to understand
whether the data are approximately bellshaped
Population Relative Frequency Histogram
Bell-shaped curve!!
.5
.4
.3
.2
Normal Density
.1
0.0
-.1
-4
-3
-2
-1
0
1
2
3
4
X
Copyright (c) Bani Mallick
6
Lecture 3 Review
q-q plots are a simple way to understand
whether the data are approximately bellshaped
If they are sort of straight, then normality of
the population relative frequency histogram is
not too badly off
Copyright (c) Bani Mallick
7
q-q plot for the healthy women
Normal Q-Q Plot of Log(Saturated Fat)
4.5
4.0
Expected Normal Value
3.5
3.0
2.5
2.0
1.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
Observed Value
Copyright (c) Bani Mallick
8
Lecture 3 Review
For bell-shaped populations, we have
empirical rules
Approximately 68% (90%) (95%) of the
population lies within 1 (1.645) (1.96)
population standard deviations s of the
population mean m
Copyright (c) Bani Mallick
9
Lecture 3 Review
In many of our examples, we have seen that
there look to be differences among
populations. How can we tell if the
differences are real?
We will say that populations are different if
the differences we observe are more than can
be expected by sample-to-sample variability.
Copyright (c) Bani Mallick
10
Lecture 3 Review
Random variables are any outcome
(qualitative or numerical) from an experiment
involving random sampling from a population
The idea of a model is to write down a
formula for the population histogram as a
function of 1-2 parameters which are
estimated from the data.
If you know the parameters of the model,
then you know everything about probabilities
in that population
Copyright (c) Bani Mallick
11
Using the Normal Model
The entire point of the normal model is to
make probability statements
In practice, we estimate the population mean
m by the sample mean
We estimate the population standard
deviation s by the sample standard deviation
Then we estimate probabilities, by pretending
the sample quantities = the population ones
Copyright (c) Bani Mallick
12
Various Cases
Suppose we want to know what % of a
population lies below a specified value, c
We write this by asking: what is
Pr(X < c)
The value c is any arbitrary value, e.g., 6
X is any random variable with a population
mean m and a population standard deviation
s
Copyright (c) Bani Mallick
13
Pr(X < c) for Normal Populations
Compute the z-score
z=
c-μ
σ
Look up value in Table 1, page 1091
(white board explanation)
Copyright (c) Bani Mallick
14
Mechanics
NHANES: suppose healthy women’s ages are
normally distributed with mean m = 40 and
standard deviation s = 6
What is the chance that a randomly selected
person from this population is aged c = 43.3
or less
We write this in symbols as pr(X < 43.3)
Copyright (c) Bani Mallick
15
Mechanics
m = 40, s = 6
pr(X < 43.3) is what we want
z = (43.3 - m)/ s = 0.55 = z-score
Look up in Table 1:
The value 0.55 is on page 1092: first column
is 0.5, first row is 0.05: add them to get
0.55, and look up the value
Pr(X < 43.3) = 0.7088
Copyright (c) Bani Mallick
16
Various Cases
Suppose we want to know what % of a
population lies above a specified value, c
We write this by asking: what is
Pr(X > c)
The value c is any arbitrary value, e.g., 6
X is any random variable with a population
mean m and a population standard deviation
s
Copyright (c) Bani Mallick
17
Pr(X > c) for Normal Populations
This is simply 1 – Pr(X <= c).
Compute the z-score (c- m)/s
Look up the value for z in Table 1
Subtract this value from 1.0
Copyright (c) Bani Mallick
18
Mechanics
m = 40, s = 6
Chance that a randomly selected person from
this population is aged 46 or more
pr(X > 46)
z = (46 - m)/ s = 1
Look up in Table 1 for 1.00: get 0.8413
Because you are asking for > 46, subtract
from 1 to get pr(X > 46) = 1 – 0.8413 =
.1587
Copyright (c) Bani Mallick
19
Mechanics
m = 40, s = 6
Chance that a randomly selected person from
this population is aged 46 or less
pr(X <= 46)
z = (46 - m)/ s = 1
Look up in Table 1: chance is 84.13%
Copyright (c) Bani Mallick
20
Mechanics
m = 40, s = 6
Chance that a randomly selected person from
this population is aged 34 or less
pr(X <= 34)
z = (34 - m)/ s = -1
Look up in Table 1: chance is 0.1587 =
15.87%
Copyright (c) Bani Mallick
21
Aortic Stenosis Data
Two populations: healthy kids and kids with
aortic stenosis
Two outcomes: body surface area and aortic
value area
Size adjusted aortic value areas is the ratio of
aortic value area to body surface area
Copyright (c) Bani Mallick
22
8
125
6
4
2
99
72
79
88
0
Stenosis Data,
AVA to BSA
Ratio: Note the
huge outlier in
the stenotic kids.
He/she has a
huge aortic value
area relative to
his/her body size
-2
N=
70
56
Healthy
Stenoti
Health Status
Copyright (c) Bani Mallick
23
Aortic Stenosis Data
Healthy kids and AVA/BSA Ratio
Sample mean = 1.38, s = 0.51
Let’s pretend the population has m = 1.4, s
= 0.5
As it turns out, the sample mean of stenotic
kids is 0.7
So, let’s ask: for healthy kids, what is
pr(X < 0.7)?
Copyright (c) Bani Mallick
24
Aortic Stenosis Data
Healthy kids and AVA/BSA Ratio
m = 1.4, s = 0.5
For healthy kids, what pr(X <= 0.7)?
z = (0.7 - m)/s = -1.4
look up in Table 1
You should get 0.0808
Copyright (c) Bani Mallick
25
Aortic Stenosis Data
For healthy kids, pr(X <= 0.7) = 0.0808
Stenotic kids have a mean ava/bsa ratio of
0.7
Thus, the average stenotic kid has a lower
ava/bsa ratio than 91.92% of healthy kids
91.92% = 100% - 8.08%
Copyright (c) Bani Mallick
26
Not all Data are Normally Distributed
“Time to an event”, e.g., time to a heart
attack
Number of things that happen, e.g., number
of heart attacks
These typically have a skew shape
.2
.1
DENSITY
0.0
-.1
-1
0
1
2
3
4
5
6
X
Copyright (c) Bani Mallick
27
Not all Data are Normally Distributed
These typically have a skew shape
Statisticians have special models to handle
this (Gamma, Poisson)
You will usually try to eliminate some of the
skewness by data transformation
.2
.1
DENSITY
0.0
-.1
-1
0
1
2
3
4
5
6
X
Copyright (c) Bani Mallick
28
Not all Data are Normally Distributed
The standard data transformations are
Square root
Logarithm: but if you have zeros in the data
set, you have to add a small constant, since
log(0) =
Copyright (c) Bani Mallick
29
Inference
The basic building blocks for inference are
statistics
Let’s start with the population mean m, the
sample mean and the sample standard
deviation s
Standard error (of the mean) is
Copyright (c) Bani Mallick
s/ n
30
Inference
The sample mean
is a random variable
This means that it varies from sample to
sample
Of course, if we were able to “sample” the
entire population, the sample mean would
equal the population mean m
Copyright (c) Bani Mallick
31
Inference
The sample mean
Its own “population” mean is m
It’s standard deviation is σ / n
Note how the standard deviation of the
sample mean becomes smaller as the
sample size becomes larger
Why does this make sense?
is a random variable
Copyright (c) Bani Mallick
32
Central Limit Theorem
The sample mean
Its own “population” mean is m
It’s standard deviation is σ / n
In “large enough” samples, the sample mean
is very nearly normally distributed, i.e., has a
bell--shaped histogram
What does this mean?
is a random variable
Copyright (c) Bani Mallick
33
Warning
It is incredibly easy to have difficulty
understanding that the sample mean is itself
a random variable
But it is the crucial concept
If I take repeated samples and compute the
sample mean each time, I will not get the
same number.
Thus, the sample mean is a random variable
Copyright (c) Bani Mallick
34
Women’s Interview Survey of Health
Funny case-control study
Seemed to indicate that those women who
ate a lot of non-chocolate sweets were at
higher risk of breast cancer
271 women controls were interview for their
diets
They completed 6 24-hour recalls
Copyright (c) Bani Mallick
35
Women’s Interview Survey of Health
271 women controls were interview for their
diets and completed 6 24-hour recalls
Hawthorne effect: the more you ask
people about their lives, the more they will
change
Does this happen here?
If so, we’d expect that their caloric intake
decreased the more they were asked about
their diet
Copyright (c) Bani Mallick
36
Women’s Interview Survey of Health
To test the Hawthorne effect, we took the
average caloric intake from the first two
interviews, and subtracted it from the
average caloric intake from the last 2
interviews
X = (average of 5 & 6) – (average of 1 & 2)
Do you think the population mean of X is
positive or negative?
Copyright (c) Bani Mallick
37
WOMEN’S INTERVIEW SURVEY OF
HEALTH (WISH)
My guess was that because of various factors
(societal pressure, awareness of diet,
Hawthorne effect), they will report fewer
calories at the second time period
My hypothesis is that the population mean of
X is < 0.
Copyright (c) Bani Mallick
38
WISH: Change in Caloric Intake
2000
247
1000
0
Does it look like
a big change?
-1000
-2000
217
239
208
-3000
N=
271
Change in mean Energ
Copyright (c) Bani Mallick
39
WISH: Change in Calories
Normal Q-Q Plot of Change in mean Energy
2000
Expected Normal Value
1000
0
-1000
-2000
-3000
-2000
-1000
0
1000
2000
Does this look
straight
enough to be
happy
thinking that
X is
approximately
normally
distributed?
Observed Value
Copyright (c) Bani Mallick
40
What does an IQR
of 838 mean?
WISH
Descriptives
Statistic
Change in mean
Energy: las t 2 recalls
minus first 2 recalls
Mean
-180.1262
95% C onfidence
Interval for Mean
Lower Bound
Upper Bound
37.2202
-253.4050
-106.8474
5% Trimmed Mean
-171.6543
Median
-128.2150
Variance
375428.7
Std. Deviation
612.7223
Minimum
-2235.00
Maximum
1567.96
Range
3802.96
Interquartile Range
Std. Error
838.1900
Skewness
-.253
.148
Kurtos is
.608
.295
Copyright (c) Bani Mallick
41
WISH
The sample size is n = 271
The sample mean change = -180 calories!
The sample standard deviation = 612
The sample standard error = 37
Empirical rule, the chance is 95% that the
population mean is with 1.96 * 37 = 74 of 180, i.e., between - 254 and -106
Copyright (c) Bani Mallick
42
WISH
Empirical rule, the chance is 95% that the
population mean between
- 254 and -106
What does this mean?
Is there a Hawthorne effect going on?
Can you attach a probability to this?
Copyright (c) Bani Mallick
43