Another Powerpoint with some information about the normal

Download Report

Transcript Another Powerpoint with some information about the normal

Problem: Diagnosing Spina Bifida
The procedure of amniocentesis involves
drawing a sample of the amniotic fluid that
surrounds an unborn child in its mother’s
womb.
High concentration of alpha fetoprotein can
indicate the condition spina bifida.
Concentration of alpha fetoprotein tends to
increase with the size of the foetus.
Amniocentesis results in miscarriage for 1%.
Preliminary tests involve measuring the level
of alpha fetoprotein in the mother’s urine.
Problem: Diagnosing Spina Bifida
For mothers with normal foetuses, the mean level of
alpha fetoprotein is 15.73 moles/litre with a
standard deviation of 0.72 moles/litre.
For mothers carrying foetuses with spina bifida, the
mean is 23.05 and the standard deviation is 4.08.
In both groups the distribution of alpha fetoprotein
appears to be approximately Normally distributed.
15.73
23.05
Problem: Diagnosing Spina Bifida
To operate a diagnostic test for spina bifida,
set a threshold concentration of alpha
fetoprotein, T, say.
If the alpha fetoprotein level is below T, then
the foetus is diagnosed as not having spina
bifida.
If the level is above T, then further testing is
required.
Problem: Diagnosing Spina Bifida
If T was set at 17.80 moles/litre:
What is the probability that a foetus with spina
bifida is correctly diagnosed?
What is the probability that a foetus not suffering
from spina bifida is correctly diagnosed?
If they wanted to ensure that 99% of foetuses with
spina bifida were correctly diagnosed, at what level
should they set T ? What are the implications of
setting T at this level?
Continuous Random Variables
If a random variable, X, can take any value in
some interval of the real line it is called a
continuous random variable.
E.g.
Hg levels, height, weight,
alpha fetoprotein concentration, cell
radius, etc. (i.e. usually quantities that
are ‘measurements’ )
The Standardized Histogram
Example: Dietary Carbohydrate in the
Workforce
The average daily intake of carbohydrate in the
diet of 5929 people.
The Standardized Histogram
(a) The histogram of the data shows the
carbohydrate intake:
.004
.002
.000
0
200
400
600
Carbohydrate (g/day)
800
* Is unimodal (modal class 200 – 225 g/day)
The Standardized Histogram
(a) The histogram of the data shows the
carbohydrate intake:
.004
.002
.000
0
200
400
600
Carbohydrate (g/day)
800
* Skewed to larger values (skewed right)
The Standardized Histogram
(a) The histogram of the data shows the
carbohydrate intake:
.004
.002
.000
0
200
400
600
Carbohydrate (g/day)
800
* Has huge variability
(highest consumers more than 10 times that
of lowest consumers)
The Standardized Histogram
(b) Area between a = 225 and b = 375 shaded
Shaded area = 0.483
(Corresponds to 48.3% of
observations)
.004
.002
.000
0
2 25
600
375
800
The Standardized Histogram
Rel. Freq/Width
(b) The standardized histogram adjusts the
height of the rectangle or bar to relative freq. or
proportion divided by interval width so that
Area = Estimated Probability
Shaded area = 0.483
(Corresponds to 48.3% of
observations)
.004
.002
.000
0
225
375
60 0
80 0
The Standardized Histogram
Rel. Freq/Width
(b) i.e. The area of the ith rectangle tells us
what proportion of the data lie in the ith
class interval.
Shaded area = 0.483
(Corresponds to 48.3% of
observations)
.004
.002
.000
0
225
375
60 0
80 0
The Standardized Histogram
For a standardized histogram:
The vertical scale is :
Relative frequency / interval width
(density scale)
Total area under the histogram = 1
The proportion of the data between a and b
is the area “under” the histogram
between a and b.
The Standardized Histogram
(c) With approximating curve
You can do this with
Fit Distribution >
Smooth Curve in JMP.
. 004
. 002
0
200
400
Carbohydrate (g/day)
600
800
The Standardized Histogram
(d) Area between a = 225 and b = 375 shaded
Shaded area = .486
.04
(cf. area = .483 for histogram)
.002
0
225
375
600
800
This area is calculated to be 0.486 and is very
close to the proportion of people who had
carbohydrate intake of between 225 and 375 g/day.
Radius of Maliginant Tumor Cells
In JMP select Histogram Options > Density
Axis to create a standardized histogram
The histogram on the left is for cell radii of malignant
tumor fine needle aspirations in the breast cancer study
from your 2nd assignment.
X = radius of a randomly selected malignant tumor cell
We estimate that,
P(14 < X < 15) = .10
or a 10% chance
AFP Levels in Spina Bifida Cases
In JMP select Histogram Options > Density
Axis to create a standardized histogram
The histogram on the left is AFP levels
found in the urine of mothers carrying a
fetus with spina bifida.
X = AFP level of random select mother carrying
fetus with spina bifida.
We estimate that,
P(22.5 < X < 25) = 2.5 X .10 = .25
or a 25% chance
Smooth Density Curves
Take a standardized histogram, decrease the
width of the class intervals and increase the
number of observations. Then the top of the
histogram tends to a smooth curve.
Histogram  Smooth Density Curves as
sample size increases! (AFP Levels)
n = 500
n = 100
n = 10,000
n = 100,000
n = 1,000,000
0.08
0.05
0.03
10
20
30
40
Density
0.10
Properties of the Probability Density
Function (p.d.f.)
1. f(x)  0 (i.e. the p.d.f. curve stays above
the x-axis)
2. P(a  X  b) = area from a to b
beneath the p.d.f curve
3. Area under the p.d.f. curve = 1
Endpoints of Intervals
For a continuous random variable, X,
endpoints of intervals are unimportant.
P(a  X  b) = P(a < X  b)
= P(a  X < b)
= P(a < X < b)
= area from a to b between the
p.d.f. curve and the x-axis.
(Inclusion or exclusion of the endpoints will not
change the area.)
The Normal Distribution
Limiting smooth bell shaped symmetric curve
is called the Normal p.d.f. curve.
Is symmetric about
the mean.
Mean = Median
50%
50%
Mean 
If a random variable, X, has a Normal
distribution with a mean and a standard
deviation we write:
X ~ Normal ( ,  )
parameters
The Normal Distribution
• The Normal distribution is important because:
– it fits a lot of data reasonably well;
– it can be used to approximate other distributions;
e.g. the binomial distribution
– it is important assumption made about the
distributional shape of continuous variables we
are working with when using parametric tools for
statistical inference (e.g. t-Tests & F-Tests).
The Normal Distribution
• A Normal distribution is solely determined by  and .
(a) Changing 
Shifts the curve along the axis
The Normal Distribution
A Normal distribution is solely determined by  and .
(b) Increasing

Increases the spread and flattens the curve
Spina Bifida Example
Let X be the AFP level found in the urine of
mother carrying a foetus with spina bifida.
We will assume that the AFP level is normally
distributed with a mean of  = 23.05 moles/L and
a standard deviation of  = 4.08 moles/L .
AFP Levels for Mothers Carrying Spina Bifida Foetus
Spina Bifida Example (Empirical Rule)
Approximately 68 % of mothers in this population
will have a AFP levels within 1 standard deviation of
the mean.
i.e., approximately 68 % of mothers in this
population will have AFP levels
between 23.05 – 4.08 and 23.05 + 4.08
=
between
18.97 and 27.13
Spina Bifida Example (Empirical Rule)
Approximately 95 % of mothers in this population
will have a AFP levels within 2 standard deviation of
the mean.
i.e., approximately 95 % of mothers in this
population will have AFP levels
between 23.05 - 2  4.08 and 23.05 + 2  4.08
= between 14.89 and 31.21
Spina Bifida Example (Empirical Rule)
Approximately 99.73 % of mothers in this
population will have a AFP levels within 3 standard
deviation of the mean.
i.e., approximately 99.73% of mothers in this
population will have AFP levels
between 23.05 - 3  4.08 and 23.05 - 3  4.08
= between 10.81 and 35.29
The Normal Distribution
For the Normal Distribution:
A random observation has approximately:
– 68% chance of falling within 1 of  ;
– 95% chance of falling within 2 of  ;
– 99.7% chance of falling within 3 of  .
Or:
In a Normal distribution, approximately:
– 68% of observations are within 1 of  ;
– 95% of observations are within 2 of  ;
– 99.7% of observations are within 3 of  .
The Normal Distribution
Probabilities and numbers of standard
deviations
Shaded area = 0.683
-  +
68% chance of
falling between
 -  and  + 
Shaded area = 0.954
 - 2

 + 2
95% chance of
falling between
 - 2 and  + 2
Shaded area = 0.997
 - 3

 + 3
99.7% chance of
falling between
 - 3 and  + 3
Problem: Diagnosing Spina Bifida
For mothers with normal foetuses, the mean level of
alpha fetoprotein is 15.73 moles/litre with a
standard deviation of 0.72 moles/litre.
For mothers carrying foetuses with spina bifida, the
mean is 23.05 and the standard deviation is 4.08.
In both groups the distribution of alpha fetoprotein
appears to be approximately Normally distributed.
15.73
23.05
Given this
For example
weinformation
might like to
want>to17.8)
be able or
to
find:we P(X
find probabilities
P(19 <with
X < these
25) etc…
associated
distributions.
for either
group.
Obtaining Probabilities
Normal distribution probabilities can be
obtained from all statistical packages by
giving the mean and standard deviation of
the distribution.
– Most tables in books give the value of P(X  x).
– i.e., cumulative or lower tail probabilities.
OR
Area = P(X  x)
x
Problem: There are infinitely many normal
distributions we might be interested in
• Because there are infinitely many values for  and
 that we might be interested in we would need a
book infinitely thick to contain all the possible tables
we might need.
• Even for our motivating example we would need to
two separate tables to find probabilities for mothers
with healthy fetuses and another for mothers
carrying fetus with spina bifida.
• We need to have “standard” normal table that
allows us to easily find probabilities for all
situations.
Standardization and the
Standard Normal Distribution
If X ~ N(,) then if we define a new random
variable Z as,
X-
Z = _______ ~

N(0,1), i.e. a normal
distribution with
mean 0 and standard
deviation 1.
We say Z has a standard normal distribution.
This process is called standardization.
Obtaining Probabilities
Basic method for obtaining probabilities
1. Sketch a Normal curve, marking on the mean
and values of interest.
2. Shade the area under the curve corresponding
to the required probability.
3. Convert all values to their z-scores
4. Obtain the desired probability using a standard
normal table or better yet use JMP.
Standard Normal Distribution
• a) P(Z > 2.25)
0
• b) P(Z < 1.28)
0
Standard Normal Distribution
• c) P(Z > .50)
0
• d) P(Z <-2.33)
0
Standard Normal Distribution
e) P(-1.96 < Z < 1.96)
0
0
0
Standard Normal Distribution
• f) Find z so that P(Z < z) = .95, i.e. what is
the 95th percentile of the standard normal
distribution?
0
Original problem:
Diagnosing Spina Bifida
15.73 23.05
Recall:
• For normal foetuses  =15.73,  = 0.72 and
for foetuses with spina bifida  = 23.05 and
 = 4.08.
• Assume the threshold for detecting
spina bifida is set at 17.8.
– (A foetus would be diagnosed as not having
spina bifida if the fetoprotein level is below 17.8)
Original problem:
Diagnosing Spina Bifida
15.73 23.05
a) What is the probability that a foetus not
suffering from spina bifida is correctly
diagnosed?
Therefore 99.80% of healthy foetuses
Let X be level of fetoprotein in normal foetus
will be diagnosed correctly. This is
X ~ Normal (15.73, 0.72) What is P(X < 17.8)?
called the specificity of the test, i.e.
= (17.8for
– 15.73)/.72
P(X < 17.8) = z-score
P(Z < z-score
17.8)
P(T |D ).
= 2.07/.72 = 2.88
P(X < 17.8) = P(Z < 2.88)
= .9980
0
15.73
2.88
17.8
Original problem:
Diagnosing Spina Bifida
15.73 23.05
b) What is the probability that a foetus with
spina bifida is correctly diagnosed?
Let Y be the
of fetoprotein
spina
Therefore
thelevel
probability
that ina afoetus
bifida foetus. Y ~ Normal (23.05, 4.08)
with spina bifida is correctly diagnosed
P(Y > 17.8) = P(Z > z-score for Y = 17.8)
is .901 or 90.1% of foetuses with spina
P(Z>-1.29) =z-score
1-P(Z <= -1.29)
(17.8 – 23.05)/4.08
bifida will= be
This is called
1 – detected.
0.099
= -5.25/4.08 = -1.29+
= 0.901
the sensitivity
of the test, i.e. P(T |D+).
17.8
-1.29
023.05
Original problem:
Diagnosing Spina Bifida
15.73 23.05
If they wanted to ensure that 99% of foetuses with
spina bifida were correctly diagnosed, at what
level should they set T ?
Find a value T so that if
First
theensures
z-score
associated
with T by
T = find
13.54
From
Normal
Table
we 99%
find of foetuses
finding
z~ so
that
Xspina
Normal
(23.05,
4.08)
with
bifida
will
be
identified.
P(Z < -2.33) = .0100 thus
P(Z < z) = .0100
TAgain
=we +

xhave
z probability
= 23.05 – 4.08
x 2.33 the
= 13.54
will
this
is called
sensitivity.
P(X > T) = .9900 or P(X < T) = .0100
Standard Normal Probabilities in JMP
• Normal Probability Calculator in JMP from Tutorials
section of course website.
• Here it is ready to calculate probabilities for the
standard normal distribution. ( = 0,  = 1)
Arbitrary Normal Probabilities in JMP
• Change the mean and standard deviation columns to
contain the desired values. For mothers carrying
foetus with spina bifida: X ~ N(23.05,4.08), i.e.
 = 23.05 moles/liter &  = 4.08 moles/liter
Here we have found P(X < 17.8) and P(X > 17.8)