Lecture 4 - ECOL 2006

Download Report

Transcript Lecture 4 - ECOL 2006

The Normal Distribution and
Inferential Statistics
BIOL 2608 Biometrics
Tabular Presentation of Data
e.g.1 the number of sparrow’s nests per
hectare were counted for
each of 36 hectares
1
1
0
1
3
0
0
0
1
1
0
1
0
1
4
0
1
2
2
1
2
1
1
3
1
2
1
0
2
1
1
0
0
1
1
1
e.g. 1. A frequency distribution table for the number of
sparrows’ nests per hectare
No. of sparrows’ nests
0
1
2
3
4
No. of hectares
10
18
5
2
1
Frequency Distribution
e.g.2. The particle sizes (m) of 37 grains
from a sample of sediment from an
estuary
Define
8.2 6.3
6.8
6.4
8.1
6.3
convenient
5.3 7.0
6.8
7.2
7.2
7.1
classes (equal
width) and
5.2 5.3
5.4
6.3
5.5
6.0
class intervals
5.5 5.1
4.5
4.2
4.3
5.1
e.g. 1 m
4.3 5.8
4.3
5.7
4.4
4.1
4.2 4.8
3.8
3.8
4.1
4.0
4.0
e.g. 2. Frequency distribution for the size of particles
collected from the estuary
Particle size (m)
3.0 to under 4.0
4.0 to under 5.0
5.0 to under 6.0
6.0 to under 7.0
7.0 to under 8.0
8.0 to under 9.0
Frequency
2
12
10
7
4
2
Frequency Histogram
Frequency
15
10
5
0
3 to <4
4 to <5
5 to <6
6 to <7
7 to <8
8 to <9
Particle size (m)
e.g. 2.
Frequency distribution for the size of particles
collected from the estuary
Stem-and-leaf Displays
e.g.2. A stem-and-leaf plot for the size of particles
collected from the estuary
Stem
3.
4.
5.
6.
7.
8.
Leaf
88
523334128100
3234551187
3843830
0221
21
This can show the actual
data and general shape of
the distribution, so it’s
useful for exploring data
Cumulative Frequency Distribution
e.g. 3. Cases of meningitis in England and Wales
(1989), by age
Age
No. of cases Cumulative % CF
frequency
<1
673
673
25.18
1 to <2
354
1027
38.42
2 to < 3
193
1220
45.64
3 to <4
129
1349
50.47
4 to <5
79
1428
53.42
5 to < 10
204
1632
61.05
10 to < 15
144
1776
66.44
15 to < 25
345
2121
79.35
>= 25
552
2673
100.00
e.g. 4. Frequency distribution of height of the last
year student (n = 52: 30 females & 22 males)
14
12
Why bimodal-like ?
Frequency
10
8
6
4
2
0
>149-153>153-157>157-161>161-165>165-169>169-173>173-177>177-181>181-185
Height (cm)
e.g. 4.
Frequency distribution of
female height (cm) of the class
(n=30)
8
7
Ideal class number = 5 log10 n
e.g. 5 log (30) = 7.4
6
then (max - min)/7.4
= (170 -149)/7.4 = 2.8 cm
Frequency
5
4
3
2
1
0
>149-152 >152-155 >155-158 >158-161 >161-164 >164-167 >167-170
Height (cm)
Normal curve

f(x) = [1/(2)]exp[(x  )2/(22)]
10
9
8
f(x)
Frequency
7
6
5
4
3
2
1
145
150
155
160
165
Height (cm)
170
175
180
0
151
155
159
163
Height (cm)
167
171
Changing bin size can modify the histogram.
Normal curve

f(x) = [1/(2)]exp[(x  )2/(22)]
Parameters  and  determine
the position of the curve on the
x-axis and its shape.
Until 1950s, it was then applied
to environmental problems.
(P.S. non-parametric statistics
were developed in the 20th
century)
0.09
0.08
Probability density
Normal curve was first
expressed on paper (for
astronomy) by A. de Moivre in
1733.
0.10
0.07
0.06
male
female
0.05
0.04
0.03
0.02
0.01
0.00
140
150
160
170
Height (cm)
180
190
f(x) = [1/(2)]exp[(x  )2/(22)]
0.50
Probability density
0.40
N(10,1)
N(20,1)
0.30
0.20
N(20,2)
N(10,3)
0.10
0.00
0
10
20
X
• Normal distribution N(,)
• Probability density function: the area under the
curve is equal to 1.
30
The standard normal curve



 = 0,  = 1 and with the total area under the curve = 1
units along x-axis are measured in  units
Figures: (a) for 1 , area = 0.6826 (68.26%); (b) for 2  
95.44%; (c) the shaded area = 100% - 95.44%
Application of the Standard Normal Distribution

For example:
We have a large data set (e.g. n = 200) of
normally distributed suspended solids
determinations for a particular site on a river:
x = 18.3 ppm and s = 8.2 ppm.
We are asked to find the probability of a
random sample containing 30 ppm
suspended solids or more.
Application of the standard normal distribution





The standard deviation (or Z value): Z = (Xi - )/
Z = (30 - 18.3)/8.2 = 1.43
Check the Z Table (Table B2 in Zar’s book), you will obtain
the probability for the samples having 30 ppm = 0.0764 or 7.64%
i.e. for n = 200, more on about 15 occasions for having 30 ppm
Central Limit Theorem

As sample size (n) increases, the means of
samples (i.e. subsets or replicate groups)
drawn from a population of any distribution
will approach the normal distribution.

By taking mean of the means, we smooth out
the extreme values within the sets while
keeping x  x.

As the number of subsets increases, the
standard deviation of the mean of the means
will be reduced and the frequency distribution
is very close to the normal distribution
Inferential statistics - testing the null hypothesis



Inferential = “that may be inferred.” Infer = conclude or reach an
opinion
The hypothesis under test, the null hypothesis, will be that Z has
been chosen at random from the population represented by the
curve.
Z values close to the mean ( = 0) are high, while frequencies
away from the mean decline
e.g. two values of Z are shown:
Z = 1.96 and Z = 2.58
From the Table B2, we have the
corresponding probability:
0.025 (2.5%) and 0.0049 (0.5%)
Inferential statistics - testing the null hypothesis
As the curve is symmetrical about the
mean, p to obtain a value of Z < -1.96
is also 2.5%;
so the total p of obtaining a value of Z
between -1.96 and +1.96 is 95%
Likewise, between Z = 2.58, the total
p = 99%
Then we can state a null hypothesis that
a random observation of the population
will have a value between -1.96 and +
1.96.
Inferential statistics - testing the null hypothesis
Alternatively, we can state the null hypothesis
as that a random observation of Z will lie
outside the limit -1.96 or +1.96.
There are 2 possibilities:
Either we have chosen an ‘unlikely’ value
of Z, or our hypothesis is incorrect.
Conventionally, when performing a
significant test, we make the rule that if
Z values lies outside the range 1.96, then the null hypothesis is rejected and
the Z value is termed significant at the 5% level or  = 0.05 (or p < 0.05) critical value of the statistics.
For Z =  2.58, the value is termed significant at the 1% level.
Statistical Errors in Hypothesis Testing

Consider court judgements where the accused is
presumed innocent until proved guilty beyond
reasonable doubt (I.e. Ho = innocent)
If the accused is If the accused is
innocent
guilty
(Ho is true)
(Ho is false)
Court’s
decision:
Guilty
Wrong
judgement
OK
Court’s
decision:
Innocent
OK
Wrong
judgement
Statistical Errors in Hypothesis Testing

Similar to court judgements, in testing a null hypothesis in
statistics, we also suffer from the similar kind of errors:
If Ho is true
If Ho is false
If Ho is rejected
Type I error
No error
If Ho is accepted
No error
Type II error
Statistical Errors in Hypothesis Testing
e.g. Ho = responses of cancer patients to a new drug and placebo
are similar
• If Ho is indeed a true statement about a statistical population,
it will be concluded (erroneously) to be false 5% of time (in case
 = 0.05).
• Rejection of Ho when it is in fact true is a Type I error (also
called an  error).
• If Ho is indeed false, our test may occasionally not detect this
fact, and we accept the Ho.
• Acceptance of Ho when it is in fact false is a Type II error
(also called a  error).
Power of a Statistical Test

Is defined as 1-

 is the probability to have Type II error



Power (1- ) is the probability of rejecting the
null hypothesis when it is in fact false and
should be rejected
Probability of Type I error is specified as 
But  is a value that we neither specify nor
known
Power of a statistical test

However, for a given sample size n,  value
is related inversely to  value

Lower p of committing a Type I error are
associated with higher p of committing a Type
II error.

The only way to reduce both types of error
simultaneously is to increase n.

For a given , large n will result in statistical
test with greater power (1 - ).
IMPORTANT NOTES

If n is large enough, frequency histograms of
interval and ratio measurements often
approximate the normal distribution.

The normal distribution is a mathematical curve
whose shape and location is determined by
two population parameters (,).

Areas beneath the standard normal curve ( =
0,  = 1) correspond to the probability of
occurrence of normally distributed
measurements with specific values.

The probability of occurrence of specified
measurements can be estimated using the Z
table (Table B2 in Zar’s book).
IMPORTANT NOTES

The central limit theorem states that the
sample mean is a normally distributed quantity.

Significance testing involves setting up a null
hypothesis (Ho) and then providing evidence
for its acceptance or rejection. If Ho is rejected
on the statistical evidence, the alternative
hypothesis (HA) must be accepted.

In terms of probability of occurrence, Ho is
rejected if its probability value <  (e.g.  =
0.05 or 5%; i.e. less than a one in 20 chance
that Ho is correct.)
IMPORTANT NOTES

Significance testing may make errors:
– Rejection of Ho when it is in fact true is a
Type I error (also called an  error).
– Acceptance of Ho when it is in fact false is
a Type II error (also called a  error).




Power (1- ) is the probability of rejecting the
null hypothesis when it is in fact false and
should be rejected.
For a given n,  is inversely proportional to 
The only way to reduce both types of error
simultaneously is to increase n.
For a given , large n will result in statistical
test with greater power (1 - ).