Transcript Document

Statistics
The usual course of events for conducting scientific work
“The Scientific Method”
Observation
Develop a Working Hypothesis
Conduct an experiment
or a series of controlled
systematic observations
Appropriate statistical
tests
Reformulate or
extend hypothesis
Confirm or
reject hypothesis
The usual course of events for conducting scientific work
“The Scientific Method”
In a group of crickets,
Observation
small
ones seem to
avoid large ones
There
willabe
movement
away from
Develop
Working
Hypothesis
large cricket by small ones
Record the
of
Conduct
an number
experiment
that of
small
crickets
ortimes
a series
controlled
move awayobservations
from small
systematic
and large crickets.
Chi squarestatistical
test
Appropriate
tests
Reformulate
Avoidance
may dependor
on
extend
hypothesis
previous
experience
Confirm
or
There
is a significant
reject
hypothesis
difference
in the number
of times small crickets
move away from large
vs. small ones
Imagine that you are collecting samples (i.e. a number of
individuals) from a population of little ball creatures - Critterus
sphericales
Little ball creatures come in 3 sizes:
Small =
Medium =
Large =
You end up with a total of five samples
-sample 1
-sample 2
-sample 3
-sample 4
-sample 5
The real population
(all the little ball creatures that exist)
Your samples
Each sample is a representation of the population
BUT
No single sample can be expected to accurately represent
the whole population
To be statistically valid, each sample must be:
1) Random:
Thrown quadrat??
Guppies netted from
an aquarium?
Assign numbers from a random number table
13
4 6
1
3
7
5
2
9
14
10
11
8
12
15
To be statistically valid, each sample must be:
2) Replicated:
But not - ‘Pseudoreplication’
Not pseudoreplication
10
samples
from 10
different
trees
Sample size = 10
Pseudoreplication
10
samples
from the
same
tree
Sample size = 1
TYPES OF
DATA
RATIO DATA
- constant size interval
- a zero point with some reality
e.g. Heights, rates,
time, volumes,
weights
INTERVAL DATA
- constant size interval
- no true zero point
zero point depends on the scale used
e.g. Temperature
Ordinal Scale
- ranked data
-grades, preference surveys
Nominal Scale
Team numbers
Drosophila eye
colour
The kind of data you are
dealing with is one
determining factor in the
kind of statistical test you
will use.
Statistics
and
Parameters
Measures of:
Central tendency - mean, median,
mode
Dispersion - range, mean deviation,
variance, standard deviation,
coefficient of variation
Central tendency - Mean
The real population
(all the little ball creatures that exist)
The real population
(all the little ball creatures that exist)
Your samples
The real population
(all the little ball creatures that exist)
Central Tendency
1) Arithmetic mean
At Population level
Measuring the diameters of
all the little ball creatures that
exist
m = SXi
N
m - population mean
Xi - every measurement
in the population
N - population size
Your samples
X = SXi
n
X = SXi
n
X = SXi
n
X = SXi
n
X = SXi
n
Sample mean
X = SXi
n
Sum of all measurements in
the sample
Sample size
If you have sampled in an unbiased fashion
X = SXi
n
X = SXi
n
X = SXi
n
X = SXi
n
X = SXi
n
Each roughly equals m
Central tendency - Median
Median - middle value of a population or sample
e.g. Lengths of Mayfly (Ephemeroptera) nymphs
5th value (middle of 9)
1
2
3
4
5
6
7
8
9
Odd number of values
Even number of values
Median value
Median value
Median = middle value
Median =
+
2
Or - to put it more formally
Odd number of values (i.e. n is odd)
Median = X(n+1)
2
Even number of values
Median = X(n/2) + X(n/2) + 1
2
Central tendency - Mode
c. Mode - the most frequently occurring measurement
Mode
Frequency
(= number of times
each measurement
appears in the
population
Values (= measurements taken)
Measures of Dispersion
Why worry about this??
-because not all populations are created equal
Distribution of values in the
populations are clearly different
BUT
means and medians are the same
Mean & median
Measures of Dispersion 1. Range - difference between the highest and
lowest values
Remember little ball creatures and the five
samples
Range =
-
Range - crude measure of dispersion
Note - three samples do not
include the highest value
and - two samples do
not include the lowest
Measures of Dispersion 2. Mean Deviation
X is a measure of central tendency
Take difference between each measure and the mean
Xi - X
BUT
SXi - X = 0
So this is not useful as it stands
Measures of Dispersion 2. Mean Deviation (cont’d)
But if you take the absolute value
-get a measure of disperson
S |Xi - X|
and
S |Xi - X|
n
= mean deviation
Measures of Dispersion 3. Variance
-eliminate the sign from deviation from mean
Square the difference
(Xi - X)2
And if you add up the squared differences
- get the “sum of squares”
S(Xi - X)2
(hint: you’ll
be seeing
this a lot!)
Measures of Dispersion 3. Variance (cont’d)
Sum of squares can be considered at both the
population and sample level
Population
SS = S(Xi -
Sample
m)2
ss = S(Xi - X)2
Measures of Dispersion 3. Variance (cont’d)
If you divide by the population or sample size
- get the mean squared deviation or VARIANCE
Population
s2
= S(Xi N
Sample
m)2
Population variance
s2 = S(Xi - X)2
n-1
Sample variance
Measures of Dispersion 3. Variance (cont’d)
Note something about the sample variance
s2 = S(Xi - X)2
n-1
Degrees of freedom or df or n
Measures of Dispersion 4. Standard Deviation
- just the square root of the variance
Population
s = S(Xi N
Sample
m)2
s = S(Xi - X)2
n-1
Standard Deviation - very useful
Most data in any population are within one
standard deviation of the mean
NORMAL DISTRIBUTION
From previous slide show
Type of data
Discrete
2 categories &
Bernoulli process
Use a Binomial model
to calculate expected
frequencies
Continuous
Other
distributions
> 2 categories
Use a Poisson distribution to
calculate expected
frequencies
Now we’re dealing with:
Type of data
Discrete
2 categories &
Bernoulli process
Use a Binomial model
to calculate expected
frequencies
Continuous
Other
distributions
> 2 categories
Use a Poisson distribution to
calculate expected
frequencies
Normal Distribution
- bell curve
Central Limit Theorem
Any continuous variable influenced by numerous random
factors will show a normal distribution.
Normal curve is used for:
2) Continuous random data
Weight, blood pressure weight, length, area, rates
Data points that would be affected by a large number of
random (=unpredictable) events
physical activity
genes
age
Blood pressure
smoking
diet
stress
Normal curves can come in different shapes
So, for comparison between them, we need to
standardize their presentation in some way
Standarize by calculating a Z-Score
Z = value of a random variable - mean
standard deviation
or
Z=X-µ
s
Example of a z-score calculation
The mean grade on the Biometrics midterm is 78.4
and the standard deviation is 6.8. You got a 59.7 on
the exam. What is your z-score?
Z=X-µ
s
Z = 59.7 - 78.4 = -2.75
6.8
If you look at the formula for z-scores:
z = value of a random variable - mean
standard deviation
z is also the number of standard deviations a
value is from the mean
Each standard deviation away from the mean defines
a certain area of the normal curve