Understanding Your Data Set

Download Report

Transcript Understanding Your Data Set

Understanding Your Data Set
• Statistics are used to describe data sets
• Gives us a metric in place of a graph
• What are some types of statistics used
to describe data sets?
– Average, range, variance, standard
deviation, coefficient of variation, standard
error
Table 1. Total length (cm) and average length of spotted gar collected
from a local farm pond and from a local lake.
Length
Number
Pond
Lake
1
34
38
2
78
82
3
48
58
4
24
76
5
64
60
6
58
70
7
34
99
8
66
40
9
22
68
10
44
91
47.2
68.2
Average=
Length
Number
Pond
Lake
1
34
38
2
78
82
3
48
58
4
24
76
5
64
60
6
58
70
7
34
99
8
66
40
9
22
68
10
44
91
47.2
68.2
Average=
• Are the two samples
equal?
– What about 47.2 and
47.3?
• If we sampled all of the
gar in each water body,
would the average be
different?
– How different?
• Would the lake fish
average still be larger?
Range
• Simply the distance between the smallest and
largest value
Lake
Overlap
Pond
0
20
40
60
80
100
Length (cm)
Figure 1. Range of spotted gar length collected from a pond and a lake.
The dashed line represents the overlap in range.
• Does the difference in average length
(47.2 vs. 68.2) seem to be much as large
as before?
Lake
Overlap
Pond
0
20
40
Length (cm)
60
80
100
Variance
• An index of variability used to describe
the dispersion among the measures of
a population sample.
• Need the distance between each
sample point and the sample mean.
100
Distance from point to the sample mean
Length
80
60
40
20
0
0
2
4
6
8
10
Number
Figure 2. Mean length (cm) of each spotted gar collected from the
pond. The horizontal solid line represents the sample mean length.
• We can easily put this new
data set into a spreadsheet
table.
• By adding up all of the
differences, we can get a
number that is a reflection of
how scattered the data
points are.
#
Length
Mean
Difference
1
34
47.2
-13.2
2
78
47.2
30.8
3
48
47.2
0.8
4
24
47.2
-23.2
5
64
47.2
16.8
6
58
47.2
10.8
• After adding up all of the
differences, we get zero.
7
34
47.2
-13.2
8
66
47.2
18.8
– This is true of all
calculations like this
9
22
47.2
-25.2
10
44
47.2
-3.2
Sum =
0
– Closer to the mean each
number is, the smaller the
total difference.
• What can we do to get rid of
the negative values?
Sum of Squares
#
Length
Mean
Difference
Difference2
1
34
47.2
-13.2
174.24
2
78
47.2
30.8
948.64
3
48
47.2
0.8
0.64
4
24
47.2
-23.2
538.24
5
64
47.2
16.8
282.24
6
58
47.2
10.8
116.64
7
34
47.2
-13.2
174.24
8
66
47.2
18.8
353.44
9
22
47.2
-25.2
635.04
10
44
47.2
-3.2
10.24
0
3233.6
Sum =
Now 3233.6 is a number we can use! This value is called the SUM
OF SQUARES.
Back to Variance
• Sum of Squares (SOS) will continue to
increase as we increase our sample size.
– A sample of 10 replicates that are highly variable
would have a higher SOS than a sample of 100
replicates that are not highly variable.
• To account for sample size, we need to
divide SOS by the number of samples minus
one (n-1).
– We’ll get to the reason (n-1) instead of n later
Calculate Variance (σ2)
σ2 = S2 = (Xi – Xm)2 / (n – 1)
Degrees of
Freedom
SOS
Variance for Pond = S2 = 3233.6 / 9 = 359.29
100
Distance from point to the sample mean
Length
80
60
40
20
0
0
2
4
6
Number
8
10
More on Variance
• Variance tends to increase as the
sample mean increases
– For our sample, the largest difference
between any point and the mean was 30.8
cm. Imagine measuring a plot of cypress
trees. How large of a difference would you
expect (if measured in cm)?
• The variance for the lake sample =
400.18.
Standard Deviation
• Calculated as the square root of the
variance.
– Variance is not a linear distance (we had to
square it). Think about the difference in
shape of a meter stick versus a square
meter.
• By taking the square root of the
variance, we return our index of
variability to something that can be
placed on a number line.
Calculate SD
• For our gar sample, the Variance was 359.29.
The square root of 359.29 = 18.95.
– Reported with the mean as: 47.2 ± 18.95 (mean ± SD).
• Standard Deviation is often abbreviated as σ
(sigma) or as SD.
• SD is a unit of measurement that describes the
scatter of our data set.
– Also increases with the mean
Coefficient of Variation
• A statistic used to compare variations of
populations that have different means
• Calculated as:
– CoV = (SD / mean )* 100
Pond
Lake
Mean
SD
Coeff. Var.
47.2
68.2
18.95
20.00
40.15
29.33
Standard Error
• Calculated as: SE = σ / √(n)
– Indicates how close we are to estimating the true
population mean
– For our pond ex: SE = 18.95 / √10 = 5.993
– Reported with the mean as 47.2 ± 5.993 (mean ± SE).
– Based on the formula, the SE decreases as sample
size increases.
• Why is this not a mathematical artifact, but a true
reflection of the population we are studying?
Sample Size
• The number of individuals within a
population you measure/observe.
– Usually impossible to measure the entire
population
• As sample size increases, we get closer
to the true population mean.
– Remember, when we take a sample we
assume it is representative of the
population.
Effect of Increasing Sample Size
• I measured the length of 100 gar
• Calculated SD and SE for the first 10,
then included the next additional 10,
and so on until all 100 individuals were
included.
Raw Data
120
100
80
60
40
20
0
0
20
40
60
Sample Size
80
100
120
SD = Square root of the variance
(Var = (Xi – Xm) / (n – 1))
SD
90
80
70
60
50
40
0
20
40
60
Sample Size
80
100
SE = SD / √(n)
SE
90
80
70
60
50
40
0
20
40
60
Sample Size
80
100
SD
24
22
20
18
16
14
12
0
20
40
60
80
100
60
80
100
SE
10
8
6
4
2
0
0
20
40