Topic 1_Statistical Analysis

Download Report

Transcript Topic 1_Statistical Analysis

Topic 1: Statistical Analysis
Warm-Up:
1.What is the importance of standard deviation
with regards to the mean?
2.What do error bars indicate?
3.What percentage of values fall within 1
standard deviation of the mean? And 2?
Error Bars
1.
State that error bars are a graphical representation
of the variability of data.

There is almost always variation in biological data.
Graphs
 Despite the variety of graphs used in business
and the popular press, there are only a few basic
styles used in biology, and generally
straightforward criteria for which to use in each
situation.
 The object of graphing is to depict numeric data
visually, so it is important to avoid visual
elements that do not add to seeing the data, and
to choose a graph design that visually shows
the comparisons you intend to make.
Bar Graphs
 Bar graphs: These are best used to show numeric
data that represent discrete items or experiments.
 Bars imply that there are no intermediate values
(contrast with lines below), and in many (but not all)
cases the order of the bars along the X-axis will be
arbitrary.
Bar Graphs
 Side-by-side - Bar graphs
can contain but a single
series of data, but when
they contain more than
one, the additional series
can be arranged in two
ways.
 In a side-by-side graph,
the bars are exactly that.
This allows the series to
be visually compared on
an item-by-item basis
Bar Graphs
 Stacked - Sometimes the numeric
values for an item accumulate
between series, and the important
visual comparison is between
items rather than series.
 In this case, a stacked bar graph
is more appropriate. In this
example, the bars are oriented
horizontally, because the flow of
time is often represented
horizontally, and the X-axis is now
the dependent variable.
 As a general rule, horizontal bars
should only be used if there is a
reason to do so.
Bar Graphs
 Error bars - When the numeric
value of a bar is a mean, it is
often important to show
variability.
 A common way of doing this is
with error bars: lines extending
above and below the top of the
bar to show some aspect of
variability, such as the standard
deviation, the standard error of
the mean, or the 95% confidence
level of the mean.
 The error bars can extend up
away from the top of the bar
only, or both above and below
(in that case the bar should have
no fill).
Graphs
 Floating error bars - The same
graph can be constructed without
the bars: the error bars remain,
but the mean is now represented
by a symbol.
 The choice between this and the
graph above is not straight
forward, and different disciplines
characteristically use one or the
other.
 The example visually stresses
comparison of the means over
comparison of the variation; this
example stresses comparison of
the variation, and de-emphasized
comparison of the means.
Box Plots
 Box plots - Sometimes it is useful
to show a visual representation of
variability in data without
resorting to parametric measures
of variation.
 A box plot depicts the median,
rather than the mean (although
many graph programs substitute
the mean), and the quartiles (the
25% of the data above the median
and the 25% below the median).
 This example adds thin lines
including 90% of the data, and
those individual data points that
are outliers. Note that these are
not confidence intervals; they are
measures of the actual data.
Line Graph
 Line graph: Line graphs best represent
data that are samples from continuous
phenomena.
 The visual implication of the line is that
intermediate points exist, but were not
sampled. Values taken over time or
through space fit this criterion, as do
observations at different dosages
(assuming that the dosage could be
varied continuously).
 The order of the data along the X-axis is
of course not arbitrary with a line graph.
In this example, there are error bars for
the individual samples. The samples are
also connected by straight lines; they
could also be connected by spline
curves, which would give a smoother
appearance, but which are no better
predictors of intermediate values.
2. Calculating Mean
 The arithmetic mean is another name for the average of a set
of scores. The mean can be found by dividing the sum of the
scores by the number of scores.
 For example, the mean of 5, 8, 2, and 1 can be found by first
adding up the numbers. 5 + 8 + 2 + 1 = 16. The mean is then
found by taking this sum and dividing it by the number of
scores. Our data set 5, 8, 2, and 1 has 4 different numbers,
hence the mean is 16 ÷ 4 = 4.
2. Calculating Standard Deviation
 Variance and Standard Deviation- The variance
and standard deviation of a data set measures the
spread of the data about the mean of the data set.
 The variance of a sample of size n represented by
s2 is given by:
s2 = ∑(x – mean)2
(n-1)
 The standard deviation (s) can be calculated by
taking the square root of the variance.
Standard Deviation
3. State that the term standard deviation is used to
summarize the spread of values around the mean,
and that 68% of the values fall within one
standard deviation of the mean.
 For normally distributed data, 68% of the values
fall within one standard deviation of the mean
 For normally distributed data, 95% of the values
fall within two standard deviation of the mean
Why use Standard Deviation?
4. Explain how the standard deviation is useful for
comparing the means and the spread of data
between two or more samples.
 A small standard deviation means that the data
are clustered closely around the mean value.
 A large standard deviation indicates a wider
spread around the mean.
 Standard deviation can be used to compare the
means and spread of two or more data sets.
t-Test
5. Deduce the significance of the difference between two sets
of data using calculated values for t and the appropriate
tables.
t-Test
 The t-test assesses whether the means of two groups are
statistically different from each other.
 The larger the difference between the two means, the
larger t is.
 The larger the standard deviations, the smaller t is.
t-Test Assumptions
 Normally distributed data
 Equal Variances
 Large sample size (at least 10 individuals)
t-Test
 Normality assumption. The data come from a distribution
that has one of those nice bell-shaped curves known as a
normal distribution. People worry about violating the
assumption of normality because data often look skewed.
 Fortunately, it has been shown that if the sample size is
even moderate for each group, quite severe departures
from normality don't seem to affect the conclusions
reached.
t-Test
 Equality of variance. Some researchers have argued that
equality of variance is actually more important than the
assumption of normality.
 In other words, the standard deviations of the two groups
are pretty close to equal.
t-Test
1.
Enter the values in a graphic display calculator or a
spreadsheet program, with values for the two
populations entered separately.
2.
Use the calculator function keys or computer software to
calculate t.
3.
Find the number of degrees of freedom.
This will be the total number of values in both populations,
minus 2.
t-Test
4. Find the critical value for t either using the computer
software or a table of values of t. The level of
significance (P) chosen should be 0.05 (5%) and the
appropriate row should be selected according to the
number of degrees of freedom.
5. Compare the calculated value of t with the critical value. If
the critical value is exceeded, there is evidence of a
significant difference between the means, at the 5%
level.
Correlations
6. Explain that the existence of a correlation does
not establish that there is a causal relationship
between two variables.
 A correlation cannot be validly used to infer a
causal relationship between variables.
 This does not mean that correlations cannot
indicate causal relations.
 However, the causes underlying the correlation, if
any, may be indirect and unknown.
 Consequently, establishing a correlation between
two variables is a not sufficient condition to
establish a causal relationship (in either direction).
Correlations
 Here is a simple example: hot weather may cause both
crime and ice-cream purchases.
 Therefore crime is correlated with ice-cream purchases.
 But crime does not cause ice-cream purchases and ice-
cream purchases do not cause crime.
Correlations
 A correlation between age and height in children
is fairly causally transparent, but a correlation
between mood and health in people is less so.
 Does improved mood lead to improved health? Or
does good health lead to good mood? Or does
some other factor underlie both? Or is it pure
coincidence?
 In other words, a correlation can be taken as
evidence for a possible causal relationship, but
cannot indicate what the causal relationship, if
any, might be.