Measures of Variability
Download
Report
Transcript Measures of Variability
Data analysis
1
The first step in any data analysis strategy is to calculate
summary measures to get a general feel for the data.
Summary measures for a data set are often referred to
as descriptive statistics. Descriptive statistics fall into
three main categories:
measures of position (or central tendency)
measures of variability
measures of skewness
2
The purpose of descriptive statistics is to describe the
data.
The type of data will determine which descriptive statistic
is appropriate.
Specifically, one can only calculate a mean with interval
or ratio data, whereas a mode can be calculated with
nominal, ordinal, interval or ratio data.
3
Measures of Position
Measures of position (or central tendency) describe where the
data are concentrated.
Mean
The Mean is simply the mathematical average of the data. T
the mean provides you with a quick way of describing your
data, and is probably the most used measure of central
tendency.
However, the mean is greatly influenced by outliers. For
example, consider the following set: 1 1 2 4 5 5 6 6 7 150
While the mean for this data set is 18.7, it is obvious that nine
out of ten of the observation lie below the mean because of
the large final observation.
Consequently, the mean is not always the best measure of
central tendency.
4
Median:
The median is the middle observation in a data set. That
is, 50% of the observation are above the median and
50% are below the median (for sets with an even
number of observation, the median is the average of the
middle two observation).
The median is often used when a data set is not
symmetrical, or when there are outlying observation.
For example, median income is generally reported
rather than mean income because of the outlying
observation.
5
To get the median, first put your numbers in ascending or
descending order. Then just use check to see which of
the following two rules applies:
Rule One. If you have an odd number of numbers, the median is
the center number (e.g., three is the median for the numbers 1,
1, 3, 4, 9).
Rule Two. If you have an even number of numbers, the median
is the average of the two innermost numbers (e.g., 2.5 is the
median for the numbers 1, 2, 3, 7).
6
Mode:
The Mode is the value around which the
greatest number of observation are
concentrated, or quite simply the most
common observation.
Mode is often used with nominal data, but
is not the preferred measure for other
types of data.
7
The mean, median, and mode are
affected differently by skewness (i.e.,
lack of symmetry) in the data.
8
When a variable is normally distributed, the mean,
median, and mode are the same number.
9
When the variable is skewed to the left (i.e.,
negatively skewed), the mean is pulled to the
leftthe most, the median is pulled to the left the
second most, and the mode the least affected.
Therefore, mean < median < mode.
10
When the variable is skewed to the right (i.e., positively
skewed), the mean is pulled to the right the most, the
median is pulled to the right the second most, and the
mode the least affected.
Therefore, mean > median > mode.
11
Measures of Variability
While measures of position describe where the data points are
concentrated, measures of variability measure the dispersion (or
spread) of the data set.
Range:
The range is the difference between the largest and the smallest
observations in the data set. However, This is a limited measure
because it depends on only two of the numbers in the data set.
Using the above data set again, the range is 149, but that does not
provide any information regarding the concentration of the data at
the low end of the scale. Another limitation of range is that it is
affected by the number of observations in the data set.
Generally, the more observation there are, the more spread out they
will be. One use of range in everyday life is in newspaper stock
market summaries, which give the day's high and low numbers.
12
Measures of Variability
Measures of variability tell you how "spread out" or how
much variability is present in a set of numbers.
For example, which set of the following numbers
appears to be the most spread out?
Set A. 93, 96, 98, 99, 99, 99, 100
Set B. 10, 29, 52, 69, 87, 92, 100
Right! The numbers in set B are more "spread out."
One crude indicator of variability is the range (i.e., the
difference between the highest and lowest numbers).
13
Two commonly used indicators of
variability are the variance and the
standard deviation.
Variance:
Unlike range, variance takes into consideration all the
data points in the data set. If all the observation are the
same, the variance would be zero. The more spread out
the observation are, the larger the variance.
The variance tells you (exactly) the average deviation
from the mean, in "squared units."
14
Standard Deviation:
Standard deviation is the positive square root of the
variance, and is the most common measure of variability.
Standard deviation indicates how close to or how far the
numbers tend to vary from the mean. The larger the
standard deviation, the more variation there is in the data
set.
(If the standard deviation is 7, then the numbers tend to be
about 7 units from the mean. If the standard deviation is 1500,
then the numbers tend to be about 1500 units from the mean.)
15
Virtually everyone in education is already
familiar with the normal curve
An easy rule applying to data that follow the
normal curve is the "68, 95, 99.7 percent rule."
That is . . .
Approximately 68% of the cases will fall within one
standard deviation of the mean.
Approximately 95% of the cases will fall within two
standard deviations of the mean.
Approximately 99.7% of the cases will fall within three
standard deviations of the mean.
16
Higher values for both of these indicators
stand for a larger amount of variability.
Zero stands for no variability at all (e.g., for
the data 3, 3, 3, 3, 3, 3, the variance and
standard deviation will equal zero).
17
Frequency Distributions
One useful way to view information in a variable
is to construct a frequency distribution (i.e., an
arrangement in which the frequencies, and
sometimes percentages, of the occurrence of
each unique data value are shown).
When a variable has a wide range of values, you
may prefer using a grouped frequency
distribution (i.e., where the data values are
grouped into intervals, 0-9, 10-19, 20- 29, etc.,
and the frequencies of the intervals are shown).
18
Graphic Representations of
Data
Another excellent way to clearly
describe your data (especially for
visually oriented learners) is to
construct graphical
representations of the data (i.e.,
pictorial representations of the
data in two-dimensional space).
A bar graph uses vertical bars to
represent the data. The height of
the bars usually represent the
frequencies for the categories
shown on the X axis(i.e., the
horizontal axis). (By the way, the Y
axis is the vertical axis.)
19
A line graph uses one or more
lines to depict information
about one or more variables.
100
90
80
A simple line graph might be
70
used to show a trend over
time (e.g., with the years on
60
the X axis and the population
50
sizes on the Y axis).
40
Line graphs are used for many
30
different purposes in research.
For example, (GPA is on the X 20
axis and frequency is on the Y 10
0
axis)
East
West
North
1st Qtr 2nd Qtr 3rd Qtr 4th Qtr
20
A scatterplot is used to
depict the relationship
between two quantitative
variables.
Typically, the independent
or predictor variable is
represented by the X axis
(i.e., on the horizontal
axis) and the dependent
variable is represented by
the Y axis (i.e., on the
vertical axis).
21
The relationship is not always positive
Correlation coefficient range between -1
and +1
Interpretation of Pearson r
• +1 highly positvely correlated
• -1 highly negatively correlated
• Close to zero, no correlation
22
Correlation does not necessarily indicate
causation
+.82 tells us that a person with an average
score on the test will probably obtained an
average score on other test
23
How to Interpret the Values of Correlations.
The correlation coefficient (r) represents the linear
relationship between two variables. If the correlation
coefficient is squared, then the resulting value (r2, the
coefficient of determination) will represent the proportion
of common variation in the two variables (i.e., the
"strength" or "magnitude" of the relationship).
In order to evaluate the correlation between variables, it
is important to know this "magnitude" or "strength" as
well as the significance of the correlation.
24
Outliers.
Outliers are atypical (by definition), infrequent
observations.
Outliers have a profound influence on the slope
of the regression line and consequently on the
value of the correlation coefficient.
A single outlier is capable of considerably
changing the slope of the regression line and,
consequently, the value of the correlation, as
demonstrated in the following example.
25
26
Analyses for Comparison
Nominal Data: Chi-Square
Interval Data: t-Test
Interval Data: One-Way ANOVA
Interval Data: Factorial ANOVA
Analyses for Association
Interval Data: Pearson Product-Moment Correlation
(r)
Nominal Data: Phi Coefficient
Ordinal Data: Spearman Rank-Order Correlation
27
parametric Methods
Non parametric Methods
t-test for independent
samples
Mann-Whitney U test
ANOVA/MANOVA
Kruskal-Wallis
analysis of ranks and
the Median test.
Sign test and
Wilcoxon's matched
pairs test
(multiple groups)
t-test for dependent
samples (two variables
measured in the same
samplE)
28
t-test for independent samples
Purpose, Assumptions.
The t-test is the most commonly used method to
evaluate the differences in means between two groups.
For example, the t-test can be used to test for a
difference in test scores between a group of patients
who were given a drug and a control group who received
a placebo.
Theoretically, the t-test can be used even if the sample
sizes are very small (e.g., as small as 10; some
researchers claim that even smaller n's are possible), as
long as the variables are normally distributed within each
group and the variation of scores in the two groups is not
reliably different
29
The normality assumption can be evaluated by
looking at the distribution of the data (via
histograms) or by performing a normality test.
The equality of variances assumption can be
verified with the F test, or you can use the more
robust Levene's test.
If these conditions are not met, then you can
evaluate the differences in means between two
groups using one of the nonparametric
alternatives to the t- test (Nonparametrics).
30
Independent sample t test
Mean
Talk
Low stress
High stress
N
42.20
22.07
Std.Deviati
on
15
15
Std. Error
Mean
24.97
27.14
6.45
7.01
Sx = SD/√15
DV
Talk
IV
Equal variance assumed
Equal variance not
assumed
Standard deviation of the sample means
F
Sig.
.023
.881
Levene’s test for equality of variance
T
Df
Sig.
(2tailed
2.43
28
2.430 27.808
.022
.022
Mea Std.
n diff erro
r
diff
.
Tested at α = .05
In this case,
Here you want variance to equal
variances are similar
The larger the F value the more dissimilar the varainces are
You want a small F
31
An independent t st was conducted to evaluate the
hypothesis that students talk differently (amount of talkin)
under different stress condition. The test was significant,
t (28) = 2.43, p =.022. Students in high stress-condition
talked less (M=22.07; SD = 27.14) than students in lowstressed condition (M=45.20; SD = 24.97)
32
t-test for dependent samples (paired sampel t-test
Test two groups of observations (that are to be compared)
are based on the same sample of subjects who were
tested twice (e.g., before and after a treatment )
Mean
PAY
SECURITY
N
5.67
4.50
Std.Deviation Std. Error
Mean
30
30
1.49
1.83
.27
.33
Sx = SD/√30
Standard deviation of
the sample means
33
Pay- security
Mean
Std.
Dev.
Std.
Err.
Lower
Upper
t
df
Sig. (2tailed)
1.17
2.26
.41
.32
2.01
2.827
29
.008
A paired-sample t test was conducted to evaluate
whether employees were more concerned with pay or
job security. The results indicated that the mean concern
for pay (M = 5.67, SD = 1.49) was significantly greater
than the mean concern for security (M = 4.50, SD =
1.83), t (29) = 2.83, p = .008.
34
It was suggested (Marija J. Norusis) that
When reporting your results, give the exact
observed significance level. It will help the rader
evaluate your findings
Eg: p = .008, [8 chances in 1000] you would observe the
difference between the two sample.
Eg; p = .08 [8 chances in 100] but you have set that you will only
acet if it is [5 chances in 100]
35
Pearson Chi-square.
The Pearson Chi-square is the most common test for significance of the
relationship between categorical variables.
This measure is based on the fact that we can compute the expected
frequencies in a two-way table (i.e., frequencies that we would expect if
there was no relationship between the variables).
For example, suppose we ask 20 males and 20 females to choose between
two brands of jeans (brands A and B).
If there is no relationship between preference and gender, then we would
expect about an equal number of choices of brand A and brand B for each
sex.
The Chi-square test becomes increasingly significant as the numbers
deviate further from this expected pattern; that is, the more this pattern of
choices for males and females differs.
36
The Goodness of Fit test: used to find out if the population
under study follow the distribution values
Ho: the population distribution is uniform, that is, each brand of cola
drinks is prefered by an equal percentage of the population
Ha: the population distribution is not uniform, that is, each brand of
cola drinks is not prefered by an equal percentage of the population
37
brand
O
E
A
50
60
B
65
60
C
45
60
D
70
60
E
70
60
Total
300
60
O-E
(O-E)2
(O-E)2/E
X 2 (df=5)= 9.18, let say the significant value is 9.49, then
Ho has to rejected and we cannot say that cola brands are
preferred by an equal percentage of the population
Df = (r-1). (c-1)
38
Test of independence [ we can test the realtionship
between nominal variables)
The data are obtained from a random
sample
We use count data (frequencies)
We want to test whether perception of life is independent
of gender or men and women find life equaly exciting
39
Life excitement male
female
excited
300
384
684
Not excited
296
481
777
596
865
1461
Chi square 4.76, DF =1; p =.0290
What can you conclude?
40