Module 10 - Prevention Research Center for Healthy Neighborhoods
Download
Report
Transcript Module 10 - Prevention Research Center for Healthy Neighborhoods
Module 10
Sept. 3, 2014
Agenda
Stats Lecture
1) Univariate analysis (looking at one variable)
… central tendencies, and variability (dispersion)
2) Bivariate analysis (comparing two variables)
… correlation, t-test, chi-square association
3) Additional context for stats assignment in group project
Applications in SPSS (handout)
Discussion
Where do we start?
Univariate Analyses
Need to make sure all our variables (e.g. scores on a
scale, income figures, gender, ethnicity) are behaving
appropriately for statistical testing
Each must have some variability (e.g. if all women, no
variability, cannot do outcomes based on gender)
Need to check out how much variability and typical
values for each
For example, a typical value may be its average or mean value
These analyses called univariate analyses.
Univariate analysis involves the examination across cases
of one variable at a time.
Summarizing Univariate Distributions
Any set of measurements that summarizes a variable
should have two important properties:
1. The Central Tendency (or typical value)
mode, median, mean
2. The Spread (variability or dispersion) about that value
range, variance, standard deviation
(That is, how do each of the data values differ from
the mean or median value? )
Example of central tendency and variation
2
Assume mean = 5.0
Each point varies around
the mean.
Example of central tendency and variation
2
Assume mean = 5.0
Each point varies around the
mean.
This variation contributes to
the overall standard deviation
(SD)
More on standard
deviations, later…
Measures of Central Tendency
An estimate of the center of a distribution of
values; how much our data are similar
The means to determine what is most typical,
common, and routine
Central tendency is usually summarized with
one of three statistics:
1) Mode
2) Median
3) Mean
Measures of Central Tendency 1
The Mode
The mode, the most frequent value in a distribution, is the
least often used as it easily gives a misleading impression:
mnemonic - mode = most.
If the mode occurs twice, then the distribution is
called bimodal.
Can be used for all four levels of measurement (for nominal,
just the most common response: ex. the number of female
and male in a study)
May not be effective in describing what is typical in the
distribution of a variable
Measures of Central Tendency 1
The Mode example
What is the most frequent value?
28, 31, 38, 39, 42, 42, 42, 42, 43, 47, 51, 51, 54, 55,
56, 56, 58, 59, 59, 59
(this listing of the data set is called an array)
Where is the mode in each of these distributions?
Measures of Central Tendency 2
The Median
The median, the point that divides the distribution in
half; the midpoint of a set of numbers
To find the median value of a data set, arrange the
data in order from smallest to largest
Must be used for at least ordinal level of
measurement – why?
Unlike the mode, the median does not always
coincide with an actual value in the set (unless the
set has an odd number of values
Measures of Central Tendency 2
The Median Example
2, 2, 3, 3, 4, 5, 5, 7, 8, 9, 10, 11, 11, 14, 14, 15, 16, 18, 20
19 points, 10th one is the Median
= 9 Median
If the number of points is even – then average the two
values around the middle (n = 18):
2, 3, 3, 4, 5, 5, 7, 8, 9, 10, 11, 11, 14, 14, 15, 16, 18, 20
9 + 10 / 2 = 9.5 Median
Measures of Central Tendency 3
Mean
The mean, or statistical average, takes into account
the values of each case in the distribution
It is the sum of all of the values divided by the total #
of the values.
Must be interval or ratio level measurements (e.g.,
weight, age, miles driving).
Should not be computed for ordinal level – why?
Mean can promote accuracy or distortion depending
on whether the distribution is symmetrical or
skewed.
Measures of Central Tendency 3
The Mean Example
2, 2, 3, 3, 4, 5, 5, 7, 8, 9, 10, 11, 11, 14, 14, 15, 16, 18, 20
ANSWER:
2+2+3+3+4+5+5+7+8+9+10+11+11+14+14+15+16+18+20
Total N = 19
= 177 / 19 = 9.32
= SUM of all values / N
What is the Normal Distribution?
It looks like a bell with one “hump” in the middle,
centered around the population mean, and the
number of cases (data) tapering off to both sides of
the mean;
the symmetrical distribution of scores around the mean
Normal Distribution (aka, Bell Curve)
– where
is the mean, median, and mode?
16
Normal Distribution (aka, Bell Curve) – where
is the mean, median, and mode?
In a perfect normal distribution, mean,
median and mode are equal!
Mode
Median
Mean
17
Means and variances are best measures
for symmetric or normal distributions
Describe by using
arithmetic MEAN
VARIANCE (standard
deviation)
Secondarily,
Range
Mode (most common
value)
Skew (left or right)
Kurtosis (thickness of
tails)
Normal Distribution - Skewness
• Skewness is used in describing abnormal distributions.
• In a normal curve, the right and left halves of the curve
are mirror images of each other.
• If this is not the case, the curve is said to be skewed, either
positively (to the right) or negatively (to the left).
• If the scores tend to be concentrated toward the high
end of the score scale, the curve is negatively skewed.
• If they are concentrated toward the low end of the score
scale, they are positively skewed
Skewness is measured from -3.0 to + 3.0
0 skew score = symmetrical distribution
19
Normal Distribution - Skewness
20
Example. Means and standard deviations
for all study variables
Mean
Std.
Deviation
N
SF-36 Scale
80.47
20.37
257
Number of people in Household
2.81
1.35
257
Number of hours housework (sqrt)
29.68
10.31
257
Financial stress scale
5.07
1.95
257
Mean=50
Mean=80
The Outlier Affect
Outlier: a result that is far different from most of the results for
the group; extreme value(s) that can skew the overall results
Median and mode are not sensitive to outliers. That is, they
tend not to change with outliers
Mean is sensitive to outliers. Mean can change greatly with
outliers.
Array
Mean
Median
Mode
1, 1, 1, 1, 50
10.8
1
1
1, 1, 1, 1, 100
20.8
1
1
To Address Outliers in Mean Calculations…
Trimmed mean: do not use the top and bottom five
percent of scores
In this example, we have 20 values. The lowest and highest values
reflect the lowest 5% and highest 5% values in this list
2 40 45 46 52 52 55 59 60 61 61 63 64 66 66 66 67
69 70 259
Mean for n = 20 is 66.2,
Trimmed mean for n = 18 is 53.1
Which measure of central tendency should
we use?
Both the median and mean are used to summarize
the central tendency of quantitative variables.
To decide which to use, consider these issues:
1. Level of measurement:
the median can be used with ordinal level data (often used
in scales); but,
the mean requires interval or ratio level data.
the mode should be used for nominal level data. (Think
Yes=1 and No=0 data. What would 0.36 mean? And 0.72?)
Which measure of central tendency should
we use?
Both the median and mean are used to summarize the
central tendency of quantitative variables.
To decide which to use, consider these issues:
2. The shape of the distribution
the median should be used when the data is skewed or has many
outliers
the mean should be used when the data is fairly “bell shaped” or
normal.
Tip: Use the mean when the mean and median are
very similar.
Mean or Median?
Shape of variable’s distribution:
The mean and median will be the same when the
distribution is perfectly symmetric.
When the distribution is not symmetric, the mean is pulled
in the direction of extreme values, but the median is not
affected in any way by extreme values.
Purpose of the statistical summary:
If the purpose is to report the middle position, then the
median is the appropriate statistic.
If the purpose is to report a mathematical average, the
mean is the appropriate statistic.
Normal distributions: means and
medians are very close
Arithmetic MEAN
(average value) is nearly
the same at the
MEDIAN (50th
percentile, or value
where half of the ranked
data points lie above
and below.)
Measures of Variability
(Variation/Dispersion)
How different the data are from each other and is
reported by how the scores fall around the mean
For nominal data, simply looks at how many in each
category, for the rest…
Captures how widely and densely spread a variable’s
distribution is.
Measures of Variability
Variability is usually summarized with one of four
statistics:
1) The Percent of responses in each category (nominal data)
2) The Range (ordinal and higher)
3) The Variance (interval and ratio)
4) The Standard Deviation (interval and ratio)
Measures of Variability 1
Percentage & Range
For nominal data, simply report percentage in categories
(51% female, 22% social workers)
For ordinal, interval 7 ratio data, the range is calculated
as the difference between the highest value in a
distribution and the lowest value.
It can be drastically altered by a extreme value (an outlier)
“Maximum value minus the minimum value + 1”
Example:
2, 3, 3, 4, 5, 5, 7, 8, 9, 10, 11, 11, 14, 14, 15, 16, 18, 20
Range is 20 – 2 + 1 = 19
2, 3, 3, 4, 5, 5, 7, 8, 9, 10, 11, 11, 14, 14, 15, 16, 18, 100
Range is 100 – 2 + 1 = 99 (outlier effect)
Example of central tendency and variation
2
Assume mean = 5.0
Each point varies around
the mean.
Example of central tendency and variation
2
Assume mean = 5.0
Each point varies around the
mean.
This variation contributes to
the overall standard deviation
(SD)
Measures of Variability 2
Variance
Variance
The variance is the average of the squared differences
from the mean.
It takes into account all the scores to determine the
spread.
To calculate the variance follows these steps:
1) Work out the mean (the simple average of the
numbers)
2) For each number: subtract the mean and then
square the result (the squared difference)
3) Work out the average of those squared differences.
Example of central tendency and variation
2
Assume mean = 5.0
Each point varies around the
mean.
This variation contributes to
the overall standard deviation
(SD)
First, calculate the mean.
Find difference at each point.
Square difference and sum.
𝐷𝐼𝐹𝐹 2
Variance =
𝑛−1
Variance =
𝑒𝑎𝑐ℎ 𝑣𝑎𝑙𝑢𝑒−𝑚𝑒𝑎𝑛 2
𝑛−1
SD =
𝑒𝑎𝑐ℎ 𝑣𝑎𝑙𝑢𝑒−𝑚𝑒𝑎𝑛 2
𝑛−1
Calculations in Excel table
Variance – Example
You and your friends have just measured the heights of your dogs.
The heights are: 600mm, 470mm,
170mm, 430mm and 300mm.
Mean=394
1.Find the Mean:
Mean = 600+470+170+430+300/5=394
2. Calculate each dogs difference from the
Mean: (600-394=206), (470-394=76),
(170-394=-224)….
3. To calculate the Variance, take each difference, square it, and then average the result:
Variance: σ2 = 2062 + 762 + (-224)2 + 362 + (-94)2 = 108,520 → 108,520/5 = 21,704
Measures of Variability 3
Standard Deviation
Standard Deviation
Standard deviation is the square root of the variance: √(variance)
SD tells us what degree the values cluster around the mean.
Standard Deviation:
σ = √21,704 = 147.32... = 147
Now we can show which heights
are within one Standard Deviation
(147mm) of the Mean:
The variance and standard deviation are
calculated via your software programs like
SPSS, Excel, SAS and others, even on hand
calculators Thank goodness for modern
technology!
So, using the Standard Deviation
we have a "standard" way of
knowing what is normal, and what
is extra large or extra small.
Rottweillers are tall dogs. And
Dachsunds are a bit short
Overview
Nominal
Central
Tendency
(best
represents all
cases)
Mode
Variability
(spread;
dispersion)
Percent of
cases in
categories
Ordinal
Interval or Ratio
Median
Mean;
Median;
Mode
Range
Variance;
Standard
deviation; Range
39
Bivariate Statistics
Now that we know a bit about each of our variables,
we can start comparing them to each other
We can also look at differences among groups
When comparing two variables or groups, use
bivariate statistics
Multivariate statistics look at the relationships among
many variables or groups at one time, beyond the
scope of our class
Comparing variables and groups…
Parametric Statistics
Parametric statistics require certain assumptions/qualities in
data/variables:
Normal distributions
Dependent variable is interval/ratio
Good sample size (at least 30)
Examples of parametric statistics
1. Correlation: Is there a relationship between variables?
2. T-Tests
: Are there mean differences in outcomes between two groups?
3. Analysis of Variance (ANOVA)
: Are there mean differences in outcomes among groups?
(two or more groups; will not do in this class)
Probability Value
A report of how likely the relationship indicated is
statistically significant or may have happened by
chance
In other words, how sure are we what we found was
not just a fluke?
Most researchers set the level for statistical
significance at 0.05 or smaller (or 0.01, 0.001)
Indicated by P Value, e.g. P< .05 means there is less
than 1 in 20 chance of results due to sampling error
P<.01; less than 1 in 100 chance; p<.001; less than 1 in 1,000 chance
Correlation
To determine if a relationship exists between two
variables and the direction of the relationship
“What is the actual strength and direction of the relationship
between variables within the sample?”
To determine the degree to which the variables are
related and the probability that this relationship
occurred by chance
“What is the probability that the relationship between variables
within the sample is due to sampling error?”
These variables must be measured at the interval or
ratio level.
43
Correlation
To determine if a relationship exists between two
linear variables and the direction of the relationship
“What is the actual strength and direction of the relationship
between variables within the sample?”
44
Correlation
To determine if a relationship exists between two
linear variables and the direction of the relationship
“What is the actual strength and direction of the relationship
between variables within the sample?”
45
Correlation
Strength indicated by a correlation coefficient (Pearson’s r)
Correlation Coefficient = provides the numerical value that
indicates both the strength and direction of the relationship
(r):
(–) 1.0 = perfect negative relationship
(+) 1.0 = perfect positive relationship
The closer the coeffecient is to either +1.0 or –1.0, the
stronger the linear relationship
Middle = moderate / weaker relationship
Close to 0 = no relationship
Range of Correlation Coefficients
Correlation Coefficients (r)
-1.0
Perfect
negative
0.0
No correlation
+1.0
Perfect
positive
47
Correlation Matrix
All variables are listed in the left side column &
repeated in a row on the top.
Find the direction & strength of the correlation
between variables by noting the correlation
coefficient and probability that appears in the
following matrix.
The row in which the first variable appears intersects with
the column headed by the second variable:
48
Example. Correlations among study variables
1. Life satisfaction scale
2. Dummy coded marital status
1
2
3
4
5
-
.21**
.05
.11
-.85**
.32**
.21**
-.24**
.06
.89**
-
.26**
.04
-.09
.32**
-
.00
.04
.21**
-
-.11
.07
-
3. Number of people in household
4. Number of hours of housework
per year (transformed using
square root)
5. Centered financial stress scale
6. Friend/relative negative
support/burden
** Correlation is significant at the 0.01 level (2-tailed).
* Correlation is significant at the 0.05 level (2-tailed).
6
.18**
-
7
-.13*
.018
t-tests
A statistical procedure that tests the means of two
groups to determine if they are statistically different.
Two common types:
1) Independent sample t-test
2) Paired sample t-test (use when you are comparing
means on the same subject over time, each subject
having two measures.
1)
E.g. Useful when comparing a linear measure over two test
events. These are dependent samples, not independent
samples.
50
Independent samples t-tests
Compares independent groups (i.e., men and
women, control and experimental group) in terms of
outcomes
Independent t-test is useful in studies that employ
experimental designs.
Compares the means of two samples, but the
samples must be independently drawn from a
population – random selection for experiments
Tells us if the difference between the groups is
statistically significant
51
Example. Independent samples t-tests
t=18.343; p<0.001; two groups are significantly different in the amount of exercise,
indicating that male will be more likely than female to exercise per week.
Nonparametric Statistics
Nonparametric statistics:
Do not depend on a distribution shape
Do not include nor use means, variance and standard
deviations
Use frequencies and percentages to describe the data.
Sometimes, medians, percentiles, and the difference
between the 75th and 25th percentiles (Interquartile range,
or IQR) are used when the data can be sorted and ranked in
order and displayed.
Nonparametric Statistics
Nonparametric statistics used for:
too small sample for parametric statistics,
Ordinal and nominal level dependent variables
Chi square, Mann-Whitney U, Kruskal-Wallis, etc.
Chi Square Tests (also called cross tabulation)
Not a cause and effect relationship.
It is a test of association between nominal variables
(similar to the correlation for interval level variables).
This test compares expected frequencies with observed
frequencies, easily seen in contingency tables.
Tests of Association
Are very useful when comparing two nominal
variables
Yes/No: Owning a car or having ready access to a car
Yes/No: Routine access to healthy foods
Yes/No: Regular exposure to Cigarette Smoke (exposure)
Yes/No: Occurrence of influenza (disease)
High/Mod/Low: Satisfaction with job
Yes/No: Less than college degree vs. Bachelors or higher
Tests of Association
1. The effect of one variable on another, assuming
that there are no other variables affecting the
association.
e.g. Is the development of lung cancer (Y/N) associated with at least 20
years of cigarette smoking (≥20 years vs. < 20 years)
The strength of the association can be measured comparing the odds of
developing lung cancer given long-term smoking compared to the odds of
developing disease given no long-term smoking.
2. The statistical dependence between two variables.
In particular, the presence of an association generally
implies that two characteristics occur in one individual
more often than expected by chance alone.
ODDS
Odds of an event
= # events / # non-events
Disease
Absent
controls
Exposed
a
b
a+b
Not
Exposed
c
d
c+d
sample
sums
a
b
a+b+c+d
Odds of disease given exposure
= a/b
= # dx present given exposure
# dx absent given exposure
Odds of disease given no exposure
= c/d
= # dx present given no exposure
# dx absent given no exposure
Row totals
Disease
Present
cases
ODDS ratio or cross product
Odds ratio (OR) = a/bc/d
=a d
b*c
= a*d / b*c
= ad / cb
Note: range [0 , )
(if c or b = 0, then add 0.5 to all cells)
Row totals
Disease
Present
cases
Disease
Absent
controls
Exposed
a
b
a+b
Not
Exposed
c
d
c+d
sample
sums
a
b
a+b+c+d
Odds of disease given exposure
Odd of disease given no exp.
ODDS ratio or cross product
Odds ratio (OR) = a/bc/d
=a d
b*c
= a*d / b*c
= ad / cb
Note: range [0 , )
(if c or b = 0, then add 0.5 to all cells)
Row totals
Disease
Present
cases
Disease
Absent
controls
Exposed
a
b
a+b
Not
Exposed
c
d
c+d
sample
sums
a
b
a+b+c+d
Odds of disease given exposure
Odd of disease given no exp.
Measures for the strength of the
association
Odds ratio or relative risk
The stronger the association (i.e. the greater the
magnitude of the increased or decreased risk observed)
between a two characteristics (e.g., exposure &
disease), the less likely it is that the
relationship is due merely to the effect of
some unsuspected confounding variable.
An odds ratio equal to 1 (one) means no
association at all.
Causation is not inferred from association.
Stronger odds ratio generally implies
stronger positive association
Odds ratio (OR) = ad / cb
Larger values of a and d versus
smaller values of b and c
will yield very large odds ratios
and strong associations
between
• Disease Present and NonExposure versus no disease
and Exposure.
Row totals
Disease
Present
cases
Disease
Absent
controls
Exposed
a
b
a+b
Not
Exposed
c
d
c+d
sample
sums
a
b
a+b+c+d
Odds of disease given exposure
Odd of disease given no exp.
Stronger odds ratio generally implies
stronger positive association
Odds ratio (OR) = ad / cb
Disease
Present
cases
Disease
Absent
controls
Row totals
OR = 1 : No association
(ad = bc)
Exposed
a
b
a+b
OR > 1 when ad > bc
(positive association)
OR: (1, )
Not
Exposed
c
d
c+d
sample
sums
a
b
a+b+c+d
OR<1 when bc > ad
(negative association)
OR: (0, 1)
Let’s take time to summarize these slides
and ask some questions…