Descriptive Data Analysis - Gail Johnson`s Research Demystified
Download
Report
Transcript Descriptive Data Analysis - Gail Johnson`s Research Demystified
Data Analysis for Description
Research Methods for Public
Administrators
Dr. Gail Johnson
Dr. G. Johnson,
www.ResearchDemystified.org
1
Simple But Concrete
The Children’s Defense Fund reports on
each day in America:
Four
children are killed by abuse or neglect
Five children or teens commit suicide
Eight children or teens are killed by firearms
Seventy-five babies die before their 1st birthday
㹈 http://www.childrensdefense.org/child-research-data-publications/each-day-inamerica.html
Dr. G. Johnson,
www.ResearchDemystified.org
2
Simple But Concrete
A million seconds = 11 ½ days
A billion seconds= 32 years
A trillion seconds= 32,000 years
Dr. G. Johnson,
www.ResearchDemystified.org
3
Simple But Concrete
A $700 billion bailout translates into $2,333 IOU
from every person in the U.S.
Or—using a different metric-it comes to $45 per
week for each person in the U.S.
Going one step further, it comes out to $6 a day
Framing: are you willing to pay $6 a day to have a
functioning financial system?
Read more:
http://www.time.com/time/business/article/0,8599,1870699,00.html#ixzz0aqek
0mRZ
Dr. G. Johnson,
www.ResearchDemystified.org
4
Going Too Far?
Six dollars a day is also 25 cents an hour, or less
than half a penny a minute.
Framing: Would you be willing to pay less than
half a penny a minute?
Key Point: Does the comparison point make a
difference in what you would be willing to pay?
Read more:
http://www.time.com/time/business/article/0,8599,1870699,00.html#ixzz0aqf9
HSQ9
Dr. G. Johnson,
www.ResearchDemystified.org
5
Common Descriptive Analysis
Counts: how many
Decennial census
Percents
Women earned 77% of what men earned in
2006, up from 59% in 1970
Parts of a whole
Percents
(75%) and proportions (.75 or three-
quarters)
Dr. G. Johnson,
www.ResearchDemystified.org
6
Common Descriptive Analysis
But be mindful of “bigger pie” distortions when
working with percents and proportions
If the pie grows much faster than the slice, the slice will
appear relatively smaller as a percent even though it
still grew
Best example is budget deficit as a percent of the GDP:
if GDP grows much faster than the budget deficit, it
will appear smaller even though it has also grown.
Dr. G. Johnson,
www.ResearchDemystified.org
7
Common Descriptive Analysis
Rates: number of occurrences that are
standardized
Deaths of infants per 100,000 births
Crop yields per acre
Crime rates
Rates provide an apples-to-apples comparison
between places of different size or populations
Dr. G. Johnson,
www.ResearchDemystified.org
8
Common Descriptive Analysis
Ratio: numbers presented in relationship to
each other
Student
to teacher ratio: 15:1
Divide number of students by the number of
teachers
1,500 students and 45 teachers equals a 33 to 1
student to teacher ratio (1,500 divided by 45)
Dr. G. Johnson,
www.ResearchDemystified.org
9
Common Descriptive Analysis
Rates of change
Percentage change from one time period to
the other
For example: The budget increased 23% from FY
2006 to FY 2007.
Three Steps:
1.
2.
3.
Divided newest data by oldest data
Subtract 1
Multiple by 100 to get the percentage change
Dr. G. Johnson,
www.ResearchDemystified.org
10
Common Descriptive Analysis
Rates of change
Percentage change from one time period to
the other
For example: The budget increased 23% from FY
2006 to FY 2007.
Three Steps:
1. Divided newest data by oldest data
2. Subtract 1
3. Multiple by 100 to get the percentage change
Dr. G. Johnson,
www.ResearchDemystified.org
11
Common Descriptive Analysis
Rates of change: applied
What was the rate of change in 1992 budget
deficit as compared to 1980.
1.
2.
3.
Divide 1992 budget deficit ($290 billion) by the 1980
budget deficit ($73.8 billion) = 3.93
3.93-1 – 2.93
2.93 x 100 = 293 percent
The budget deficit in current dollars (meaning not
controlled for by inflation) increased 293 percent.
Dr. G. Johnson,
www.ResearchDemystified.org
12
Common Descriptive Analysis
Frequency Distributions
Number and percents of a single variable
Dr. G. Johnson,
www.ResearchDemystified.org
13
In The News: Women Now Are Majority
of College Graduates
Dr. G. Johnson,
www.ResearchDemystified.org
14
Interpretation?
How would you interpret these percentages
in the comparative trend analysis?
Are you surprised by the changes over
time?
Why or why not?
Dr. G. Johnson,
www.ResearchDemystified.org
15
Frequency and Percent
Distributions
Survey data: analyzed by distributions
How many men and women are in the program?
Distribution of Respondents by Gender:
Male
Number Percent
100
33%
Female
Number Percent
200
67%
Dr. G. Johnson,
www.ResearchDemystified.org
Total
Number
300
16
Frequency and Percent
Distributions
How many men and women are in the
program?
Write-up:
Of the 300 people in this program, 67% are
women and 33% are men.
Dr. G. Johnson,
www.ResearchDemystified.org
17
Different Analysis Tools For
Different Situations
Frequency/percent distributions make sense when
working with nominal and ordinal data
But frequency/percent distributions for
interval/ratio data can result in a ridiculously long
table that is impossible to interpret
If I ask 500 people how many years they lived in an
area, I can can get a wide range of answers.
For this type of data, I would then look at means,
medians, modes to describe that variable.
Dr. G. Johnson,
www.ResearchDemystified.org
18
Describing Distributions
Central tendency
Means, Medians, Modes
How similar are the characteristics?
Example: Use when we want to describe the
similarity of the ages of a group of people.
Dispersion
Range,
standard deviation
How dissimilar are the characteristics?
Example:
how much variation in the ages?
Dr. G. Johnson,
www.ResearchDemystified.org
19
Measures of Central Tendency
The 3-Ms:
Mode:
Median:
Mean:
Mode, Median, Mode.
most frequent response.
mid-point of the distribution
arithmetic average.
Dr. G. Johnson,
www.ResearchDemystified.org
20
Basic Concepts Revisited
Levels of Measurement
Nominal Level Data: names, categories
Ordinal Level Data: data with an order, going from low
to high
Eg. Highest educational degree, income categories, agree—
disagree scales
Interval Level Data: numbers but no zero
Eg. Gender, religion, state, country
Eg. IQ scores, GRE scores
Ratio Level Data: real numbers with a zero point
Eg. Age, weight, income, temperature
Dr. G. Johnson,
www.ResearchDemystified.org
21
Which Measure of Central
Tendency to Use?
Depends on the type of data you have:
Nominal data:
mode
Ordinal data:
mode and median
Interval/ratio:
mode, median and
mean
Dr. G. Johnson,
www.ResearchDemystified.org
22
For Interval Or Ratio Data:
Which One To Use?
Concept of the Normal Distribution—also
called the bell-shape curve
In
a normal distribution, the mean, median and
mode should be very similar
Use mean if distribution is normal
Use median if distribution is not normal
Dr. G. Johnson,
www.ResearchDemystified.org
23
Normal Distribution:
Bell-Shaped Curve
Mean
http://en.wikipedia.org/wiki/Normal_distribution
Dr. G. Johnson,
www.ResearchDemystified.org
24
Office contributions
$10, $ 1, $.50, $.25, $.25.
The mean is $2.40 (add up and divide by 5)
The median is .50 (the mid-point of this
distribution)
The mode is .25 (the most frequently
reported contribution)
Best description of contributions is median.
Dr. G. Johnson,
www.ResearchDemystified.org
25
Salaries
Assume that you had 11 teachers. 10
teachers earned $21,000 per year and one
earned $1,000,000.
What would be the best measure to describe
this data?
Dr. G. Johnson,
www.ResearchDemystified.org
26
Salaries
The average salary would be $110,000.
The median and mode is $21,000.
The curve would be positively skewed, i.e.
Mean higher than Mode and Median
The median would do the best job at
describing the center the salaries
Dr. G. Johnson,
www.ResearchDemystified.org
27
Skewed Data
1.
2.
negative skew: The mass of the distribution is
concentrated on the right of the figure. It has
relatively few low values. The distribution is
said to be left-skewed.
positive skew: The mass of the distribution is
concentrated on the left of the figure. It has
relatively few high values. The distribution is
said to be right-skewed. The $ million salary
pulls the average up.
Wikipedia: http://en.wikipedia.org/wiki/Skewness
Dr. G. Johnson,
www.ResearchDemystified.org
28
Skewed Distributions:
Negative and Positive
http://en.wikipedia.org/wiki/File:Skewness_Statistics.svg
Dr. G. Johnson,
www.ResearchDemystified.org
29
Using Means With Survey Data?
Survey data is typically coded using numbers:
Gender: Male is coded 1
Female is coded 2
It is faster and less error-prone to code variables using
numbers
But the computer could treat these as numbers and
will compute a mean if asked
How would you interpret a mean for gender of 1.6? Or
a mean for religion of 2.8
Dr. G. Johnson,
www.ResearchDemystified.org
30
Do Not Use Means With
Nominal Data
Gender (and religion) are nominal variables
and should only be reported in terms of
distributions:
Frequency distribution: 10 men and 12 women
Percentage distribution: 45% men and 55%
women
Dr. G. Johnson,
www.ResearchDemystified.org
31
Using Means With Survey Data?
Scales (very satisfied<->very dissatisfied are
ordinal scales
But they coded into the computer using numbers
5 for very satisfied<->1 for very dissatisfied
The computer will compute a mean if asked:
The mean was 3.8 for job satisfaction.
The mean satisfaction with faculty performance was 4.2
on a scale from 1-5
Grade-point averages are an example of means based
on an ordinal scale (A—F (scale of 0-4)
Dr. G. Johnson,
www.ResearchDemystified.org
32
Using Means With Ordinal Data?
There is disagreement in the field—partly based on
academic discipline-about whether to use means with
ordinal data.
Things like GPA or faculty ratings are often shown as
means
It is often helpful for researchers to look at the means
initially when working with a lot of data—researchers are
looking for unusually high or low means.
It is also true that sometimes it is easier to show the means
than the percentage distribution for every variable
Dr. G. Johnson,
www.ResearchDemystified.org
33
Washington Employee Survey
Question
2006
2007 2009
I know what is
expected of me at
work
I receive recognition
for a job well done.
4.28
4.25
4.31
Percent
reporting 4 or
5 (positive)
87%
3.34
3.43
3.47
54%
I have the tools and
3.76
resources I need to do
my job effectively.
3.75
3.80
70%
Using Means With Ordinal Data?
But most people are more familiar with polling
results, which report percent distributions.
We tend to see something like 55% report supporting
cap and trade legislation rather than a mean of 3.4 on a
scale of 5 (for) to 1 (against).
The decision about whether means or percent
distributions are used to report ordinal data should
reflect audience preference and ease of audience
understanding.
Not an ideological stance
Dr. G. Johnson,
www.ResearchDemystified.org
35
Measures of Dispersion
Used with Interval and Ratio Data
Simple Description: The Range
Reported salaries ranged from $21,000 to $1,000,000
Ages in the group ranged from 18 to 32
Standard Deviation
Measures the dispersion in terms of the the distance
from the mean
Small standard deviation: not much dispersion
Large standard deviation: lots of dispersion
Dr. G. Johnson,
www.ResearchDemystified.org
36
Standard Deviation
Normal Distribution: Bell-shaped curve
68%
of the variation is within 1 standard
deviation of the mean
95% of the variation is within 2 standard
deviations of the mean
Dr. G. Johnson,
www.ResearchDemystified.org
37
Normal Distribution
95% of the distribution
Standard deviations
Mean
Standard deviations
Applying the
Standard Deviation
Average test score= 60.
The standard deviation is 10.
Therefore, 95% of the scores are
between 40 and 80.
Calculation:
60+20=80
60-20=40.
Dr. G. Johnson,
www.ResearchDemystified.org
39
Standard Deviation with Means
The Standard Deviation is used with interval/ratio
level data
Typically, standard deviations are presented with
means so the reader can tell whether there is a lot
or a little variation in the distribution.
Note: the standard deviation is sometimes used in
other statistical calculations, such as z-scores and
confidence intervals
Dr. G. Johnson,
www.ResearchDemystified.org
40
Describing Two Variables
Simultaneously
Cross-tabulations (cross tabs, contingency
tables)
Used when working with nominal and
ordinal data
It provides great detail
Dr. G. Johnson,
www.ResearchDemystified.org
41
Describing Two Variables
Simultaneously
Detail about the race and gender of the 233
people in the workplace:
Race
White
Black
Other
Men
21%
15%
14%
Women
31%
11%
6%
Dr. G. Johnson,
www.ResearchDemystified.org
42
Describing Race and Gender
Write-up:
Of the 233 employees, the greatest
proportion are white women (31%)
followed by white men (21%). Fifteen
percent of the employees are black men and
11% are black women, and 14% are men of
other race identity and 6% are women of
other race identity.
Dr. G. Johnson,
www.ResearchDemystified.org
43
Describing Two Variables
Simultaneously
Comparison of Means
Used when one variable is nominal or ordinal,
and the second variable is interval/ration level
of measurement.
Examples:
Men in the MPA program have a GPA of 3.2 as
compared to 3.0 for women.
The mean overall citizen satisfaction score is 4.2 this
year as compared to 3.5 last year.
Mean salary for women was $35,000 as compared to
$38,000 for men last year.
Dr. G. Johnson,
www.ResearchDemystified.org
44
Key Points
These simple descriptive analysis techniques can
be effective:
Illuminates, provides feedback, informs and might
persuade.
The math is generally straight-forward.
Descriptive data is generally easy for many people
understand as compared to more complex statistics
(stay tuned).
Complex statistics are not inherently better!
Dr. G. Johnson,
www.ResearchDemystified.org
45
The Tough Question
If descriptive data is distorted, it is tends to be in
the way things are being counted and measured.
The math is usually correct.
Example: The federal debt is often presented just in
terms of percent of debt held by the public but the total
debt includes money borrowed from other government
funds.
As a result, the debt looks smaller than what it
actually is.
Dr. G. Johnson,
www.ResearchDemystified.org
46
The Tough Question
If descriptive data is distorted, it is tends to
be in the way things are being counted and
measured. The math is usually correct
Example.
Health insurance profits look
different when calculated as a percent of
corporate revenue than when calculated as a
percent of all spending on health care.
It
will look smaller when presented as a percent of
all health care spending which is larger than just
corporate insurance revenue.
Dr. G. Johnson,
www.ResearchDemystified.org
47
The Tough Question
Always ask: what exactly is being
measured and counted?
Consider whether there are other ways of
counting and other ways of doing the
analysis that might yield different results (or
create different perceptions).
Do the choices reflect a political agenda?
Dr. G. Johnson,
www.ResearchDemystified.org
48
Creative Commons
This powerpoint is meant to be used and
shared with attribution
Please provide feedback
If you make changes, please share freely
and send me a copy of changes:
[email protected]
Visit www.creativecommons.org for more
information