Class_08 - UNC School of Information and Library Science

Download Report

Transcript Class_08 - UNC School of Information and Library Science

wed, sept 16, 2015
using large data sets
announcements
• Presentation this Friday, Sept. 18 @ noon in
our classroom, room 208
– Should we archive Facebook? Why the users are
wrong and the NSA is right
– Cathy Marshall, Texas A&M (formerly of Microsoft
Research)
Assignment: Analyzing a research article
• Select one of the seven research articles linked from our
class schedule (listed under today’s readings, Sept 16)
• Using the Analyzing Research Articles handout (need one?),
critique the article
• Focus on the purpose of the study, description of study
design (participants, methods, how they collected data),
data analysis and conclusions
• Don’t worry about specific statistical analysis methods
• Due next Wednesday, Sept 23 – print or email to me by
class time
• Format – whatever works for you (bullets, address some
but not necessarily all questions/points from handout)
• Counts as one pop quiz (worth up to 2 points)
Rationale for assignment
• Use evaluative strategies and vocabulary to
examine a scholarly research product
– (the “analyzing information” part of our class title)
• Explore research that has been conducted
using the same dataset that we will be using
for our project, the General Social Survey
today’s line-up…
•
•
•
•
Survey research as a method
General Social Survey GSS (dataset for our project)
Project details, variables, think about groups
SPSS & Virtual Lab
What is a survey?
• A survey is a:
– systematic method for gathering information
– from (a sample of) entities
– for the purposes of constructing quantitative
descriptors
– of the attributes of the larger population of which
the entities are members
typically, surveys…
• Gather information by asking people
questions
• Collect information by either (1) interviewers
asking questions and recording responses or
(2) respondents reading and recording their
own answers
• Collect information from a subset of the
population, a sample, rather than from all
members
Based on
probability
design*
*units are selected using a method that ensures that each
unit has a known, nonzero probability of being included
The SAGE Handbook of Online Research Methods (2008)
The reading spends a great length of time
introducing the history of survey development.
Why is it important to understand the history
of survey?
-Wanyi
1936 election and the
Literary Digest survey
• Magazine had predicted every
election since 1916
• Sent out 10 million surveys---and 2.4
million responded
• They said: Landon would win 57% of
the vote
• What happened: 62% Roosevelt
landslide
What went wrong?
• Sample not representative
• Lists came from subscriptions,
phone directories, club members
• Phones were a luxury in 1936
• Selection Bias toward the rich
• Voluntary response: Republicans
were angry and more likely to respond
• Context: Great Depression
– 9 million unemployed
– Real income down 33%
– Massive discontent, strike waves
Polls vs. Surveys
• No clear distinction between the two terms
– “Poll” often used for private sector opinion studies
• Use many of the same design features as studies that would be
called surveys
– “Poll” rarely used to describe government or scientific
surveys
• To me, the term poll implies either
– A commercial or less-scientific study, or
– A quick turn-around survey whose results may be of shortterm interest
Steps in conducting a survey
• Clearly state research objective(s)
• Decide on survey mode(s)
– How will you contact potential respondents?
(web, email, phone, etc.)
– In what media will the survey be given? (web,
email, phone, etc.)
• Determine fielding strategy (how to maximize
response rates?)
• Design the survey questions and the survey
instrument
Steps in conducting a survey
• Determine sample size and sampling strategy
• Obtain Institutional Review Board (IRB) or other
approval as necessary
– Are respondents promised confidentiality?
– What is the impact if their survey responses become
known?
• Pre-test
– Give the survey to some test subjects and get their
feedback
• What works and what doesn’t?
• Are you getting correct data/information?
– Revise and re-pre-test as necessary
General Social Survey
• The GSS (General Social Survey) is a biannual
personal interview survey of U.S. households
conducted by the National Opinion Research
Center (NORC). The first survey took place in
1972.
• Approximately 3000 American adults are
interviewed in person for about 90 minutes
and asked around 450 questions.
http://www3.norc.org/gss+website/
Purpose of GSS
• gather data on contemporary American
society in order to monitor and explain trends
and constants in attitudes, behaviors, and
attributes over time
• to compare the United States to other
societies
General Social Survey
• demographics & attitudes
– The questionnaire contains a standard core of
demographic and attitudinal variables, plus
certain topics of special interest selected for
rotation (called "topical modules")
– Items include national spending priorities,
drinking behavior, marijuana use, crime and
punishment, race relations, quality of life,
confidence in institutions, and membership in
voluntary associations
Data to Story Project
• Go over project description
• Components
– Project description
– Search log
– Annotated bibliography
– Presentation
– Participation in group
[handout]
variables
• variable – a characteristic that can vary in
value among subjects in a sample or a
population. We are interested in similarities
and differences - variance
• types of variables
– categorical
– quantitative
categorical variable
• scale for measurement is a set of categories
• examples:
–
–
–
–
Racial-ethnic group (white, black, Hispanic)
Political party identification (Dem., Repub., Indep.)
Vegetarian? (yes, no)
Mental health evaluation (well, mild symptom formation,
moderate symptom formation, impaired)
– Happiness (very happy, pretty happy, not too happy)
– Religious affiliation
– Major
SPANKING: Categorical (Single)
Do you strongly agree, agree, disagree, or strongly
disagree that it is sometimes necessary to discipline a
child with a good, hard spanking?
Categories:
{strongly_agree}
{agree}
{disagree}
{strongly_disagree}
{dontknow}
{refused}
Strongly agree
Agree
Disagree
Strongly disagree
DON'T KNOW
REFUSED
Code as:
5
4
3
2
1
0
Sample question from GSS
scales of measurement
for categorical variables, two types:
 nominal scale – unordered categories
o preference for president, race, gender, religious
affiliation, major opinion items (favor vs. oppose,
yes vs. no)
 ordinal scale – ordered categories
o political ideology (very liberal, liberal, moderate,
conservative, very conservative)
o anxiety, stress, self esteem (high, medium, low)
o mental impairment (none, mild, moderate, severe)
o government spending on environment (up, same,
down)
nominal scale – unordered categories
PRES12: Categorical (Single)
Did you vote for Obama or Romney?
Categories:
Obama
Romney
Other Candidate (Specify)
Didn’t vote for president
Don’t know
Refused
Code as:
5
4
3
2
1
0
ordinal scale – ordered categories
POLVIEWS: Categorical (Single)
We hear a lot of talk these days about liberals and conservatives.
I'm going to show you a seven-point scale on which the political
views that people might hold are arranged from extremely
liberal--point 1—to extremely conservative--point 7. Where
would you place yourself on this scale?
Categories:
Extremely liberal
Liberal
Slightly liberal
Moderate, middle of the road
Slightly conservative
Conservative
Extremely conservative
DON'T KNOW
REFUSED
Code as:
7
6
5
4
3
2
1
0
8
quantitative variable
• possible values differ in magnitude
• examples:
–
–
–
–
–
–
Age, height
Annual income
Time spent on online each week
Reaction time to a stimulus
(e.g., cell phone while driving in experiment)
Number of “life events” in past year
statistics
descriptive
inferential
hypothesis
testing
estimation
descriptive
statistics
use of statistics to describe, summarize, and
explain or make sense of a given set of data
measures of
central
tendency
frequency
distribution
mode
mean
median
Comparison of mean and median
• Mean
–
–
–
–
Uses all of the data
Has desirable statistical properties
Affected by extreme high or low values (outliers - example)
May not best characterize skewed distributions
• Median
– Not affected by outliers
– May better characterize skewed distributions
UNC Geography Majors
Salaries Example
mid-1980's at the University of North Carolina, the average
starting salary of geography students was well over $100,000
Correlation
Causation
sample patterns from GSS data
– median income of female respondents compared with
average income of male respondents
– median level of education of respondents who own a
gun
– number of female respondents who own a gun
compared with number of male respondents who
own a gun
– average age of respondents who indicated the
government should spend more on space exploration
– self-reported level of happiness compared with
income level
sample characteristics of the GSS
• The sampling frame of the General Social Survey
is all U.S. adults living in households. The
sampling frame includes 97.3 % of all U.S. adults.
• Who does not live in a household?
– college students in dorms
– military personnel in barracks
– prisoners
– elderly persons in retirement homes
Does the GSS sample really draw from all the adults in
its sample frame?
• After the GSS is sampled, only 70% of persons in
the sample actually respond to the survey
– 23% refuse or cut the survey off in the middle
– 2% are unavailable or can’t be found
– 5% are missing for other reasons
• In general, a response rate of 60% or more is
considered minimally acceptable, but you should
check your results in any way you can.
Let’s look at a GSS questionnaire
Start at page 31
where can you access SPSS?
• Odum Institute
– Davis Library 2nd floor – ask lab assistant
• https://virtuallab.unc.edu
• Lab in the Undergraduate Library (need to
confirm)
Notes…
• Bring a flash drive to the Odum lab on
Monday – you may want to save your work
• Davis Library >> Room 219
• The dataset that we are using (GSS 20124) is
available for download on our class website
– schedule>>sept 21
– the dataset is a .sav format – only opens with SPSS