Data Analysis and Statistics Course Using ACS

Download Report

Transcript Data Analysis and Statistics Course Using ACS

ACS Data Analysis and Statistics
Course
6.16.2016
-Mickey
Motivation
• Where are they after graduation?
I told you so!
More about the problem
-Unemployment rate: 7%
-Unemployment for rate for recent grads: 8.6%
-Graduate degree holder: 3.5%
-Median salary: $45,000
-Median salary for recent grads: $32,000
-Median salary for grad degree holders: $60,000
-Predicted growth in job field: 18%
-Likelihood to work in retail: 1.4 times the average
Source: ACS 2009-2010
Recent grads: 22-26 years old
Experienced workers: 30-54 years old
Undergrad and Grad Student
Differences
• Quantitative skills
– Data acquisition
– Data analysis
– Interpretation of findings
• More education
– Eligible for jobs with greater pay
Searching for Jobs
• Idealist.org search terms:
– Sociology: 9 results
– Social science: 75 results
– Program assistant: 231 results
– Research: 753 results
Source: Vitullo, Margaret Weigers. "Searching for a Job with an Undergraduate Degree in
Sociology." American Sociological Association: Footnotes 37, no. 7 (2009).
Solution
New Coursework
• Teach students more about data acquisition
and analysis
– Mining data
– Using statistical software
• STATA, R, SPSS, Excel
– Using mapping software
• ArcGIS
– Look at real world problems
About Sociology Undergrads
• Not very quant-focused
• Love their subfield courses
• Detest Soc 210 (Intro to Stats) and Soc 310
(Research Methods).
• Accustomed to courses dealing with
specialized topics and literature
• Think methods are boring
The Challenge
• How do we get them to take this course?
Why we really want them to take the
course
• We want them to have
the best “Life chances”
– Def: A probabilistic
concept describing how
likely it is, give certain
factors, that an individual’s
life will turn out a certain
way.
-Max Weber
Course Content
• 14 week course meeting 3 times a week for an
hour (3 credits)
• 3 main components/competencies
– Data acquisition
– Data analysis
– Data presentation
• Pre-reqs: Intro stats
Classroom requirements
• Projector with power-point and a white board
• Computers for students with the following
software
– ArcGIS
– SPSS, R, STATA
– Excel
Breaking down the weeks
• First 4 weeks
– Learn how to collect sources of secondary data
•
•
•
•
ACS | American Community Survey (demography)
CDC | Center for Disease Control (health)
GSS | General Social Survey
(attitudes)
NCES | National Center for Education Statistics (education)
– Assignments will focus on cleaning and downloading data
sets (STATA, R, Excel, or SPSS can be used)
– Lecture about codebooks
– Lecture about research ethics (i.e. FERPA) and why they
can’t always get the data they want.
• Top-coding
• Anonymity
Breaking down the weeks
• Next six weeks (weeks 5-10)
• Teach analysis skills using stats software
Review of Intro Stats
– Descriptive statistics
– Inferential statistics
• SPSS/STATA for
Breaking down the weeks
• Last four weeks (weeks 11-14)
– Teach data presentation skills
– Contingency tables
– Pivot tables (Excel)
– Graphs in Excel and STATA
– Mapping in ArcGIS
• Final project: A poster presentation using any
of the 4 data sets we will use in class.
Potential Jobs for Theses Students
• GS 5-7 research analysis for the government
– Intro level jobs
•
•
•
•
Research assistant
Program assistant
Copy editor (fact checking)
Future graduate student
Problems
• How to make this course appeal to students
beyond insecurities.
– Content to make the course more interesting
– How to name the course.
• Secondary data analysis doesn’t sound great
An example lesson
Given the time we have we will go through this quickly.
Module 3: Data Mining
Overview
•
•
•
•
What is data mining
Where to acquire data
Acquiring data (6 different sources)
Our goal is to get those 6 sources of data and
make them into STATA .dta files if they are not
available in that format.
What is it?
• Data mining is a euphemism that researchers use
when acquiring data. It refers to the tedious
process of obtaining data from the internet,
searching for the specific data you would like to
use, and downloading it.
• Most commonly it is used when referencing the
acquisition of secondary data.
• Secondary data is data that has already been
collected by someone else (person, group, or
organization).
Common Places for Data Mining
Population
• United States Census Bureau (American Fact Finder)
Economy
• Bureau of Labor Statistics (closely affiliated with the Census)
Education
• National Center for Education Statistics
Attitudes
• General Social Survey
Health
• Center for Disease Control
Global
• United Nations
Census Bureau Data
• The main purpose of the United States Census Bureau is
foremost to conduct a decennial census (every ten years)
so that Congressional Representatives and taxes can be
appropriately allocated across the country.
• Although the Census was started for this reason it has done
much more in terms of data collection.
• The U.S. Census Bureau conducts more than 130 surveys
each year.
• Their most common surveys are:
–
–
–
–
–
The Decennial Census (a.k.a. the Census)
The American Community Survey (yearly)
Annual Retail Trade
Survey of Income and Program Participation (rotating panel)
Current Population Survey (monthly; also source of Bureau of
Labor Statistics Data)
American Community Survey
• Done every year since 2005
• Why did it start?
– The Decennial Census used to have 2 versions. A short one that was sent to the U.S.
population, and a long form which was used to acquire more data about an additional
random subsample representative of the U.S. population.
– The ACS was started to collect this information on a yearly basis, and to do away with
the long form of the Decennial Census.
• What it does:
– “Through the ACS, we know more about jobs and occupations, educational attainment,
veterans, whether people own or rent their home, and other topics. Public officials,
planners, and entrepreneurs use this information to assess the past and plan the future.
When you respond to the ACS, you are doing your part to help your community plan
hospitals and schools, support school lunch programs, improve emergency services,
build bridges, and inform businesses looking to add jobs and expand to new markets,
and more.” –Census Bureau Website
Getting the Data
• Go to:
http://factfinder.census.gov/faces/nav/jsf/pag
es/index.xhtml
• Click on Advanced Search
• Click on SHOW ME ALL
Getting the Data
• Click on Topics, then click on Dataset, and click
on 2013 ACS 5-year estimates.
• Also within Topics click on Age and Sex, click
on Age, and then click on Sex
Getting the Data
• Instead of clicking Topics click on Geographies,
select a geographic type (State-040), Select
one or more geographic areas (All States
within United States and Puerto Rico), click
ADD TO YOUR SELECTIONS. Now close the
window.
Getting the Data
• Now you will see this
• Click on the “AGE AND SEX” table in blue
• By age group you will see the number of males and females, and the total
number of people in each state.
• There is an estimate and a margin of error. You are given a margin of error
because there is sampling involved. This survey didn’t go out to everyone
in the United States. Because of this there is most likely a little bit of error
attributable to the fact that the true population and the population
represented by the sample are different. However, because the number of
people surveyed was high these errors are low (about + or – 0.01%).
Getting the Data
• You will want to download the data and covert
it into STATA
– (see Module 1 if you forgot how to do this)
Bureau of Labor Statistics
• Founded in 1884 by the Bureau of Labor Act to collect
information about employment and labor.
• The Bureau of Labor Statistics of the U.S. Department of
Labor is the principal Federal agency responsible for
measuring labor market activity, working conditions, and
price changes in the economy. Its mission is to collect,
analyze, and disseminate essential economic information to
support public and private decision-making. As an
independent statistical agency, BLS serves its diverse user
communities by providing products and services that are
objective, timely, accurate, and relevant.
• It’s biggest survey, the Current Population Survey, is a joint
product between them and the Census.
Getting the Data
• Working with the BLS is extremely difficult
• They have a search tool which is not user
friendly at all called the BLS Data Finder
– http://beta.bls.gov/dataQuery/find
• It’s easier to find what you want using Google
and typing “BLS” and “whatever data you are
looking for”
Example
• Let’s say you want data on employment
status.
• Your search should lead you here:
– http://www.bls.gov/gps/home.htm#tables
– However, you might have to mine around a while
before finding the right place for the data you
want.
Example
• Click on GPS tables
• Now click on the 2014 Annual Averages (XLS) link. It should
produce an Excel table.
• You might do a lot of data cleaning depending on your
needs.
National Center for Education Statistics
• The functions for the National Center for Educational Statistics have
existed since the US Department of Education was founded in 1867.
• The Department of Education has changed status from Department
to Office, and other changes.
• There has been a lot of opposition to the Department of Education
from the Republican party. Many (e.g., Regan, and Dole) have
pledged to dismantle it.
• The National Center for Education Statistics (NCES) is the primary
federal entity for collecting and analyzing data related to education
in the U.S. and other nations. NCES is located within the U.S.
Department of Education and the Institute of Education Sciences.
NCES fulfills a Congressional mandate to collect, collate, analyze,
and report complete statistics on the condition of American
education; conduct and publish reports; and review and report on
education activities internationally.
Getting the Data
• Go to:
– https://nces.ed.gov/
– Place your cursor over the “Data & Tools” tab and
click on the “Custom Datasets & Tables” option
Getting the Data
• Click on the Education Data Analysis Tool
(EDAT)
• You will have to create a log in to have access
to the data
• Where is says select survey click on NELS
(National Education Longitudinal Study of
1988)
About NELS
• A nationally representative sample of eighth-graders were first
surveyed in the spring of 1988. A sample of these respondents were
then resurveyed through four follow-ups in 1990, 1992, 1994, and
2000. On the questionnaire, students reported on a range of topics
including: school, work, and home experiences; educational
resources and support; the role in education of their parents and
peers; neighborhood characteristics; educational and occupational
aspirations; and other student perceptions. Additional topics
included self-reports on smoking, alcohol and drug use and
extracurricular activities.
• It basically collects information from students when they are eighth
graders, tenth graders (sophomores), twelfth graders (seniors), 2
years after college when many were enrolled in college, and lastly 6
years later when many had finished college.
About NELS
• This data set is a bit hard to understand.
– BY= Base Year
– F1= 10th grade
– F2= 12th grade
– F3= Roughly 2 years after high school
– F4= Roughly 6 years after high school
Getting the Data
• Let’s just get a few quick things lets get race, sex, SES, an
indicator of academic ability all at the base year (BY).
• Then we’ll get a high school graduation variable from F3
• Under variable search, click and select “Search by Variable
Name”
Getting the Data
• Race Search “race” and select the “RACE”
variable
• Sex Search “sex” and select the “SEX” variable
• SES Search “socio-economic” and select the
“BYSES” variable
• Academic ability Search “BY2XMTH” and select
the “BY2XMTH” variable
• High school graduate Search “F3DIPLOM” and
select the “F3DIPLOM” variable
Getting the Data
• Click on the “Temporary Tag File” to
make sure you have all of the variables
you are looking for.
• Place your cursor over the
“DOWNLOAD OPTIONS” and click on
the “Download Data and Syntax Files”
link
• Click “Next Step”
• Fill in the STATA bubble and click “Next
Step”
• Click on the burnt orange download
button.
• Done!
General Social Survey
• The General Social Survey is housed at NORC (National Organization
for Research at the University of Chicago).
• “Understanding our society – and taking action on the issues that
confront it – requires insight gained through objective, high-quality
social science research. That’s why decision makers and policy
leaders turn to NORC at the University of Chicago, an independent
research organization known for excellence, innovation, and
effective collaboration. Working with NORC experts, clients obtain
the data and analysis needed to drive evidence-based decisions and
improve public policy in fields such as health, education,
economics, crime, justice, energy, security, and the environment.
Dedicated to the public interest for 70 years, NORC has helped a
wide range of clients identify and address society’s most urgent
challenges” –NORC website
Getting the Data
• Go here:
– http://www3.nor
c.org/GSS+Websit
e/
– Click on “Browse
Variables”
Getting the Data
• Let’s say we want
data on race-based
affirmative action
• You can filter by
subject or keyword
• Filtering by subject
we see that
affirmative action is
there so let’s click it.
Getting data
• To make things simple lets click on the “add to
cart” icon for the first three variables and the
last one.
• Please note that the years for these variables
are not the best. There is no trend data for the
college variables. We only have one time point
(i.e. cross-sectional data).
Getting the Data
• Again, you will have
to sign-up to use
this data.
• To trigger the sign
in go to your cart
click on
“VARIABLES” and in
the drop down
menu click “VIEW
ALL”
Getting the Data
• Now click on “Actions”
– Under the drop-down menu click “Extract data”
– On the next screen name your extract and give it a
description
– Then click “NEXT”
– Under choose variables click add all
Getting the Data
• For case selection don’t select anything. Just click next.
• For “Choose output options” click on the “Stata Control
File Bubble”
– Click “Save”
• On the next screen click on the download icon after it
finishes loading (i.e. the arrows stop turning)
Center for Disease Control and
Prevention
• The Center’s name was formerly the Communicable
Disease Center.
• It was founded in 1946 as the successor of the WW2
program for Malaria Control in War Areas.
• It would take on a greater role as the Communicable
Disease Center.
• It kept growing and adding STDs and Tuberculosis to its list
of responsibilities.
• As its role changed so did it’s name. It briefly became the
National Communicable Disease Center in 1967, and later
changed to Center for Disease Control in 1970.
• “and Prevention” was tacked on in 1992, but it retained the
CDC acronym.
CDC
• CDC works 24/7 to protect America from health, safety
and security threats, both foreign and in the U.S.
Whether diseases start at home or abroad, are chronic
or acute, curable or preventable, human error or
deliberate attack, CDC fights disease and supports
communities and citizens to do the same.
• CDC increases the health security of our nation. As the
nation’s health protection agency, CDC saves lives and
protects people from health threats. To accomplish our
mission, CDC conducts critical science and provides
health information that protects our nation against
expensive and dangerous health threats, and responds
when these arise.
Real CDC
Zombie Apocalypse CDC from
the “Walking Dead.” This is not
really a CDC location.
The BRFSS
• The Behavioral Risk Factor Surveillance System (BRFSS)
is the nation's premier system of health-related
telephone surveys that collect state data about U.S.
residents regarding their health-related risk behaviors,
chronic health conditions, and use of preventive
services.
• Established in 1984 with 15 states, BRFSS now collects
data in all 50 states as well as the District of Columbia
and three U.S. territories.
• BRFSS completes more than 400,000 adult interviews
each year, making it the largest continuously
conducted health survey system in the world.
Getting the Data
• Go to
– http://www.cdc.gov/brfss/annual_data/annual_d
ata.htm
– Click on the 2013 Annual Survey Data
– On the next screen scroll down until you see the
2013 BRFSS Data (SAS Transport Format) link
– Download the data
Getting the Data
• This next part is a bit tricky
• To open the file you first have to type a special
command. Do this in the command window or
a separate .do file.
• For this extract I typed in
fdause "C:\Users\dmhurdle\Desktop\LLCP2013.XPT“
• Your file path should be different
Getting the Data
• You should now be able to look at the data with your Data
Editor window.
– Within that window there is a sub-window called “Variables”
• Use it to browse all of the variables you have just collected.
• To gain an understanding of the variables you can also
download the 2013 BRFSS Codebook found under the
“2013 Survey Data Information” heading
United Nations
• Replaced the ineffective League of Nations
after WW2.
• Established in 1945
• There are several locations Manhattan,
Geneva, Nairobi, and Vienna.
• The HQ is in Manhattan
The Department of Economic and
Social Affairs
• A subdivision of the UN
• The Department of Economic and Social Affairs (DESA)
promotes and supports international cooperation to
achieve development for all, and assists governments in
agenda-setting and decision-making on development issues
at the global level.
• DESA provides a broad range of analytical products and
policy advice that serve as valuable sources of reference
and decision-making tools for developed and developing
countries, particularly in translating global commitments
into national policies and action and in monitoring progress
towards the internationally agreed development goals,
including the Millennium Development Goals.
Getting the Data
• Go to:
– http://www.un.org/en/development/desa/population/pub
lications/database/index.shtml
• Click on total population
• On the next screen click the Download Data Files Icon
• On the next screen click on the Download Data Files
Icon
• Convert the data from Excel to STATA (See Module 1)
• Done!
Now what
• Now that you’ve had some training here are
some great places to help you track down data
– https://ciser.cornell.edu/info/datasource.shtml
– http://www.icpsr.umich.edu/icpsrweb/ICPSR/inde
x.jsp
• Or do a Google search for the data you are
looking for.
A Caveat for Data Miners
• Ethics is a very important part of research.
• Hence, there are restrictions to accessing
information.
• Much data is top-coded making it very
difficult to find an individual within a data
set. Most data sets are anonymized and
attributes that could be used to single out
people are commonly omitted from the
data.
• Some types of data, like those concerning doctor-patient
confidentiality or student data records (i.e. grades) are not easily
obtained because there are laws protecting their use.