Transcript big data
Messy Data:
Teaching Students Early on About
the Realities of Data
Cornell College
• Small liberal arts college (1100
students)
• Mathematics and Statistics Department with
4.5 tenure track lines
• Teach on the block plan
Statistics History at Cornell
•
•
•
•
Intro stat
Probability/Math Stat
Stat 2
“New Frontiers”
• Epidemiology
• Dealing with Data: Data Manipulation, Data
Visualization, and Big Data
Data Course
• Team taught with computer scientist
• Prerequisite either intro stat or CS 1
• Focused on hands-on
• Morning was two hours of lecture
• Afternoon was two hours of computer lab
Data Course - Plan
• 1/3 of the course on each topic
• Data Cleaning
• Data Visualization
• Big Data
• Relevant computer science
fundamentals addressed in a just-intime fashion
• Use R as the software tool
Data Course - Reality
• 1/3 Data Cleaning
• 1/2 Data Visualization
• 1/6 Big Data
Daily Structure
• Morning – 2 hours
• M-Thur: Lecture
• 1 hour stat
• 1 hour CS
• Fri : Student presentations
• Afternoon – 2 hours
• Computer lab
Data Cleaning
• Simple issues
• Clearly wrong entries
• Potentially wrong entries
• Functions of a variable
Data Cleaning
• More complex issues
• Combining data sets
• Linking variable issues
• Making sure data sets are combined properly
• Different variable formatting in different data sets
Data Visualizations
• Look at published visualizations
• Discuss ways to improve published
visualizations
• Specific visualizations created:
• Stream graphs
• Tree graphs
• Maps
Big Data
• Described “big data”
• Volume
• Velocity
• Variety
• Discussed computer science issues
• MapReduce
• Hadoop
Projects
• 3 Projects
• Chapter 2 of Data Science in R: A Case Studies
Approach to Computational Reasoning and
Problem Solving by Deborah Nolan and
Duncan Temple Lang
• Twitter project
• Group project
Project 1
• Introduce students to R
• 10 years of data from the Cherry
Blossom Road Race in DC
• Lots of data cleaning
• Introduced some visualization
issues with larger data sets
• Introduced the idea of smoothing
Project 1
• Done in pairs
• Deliberately formed with one “stat” and one
“cs” student
• In class work following the steps
given for the men’s data
• Written report due for women’s data
• Includes both code and statistical report
Project 2
•
•
•
•
•
Download public tweets
Filter for a query term
Assign a sentiment score
Aggregate tweets by state
Produce geographic visualization of
data
Project 2
• Again done in cs/stat pairs
• Final report
• Required an extension of the basic lab
• Required both code and statistical report
Project 3
• Term-long 4-person group project
• First week
• Individual brainstorming about topics
• Friday morning – elevator pitches
• Second week
• Teams find data and refine goals
• Friday morning – check-in report from all
teams – class feedback provided
Project 3
• Third week
• Lab time devoted to project
• Finish data cleaning and do much of the
analysis
• Friday morning – check-in report from all
teams – class feedback provided
Project 3
• Last 3 days of class
• Finishing touches on the analysis
• Create project website
• Final presentation to both class and other
visitors
Examples of group projects
Examples of group projects
Examples of group projects
Lessons Learned
• Slower introduction to R
• Small individual assignments as we
go
• More faculty input for statistical
analysis of group projects
For more information
Ann Cannon
Department of Math and Stat
Cornell College
600 First St SW
Mt. Vernon, IA 52314
(319) 895-4461
[email protected]