Introduction to Data Science Section 3

Download Report

Transcript Introduction to Data Science Section 3

Introduction to Data Science
Section 3
Data Matters 2015
Sponsored by the Odum Institute, RENCI, and
NCDS
Thomas M. Carsey
[email protected]
1
Big Data
2
3
Big Data
• The launch of the Data Science conversation has
been sparked primarily by the so-called “Big
Data” revolution.
• As mentioned, we have always had data that
taxed our technical and computational capacities.
• “Big Data” makes front-page news, however,
because of the explosion of data about people.
• Contemporary definitions of Big Data focus on:
– Volume (the amount of data)
– Velocity (the speed of data in and out)
– Variety (the diverse types of data)
4
5
6
How Big is Big?
• 500 million Tweets sent per day (by 2013)
• 300 hours of video uploaded to YouTube every
minute.
– https://www.youtube.com/yt/press/statistics.html
• 1.44 Billion Facebook users (April, 2015)
• Internet Usage:
– http://www.internetlivestats.com/
7
So Much More
• Locational Tracking (smart cars, smart phones)
• Satellite images (Nightlight Project, parking lot
images, crop images)
• Internet of Things
– Smart energy grid; biochips in livestock; Fitbits;
predictive maintenance;
8
9
Big Data
• Despite their linkage in many contemporary
discussions, Big Data ≠ Data Science.
– Data science principles apply to all data – big and
small.
– There is also the so-called “Long Tail” of data.
10
The Long Tail
Big Data
Most Data
11
Challenges of Big Data
• Big Data does present some unique challenges.
– Searching for average patterns may be better served
by sampling
– Searching rare events might require big data
• Big haystacks (may) contain more needles.
– This raises a point about so-called outliers
• Rare or odd events might distort estimates of “average”
effects.
• However, rare events might also be exactly what you are
seeking to study
– Methods of outlier detection are crucial
• Note looking for single outliers, pairs, or clusters
12
Challenges in the Long Tail
• Individual data sets are smaller.
• Aggregation could produce a whole greater
than the sum of its parts, but:
– Data sets might have similar measures, but use
slightly different measurement strategies,
metadata, etc.
• The DataBridge Project
• http://databridge.web.unc.edu/
13
The Promise of Big Data
• There has been a lot of hype about Big Data.
• There is the belief among some that Big Data
will solve all sorts of social, economic, and
scientific problems.
• The “Truth” must be in there somewhere – we
just need to find it.
• We have big problems – Big Data can help us
solve them.
14
15
16
Hope or Hype?
• Washington Post column by Samuel Arbesman
titled “Five myths about big data” (8-16-2013)
referenced the following tween offered as a
definition of Big Data.
– Big Data, n.: the belief that any sufficiently large
pile of shit contains a pony with probability
approaching 1 (by James Grimmelmann)
17
Even If True, What Kind of Pony?
18
Arbesman’s 5 Myths
•
•
•
•
•
“Big Data” has a clear definition
Big Data is new
Big Data is revolutionary
Bigger Data is better data
Big Data means the end of scientific theories
19
Does Big = Good?
• Lost in most discussions of Big Data is whether it
is representative data or not.
– We can mine Twitter, but who tweets?
– We can mine health records, who whose records do
we have?
– We can track online purchasing, but what about offline market behavior?
• Survey research has spent decades worrying
about representativeness, weighting, etc., but I
do not see it discussed nearly as much in data
science.
20
21
Theory, Methods, and Big Data
• The greatest need for theory and the greatest
challenges for computationally intensive methods
arise:
– When data is too small – there is not enough
information in the data by itself.
– When data is too big – the computational costs
become too high
– There is a “just right” that allows for complex models
and computationally demanding methods to be used
so that theoretical assumptions can be relaxed.
22
One Example of Data Science
23
Data Science and Elections
• The Obama campaigns in 2008 and 2012 are credited for
their successful use of social media and data mining.
• Micro-targeting in 2012
– http://www.theatlantic.com/politics/archive/2012/04/the-creepiness-factorhow-obama-and-romney-are-getting-to-know-you/255499/
– http://www.mediabizbloggers.com/group-m/How-Data-and-Micro-TargetingWon-the-2012-Election-for-Obama---Antony-Young-Mindshare-NorthAmerica.html
– Micro-profiles built from multiple sources accessed by aps, realtime updating data based on door-to-door visits, focused media
buys, e-mails and Facebook messages highly targeted.
– 1 million people installed the Obama Facebook app that gave
access to info on “friends”.
24
http://www.theatlantic.com/politics/archive/2012/04/
the-creepiness-factor-how-obama-and-romney-are-getting-to-know-you/255499/
25
26
27
Source: Nate Silver:
http://fivethirtyeight.com/features/senate-control-could-come-down-to-whole-foods-vs-cracker-barrel/
28
Big Data and Politics: Something Old, Something
New . . .
• The massive data collection and microtargeting regarding voters that defined 2012 is
both:
– New
• that amount and diversity of data mobilized for near
real time updating and analysis was unprecedented.
– Old
• it is a reversion to retail, door-to-door, personalized
politics.
– “All Politics is Local” – Tip O’Neill.
29