Transcript Big Data

What’s Going on in
Survey Research?
Lars Lyberg
Stockholm University
Frimis, November 11, 2015
A Changing Survey Landscape
•
•
•
•
•
•
Probability and nonprobability sampling
Total survey error
New technology
Big data
International surveys
Hard-to-survey populations
2
Probability Sample
Every object in the target population has a known
non-zero probability of being selected
• Very few samples in market, opinion and social
research live up to this definition
• Reasons include nonresponse, frame problems,
and special research goals
3
The Origins of Probability Sampling
• Introduced in 1934
• Basically a financial breakthrough
• Data collection was expensive
• To be able to say something about a population
based on a relatively small sample and a
margin of error to go with that was almost like
magic
4
A Couple of Giants
Sir Ronald Fisher
Jerzy Neyman
5
Problems
• It took a while for probability sampling to be
accepted
• The sampling theory did not handle other error
sources very well
• Basically the only “allowed” error source is
sampling
6
Issues Associated with Sampling
• Ridiculous response rates
• Increased demands for timely data
• Access to large volumes of (inexpensive) data
• Margins of error are understated
• Discussions about nonprobability sampling
• New less expensive ways of collecting data
• The advent of opt-in panels
• Proper inference not always possible
7
Examples of Statements
• Probability sampling is the only reasonable way
to achieve representativity
• Probability samples are not representative due
to nonresponse
• There is no theoretical foundation for opt-in
panels
• There are theories and methods based on
modeling and weighting
8
More Statements
• Studies show that probability sampling is more
accurate that nonprobability sampling
• Some of these comparisons are flawed since
weighting of the nonprobability samples has not
been sufficiently ambitious
• Even though results from opt-in panels might
be biased to some extent they come at a
fraction of the costs for a probability sample
and much quicker
9
The Current Situation
• Both probability and nonprobability sampling
have problems
• Bayesian inference gaining ground
• Lots of experimentation needed
• Quality criteria need to be defined
10
The Recent British Election
• Whilst the Conservatives won convincingly, 18% of
the campaign polls had suggested a dead heat and
a further 46% had suggested Labour leads.
• Of the 36% of polls that registered Conservative
leads, three out of four showed leads that were less
than half the actual outcome.
• Both probability sampling and panels failed.
• The British Polling Council has initiated an
investigation on why things went wrong.
11
Total Survey Error
Sampling
Error
Due to selecting
a sample instead of
the entire pop’n
Nonsampling
Error
Errors due to
mistakes or system
deficiencies
12
Risk of Bias and Variance by
Error Source
MSE Component
Sampling error
Specification error
Nonresponse error
Frame error
Measurement error
Data Processing error
Var
High
Low
Low
Low
High
High
Bias
Low
High
High
High
High
High
13
What to do about Total Survey
Error
• Minimize variances and biases through QA, QC,
QM, and best practices
• Estimate the size of the total error
• Apply risk management
14
New Technology
• Smartphones as a data collection mode
• Social media as an information source
• GPS
15
Big data is a term that describes data sets so large
and complex that they cannot be processed and
analyzed with conventional software systems.
Sources:
• Transaction databases
• Social media
• The Internet of Things
16
A Black Swan
A black swan is an undirected and unpredicted
event.
It is rare, has an extreme impact but in retrospect
we saw it coming
• Internet - yes
• 9/11 - yes
• The Lehman Brothers crash -
yes
• The advent of Big Data - ?
17
The Three V’s
•
Volume
• Tera- to Peta- to Exabytes of data, stored and
processed
•
Variability
• Structured, unstructured, text, images, maps,
multimedia
• Varying sources
•
Velocity
• Streaming data, from seconds to milliseconds
•
Veracity
• Can we trust Big Data? Can we use it? Proxies,
indicators
18
19
Big Data
Examples of Big Data with use or potential use in
statistics production
• Google searches (flu trends)
• Traffic camera data
• Retail scanner data
• Credit card and transaction data
• GPS data
20
Hype of Big Data
Gartner’s hype curve
Source: Wikipedia
21
Happiness and Well-being
The common survey question: How satisfied are
you with your life?
BD alternative
• 10 million tweets that are coded for happiness
(rainbow, love, beauty, hope, wonderful,
wine…) and non-happiness (damn, boo, ugly,
smoke, hate, lied,…)
• Happiest states: Hawaii, Utah, Idaho, Maine,
Washington
• Saddest states: Louisiana, Mississippi,
Maryland, Michigan, Delaware
22
Big Data Challenges
• Data quality
• Data analytics
• Confidentiality concerns
23
Mono Surveys vs 3MC Surveys
• 3MC=multinational, multregional and
•
•
•
•
multicultural contexts
One population vs more than one
population
In 3MC TSE or MSE as planning criteria
must be complemented by equivalence
or comparability
3MC surveys need to be designed with a
mixture of standardization and flexibility
to achieve operational equivalence
Implementation and control much more
demanding in 3MC surveys
24
Examples of 3MC Surveys
•
Gallup World Poll
(GWP)
Student assessment
(PISA)
•
European Statistical
System
•
European Social
Survey (ESS)
•
•
World values (WVS)
•
Health, ageing and
retirement (SHARE)
Marketing surveys on
customer satisfaction,
brand names,
attitudes, finances
etc
•
Pure entertainment
surveys
•
Adult literacy (IALS)
•
Adult skills (PIAAC)
•
•
Electoral systems
(CSES)
25
Some Special Features in a
3MC Survey Setting
• Comparability is the main goal
• Concepts must have a uniform meaning
• Risk management differs
• Financial and methodological resources
•
•
•
•
differ (3MC’s are expensive)
National and international interests are
in conflict
Scientific challenge
Administrative challenge
National pride is at stake
26
Response Rates in PIAAC, Cycle I (%)
•
Australia 71
•
Japan 50
•
Austria 53
•
Korea 75
•
Belgium 62
•
Netherlands 51
•
Canada 58
•
Norway 62
•
Cyprus 73
•
Poland 54
•
Czech Republic 66
•
Slovak Republic 66
•
Denmark 50
•
Spain 48
•
Estonia 63
•
Sweden 45
•
Finland 66
•
UK-England 59
•
Germany 55
•
UK-Northern Ireland 65
•
Ireland 72
•
USA 70
•
Italy 56
27
Challenges in 3MC Surveys
•
•
•
•
•
•
•
Design (what can vary, what is rigid)
Translation
Adaptation
Culturally different error structures
Data fabrication
Quality control
Often too many countries involved
28
Hard-to-survey Populations
(H2S)
Homeless
Prostitutes
Refugees
Victims
Persons with disabilities
Minorities
Illegal aliens
Rare (fans, musicians, language groups,
extremists)
• Mobile populations (nomads, migrants,
students)
•
•
•
•
•
•
•
•
29
Methodological Approaches to H2S
• Innovative sampling methods
• Venue-based (red light districts, voting
facilities)
• Indirect sampling
• Snowball and respondent driven
• Qualitative studies (anthropology etc)
• Formative research
30
The End of Theory
Faced with massive data, this
approach to science — hypothesize,
model, test — is becoming obsolete.
Petabytes allow us to say: ‘Correlation
is enough.’ We can stop looking for
models. We can analyze the data
without hypotheses about what it
might show. We can throw the
numbers into the biggest computing
clusters the world has ever seen and
let statistical algorithms find patterns
where science cannot.
Chris Anderson 2008
31
The Future of Surveys is Uncertain
Too many surveys, too much off-the-shelf tools
Active participation going down
Passive participation going up
Many problems are global
Decision makers need data fast and at low cost
The design-based approach needs refreshment
Decision makers need data from different
sources
• The big survey institutes are worried
•
•
•
•
•
•
•
32
Endnote
• Our industry needs innovations and less
•
•
•
•
fighting
We need to merge with other research cultures
We need to know more about combining data
sources
We need to account for all major sources of
uncertainty that is associated with data
collection and analysis of data
We need to develop new theories for handling
error structures, combining data sources, and
reaching equivalence
33
Over and Out
34