Perspectives on Big Data for Business Statistics

Download Report

Transcript Perspectives on Big Data for Business Statistics

What Issues do Big Data Present
for Business Education?
Bob Andrews
(Virginia Commonwealth University)
Relevant Statements for Big Data Statistics
Analytics “describes any use of data and statistical analysis to drive
business decisions from data whether the purpose is predictive or
simply descriptive.” Sharpe, DeVeaux & Velleman 3rd edition
“Data mining refers to extracting useful knowledge from what may
otherwise appear to be an overwhelming amount of noisy data.“
Stine & Foster
Definition of Statistics
“Statistics
is the science and art of extracting
answers from data. Some of the answers do require
numbers and formulas, but you can also do every statistical
analysis with pictures-graphs and tables.
… in this course you’ll learn how to use statistics to interpret
data and answer interesting questions.”
Stine & Foster
Definition of Statistics
"Statistics
is a way of reasoning, along with a
collection of tools and methods, designed to help us
understand the world."
Sharpe, DeVeaux & Velleman 2nd edition
Statistics is the science of uncertainty. (Andrews)
The Signal and the Noise by Nate Silver
“Finding patterns is easy in any kind of data-rich environment; …
The key is in determining whether the patterns
represent noise or signal.“ (pg. 240)
“… sampling error does not always tell the whole story
…” (pg. 252)
“If you’re using a biased instrument, it doesn’t matter how many
measurements you take-you’re aiming at the wrong target.” (pg.
253)
“… the era of Big Data only seems to be worsening the problems
of false positive findings in the research literature.” (pg. 253)
The Signal and the Noise by Nate Silver
“Essentially, the frequentist approach toward statistics
seeks to wash its hands of the reason that predictions
most often go wrong: human error. It views
uncertainty as something intrinsic to the experiment
rather than something intrinsic to our ability to
understand the real world. The frequentist method
also implies that, as you collect more data, your error
will eventually approach zero; this will be necessary
and sufficient to solve any problem.” (pg. 253)
The Signal and the Noise by Nate Silver
Fisher’s notion of statistical significance, which
uses arbitrary cutoffs devoid of context to
determine what is a “significant” finding and what
isn’t, is much too clumsy …” (pg. 256)
“… some professions have considered banning
Fisher’s hypothesis test from their journals.” (pg.
260) “The Null Hypothesis Testing Controversy in
Psychology,” JASA, December 1999 by David H. Krantz
The New Statistics
Geoff Cumming http://pss.sagepub.com/content/25/1/7
“in response to renewed recognition of the severe
flaws of null-hypothesis significance testing
(NHST), we need to shift from reliance on NHST
to estimation and other preferred techniques.”
“The new statistics refers to recommended
practices, including estimation based on effect
sizes, confidence intervals, and meta-analysis.”
The Signal and the Noise by Nate Silver
The goal of any predictive model is to capture as much signal
as possible and as little noise as possible. Striking the right
balance is not always so easy, and our ability to do so will be
dictated by the strength of the theory and the equality and
quantity of the data. In economic forecasting, the data is very
poor and the theory is weak, hence Armstrong’s argument
that the more complex you make the model the worse the
forecast gets.” (pg. 388) (He is referring to Scott Armstrong of
the Wharton School of U. Penn.)
“What matters most, as always, is how well the predictions
do in the real world.”
Wikipedia definition of Big Data
2-19-2014
“Big data is the term for a collection of data sets
so large and complex that it becomes difficult to
process using on-hand database management
tools or traditional data processing applications.”
These are NOT issues to be addressed in
introductory statistics.
A Statistician’s Definition of Big Data
Michael Horrigan from the Bureau of Labor Statistics sees “Big
Data as nonsampled data, characterized by
the creation of databases from electronic
sources whose primary purpose is something
other than statistical inference.” Horrigan, Michael W., “Big
Data: A Perspective from the BLS,” Amstatnews, Issue#427 (January 2013), pp. 25-27.
The V’s of Big Data
3 V’s: Volume, Velocity & Variety
th
4
V: (Veracity, Validity or Verification)
th
5
V: Value
Sources of Uncertainty/Variation
1. Standard Error of the Statistic
2. Uncertainty surrounding the veracity of the
data used to calculate the Standard Error
For Big Data the second source of uncertainty
becomes much more important.
What is Driving Big Data Use?
What is the value from Big Data?
It’s Better Business Decision Making
Consider these titles of Tom Davenport’s books
Competing on Analytics: The New Science of Winning
Analytics at Work: Smarter Decisions, Better Results
Big Data is about Decision Making
Big Data is not about Hypothesis Testing.
Statistics Instruction for Big Data should have
more emphasis on
Statistical Thinking rather than statistical mechanics.
Graphing and Visualization to effectively communicate
the data’s story.
Decision Making rather than hypothesis testing
Determining the veracity/validity of the data for making
decisions about the phenomenon of interest.
(Understanding implications of data being obtained
over time)