Orange and Cool Grey
Download
Report
Transcript Orange and Cool Grey
Tweaking Intro Stats in the Age of n = All
Glenn Miller
Borough of Manhattan Community College
AMATYC Conference
New Orleans
November 20, 2015
your name
Big Data and GAISE
• Emphasize statistical literacy and
statistical thinking
• Use real data
• Stress conceptual understanding
• Foster active learning
• Use technology (to develop
understanding and to analyse data)
• Use assessments to improve and
evaluate student learning
your name
My focus:
• My Introduction to
Statistics course is
focused on the
Central Limit
Theorem
• What
mathematical
content is relevant
for N = All?
your name
Goodness-of-fit?
Experimental design
Replication &
Bootstrapping
Data mining
Various test
statistics
Math modelling
Deterministic,
axiomatic
methods
your name
What is Big Data?
• Targeted marketing (Netflix, Amazon,
Google ads, political campaigns)
• Spam filter- log form of Bayes Theorem
• Health care- large scale factor analysis or
clustering analysis on unstructured data
• LinkedIn – graph theory
your name
Issues in Analysing Big Data
• Correlation is not causation, but it is
. . . $$$ profitable
• The Four V’s: Velocity, Variability,
Veracity and Volume
• Finding the signal in the noise of
unstructured data
• Apophenia: seeing patterns where
none actually exist: Replication?
your name
Issues in Analysing Big Data
• Outliers: Dirty data or a Black
Swan?
• Smoothing- do not want a model to
overfit the data
• N = All of what? Use of Twitter data
your name
Tweak #1: Math Modelling
• Broader range of non-linear and nonnormal models
• Emphasize usefulness of prediction (𝑅2 ?)
rather than the confidence intervals
around slope and intercept (measures of
how well the sample statistics estimate
the parameters)
• “Model assessment remains critical,
while statistical significance is less
central” ASA Curriculum Guidelines
your name
Goodness-of-fit?
• Is it normal?? (Normal approximation to
the binomial)
your name
Goodness-of-fit?
• Is it mathematics? Normal approximation
to the binomial
your name
-0.23
-0.224
-0.218
-0.212
-0.206
-0.2
-0.194
-0.188
-0.182
-0.176
-0.17
-0.164
-0.158
-0.152
-0.146
-0.14
-0.134
-0.128
-0.122
-0.116
-0.11
-0.104
-0.098
-0.092
-0.086
-0.08
-0.074
-0.068
-0.062
-0.056
-0.05
-0.044
-0.038
-0.032
-0.026
-0.02
-0.014
-0.008
-0.002
0.004
0.01
0.016
0.022
0.028
0.034
0.04
0.046
0.052
0.058
0.064
0.07
0.076
0.082
0.088
0.094
0.1
0.106
Goodness-of-fit?
• Is it normal?? Return on S&P
S&P Daily Ln Return
0.14
0.12
0.1
0.08
0.06
0.04
0.02
0
your name
-0.23
-0.224
-0.218
-0.212
-0.206
-0.2
-0.194
-0.188
-0.182
-0.176
-0.17
-0.164
-0.158
-0.152
-0.146
-0.14
-0.134
-0.128
-0.122
-0.116
-0.11
-0.104
-0.098
-0.092
-0.086
-0.08
-0.074
-0.068
-0.062
-0.056
-0.05
-0.044
-0.038
-0.032
-0.026
-0.02
-0.014
-0.008
-0.002
0.004
0.01
0.016
0.022
0.028
0.034
0.04
0.046
0.052
0.058
0.064
0.07
0.076
0.082
0.088
0.094
0.1
0.106
Goodness-of-fit?
• Is it normal?? Return on S&P
S&P vs Normal
0.14
0.12
0.1
0.08
0.06
0.04
October 19, 1987
0.02
0
S&P
Normal
your name
Goodness-of-fit?
• Is it Pareto?? (Zipf’s Law)
Relative frequency
Market Cap of Retail, Apparel, and
Discount Stores
($ millions)
your name
Tweak #2: De-emphasize experimental design
• With less emphasis on inference,
sampling techniques are less
important (and not mathematical)
• Emphasize type of data in that part
of the course
your name
Tweak #3: The Project
• Have students choose their own
data set from online source
• Allow the research question to be
chosen after the selection of data set
(global warming, education
research,…)
your name
Tweak #4: Deeper not Broader
• Which hypothesis tests matter?
Again, something has to go . . .
• Confidence interval vs hypothesis
test
your name
Tweak #5: Rethink the Role of Probability
• Probability’s role in inference?
• Conditional probability and Bayes’
Theorem
• Independence and dependence as
information theory
• This is where N=All has always been
in our course!
your name
From ASA Webinar by Nicholas Horton:
your name
Bibliography
Gould, R. (2015) Intro Stats and GAISE in the Age of Big Data AMATYC Webinar, Available at
http://www.amatyc.org/?page=Webinars, August 17, 2015
Hand, D. J. (1999). Statistics and data mining: intersecting disciplines. ACM SIGKDD Explorations
Newsletter, 1(1), 16-19.
Hastie, T., Tibshirani, R., Friedman, J., & Franklin, J. (2005). The elements of statistical learning: data
mining, inference and prediction. The Mathematical Intelligencer, 27(2), 83-85.
Mayer-Schönberger, Viktor, and Kenneth Cukier. Big data: A revolution that will transform how we live,
work, and think. Houghton Mifflin Harcourt, 2013.
Miller, G. (2013) Implementing GAISE at Community Colleges: Benefits and Challenges, MathAMATYC
Educator, 5(1), 9 -12
Rudder, C. (2014). Dataclysm: Who We Are (When We Think No One's Looking). Random House
Incorporated.
Various presenters, Mathematics in Data Science Conference, ICERM Institute for Computational and
Experimental Research in Mathematics Conference, Brown University, July 28-30, 2015
your name
https://icerm.brown.edu/topical_workshops/tw15-6-mds/
Bibliography
American Statistical Association Undergraduate Guidelines Workgroup. 2014. 2014 curriculum guidelines for
undergraduate programs in statistical science. Alexandria,VA: American Statistical Association.
Mandelbrot, Benoit, and Richard L. Hudson. The Misbehavior of Markets: A fractal view of financial
turbulence. Basic Books, 2014.
[email protected]
your name