After Dinner Talk at SLDS Meeting, 2016

Download Report

Transcript After Dinner Talk at SLDS Meeting, 2016

Some Musings on Life
& “Data Science”
Statistical Learning and Data Science
Friday Center, UNC, Chapel Hill
J. S. Marron
Dept. of Statistics and Operations Research
University of North Carolina
Some Views of Statistics
Statistics
X  z1 
2
s
n
H o : 1   2
EY  X
Most People
Some Views of Statistics
Statistics
Bootstrap
HDLSS
X  z1 
2
s
n
Bayes
H o : 1   2
EY  X
Kernels
Survival Analysis
Sparsity
Functional Data
Machine Learning
MCMC
Time Series
Mixed Models
Etc. Etc. Etc. …
Reality
Some Views of Statistics
Statistics
Statistics in Science
Some Views of Statistics
Medicine
Biology
Agriculture
Physics
Statistics
Geology
Economics
Psychology
Statistics in Science
Some Views of Statistics
John Tukey Quote:
From:
http://www.morris.umn.edu/~sungurea/introstat/history/w98
Statistics in Science
Some Views of Statistics
John Tukey Quote:
“The best thing about being a statistician
is that you get to play in everyone's
backyard”
From: http://www.york.ac.uk/depts/maths/histstat/tukey_nytimes.htm
Statistics in Science
Some Views of Statistics
Words coined by John Tukey:
(0 – 1 data unit)

Bit

Software
(mention to Computer Science friends…)
Some Views of Statistics
Another Prescient Statistician:
Bill Cleveland
Coined the Term “Data Science”
Cleveland, W. S. (2001). Data science: an action plan
for expanding the technical areas of the field of
statistics. International Statistical Review.
Some Views of Statistics
Statistics
X  z1 
2
s
n
H o : 1   2
EY  X
Most People
Some Views of Statistics
“Data Science (Analytics)”
 Computer Science
 Math (Applied)
 Bus. / Finance
 Others (Info. Sci., Psych, …)
Statistics
X  z1 
2
Caution: ∃ a desire to replace old
ideas with exciting new ones
s
n
H o : 1   2
EY  X
Some Views of Statistics
What is (should be) the relationship?
Statistics
Data Science
Machine Learning
…
(Cleveland View)
Some Views of Statistics
What is (should be) the relationship?
Data Science
Machine Learning
…
Statistics
The Big Question
What are the Boundaries of Statistics?
NSF/DMS Program Director (late 2004):
“That is not statistics”
The Big Question
What are the Boundaries of Statistics?
OK, then where are they?
We should discuss this much more…
Openly, not in the “Rejection Process
(Publications, Grants, etc.)”
Variation
Thoughts From Business Statistics Course
Variation
A Fundamental Concept:
 Sounds Obvious
 Easy to Not Consider (Forget)
{Surprisingly So}
Variation
 Easy to Not Consider (Forget)
E.g. An Explorer Drowned in a Lake That
Averaged 6 Inches in Depth…
o Hard to visualize?
Thanks to N. I. Fisher
Variation
 Easy to Not Consider (Forget)
E.g. An Explorer Drowned in a Lake That
Averaged 6 Inches in Depth…
o Hard to visualize?
Lake Eyre, Australia, from Wikipedia
Variation
 Easy to Not Consider (Forget)
E.g. An Explorer Drowned in a Lake That
Averaged 6 Inches in Depth…
o Hard to visualize?
Lake Eyre, Australia, from
www.airadventure.com.au
Variation
 Easy to Not Consider (Forget)
E.g. An Explorer Drowned in a Lake That
Averaged 6 Inches in Depth…
o Hard to visualize?
o Key is Variation About “Average”
o Simple Idea Takes a Minute to Recall
(happens a lot)
Variation
A Fundamental Concept:
 Sounds Obvious
U.S. Presidential
Politics ?!?
Common Gross Oversimplification:
Group of people:
Political. Religious,
Ethnic Origin, …
They are going to …
They all want to ..
Variation
Homework C0.1
Find an Example of Ignoring Variation.
Send me an email, with: text, and attribution.
Plan to discuss in class.
Variation
Homework C0.1
Results:
Out of First 10 Quotes
9 Were From
Donald Trump
Ideas on Human Relationships
Common Question:
“How Are Dep’t Politics Going?”
Background:
 Long Dubious History
 Merger of Statistics & OR
(More Diverse Interests)
 Rapidly Changing University
Ideas on Human Relationships
Response:
“Best I’ve Seen in Chapel Hill”
Reason:
Respect
 Key to Current Interactions
 Moved Beyond “Politics of Disrespect”
Ideas on Human Relationships
Fundamental Observation:
Human Interactions Work Best In An
Atmosphere of Respect
 Day to Day Interactions w/ Colleagues
 Reviews of Papers / Grant Proposals
 US Congress
 US Presidential Politics…
Special Thanks
UNC, Stat & OR
Department of Statistics and Applied Prob.
National University of Singapore
For Many Discussions

This Talk
28
BIG DATA Models & Concepts
UNC, Stat & OR
Challenge from the Recent Media:
Mayer-Schönberger and Cukier (2014)
“Big Data: A Revolution That Will Transform
How We Live, Work, and Think”
29
BIG DATA Models & Concepts
UNC, Stat & OR
Challenge from the Recent Media:
Mayer-Schönberger and Cukier (2014)
Major Premise:
Differing Data Analytic Goals
“Correlational” vs. “Causal”
30
BIG DATA Models & Concepts
UNC, Stat & OR
“Causal” Data Analysis:
 Goal: Underlying Causes of Phenomena
 Approach: Classical “Scientific Method”
 Formulate Hypothesis
 Collect Data
 Test Hypothesis
 Consequences:
Solid Knowledge w/ Measurable Certainty
31
BIG DATA Models & Concepts
UNC, Stat & OR
“Correlational” Data Analysis:
 Goal: Find (and Use) Mere Correlations
 Motivation: Correlations are
 Useful (e.g. ___ Recognition Software)
 Valuable (Buying and Selling of Data…)
 Insightful????
 Consequences:
Automatic Solutions to Some Hard Problems
32
Correlation vs. Causation
UNC, Stat & OR
How New Is This Discussion?
33
Correlation vs. Causation
UNC, Stat & OR
How New Is This Discussion?
Naïve Readers
:
[Of Mayer-Schönberger and Cukier (2014)]
This is Exciting!!!
Great New Ideas!!!
Change Statistics Curricula!!!
Start Up “Data Analytics”!!!
34
Correlation vs. Causation
UNC, Stat & OR
How New Is This Discussion?
Statistics
Time
35
Correlation vs. Causation
UNC, Stat & OR
How New Is This Discussion?
Time
Statistics
Pattern Recognition
Artificial Intelligence
Neural Networks
Data Mining
Machine Learning
36
Correlation vs. Causation
UNC, Stat & OR
How New Is This Discussion?
Time
Statistics
Pattern Recognition
Artificial Intelligence
Neural Networks
Data Mining
Machine Learning
???
37
Correlation vs. Causation
UNC, Stat & OR
How New Is This Discussion?
Time
Statistics
Pattern Recognition
Artificial Intelligence
Neural Networks
Data Mining
Machine Learning
Big Data – Data Science
38
A Small Aside
A Personal Apology to
Xiaotong Shen
For My Skepticism About
ASA Section on Data Mining
My (Wrong) Idea: Name Would Change,
So Not Appropriate as “Section”
{Great to See Recent Name Change}
Correlation vs. Causation
UNC, Stat & OR
How New Is This Discussion?
Time
Statistics
Pattern Recognition
Artificial Intelligence
Neural Networks
Data Mining
Machine Learning
Big Data – Data Science
40
Correlation vs. Causation
UNC, Stat & OR
How New Is This Discussion?
Some Came
With Major
New Ideas
Pattern Recognition
Artificial Intelligence
Neural Networks
Data Mining
Machine Learning
Big Data
41
Correlation vs. Causation
UNC, Stat & OR
How New Is This Discussion?
Pattern Recognition
Less So For
Others, But
More Focus
On
Artificial Intelligence
Neural Networks
Data Mining
Machine Learning
Big Data
42
Correlation vs. Causation
UNC, Stat & OR
How New Is This Discussion?
Data Mining
Great Correlational Discovery
43
Correlation vs. Causation
UNC, Stat & OR
How New Is This Discussion?
Data Mining
Great Correlational Discovery:
Super Market Scanner Data
Baby Diapers (aka Nappies) & Beer
44
Correlation vs. Causation
UNC, Stat & OR
How New Is This Discussion?
Data Mining
Baby Diapers (aka Nappies) & Beer
Some Perspective:

Correlational Discovery

Makes Causational Sense
(Too Soon To Totally Dump Causation)
45
Correlation vs. Causation
UNC, Stat & OR
Relative Emphasis???
46
Correlation vs. Causation
UNC, Stat & OR
Relative Emphasis???
Classical Statistics:
Correlation
vs.
Causation
47
Correlation vs. Causation
UNC, Stat & OR
Relative Emphasis???
Mayer-Schönberger and Cukier:
Correlation
vs.
Causation
48
Correlation vs. Causation
UNC, Stat & OR
Relative Emphasis???
Suggested Actual Future Course:
Correlation & Causation
49
Correlation vs. Causation
UNC, Stat & OR
Relative Emphasis???
Suggested Actual Future Course:
Correlation & Causation
Note: Changes Are Needed in Curricula, Etc.
50
The Big Question
What are the Boundaries of Statistics?
NSF/DMS Program Director (late 2004):
“That is not statistics”
The Big Question
What are the Boundaries of Statistics?
We Should Openly Discuss Much More…
Statistics
Data Science
Machine Learning
…
Data Science
OR
Machine Learning
…
Statistics
The Big Question
What are the Boundaries of Statistics?
We Should Openly Discuss Much More…
How Much Leadership Should We Take?
Let’s Embrace Our Wide Diversity
of Opinions on This Point
Challenges for You
 Lead Statistics (D. S.) into the Future
 Promote Increasing Breadth
 Embrace New Ideas
 Advocate Them While Reviewing
 Speak Up Serving On Panels
 Openly Discuss Boundaries