The Emergence of Data Science: Why Now?
Download
Report
Transcript The Emergence of Data Science: Why Now?
The Emergence of Data Science:
Why Now?
Ike Nassi
(With contributions from Andrew McAfee, MIT Sloan)
17-Oct 2013
BSOE Research Day
What this talk is all about
Convince you that
There is a need
We have some tools
We need new approaches
We can’t do it all ourselves
Evidence-based decision making is important
And it needs more attention
It will happen anyway
Outline
Societal
Economic
Technological
A Short Story – Point of View
1984
1984
Configuration = 0
Configuration ≠ 0
The Future: Hard to Predict Accurately
iWatch?
Skynet?
Changes happen faster than we think!
How well can experts predict?
2012 Political Campaign
slide by Andrew McAfee (MIT)
“Bottom line: Romney 315,
Obama 223. That sounds high
for Romney. But he could
drop Pennsylvania and
Wisconsin and still win the
election. Fundamentals."
Barone: Going out on a limb: Romney beats Obama, handily (315 to 223)
The Washington Examiner ^ | 11/2/12 | Michael Barone
What about the experts?
slide by Andrew McAfee (MIT)
A Meta-Study Scorecard
slide by Andrew McAfee (MIT)
136 studies of expert vs. algorithmic prediction
Experts Clearly Better
8 (6%)
Tossup
Algorithm Clearly Better
65 (48%)
63 (46%)
The Digital Frontier Keeps Expanding
(slide contributed by Andy McAfee, MIT)
Source: “Building Watson: It’s not so elementary, my dear” – W. Shih. HBS case #9-612-017
Ken Jennings
(slide contributed by Andrew McAfee, MIT)
Why is Data Science happening now?
We can collect “Big Data”
slide by Andrew McAfee (MIT)
Big Data
slide by Andrew McAfee (MIT)
What can Economics tell us?
We are collecting a lot more data, but…
We are facing a rapidly changing economic
landscape
And we are not very good at controlling the economy
Who is going to analyze it?
Capital vs. Labor
slide by Andrew McAfee (MIT)
Corporate Profits After Tax & Non-Farm Labor Share, 1947-2012
120
Corporate Profits ($Billions)
1,400
Corporate Profit
117
1,200
114
1,000
111
800
108
600
105
400
102
200
Labor Share
0
-200
1947
99
96
1952
1957
1962
1967
1972
1977
Source: Federal Reserve Bank of St. Louis, Economic Research
1982
1987
1992
1997
2002
2007
93
2012
Labor Share (2005 = 100)
1,600
Level of GDP, Profits, and Investment (Jan-95 = 100)
Recent Trends
slide by Andrew McAfee (MIT)
Trends in US GDP, Profits, Investment, and Employment,
1995-2011
350
GDP
Corporate Investments
300
All Profits After Tax
Non-Financial Profits After Tax
250
200
150
100
50
Shaded areas indicate recessions
0
1995
1997
1999
2001
2003
2005
2007
2009
2011
slide by Andrew McAfee (MIT)
Trends in US GDP, Profits, Investment, and Employment,
1995-2011
350
74
GDP
Corporate Investments
All Profits After Tax
Non-Financial Profits After Tax
Employment to Population Ratio
300
250
72
70
68
200
66
150
64
100
62
50
60
Shaded areas indicate recessions
0
1995
1997
1999
2001
2003
2005
2007
2009
58
2011
Employment/Population Ratio
Level of GDP, Profits, and Investment (Jan-95 = 100)
Recent Trends
Skill Disparities
slide by Andrew McAfee (MIT)
Changes in Wages for Full-Time, Full-Year Male U.S. Workers, 1963-2008
Composition-Adjusted Real Log Weekly Wages
0.6
Graduate
School
0.5
0.4
College
Graduate
0.3
0.2
0.1
Some
College
0.0
High School
Graduate
-0.1
1963
1968
1973
1978
1983
1988
Source: http://econ-www.mit.edu/~dautor/hole-vol4/figs/fig-04.zip
1993
1998
2003
High School
2008Dropout
Superstars
U.S. Top 0.01% Income Share, 1913-2010
7%
6%
Income Share
5%
4%
3%
2%
1%
0%
1913
1923
1933
1943
1953
1963
Source: http://emlab.berkeley.edu/users/saez/piketty-saezOUP04US.pdf
1973
1983
1993
2003
How to effect change
Make the experts more effective
Proactive and Reactive Approaches
Collect data, predict, act (proactive)
E.g. Evidence-based medicine
Build systems that collect data, create feedback loops (reactive)
E.g. Human body
Both are needed
Analysis
Proactive
Reactive
Technology Requirements
Data sizes for data under management are monotonically
increasing
Who wants less data?
Our appetite for analysis is monotonically increasing
Do you think, or do you know?
Trend toward evidence-based management
Our appetite for speed is monotonically increasing
Who wants questions answered more slowly?
Hence the industry interest in in-memory data management
systems
Our overall ability to manage complexity is not increasing
Technology To Support Data Science
Processor speeds are limited
Processor core density has been increasing at a healthy rate
Memory density is increasing (but at a lower rate than core
density)!
Therefore, the memory/core ratio is going in the wrong
direction!
We haven’t significantly changed the memory/storage
hierarchies for decades
Interconnects are getting faster – as fast as memory access?
memory access is slow
caches are fast!
Memory-Density/Core-Density Declining…
Technological Solutions
It’s in our nature to tackle more ambitious
problems
Need faster answers
SAP, Oracle, Neo-4j, Objectivity, etc.
More in-memory solutions (e.g. NYSE/Euronext –
Steve Rubinow)
Cannot get faster processors, but we can get
more of them
But: parallelism is difficult
Legacy software is a huge problem
Need more machine learning, therefore,
feedback
What about memory?
Scaling out
When all you have is a hammer, every problem looks
like a nail
Or, in my case, a thumb!
Today we rely almost exclusively on “scale-out” systems
Because that’s the main way we add processors and
memory
Shard the data, intelligently target the queries – time
consuming
It’s not easy to query partitioned databases
What is the best way to do it?
Moving data is time-consuming
And you might have to change it
What if you could build systems that “scale-up”?
What I’m doing about this
Enabling systems that scale-up (TidalScale Inc. mission)
Software that sits below an operating system but above
the hardware that aggregates a set of servers together
and runs that collection as a single virtual server running
a single conventional operating system
dynamic scaling at linear cost
supporting unmodified legacy software and legacy
operating systems
automatically, dynamically and hierarchically optimizing
processors, memory, networks, and storage systems through
machine learning
automatically evolving as hardware evolves
The computer begins to learn what it needs to do to
manage itself!
Why Data Science Now?
NEED: the future is increasingly complex and difficult to
predict
NEED: we don’t have enough qualified experts, and
experts often get it wrong
RAW MATERIALS: we are collecting huge amounts of
data at an increasing rate
ENABLER: new hardware and software tools are
emerging
THEREFORE: Data science is inevitable! We don’t have
a choice
What are the implications?
Danny Hillis, inventor of the Connection Machine:
“I want to build a computer that will be proud of me”
What about SkyNet?
Let’s leave that discussion for another day….
The Second
Machine
Age
Andrew McAfee, MIT
[email protected]
@amcafee
Thank you
Ike Nassi
UCSC Computer Science
[email protected]
and
TidalScale, Inc.
[email protected]