Transcript Document

Introduction
Malathi Veeraraghavan
Professor
Charles L. Brown Dept. of Electrical and Computer Engineering
University of Virginia
[email protected]
1
Outline
• Increasing interest in data
• Course: From Data to Knowledge
• Summary
2
“The data deluge”
“Data, data everywhere”
• Economist Special Issue Feb 27-Mar. 5, 2010
• Walmart databases alone are estimated at
more than 2.5 petabytes (a petabyte is 1
million gigabytes): 2010 numbers
• From businesses to governments, data
collection and analysis is rapidly becoming the
next big thing.
• 2012:
http://www.nytimes.com/2012/02/12/sundayreview/big-datas-impact-in-the3
world.html?pagewanted=all
“The data deluge”
• “A new kind of professional has emerged,
the data scientist, who combines the skills
of software programmer, statistician and
storyteller/artist to extract the nuggets
of gold hidden under mountains of data.”
• Hal Varian, Google’s chief economist notes
that “Data are widely available; what is
scarce is the ability to extract wisdom
from them.”
4
Business intelligence
• Nestle sells > 100,000 products in 200 countries
using 550,000 suppliers
• Problem: not using its huge buying power
effectively
• Used SAP software and analyzed its data
• Just one ingredient – vanilla – its American
operation reduced the number of specifications
and used fewer suppliers, saving $30M per year
• Annual savings from such operational
improvements: $1 billion
5
Economist special issue
Medical use
• Dr. Carolyn McGregor from University of Ontario
• Goal: spot fatal infections in premature babies
• Monitors subtle changes in 7 streams of real-time
data, such as heart rate, blood pressure, etc.
• ECG alone takes 1000 readings/second
• Infections are detected before obvious symptoms
emerge
• Naked eye cannot see it, but the computer can!
• Who programs these? Stats experts.
• Another term: Evidence Based Medicine
6
Economist special issue
Government usage
• An add-on to a 1986 law required
firms to disclose the harmful
chemicals they release.
• When the public started tracking
these numbers, by 2000, American
businesses had reduced their
emissions of the chemicals covered
under the law by 40%
7
Economist special issue
Best-sellers
• “Super-crunchers: Why Thinking-byNumbers Is the New Way to Be Smart” by
Ian Ayres
• “Money Ball: The Art of Winning an Unfair
Game” by Michael Lewis
• “The Long Tail” by Chris Anderson
• Malcolm Gladwell books - Outliers
• Microtrends – Mark Penn (elections)
• Freakonomics – S. Dubner and S. Levitt
8
Moneyball example
• 2002 season: Richest team, NY Yankees, had a payroll of
$126 million, while the Oakland A’s had a payroll of less than
a third of that, about $40 million, and yet they had reached
the playoffs three years in a row, and took the Yankees
close to elimination. How did they do it?
• Billy Beane, general manager of Oakland A’s
– Respected statistics
– Hired Paul DePodesta, Harvard MBA, who applied Bill James’
formulas and selected players based on their statistics.
– Runs created = (Hits + Walks) Total Bases/(At Bats + Walks)
– Jeremy Brown – only player in the history of the SEC with 300
hits and 200 walks, but he was overweight
– Scouts vs. statisticians!
• The tendency of everyone to generalize wildly from his own
experience. Most people think their own experience is
typical!
9
Malcolm's Gladwell's "Outliers”
hockey players story
• Why Canadian hockey players born early in the year have a big
advantage; cutoff date was Jan. 1
• ESPN conducted a little study: All the 2008 season NHL
players who were born from 1980 to 1990. [Later disputed for
2011 players]
• Sure enough: Many more were born early in the year than late.
Jan.
51
Jul.
36
Feb.
46
Aug.
41
Mar.
61
Sep.
36
Apr.
49
Oct.
34
May
46
Nov.
33
June
49
Dec.
30
http://sports.espn.go.com/espn/page2/story?page=merron/081208
10
Examples from “The Long Tail”
• Rhapsody, an online music store, which in Dec. 2005 had 1.5M
tracks, reported that the number of downloads/month for
even the 100,000th track was in the 1000s, when a Walmart
store, the largest brick-and-mortar music retailer, stocks
only 55,000 tracks.
• Rhapsody reports that 40% of its total sales came from the
Long Tail products, i.e., those not available in retail stores.
• Anderson gives several such examples, calling these
businesses Long-Tail aggregators
–
–
–
–
–
Google as the long-tail aggregator of advertising
eBay of goods
Amazon of books
Apple of music
Netflix of movies
11
Experts vs. intuition
• Ian Ayres’ book
– “The future belongs to people like
Wolfers who are comfortable with both
intuition and numbers”
– Wolfers analyzed 44,000 college
basketball games (> 16 years)
• Also see Jason Lehrer’s “How we
Decide” – another bestseller
12
Ian Ayres’ book, page 220
What Wolfers did
• Plot density function of number of games that beat the Las
Vegas spread
– Perfect normal bell curve!
• Just look at games with point spreads less than or equal to
12
– Perfect normal bell curve
• Look at games with point spread > 12
– 47% chance that the favored team beat the spread (53% failed
to cover the spread)
– more than 20% of games fell in this category of games with >12
spreads
– Is it point shaving?
• Look at the score five minutes before the end of the game –
right on track to beat the spread 50% of the time!
– Indeed a stronger case for point shaving
Ian Ayres’ book, page 216
13
2SD Rule:
To understand variability
• There is a 95% chance that a normally distributed
variable will fall within two standard deviations
(plus or minus) of its mean
• Statistical significance – simple intuitive concept –
there is less than 5% chance that a random
variable will be more than two standard deviations
away from the mean.
• Stanford Law school students knew that
professors were required to give a 3.2 mean. They
wanted to know if the professor was a “spreader”
or a “clumper”!
14
Ian Ayres’ book, page 221
Technology trends enabling
all this data analysis
• Cloud computing
– Amazon , Google, Yahoo, Microsoft
• Open source software
– R programming language
• NY Times article, Jan. 7, 2009
– Hadoop allows ordinary PCs to analyze
huge quantities of data that previously
required supercomputers
15
Economist special issue
Technology or techniques?
• Moore’s Law
– Processing power doubles every two years
– Supercrunching does need CPUs, but computing
power has been available
• More important: Kryder’s Law
– Storage capacity of hard drives has been
doubling every two years
– Chief technology office (Mark Kryder) for hard
drive manufacturer, Seagate
16
Ian Ayres’ book, page 151
Three techniques
• Regressions
yi  0  1xi   i
– error term ~ N(0,2)
• Randomization
– Run experiments by treating different
samples in different ways
• Neural networks
– Functional form is not assumed to be
linear or anything specific
17
Ian Ayres’ book
Course material
•
•
•
•
From Data to Knowledge
Focus on data sets
Less on details of statistical techniques
Learn R programming through classprovided R programs and assignments
• http://www.ece.virginia.edu/mv/edu/D2
K/index.htm
18
Summary
• Importance of data analysis
– in every walk of life!
• How to extract the “story” hidden in
the data set?
19