Slides from class 1

Download Report

Transcript Slides from class 1

Statistical Learning
Introduction:
Modeling Examples
Visualization example: Fraud by customer type
60
Our goal is to build
model to predict
fraud in advance
50
%
40
Legitimate (n=5000)
30
Fraud (n=200)
20
We can see
associations between
customer type and
fraudulent behavior.
10
Are they legitimate?
Data leakage?
0
Type A
Type B
Type C
• Predict whether someone will have a heart attack on the basis
of demographic, diet and clinical measurements
• Identify the risk factors for prostate cancer (lpsa), based on clinical
and demographic variables.
ESL Chap1 - Introduction
• Classify a recorded phoneme, based on a log-periodogram.
A restricted model
(red) does much
better than an
unrestricted one
(jumpy black)
• Customize an email spam detection system.
X = which words appear and how much
Y = Spam or not?
• Identify the numbers in a handwritten zip code, from a digitized image
X = color of each pixel
Y = which digit is it?
• Classify a tissue sample into one of several cancer classes, based on a
gene expression profile.
X = expression levels of genes
Y = which cancer?
• Classify the pixels in a LANDSAT image, according to usage:
Y = {red soil, cotton, vegetation stubble, mixture, gray soil, damp gray
soil, very damp gray soil}
X = values of pixels in several wavelength bands
October 2006 Announcement
of the NETFLIX Competition
USAToday headline:
“Netflix offers $1 million prize for better movie recommendations”
Details:
•
•
•
•
•
•
•
Beat NETFLIX current recommender model ‘Cinematch’ by 10% based on
absolute rating error prior to 2011
$50K for the annual progress price (relative to baseline)
Data contains a subset of 100 million movie ratings from NETFLIX including
480,189 users and 17,770 movies
Performance is evaluated on holdout movies-users pairs
NETFLIX competition has attracted 45878 contestants on 37660 teams from
180 different countries
Tens of thousands of valid submissions from thousands of teams
Conclusion: in 2009, an international team attained the goal and won the
prize! More later…
Data Overview:
NETFLIX
Internet Movie Data Base
17K
Selection
unclear
480 K
At least 20
Ratings by
end 2005
NETFLIX
Competition
Data
All users (6.8 M)
All movies (80K)
Fields
Title
Year
100 M ratings
4
5
1
Actors
Awards
3
2
Revenue
4
…
NETFLIX data generation process
User Arrival
17K movies
Movie Arrival
Training Data
1998
Time
2005
4
5
?
Qualifier
Dataset
3M
3
2
?
Netflix and us
• We will have a modeling challenge in our course which will use the
Netflix data. The winners will get a grade boost!
• The $1M was won in 2009 by a collaboration of several leading
teams
– The strongest team, which won both yearly $50K prizes, was founded at
AT&T, with an Israeli participant (Yehuda Koren)
– Yehuda was one of the major driving forces on the final winning team
– He is now back in Israel, and may come give us a talk!
• While I was at IBM Research, our team won a related competition in
KDD-Cup 2007 (same data, more “standard” modeling tasks)
– We may have a “case study” lecture on that as well
Project evolution and relevance
to our course
Business
problem
definition
Modeling
problem
definition
Targeting,
Sales force
mgmt.
Wallet /
opportunity
estimation
Model
generation &
validation
Programming,
Simulation,
IBM Wallets
Statistical
problem
definition
Quantile est.,
Latent
variable est.
Implementation
& application
development
Modeling
methodology
design
Quantile est.,
Graphical
model
Outside
scope
Keep in mind
OnTarget,
MAP
This is our
domain!