Demographics and Behavioral Data Mining Case Study

Download Report

Transcript Demographics and Behavioral Data Mining Case Study

Demographics and Weblog
Hackathon – Case Study
5.3% of Motley Fool visitors are subscribers.
Design a classificaiton model for insight into
which variables are important for strategies to
increase the subscription rate
Learn by Doing
http://www.meetup.com/HandsOnPro
grammingEvents/
Data Mining Hackathon
Funded by Rapleaf
•
•
•
•
•
With Motley Fool’s data
App note for Rapleaf/Motley Fool
Template for other hackathons
Did not use AWS. R on individual PCs
Logisics: Rapleaf funded prizes and food for 2
weekends for ~20-50. Venue was free
Getting more subscribers
Headline Data, Weblog
Demographics
Cleaning Data
• training.csv(201,000), headlines.tsv(811MB),
entry.tsv(100k), demographics.tsv
• Feature Engineering
• Github:
Ensemble Methods
• Bagging, Boosting, randomForests
• Overfitting
• Stability (small changes make large prediction
changes)
• Previously none of these work at scale
• Small scale results using R, large scale exist in
proprietary implementations(google, amazon,
etc..)
ROC Curves
Binary Classifier Only!
Paid Subscriber ROC curve, ~61%
Boosted Regression Trees Performance
• training data ROC score = 0.745
• cv ROC score = 0.737 ; se = 0.002
• 5.5% less performance than the winning score
without doing any data processing
• Random is 50% or .50. We are .737-.50 better
than random by 23.7%
Contribution of predictor variables
Predictive Importance
• Friedman, number of times a variable is selected for splitting weighted by
squared error or improvement to model. Measure of sparsity in data
• Fit plots remove averages of model variables
• 1 pageV 74.0567852
• 2
loc 11.0801383
• 3 income 4.1565597
• 4
age 3.1426519
• 5 residlen 3.0813927
• 6 home 2.3308287
• 7 marital 0.6560258
• 8
sex 0.6476549
• 9 prop 0.3817017
• 10 child 0.2632598
• 11 own 0.2030012
Behavioral vs. Demographics
• Demographics are sparse
• Behavioral weblogs are the best source. Most
sites aren’t using this information correctly.
There is no single correct answer. Trial and
Error on features. The features are more
important than the algorithm
• Linear vs. Nonlinear
Fitted Values (Crappy)
Fitted Values Better
Predictor Variable Interaction
• Adjusting variable
interactions
Variable Interactions
Plot Interactions age, loc
Trees vs. other methods
• Can see multiple levels good for trees. Do
other variables match this? Simplify model or
add more features. Iterate to a better model
• No Math. Analyst
Number of Trees
Data Set Number of Trees
Hackathon Results
Weblogs only 68.15%, 18% better than
random
Demographics add 1%
AWS Advantages
• Running multiple instances with different
algorithms and parameters using R
• Add tutorial, install Screen, R GUI bugs
• http://amazonlabs.pbworks.com/w/page/280
36646/FrontPage
Conclusion
• Data Mining at scale requires more development
in visualization, MR algorithms, MR data
preprocessing.
• Tuning using visualization. Tune 3 parameters, tc,
lr, #trees. Didn’t cover 2/3.
• This isn’t reproducable in Hadoop/Mahout or any
open source code I know of
• Other use cases, i.e. predicting which item will
sell(eBay), search engine ranking.
• Careful with MR paradigms, Hadoop MR !=
Couchbase MR