Focus the mining beacon: lessons and challenges
Download
Report
Transcript Focus the mining beacon: lessons and challenges
SF Bay ACM Data Mining SIG, 6/13/2006
Focus the Mining Beacon:
Lessons and Challenges from
the World of E-Commerce
Ronny Kohavi
General Manager, Experimentation Platform, Microsoft
Partly based on Joint work with Llew Mason, Rajesh
Parekh, Zijian Zheng, Machine Learning, vol 57, 2004
2
Overview
Background/experience
Business lessons and Controlled Experiments
Simpson’s paradox
Technical lessons
Challenges
Q&A
Ronny Kohavi, Microsoft
3
Background (I)
A consultant is someone who
• borrows your razor,
• charges you by the hour,
• learns to shave
on your face
1993-1995: Led development of
MLC++, the Machine Learning Library in C++ (Stanford University)
Implemented or interfaced many ML algorithms.
Source code is public domain, used for algorithm comparisons
1995-1998: Developed and managed MineSet
MineSet ™ was a “horizontal” data mining and visualization product at Silicon
Graphics, Inc (SGI). Utilized MLC++. Now owned by Purple Insight
Key insight: customers want simple stuff: Naïve Bayes + Viz
ICML 1998 keynote: claimed that to be successful, data mining
needs to be part of a complete solution in a vertical market
I followed this vision to Blue Martini Software
Ronny Kohavi, Microsoft
4
Background (II)
1998-2003: Director of Data Mining, then VP of
Business Intelligence at Blue Martini Software
Developed end-to-end e-commerce platform with integrated business
intelligence from collection, extract-transform-load (ETL) to
data warehouse, reporting, mining, visualizations
Analyzed data from over 20 clients
Key insight: collection, ETL worked great. Found many insights.
However, customers mostly just ran the reports/analyses we provided
2003-2005: Director, Data Mining and Personalization,
Amazon
Key insights: (i) simple things work, and (ii) human insight is key
2005: Microsoft
Assistance Platform
Started Experimentation Platform group 3/2006
Ronny Kohavi, Microsoft
5
Business-level Lessons (I)
Auto-creation of the data warehouse worked
very well
At Blue Martini we owned the operational side as well as
the analysis, we had a ‘DSSGen’ process that autogenerated a star-schema data warehouse
This worked very well. For example, if a new customer
attribute was added at the operational side, it automatically
became available in the data warehouse
Clients are reluctant to list specific questions
Conduct an interim meeting with basic findings.
Clients often came up with a long list of questions
faced with basic statistics about their data
Ronny Kohavi, Microsoft
6
Business-level Lessons (II)
Collect business-level data from operational
side
Many things not observable in weblogs (search
information, shopping cart events, registration forms, time
to return results). Log more at app-server
External events: marketing promotions, advertisements, site
changes
Choose to collect as much data as you realistically can
because you do not know what might be relevant for a
future question.
(Subject to privacy issues, but aggregated/anonymous data
is usually OK.)
Ronny Kohavi, Microsoft
7
Collection example – Form Errors
Here is a good example of data
collection that we introduced
without knowing apriori whether it
will help: form errors
If a web form was filled and a field
did not pass validation, we logged
the field and value filled
This was the Bluefly home page
when they went live
Looking at form errors, we saw
thousands of errors every day on
this page
Any guesses?
Ronny Kohavi, Microsoft
8
Business-level Lessons (III)
Crawl, Walk, Run
Do basic reporting first, generate univariate statistics, then
use OLAP for hypothesis testing, and only then start
asking characterization questions and use data mining
algorithms
Agree on terminology
What is the difference between a visit and a session?
How do you define a customer
(e.g., did every customer purchase)?
How is “top seller” defined when showing best sellers?
Why are lists from Amazon (left) and Barnes Noble (right)
so different?
The answer: no agreed-upon definition of sales rank.
Ronny Kohavi, Microsoft
9
Human Intuition is Poor
Do you believe in intuition?
No, but I have a feeling I might someday
Many explanations we give to “success” are backwards looking.
Hindsight is 20/20
Sales of sunglasses per-capita in Seattle vs. LA example
Our intuition at assessing new ideas is usually very poor
We are especially bad at assessing ideas that are not incremental, i.e., radical
changes
We commonly confuse ourselves with the target audience
Discoveries that contradict our prior thinking are usually the most interesting
Next set of slides are a series of examples where you can test
your intuition, or your “prior probabilities.”
Ronny Kohavi, Microsoft
10
How Priors Fail us
Warning: graphic image
may be disturbing to some
people.
However, it’s just your
priors.
We tend to interpret
the picture to the left
as a serious problem
Ronny Kohavi, Microsoft
11
We are not Used to Seeing Pacifiers with Teeth
Ronny Kohavi, Microsoft
12
Checkout Page
The conversion rate is the percentage of visits to the website that include a purchase
A
B
Which version has a higher conversion rate? Why?
Example from Bryan Eisenberg’s article on clickz.com
Ronny Kohavi, Microsoft
13
Graphics / Color
Which one converts (to search) better?
A
B
Source: Marketing Experiments
http://www.marketingexperiments.com
Ronny Kohavi, Microsoft
14
Amazon Shopping Cart Recs
Add an item to your shopping cart at a website
Most sites show the cart
At Amazon, Greg Linden had the idea of showing
recommendations based on cart items
Evaluation
Pro: cross-sell more items
Con: distract people from checking out – VP asked to stop work on this idea
As with many new things, hard to decide
A/B test was run
Idea was great. As many of you know from experience,
this feature is live on the site
From Greg Linden’s Blog: http://glinden.blogspot.com/2006/04/early-amazon-shopping-cart.html
Ronny Kohavi, Microsoft
15
Office Online
Small UI changes can make a big difference
Example from Microsoft Help
When reading help (from product or web), you have an option to
give feedback
Ronny Kohavi, Microsoft
16
Office Online Feedback
A
B
Feedback A puts everything together, whereas
feedback B is two-stage: question follows rating.
Feedback A just has 5 stars, whereas B annotates the
stars with “Not helpful” to “Very helpful” and makes
them lighter
Which one has a higher response rate? By how much?
Ronny Kohavi, Microsoft
17
Another Feedback Variant
C
Call this variant C. Like B, also two stage.
Which one has a higher response rate, B or C?
Ronny Kohavi, Microsoft
18
Twyman’s Law
Any statistic that appears interesting
is almost certainly a mistake
Validate “amazing” discoveries in different ways.
They are usually the result of a business process
5% of customers were born on the exact same day (including year)
o 11/11/11 is the easiest way to satisfy the mandatory birth date field
For US and European Web sites, there will be a small sales
increase on Oct 29th, 2006
Ronny Kohavi, Microsoft
19
Twyman’s Law (II)
20%
28
3/
21
3/
3/
14
7
3/
29
2/
22
2/
2/
15
0%
8
Sites go through phases
(launches) and multiple
things change together
2/
Default for registration question was changed from “yes” to “no” on 2/28
When it was realized that few were opting-in, the default was changed
This coincided with a $10 discount off every purchase
Lots of participants found this
100 %
spurious correlation, but it
80%
was terrible for predictions
60%
on the test set
40%
1
o
o
o
o
2/
KDD CUP 2000
Customers who were willing to receive e-mail
correlated with heavy spenders (target variable)
P erc entage of Cu sto mers
Date
Heavy SpeRonny
nde rs Kohavi,
Accepts
Email
Microsoft
20
Interrupt: Key Takeaways
Every talk (hopefully) has a few key points to
take away. Here are two from this talk:
Encourage controlled experiments (A/B tests)
o The previous examples should have convinced you that our intuition is
poor and we need to experiment to get data
Simpson’s paradox
o
o
o
o
o
Lack of awareness of the phenomenon can lead to mistaken conclusions
Unlike esoteric brain teasers, it happens in real life
In the next few slides I’ll share examples that seem “impossible”
We’ll then explain why they are possible and do happen
Discuss implications/warning
Ronny Kohavi, Microsoft
21
Examples 1: Drug Treatment
Real-life example for kidney stone treatments
Overall success rates:
Treatment A succeeded 78%, Treatment B succeeded 83% (better)
Further analysis splits the population by stone size
For small stones
Treatment A succeeded 93% (better), Treatment B succeeded 83%
For large stones
Treatment A succeeded 73% (better), Treatment B succeeded 69%
Hence treatment A is better in both cases, yet was worse in total
People going into treatment have either small stones
or large stones
A similar real-life example happened when the two populations
segments were cities (A was better in each city, but worse overall)
Adopted from wikipedia/simpson’s paradox
Ronny Kohavi, Microsoft
22
Example 2: Sex Bias?
Adopted from real data for UC Berkeley admissions
Women claimed sexual discrimination
Only 34% of women were accepted, while 44% of men were accepted
Segmenting by departments to isolate the bias, they
found that all departments accept a higher percentage
of women applicants than men applicants.
(If anything, there is a slight bias in favor of women!)
There is no conflict in the above statements.
It’s possible and it happened
Bickel, P. J., Hammel, E. A., and O'Connell, J. W. (1975). Sex bias in graduate
admissions: Data from Berkeley. Science, 187, 1975, 398-404.
Ronny Kohavi, Microsoft
23
Example 3: Purchase Channels
Multichannel customers spend 72% more
per year than single channel customers
-- State of Retailing Online, shop.org
Real example from a Blue Martini Customer
We plotted the average customer spending for customers
purchasing on the web or “on the web and offline (POS)”
(multi-channel), but segmented by
2000
number of purchases per customer
1800
1600
In all segments, multi-channel
1400
customers spent less
1200
1000
However, like shop.org predicted,
800
ignoring the segments, multi-channel
600
400
customers spent more on average
Customer Average Spending
200
0
1
2
3
4
5
>5
Number of purchases
Multi-channel
Web-channel
only
Ronny Kohavi,
Microsoft
24
Last Example: Batting Average
Baseball example
(For those not familiar with baseball, batting average is percent of hits.)
One player can hit for a higher batting average than another player
during the first half of the year
Do so again during the second half
But to have a lower batting average for the entire year
Example
First Half
A
B
Second Half
Total season
4/ 10 = 0.400
25/100 = 0.250
29/110 = 0.264
35/100 = 0.350
2/ 10 = 0.200
37/110 = 0.336
Key to the “paradox” is that the segmenting variable (e.g., half
year) interacts with “success” and with the counts.
E.g., “A” was sick and rarely played in the 1st half, then “B” was
sick in the 2nd half, but the 1st half was “easier” overall.
Ronny Kohavi, Microsoft
25
Not Really a Paradox, Yet Non-Intuitive
If a/b < A/B and c/d < C/D, it’s possible that
(a+c)/(b+d) > (A+C)/(B+D)
We are essentially dealing with weighted averages when we
combine segments
Here is a simple example with two treatments
Each cell has Success / Total = Percent Success %
T1 is superior in both segment C1 and segment C2, yet loses overall
C1 is “harder” (lower success for both treatments)
T1 gets tested more in C1
C1
C2
Both
T1
T2
2/8 = 25% 1/5 = 20%
4/5 = 80% 6/8 = 75%
6/13 = 46% 7/13= 54%
Ronny Kohavi, Microsoft
26
Important, not Just Cool
Why is this so important?
In knowledge discovery, we state probabilities
(correlations) and associate them with causality
Treatment T1 works better
Berkeley discriminates against women
We must be careful to check for confounding
variables
Confounding variables may not be ones we are
collecting (e.g., latent/hidden)
Ronny Kohavi, Microsoft
27
Controlled Experiments
100%
Users
Multiple names to the same concept
A/B tests
Control/Treatment
Controlled experiments
Randomized Experimental Design
50%
Users
50%
Users
Control:
Existing System
Treatment:
Existing System
with Feature X
Concept is trivial
Randomly split traffic between two versions
o Control: usually current live version
o Treatment: new idea (or multiple)
Collect metrics of interest, analyze (statistical tests, data mining)
First known controlled experiment in the 1700s
Users interactions instrumented,
analyzed & compared
Analyze at the end of the
experiment
British captain noticed lack of scurvy in Mediterranean ships
Had half the sailors eat limes (treatment), half did not (control)
Experiment was so successful, British sailors are still called limeys
Note: success despite no understanding of vitamin C deficiency
Ronny Kohavi, Microsoft
28
Advantages of Controlled Experiments
Controlled experiments test for causal
relationships, not simply correlations
They insulate external factors
Problems that plague interrupted time series, such as
history/seasonality/regression impact both versions
They are the standard in FDA drug tests
But like most great things, there are problems
and it’s important to recognize them…
Ronny Kohavi, Microsoft
29
Issues with Controlled Experiments (1 of 4)
If you don't know where you are going, any road will take you there
—Lewis Carroll
Org has to agree on key metric(s) to improve
While it may seem obvious that we need to know if we’re
improving, it’s not easy to get clear agreement
If nothing else, bringing this question to the surface is a great
benefit to the org!
Ronny Kohavi, Microsoft
30
Issues with Controlled Experiments (2 of 4)
Quantitative metrics, not always explanations of “why”
For example, we may know that lemons work against scurvy, but not why;
it may take a while to understand vitamin C deficiency
Data Mining may help identify segments where difference is large, leading
to better understanding
Usability studies also useful at explaining
Short-term vs. Long-term
Hard to assess long term effects, such as customer abandonment
Example: if you optimize for ads for clickthrough revenues, you might
plaster the site with ads. Long-term concerns should be part of metric
(e.g., revenue per pixels of real estate on the window)
Ronny Kohavi, Microsoft
31
Issues with Controlled Experiments (3 of 4)
Primacy effect
Changing navigation in a website may degrade the customer experience
(temporarily), even if the new navigation is better
Evaluation may need to focus on new users, or run for a long period
Multiple experiments
Even though the methodology shields an experiment from other changes,
statistical variance increases making it harder to get significant results
It is useful to avoid multiple changes to the same “area.”
QA also becomes harder when tests interact
Consistency/contamination
On the web, assignment is usually cookie-based, but people may use
multiple computers, erase cookies, etc. Typically a small issue
Launch events / media announcements sometimes
preclude controlled experiments
The journalists need to be shown the “new” version
Ronny Kohavi, Microsoft
32
Issues with Controlled Experiments (4 of 4)
Statistical tests: distributions are far from
normal
97% of sessions do not purchase, so there’s a large mass on
the zero spending
Proper randomization required
You cannot run option A on day 1 and option B on day 2, you
have to run them in parallel
When running in parallel, you cannot randomize based on IP
(e.g., load-balancer randomization) because all of AOL traffic
comes from a few proxy servers
Every customer must have an equal chance of falling into
control or treatment and must stick to that group
Ronny Kohavi, Microsoft
33
Technical Lessons – Cleansing (I)
Auditing data
Make sure time-series data exists for the whole period.
It is very easy to conclude that this week was bad
relative to last week because some data is missing
(e.g., collection bug)
Synchronize clocks from all data collection points.
In one example, some servers were set to GMT and
others to EST, leading to strange anomalies.
Even being a few minutes off can cause add-to-carts to
appear “prior” to the search
Ronny Kohavi, Microsoft
34
Technical Lessons – Cleansing (II)
Auditing data (continued)
Remove test data.
QA organizations constantly test the system.
Make sure the data can be identified and removed
from analysis
Remove robots/bots/spiders
5-40% of site e-commerce site traffic is generated by
crawlers from search engines and
students learning Perl.
These significantly skew results unless removed
Ronny Kohavi, Microsoft
35
Data Processing
Utilize hierarchies
Generalizations are hard to find when there are many attribute
values (e.g., every product has a Stock Keeping Unit number)
Collapse such attribute values based on hierarchies
Remember date/time attributes
Date/time attributes are often ignored, but contain information
Convert them into cyclical attributes, such as hour of day or
morning/afternoon/evening, day of week, etc.
Compute deltas between such attributes (e.g., ship date minus
order date)
Ronny Kohavi, Microsoft
36
Analysis / Model Building
Mining at the right granularity level
To answer questions about customers, we must aggregate
clickstreams, purchases, and other information to the
customer level
Defining the right transformation and creating summary
attributes is the key to success
Phrase the problem to avoid leaks
A leak is an attribute that “gives away” the label.
E.g., heavy spenders pay more sales tax (VAT)
Phrasing the problem to avoid leaks is a key insight.
Instead of asking who is a heavy spender, ask which
customers migrate from spending a small amount in period 1
to a large amount in period 2
Ronny Kohavi, Microsoft
37
Data Visualizations
Picking the right visualization is key to seeing patterns
On the left is traffic by day – note the weekends (but hard to see patterns)
On the right is a heatmap, showing traffic colored from green to yellow to red
utilizing the cyclical nature of the week (going up in columns)
It’s easy to see the weekend, Labor day on Sept 3, and the effect of Sept 11
weekends
Ronny Kohavi, Microsoft
38
Model Visualizations
When we build models for prediction, it is
sometimes important to understand them
For MineSet™, we built visualizations for all
models
Here is one: Naïve-Bayes / Evidence model (movie)
Ronny Kohavi, Microsoft
39
A Real Technical Lesson:
Computing Confidence Intervals
In many situations we need to compute confidence intervals,
which are simply estimated as: acc_h +- z*stdDev
where acc_h is the estimated mean accuracy,
stdDev is the estimated standard deviation, and
z is usually 1.96 for a 95% confidence interval)
This fails miserably for small amounts of data
For Example: If you see three coin tosses that are head, the confidence interval for
the probability of head would be [1,1]
Use a more accurate formula that does not require using stdDev
(but still assumes Normality):
It’s not used often because it’s more complex, but that’s what computers are for
See Kohavi, “A Study of Cross-Validation and Bootstrap for Accuracy Estimation
and Model Selection” in IJCAI-95
Ronny Kohavi, Microsoft
40
Challenges (I)
Finding a way to map business questions to
data transformations
Don Chamberlin wrote on the design of SQL “What we
thought we were doing was making it possible for nonprogrammers to interact with databases." The SQL99
standard is now about 1,000 pages
Many operations that are needed for mining are not easy to
write in SQL
Explaining models to users
What are ways to make models more comprehensible
How can association rules be visualized/summarized?
Ronny Kohavi, Microsoft
41
Challenges (II)
Dealing with “slowly changing dimensions”
Customer attributes change (people get married, their children
grow and we need to change recommendations)
Product attributes change, or are packaged differently.
New editions of books come out
Supporting hierarchical attributes
Deploying models
Models are built based on constructed attributes in the data
warehouse. Translating them back to attributes available at
the operational side is an open problem
For web sites, detecting bots/robots/spiders
Detection is based on heuristics (useragent, IP, javascript)
Ronny Kohavi, Microsoft
42
Challenges (III)
Analyzing and measuring long-term impact of
changes
Control/Treatment experiments give us short-term value.
How do we address long-term impact of changes?
For non-commerce sites, how do we measure user
satisfaction?
Example: users hit F1 for help in Microsoft Office and
execute a series of queries, browsing through documents.
How do we measure satisfaction other than through surveys?
Ronny Kohavi, Microsoft
43
Summary
The lessons and challenges are from e-commerce, but
likely to be applicable in other domains
Think about the problem end-to-end from
collection, transformations, reporting, visualizations,
modeling, taking action
Beware of hidden variables when concluding causality.
Think about Simpson’s paradox.
Conduct many controlled experiments (A/B tests)
because our intuition is poor
Build infrastructure for controlled experiments (this is what my team is
now doing at Microsoft)
Copy of talk at http://exp-platform.com
Ronny Kohavi, Microsoft