Transcript ppt file

KDD Cup Survey
Xinyue Liu
1
Outline

Nuts and Bolts of KDD Cup

KDD Cup 97-99

KDD Cup 2000

Summary
2
About KDD Cup
A knowledge discovery and data mining tools
competition in conjunction with KDD
conferences. It aims at:



showcase the best methods for discovering higherlevel knowledge from data.
Helping to close the gap between research and
industry
Stimulating further KDD research and development
3
Statistics


Participation in KDD Cup
grew steadily, especially
requests to access the
data
Average person-hours per
submission: 204
Max person-hours per
submission: 910
Commercial software
grew from 44% (cup 97)
to 52% (cup 98) to 77%
(cup 2000)
4
Access and Final Participation
200
150
Count

Cup 97
Cup 98
100
Cup 99
50
Cup 2000
0
NDA (access to data)
Participants
Algorithms
Decision trees most widely tried and by far the most
commonly submitted
5
KDD Cup 97
A classification task – to predict Financial
services industry direct mail response
 Winners




Charles Elkan, a PhD from UC-San Diego with his
Boosted Naive Bayesian (BNB)
Silicon Graphics, Inc with their software MineSet
Urban Science Applications, Inc. with their
software gain, Direct Marketing Selection System
6
BNB
Boosting – to learn a series of classifiers, where
each classifier in the series pays more attention to
the examples misclassified by its predecessor.
Repeated T rounds.
 BNB – representationally equivalent to a
multilayer perceptron with a single hidden layer.
 Complexity – O(ef)
e – examples
f - attributes

7
MineSet

A KDD tool that combines data access, transformation,
classification, and visualization.
8
KDD Cup 98

URL: www.kdnuggets.com/meetings/kdd98/kdd-cup-

98.html
A classification task – to analyze fund raising mail
responses to a non-profit organization

Winners



Urban Science Applications, Inc. with their software
GainSmarts.
SAS Institute, Inc. with their software Enterprise
Miner.
Quadstone Limited with their software
Decisionhouse
9
GainSmarts

GainSmarts – a feature selection expert
system
First step - used Logistic Regression to assign
each prospect a probability of donation (Pi).
 Second step - used Linear Regression to
estimate a conditional donation amount of
responding donors (Ai)
 Result (<1% error) Prediction = Pi * Ai

10
Enterprise Miner


A data mining solution that addresses the entire
data mining process
SEMMA Process
Sample
 Explore
 Modify
 Model
 Assess


Algorithms
Decision tree
 Neural network
 Regression

11
Decisionhouse

Decisionhouse – an integrated modelling
software suite by Quadstone



Data exploration using visualization modules.
Use Decision trees and Scorecards to model more
complex tasks.
Choose the final model by comparing a variety of
modeling approaches and looking at the difference
in predicted net profitability (lift curve).
12
Results
$70,000
$65,000
$60,000
$55,000
$50,000
$45,000
$40,000
$35,000
$30,000
$25,000
$20,000
$15,000
$10,000
$5,000
$-
Maximum Possible Profit Line
($72,776 in profits with 4,873 mailed)
100%
90%
80%
Mail to Everyone Solution
($10,560 in profits with 96,367 mailed)
70%
60%
50%
GainSmarts
SAS/Enterprise Miner
Quadstone/Decisionhouse
40%
30%
20%
10%
0%
KDD Cup 99
URL:
www.cse.ucsd.edu/users/elkan/kdresults.html
 Problem
same data set as KDD Cup 98
 Winners
 SAS Institute Inc. with their software
Enterprise Miner.
 Amdocs with their Information Analysis
Environment

14
Software


SAS – using two-stage model which includes two
multi-layer perceptron (MLP) neural networks
models.
Amdocs – using its own Information Analysis
Environment, which allows modeling of the value and
class membership simultaneously. Algorithms used is a
hybrid logistic regression model
15
KDD Cup 2000
www.ecn.purdue.edu/KDDCUP/
Sponsored by
Purdue University
Blue Martini Software
16
Data Set
Data collected from Gazelle.com, a
legwear and legcare web retailer
 Pre-processed
Training set: 2 months
 Test sets: one month
 Data collected includes:



17
Click streams
Order information
Registration form
Problems

The goal – to design models to support web-site
personalization and to improve the profitability of the
site by increasing customer response.

Questions - When given a set of page views,
will the visitor view another page on the site or leave?
which product brand will the visitor view in the remainder of
the session?
characterize heavy spenders
characterize killer pages
characterize which product brand a visitor will view in the
remainder of the session?
1.
2.
3.
4.
5.
18
Evaluation


Accuracy/score was measured for the two
questions with test sets
Insight questions judged with help of retail experts
from Gazelle and Blue Martini
Created a list of insights from all participants
 Each insight was given a weigh
 Each participant was scored on all insights

Additional factors:
 Presentation quality
 Correctness
19
The Winners

Question 1 & 5 Winner: Amdocs

Question 2 & 3 Winner: Salford Systems

Question 4 Winner: e-steam
poster
20
Software
(Amdocs)
Exploratory Data Analysis – SAS
 Classification Tree – Amdocs Business Insight
Tool





Decision tree
Rules Extraction
Modeling
Combining models
21
Scheme
Yes
Bot /
Crawler ?
No
one-click
Final
Prediction
Model
Score Model
0 for all
Score Model
5 rules (0 or 1)
Main Model
All Data
Yes
Score Model
1 for all
multi-click
Partial
Testing
Pattern?
22
Hybrid
Merged
No
Score Model
Ensemble
Best
Rule
Main Model
Decision Tree
5 trees
built on 34000 cases
Rule Generator
1466 rules
111 continue rules
Best
Rule
Rule
23
Hybrid
Hybrid
Model
Model
Merged
Merged
Rules
Rules
Sub-models
Best rule
Hybrid
Model
Merged
Rules
24
Chooses most accurate
rule satisfied by each
record
Logistic regression on
rule set + raw field
values combine to define
score for each record
Logistic regression on
rule set defines score for
each record as a
combination of rules the
record satisfies
Each model
captures a
different
aspect of
the overall
behavior in
the data.
Combining
or
ensembling
the models
provides the
best
prediction
results.
Software
(Salford)

CART - a decision tree tool that automatically
searching for and isolating significant
patterns and relationships

MARS - a multivariate non-parametric
regression procedure

HotSpotDetector

TreeNet
25
Cart
Binary recursive partitioning.
 Key elements:


Splitting rules






Brute force search all possible splits for all variables
Rank each splitting rule on the basis of a quality-of-split
criterion (default GINI)
Recursion - split until further splitting is impossible
or stopped.
Class assignment

Plurality rule

Assign every node whether it is terminal or not.
Pruning Trees – does not stop in the middle
Testing - best sub-tree is the one with the lowest error
26
MARs





Automatic variable search
Automatic variable transformation
Automatic limited interaction searches
Variable nesting
Built-in testing regimens model selection parameters.
27
Insights

(Heavy Spenders)
Some of the Good insights
 Referrers - establish ad policy based on conversion






rates, not click-throughs
Not an AOL user - browser window too small for layout
Referring site traffic changed dramatically over time
Came to site from print-ad or news, not friends & families
Very high and very low income
Geographic: Northeast U.S. states
Repeat visitors
28
Insights

(Who leaves?)
Some of the good insights
Crawlers, bots accounted for 16% of sessions
Long processing time (> 12 seconds) implies high
abandonment
Referring sites: mycoupons have long sessions,
shopnow.com are prone to exit quickly
Returning visitors' prob of continuing is double
View of specific products (Oroblue,Levante) cause
abandonment
Probability of leaving decreases with page views
Free Gift and Welcome templates on first three
pages encouraged visitors to stay at site







29
Insights(Brand view)
 Some


30
good insights
Referrer URL is great predictor:
 Fashionmall.com and winnie-cooper are
referrers for Hanes and Donna Karan
 mycoupons.com, tripod, deal-finder are
referrers for American Essentials
Previous views of a product imply later views
Summary





Data mining requires background knowledge and
access to business users
Successful data mining solutions combine
automated and manual analysis, integrating the
power of the machine with expert knowledge and
human insight
Web Mining is challenging: crawlers/bots, frequent
site changes, etc.
KDD Cup is an excellent source to learn the stateof-art KDD techniques
KDD Cup data available for research and education
31
References
Elkan C. (1997). Boosting and Naive Bayesian Learning. Technical
Report No. CS97-557, September 1997, UCSD.
Decisionhouse (1998). KDD Cup 98: Quadstone Take Bronze
Miner Award. Retrieved March 15, 2001 from
http://www.kdnuggets.com/meetings/kdd98/quadstone/index.ht
ml
Urbane Science (1998). Urbane Science wins the KDD-98 Cup.
Retrieved March 15, 2001 from
http://www.kdnuggets.com/meetings/kdd98/gain-kddcup98release.html
Georges, J. & Milley, A. (1999). KDD’99 Competition: Knowledge
Discovery Contest. Retrieved March 15, 2001 from
http://www.cse.ucsd.edu/users/elkan/saskdd99.pdf
Rosset, S. & Inger A. (1999). KDD-Cup 99 : Knowledge Discovery
In a Charitable Organization’s Donor Database. Retrieved
March 15, 2001 from
http://www.cse.ucsd.edu/users/elkan/KDD2.doc
32
References
(Cont.)
Sebastiani P., Ramoni M. & Crea A. (1999). Profiling your Customers
using Bayesian Networks. Retrieved March 15, 2001 from
http://bayesware.com/resources/tutorials/kddcup99/kddcup99.pdf
Inger A., Vatnik N., Rosset S. & Neumann E. (2000). KDD-Cup 2000:
Question 1 Winner’s Report. Retrieved March 18, 2000 from
http://www.ecn.purdue.edu/KDDCUP/amdocs-slides-1.ppt
Neumann E., Vatnik N., Rosset S., Duenias M., Sasson I. & Inger A.
(2000). KDD-Cup 2000: Question 5 Winner’s Report. Retrieved
March 18, 2000 from
http://www.ecn.purdue.edu/KDDCUP/amdocs-slides-5.ppt
Salford System white papers:
http://www.salford-systems.com/whitepaper.html
Summary talk presented at KDD (2000)
http://robotics.stanford.edu/~ronnyk/kddCupTalk.ppt
33