all categories internal usage

Download Report

Transcript all categories internal usage

KDD Cup
2009
Fast Scoring on a Large Database
Presentation of the Results at the KDD Cup Workshop
June 28, 2008
The Organizing Team
KDD Cup 2009 Organizing Team
Project team at Orange Labs R&D:
• Vincent Lemaire
• Marc Boullé
• Fabrice Clérot
• Raphaël Féraud
• Aurélie Le Cam
• Pascal Gouzien
Beta testing and proceedings editor:
• Gideon Dror
Web site design:
• Olivier Guyon (MisterP.net, France)
Coordination (KDD cup co-chairs):
• Isabelle Guyon
• David Vogel
Thanks to our sponsors…

Orange

ACM SIGKDD

Pascal

Unipen

Google

Health Discovery Corp

Clopinet

Data Mining Solutions

MPS
Record KDD Cup Participation
KDD Cup Participation By Year
500
453
450
400
350
300
250
200
150
136
128
102
100
95
68
57
57
50 45
37
31
24
18
0
1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009
Year
Year
# Teams
1997
45
1998
57
1999
24
2000
31
2001
136
2002
18
2003
57
2004
102
2005
37
2006
68
2007
95
2008
128
2009
453
Participation Statistics

1299 registered teams

7865 entries

46 countries :
Argentina
Germany
Malaysia
South Korea
Australia
Greece
Mexico
Spain
Austria
Hong Kong
Netherlands
Sweden
Belgium
Hungary
New Zealand
Switzerland
Brazil
India
Pakistan
Taiwan
Bulgaria
Iran
Portugal
Turkey
Canada
Ireland
Romania
Uganda
Chile
Israel
Russian Federation
United Kingdom
China
Italy
Singapore
Uruguay
Fiji
Japan
Slovak Republic
United States
Finland
Jordan
Slovenia
France
Latvia
South Africa
A worlwide operator

One of the main
telecommunication operators in
the world

Providing services to more than
170 millions customers over five
continents

Including 120 millions under the
Orange Brand
KDD Cup 2009 organized by Orange
Customer Relationship Management (CRM)

Three marketing tasks: predict the propensity of customers
– to switch provider: Churn
– to buy new products or services: Appentency
– to buy upgrades or new options proposed to them: Up-selling

Objective: improve the return of investments (ROI) of
marketing campaigns
– Increase the efficiency of the campaign given a campaign cost
– Decrease the campaign cost for a given marketing objective

Better prediction leads to better ROI
Data, constraints and requirements

Input data

– Relational databases
Train and deploy requirements
– About one hundred models per
month
– Fast data preparation and
modeling
– Fast deployment
– Numerical or categorical
– Noisy
– Missing values
– Heavily unbalanced distribution

Train data

– Robust
– Accurate
– Understandable
– Hundreds of thousands of instances
– Tens of thousand of variables

Deployment
– Tens of millions of instances
Model requirements

Business requirement
– Return of investment for the whole
process
In-house system
From raw data to scoring models
Modèle Conceptuel de Données
Adresse
Modèle : MCD PAC_v4
Cercle Relationnel
Id adresse
<pi>
Code postal distribution
Commune
Nb habitants commune
Département
Package :
Diagramme : Tiers Services
Auteur : claudebe Date : 14/06/2005
Version :
Id CR
<pi>
Libéllé cercle relationnel
1,n
1,n
CRU a pour DCR
1,1
Statut Opérateur
F a pour A
Id statut opérateur
<pi>
Libellé statut opérateur
CSP
Date début adresse D
Date fin adresse
D
0,n
1,1
Operateur
Foyer
Id opérateur
<pi>
Libellé opérateur
0,n
1,n
Id foyer
<pi>
Date création foyer
Date fin foyer
Nb personnes foyer
T a pour S
Type de fonction d'usage
Id CSP 350
<pi>
Libellé CSP 350
Id CSP 23
Libellé CSP 23
Id CSP 5
Libellé CSP 5
id type FU <pi>
lib typ FU
CRU a pour OCR
1,1
1,n
0,n
Fonction Usage
Fu appartient type FU
0,n
Date début statut tiers D
Date fin statut tiers
D
1,n
0,1
T a pour F
T a pour CSP
CRU Enchainement
Date début appartenance foyer D
Date fin appartenance foyer
D
Role tiers ds foyer
VA1
Type Relation Tiers
Id type relation
<pi>
Date création type de relation tiers
Libellé type relation tiers
0,1
0,n
1,n
0,n
DP pour O

Data warehouse
(1,1)
0,1
T a pour CR
Date début tiers ds classe risque D
Date fin tiers ds classe risque
D
0,n
0,n
1,n
CRU appartient à la CCRU
– Relational data base

Data mart
– Star schema
1,1
0,n
1,1
0,n
Etat Usage
Id EU
<pi>
libellé état usage
0,n
EDP a EU
1,1
0,n
1,1
1,n
0,n
Id offre
<pi>
Libellé offre
Id EDP
<pi>
Date dernière utilisation EDP
Date première utilisation EDP
T détient EDP
Date début détention EDP D <O>
Date fin détention EDP
D
O fait partie OC
Offre
Elément De Parc
1,n
mois VA6
valeur N10
IT génère CRU
T titulaire CT
1,n
EDP souscrit ds O
Date début souscription offre D <O>
Date fin souscription offre
D
Date début rattachement offre D
Date fin rattachement offre
D
Coordonnées Tiers
CO hiérarchie
Type Ligne Facture
1,n
Id média
<pi>
Libellé média
Id positionnement
<pi>
Libellé positionnement
(1,1)
1,1
Compte Facturation
Média
C correspond à M
Id offre composée
<pi>
Libellé offre composée
Positionnement classification
0,n
T payeur du CF
Id coordonnée tiers
<pi>
Date création coordonnée
Libellé coordonnée tiers
Offre composée
1,n
0,n
O positionnée ds C
LF correspond à EDP
1,1
Id type ligne facture
<pi>
Libellé type ligne facture
EDP facturé sur CF
0,n
Id compte facturation
<pi>
Date début validité compte facturation
Date fin validité compte facturation
0,n
0,1
Classification Offre
Id classification offre
<pi>
Libellé classification offre
P dans O
0,n
0,1
P hiérarchie
0,n
0,n
LF a pour TLF
Facture
Classe de risque
Id classe risque
<pi>
Libellé classe risque
Libellé court classe risque
Niveau risque minimum
Niveau risque maximum
Gamme
Id gamme
<pi>
Libellé gamme
Date création gamme
Date fin de gamme
Heritage offre commerciale
1,n
Identité Tiers
Id identité tiers
<pi>
Login
Type identité tiers
0,n
Offre commerciale
Id offre commerciale
<pi>
Libellé offre commerciale
Date création offre
Date clôture offre
CRU généré par EDP
T utilise IT
Heritage tiers
1,1
0,n
O composée de PS
Id groupe de CRU <pi>
0,n
G composée de PS
1,1
1,n
CRU concerne FU
Groupe de CRU
0,1
Id tiers
<pi>
Prénom tiers PP
Nom tiers PP
Nom marital PP
Genre PP
Date naissance tiers PP
Date création tiers
Date clôture tiers
Date modification tiers
Type Tiers
Données payeur
Inscription fichier contentieux
Nb dossiers recouvrement actifs
Nb dossiers réclamation actifs
Nb dossiers recouvrement
Nb dossiers réclamation
Niveau risque courant
Niveau risque précédent
Produit & Service
Id PS
<pi>
Date fin validité du P&S
Date début validité du P&S
Date création du P&S
Libellé P&S
EDP correspond PS
1,1 1,1
Tiers
0,n
1,n
0,n
1,n
0,n
1,n
T a pour relation avec T
PS a pour FU
1,n
Id compte rendu usage
<pi>
Date début CRU
Date fin CRU
Volume descendant CRU
Volume montant CRU
Type transmission
1,1
0,n
1,n
Id fonction d'usage
<pi>
Libéllé fonction usage
Compte Rendu Usage
0,n
F émise pour CF
Id facture
<pi>
Date échéance facture
1,1
1,1
Ligne Facture
1,1
1,n
LF compose F
(1,1)
Id ligne de facture
<pi>
Ligne affichée sur facture
Montant HT
Montant TTC
Data
feeding
Customer
Services
Products
Call details
…

Feature construction
– PAC technology
– Generates tens of thousands of
variables
PAC
Id customer zip code Nb call/month Nb calls/hour Nb calls/month,weekday,hours,service
Khiops

Data preparation and modeling
– Khiops technology
scoring
model
…
Design of the challenge

Orange business objective
–

Data
–
–
–

Data store
– Not an option
Data warehouse
– Confidentiality and scalability issues
– Relational data requires domain knowledge and specialized skills
Tabular format
– Standard format for the data mining community
– Domain knowledge incorporated using feature construction (PAC)
– Easy anonymization
Tasks
–

Benchmark the in-house system against state of the art techniques
Three representative marketing tasks
Requirements
–
–
Fast data preparation and modeling (fully automatic)
Accurate
–
–
–
Fast deployment
Robust
Understandable
Data sets extraction and preparation

Input data
–
–
–

Instance selection
–
–

Using PAC technology
20000 constructed variables to get a tabular representation
Keep 15 000 variables (discard constant variables)
Small track: subset of 230 variables related to classical domain knowledge
Anonymization
–
–
–
–

Resampling given the three marketing tasks
Keep 100 000 instances, with less unbalanced target distributions
Variable construction
–
–
–
–

10 relational table
A few hundreds of fields
One million customers
Discard variable names, discard identifiers
Randomize order of variables
Rescale each numerical variable by a random factor
Recode each categorical variable using random category names
Data samples
–
–
50 000 train and test instances sampled randomly
5000 validation instances sampled randomly from the test set
Scientific and technical challenge

Scientific objective
–
–
–

Fast data preparation and modeling: within five days
Large scale: 50 000 train and test data, 15 000 variables
Hetegeneous data
– Numerical with missing values
– Categorical with hundreds of values
– Heavily unbalanced distribution
KDD social meeting objective
–
–
–
Attract as many participants as possible
– Additional small track and slow track
– Online feedback on validation dataset
– Toy problem (only one informative input variable)
Leverage challenge protocol overhead
– One month to explore descriptive data and test submission protocol
Attractive conditions
– No intellectual property conditions
– Money prizes
Business impact of the challenge

Bring Orange datasets to the data mining community
– Benefit for community
– Access to challenging data
– Benefit for Orange
– Benchmark of numerous competing techniques
– Drive the research efforts towards Orange needs

Evaluate the Orange in-house system
– High number of participants and high quality of the results
– Orange in-house results:
– Improved by a significant margin when leveraging all business requirements
– Almost Parretto optimal when other criterions are considered
(automation, very fast train and deploy, robustness and understandability)
– Need to study the best challenge methods to get more insights
KDD Cup 2009: Result Analysis
Best Result (period considered in the figure)
In House System (downloadable : www.khiops.com)
Baseline (Naïve Bayes)
Overall – Test AUC – Fast
Good Result Very Quickly
Best Results (on each dataset)
Submissions
Overall – Test AUC – Fast
Good Result Very Quickly
Best Results (on each dataset)
Submissions
In House (Orange) System:
•No parameters
•On 1 standard laptop (mono proc)
•If deal as 3 different problems
Overall – Test AUC – Fast
Very Fast Good Result
Small improvement after the first day
(83.85  84.93)
Overall – Test AUC – Slow
Very Small improvement after the 5th day
(84.93  85.2)
Improvement due to
unscrambling?
Overall – Test AUC – Submissions
23.24% of the
submissions (>0.5)
< Baseline
 84.75% of the
submissions (>0.5)
< In House
 15.25% of the
submissions (>0.5)
> In House
Overall – Test AUC
'Correlation' Test / Valid
Overall – Test AUC
'Correlation' Test / Train
?
Random Values Submitted
Boosting Method or
Train Target Submitted
Over fitting
Overall – Test AUC
Test AUC - 12 hours
Test AUC – 36 days
Test AUC - 24 hours
Test AUC – 5 days
Overall – Test AUC
 Difference between :
• best result at the end of the first day and
• best result at the end of the 36 days
=1.35%
Test AUC - 12 hours
• time to adjust model parameters ?
• time to train ensemble method ?
• time to find more processors ?
• time to test more methods
• time to unscramble ?
•…
Test AUC – 36 days
Test AUC = f (time)
Churn – Test AUC – day  [0:36]
Harder ?
Appetency – Test AUC – day  [0:36]
Up-selling– Test AUC – day  [0:36]
Easier ?
Test AUC = f (time)
Churn – Test AUC – day  [0:36]
 =1.84%
Appetency – Test AUC – day  [0:36]
Up-selling– Test AUC – day  [0:36]
 =1.38%
Harder ?
 =0.11%
Easier ?
 Difference between :
• best result at the end of the first day and
• best result at the end of the 36 days
Correlation
Test AUC / Valid AUC (5 days)
Churn – Test/Valid – day  [0:5]
Harder ?
Appetency – Test/Valid – day  [0:5]
Up-selling– Test/Valid – day  [0:5]
Easier ?
Correlation
Train AUC / Valid AUC (36 days)
Churn – Test/Train – day  [0:36]
Appetency – Test/Train – day  [0:36]
Difficulty to conclude something…
Up-selling– Test/Train – day  [0:36]
Histogram
Test AUC / Valid AUC ([0:5] or ]5-36] days)
Churn – Test AUC – day  [0:36]
Appetency – Test AUC – day  [0:36]
Up-selling– Test AUC – day  [0:36]
Knowledge (parameters?) found during 5 days helps after… ?
Histogram
Test AUC / Valid AUC ([0:5] or ]5-36] days)
Churn – Test AUC – day  [0:36]
Appetency – Test AUC – day  [0:36]
Up-selling– Test AUC – day  [0:36]
Knowledge (parameters?) found during 5 days helps after… ?
Churn – Test AUC – day  ]5:36]
Appetency – Test AUC – day  ]5:36]
YES !
Up-selling– Test AUC – day  ]5:36]
Fact Sheets:
Preprocessing
&
Feature Selection
PREPROCESSING (overall usage=95%)
Replacement of the missing values
Discretization
Normalizations
Grouping modalities
Other prepro
Principal Component Analysis
0
20
40
60
80
FEATURE SELECTION (overall usage=85%)
Percent of participants
Feature ranking
Filter method
Other FS
Forward / backward
wrapper
Embedded method
Wrapper with search
0
10
20
30
40
Percent of participants
50
60
Fact Sheets:
Classifier
CLASSIFIER (overall usage=93%)
Decision tree...
Linear classifier
Non-linear kernel
Other Classif
- About 30% logistic loss, >15%
exp loss, >15% sq loss, ~10%
hinge loss.
Neural Network
Naïve Bayes
- Less than 50% regularization
(20% 2-norm, 10% 1-norm).
Nearest neighbors
Bayesian Network
- Only 13% unlabeled data.
Bayesian Neural Network
0
10
20
30
40
Percent of participants
50
60
Fact Sheets:
Model Selection
MODEL SELECTION (overall usage=90%)
10% test
K-fold or leave-one-out
Out-of-bag est
Bootstrap est
Other-MS
- About 75% ensemble methods
(1/3 boosting, 1/3 bagging, 1/3 other).
Other cross-valid
Virtual leave-one-out
- About 10% used unscrambling.
Penalty-based
Bi-level
Bayesian
0
10
20
30
40
Percent of participants
50
60
Fact Sheets:
Implementation
>= 32 GB
> 8 GB
Run in parallel
<= 8 GB
None
<= 2GB
Multi-processor
Memory
Parallelism
Mac
OS
Java
Other (R, SAS)
Matlab
Linux Unix
Windows
C C++
Software Platform
Operating System
Winning methods
Fast track:
- IBM research, USA +: Ensemble of a wide variety of classifiers. Effort put
into coding (most frequent values coded with binary features, missing values
replaced by mean, extra features constructed, etc.)
- ID Analytics, Inc., USA +: Filter+wrapper FS. TreeNet by Salford Systems
an additive boosting decision tree technology, bagging also used.
- David Slate & Peter Frey, USA: Grouping of modalities/discretization, filter
FS, ensemble of decision trees.
Slow track:
- University of Melbourne: CV-based FS targeting AUC. Boosting with
classification trees and shrinkage, using Bernoulli loss.
- Financial Engineering Group, Inc., Japan: Grouping of modalities, filter FS
using AIC, gradient tree-classifier boosting.
- National Taiwan University +: Average 3 classifiers: (1) Solve joint
multiclass problem with l1-regularized maximum entropy model. (2) AdaBoost
with tree-based weak leaner. (3) Selective Naïve Bayes.
-(+: small dataset unscrambling)
Conclusion

Participation exceeded our expectations. We thank the
participants for their hard work, our sponsors, and Orange
who offered:
– A problem of real industrial interest with challenging scientific
and technical aspects
– Prizes.

Lessons learned:
– Do not under-estimate the participants: five days were given
for the fast challenge, only a few hours sufficed to some
participants.
– Ensemble methods are effective.
– Ensemble of decision trees offer off-the-shelf solutions to
problems with large numbers of samples and attributes, mixed
types of variables, and lots of missing values.