YES, but it depends

Download Report

Transcript YES, but it depends

Science in Business Data Mining?

Background: support managerial decision making

Is there a science to data mining (with CI-methods)?
YES, but it depends
(and it may be empirical Wizardry driven
by efficiency rather than effectiveness!)
Outline
1. Data Mining in Business & Management
2. Rules established in Business practices vs. Data mining?
1. Statistics vs. Data driven modelling
2. A personal view
3. How do develop meta-knowledge
Sven F. Crone,
Lancaster University Management School
Research Centre for Forecasting
Business Data Mining?
Aggregate
Demand
Adoption
Extrapolative Forecasting
(incl. Judgement)
Market Experiments
Intentions
Individual
Demand
Acquisition
Marketing Response
Activation
Relationship
High value
Customer
Target
Market
Credit Scoring
Prospect

New
Customer
Inital
Customer
Direct
Marketing
New Customer
Main areas for Data Mining:


High potential
Customer
Low value
Customer
Established Customer
Churn Prediction
Retention
Voluntary
Churn
Resignation
Forced
Churn
Former Customer
adapted from Berry and Linoff (2004) and Olafson et al (2006)
Finance: Credit risk (personal & corporate)
Sven F. Crone,
Marketing: Customer Relationship Management Lancaster University
Management School
(=Direct Marketing, Database Marketing) Research Centre for Forecasting
Best practices
Credit Scoring

Small & Balanced classes




Large & imbalanced sample


Discretise all (!) variables


Use 2000 of minority class
Use undersampling
Cross-Selling
Binary dummies / WOE to
capture non-linearity
Use Logistic regression
A personal view:

Use large sample sizes
Original (Imbalanced) class
distribution
…
Extensive use of expert
domain knowledge
GAP
 efficient solution ≠ best
Data selection is best using prior domain knowledge (use filters)
Pre-processing more important than method [Crone et al, 2006; Keogh 2002]
(Balanced) sampling & pre-processing is method dependent
Best practices exist & are domain dependent
Sven F. Crone,
(e.g. homogeneous datasets in credit scoring)
Lancaster University Management School
Research Centre for Forecasting
• Flat Maximum effect [Lovie & Lovie, 1986]
•
•
•
•
How do derive (meta)-knowledge?

Lessons from other disciplines: Time Series Forecasting






More ‘Evidence based methods”
[Armstrong 2000]


Empirical Evidence
Conditions under which methods perform well (multiple hypothesis)




Multiple out-of-sample evaluations (≠ single fold, one origin)
Multiple homogeneous datasets from one domain
Use of valid benchmark methods & unbiased error measures
Honour the domain & decision context (active learning, cost sensitive)

Studies must allow replications – document all steps / parameters
Domain specific Competitions (valid & reliable)
Replications
STOP FINE-TUNING / MARGINAL EXTENSION OF
SINGLE METHOD ON SINGLE TOY DATASET
Develop solutions for domain (Why make life harder?)
Where to start?  follow high impact approach!





Identify most prominent application domains (e.g. credit risk)
Select promising application domains for CI-methods
Get corporate sponsor & run competition
Sven F. Crone,
Lancaster University Management School
Analyse conditions (!) using meta-studies!
Research Centre for Forecasting
Embed findings as methodology in SOFTWARE
Literature





Ian Ayres (2007) Super Crunchers: Why
Thinking-by-Numbers Is the New Way to Be
Smart, Bantam
Thomas H. Davenport, Jeanne G. Harris
(2007) Competing on Analytics: The New
Science of Winning, Harvard Business School
Press
Fildes, Nikolopoulos, Crone, Synthetos (2009) Forecasting and Operational
Research – a Review, JORS, forthcoming
Finlay, Crone (under review), Sampling issues in Credit Scoring – the effect of
sample size and sample distribution on predictive accuracy, EJOR
Keogh, Kasetty (2002, 2004) On the Need for Time Series Data Mining
Benchmarks: A Survey and Empirical Demonstration, SIGKDD’02 & Data
Mining Journal