Data Mining:

Transcript Data Mining:

IRMAC: Data Warehouse SIG
November 5, 2002
Data Mining
A Practical Look at Data Preparation
Jason Brown
Cognicase Inc.
Agenda
• Crash Course in Data Mining
– What
– Why
– How
• The virtuous cycle
– Data Preparation
• Case Study
– Background
– Going through the cycle
• Data Preparation
• Q&A
2
The Crash Course
•What
•Why
•How
3
Definitions
Data Mining:
The process of exploration and analysis, by
automatic or semi-automatic means, of large
quantities of data in order to discover
meaningful patterns and rules.
Knowledge Discovery
Data Mining is not Data Warehousing, OLAP etc.
4
Definitions
Modeling:
• Not an ER type Data Model
• A data mining model is computational, full of
algorithms
• A model can be descriptive or predictive.
– A descriptive model helps in understanding
underlying processes or behavior.
– A predictive model uses known values (input) to
predict an unknown value (output)
5
Two Types of Data Mining
• Directed
– Know specifically what we are looking for
• Who is likely to respond to our offer?
• What our customers going to be worth to us over their
lifetime?
– Model is a Black Box
Input
Output
6
Two Types of Data Mining
• Undirected
– Not exactly sure what we are looking for
• How should we define our Customer Segments?
• What is interesting about all of our point of sale data?
– Model is a Transparent Box
Input
Output
7
Modeling Techniques
Decision Trees
Rule Induction
If …….. Then ……..
Neural Networks
Clustering
Nearest Neighbour
8
Modeling Techniques
Decision Trees
• The tree is built based on the input of a training
data set
– Training Data Set is based on historical data
– Over sample the data that reflects your question
• Each record of the Model Set is run through the
branches of the tree until the record reaches a leaf
9
Modeling Techniques
Decision Trees
Age < 23
Income < 12 000
N
Y
N
Y
Male ?
N
Y
28%
37%
10
Modeling Techniques
Neural Networks
• Neural networks are a nonlinear model -similar
to a ‘brain’.
• The network is built based on the input of a
training set.
• Model sets run through this network will return
accurate results based on the patterns identified
in the training set.
• Very Complex
11
Modeling Techniques
Clustering
• Clustering finds groups
of records that are
similar.
• For example, customers
can be clustered by:
– income
– age
– ytd revenue
12
Modeling Techniques
Clustering
Male
Income < 12 000
Age < 23
Coke Buyers
Male
Income < 12 000
Age < 23
Non Coke Buyers
13
Modeling Techniques
Nearest Neighbour
• Model is built based on the
input of a training set.
• Classifies a record by
calculating the distances
between the record criteria
and the training data set
• Then it assigns the record to
the class that is most
common among its nearest
neighbours
14
Modeling Techniques
Nearest Neighbour
Records plotted
based on:
Did Not
Did Not
Income
Gender
Age
Bought Coke
Bought Coke
Bought Coke
15
Modeling Techniques
Rule Induction
• A technique that infers generalizations from the
information in the data
IF age < 19 AND purchase is coke THEN 40%
purchase chips
• Describes the data, allows us to visualize what is
going on
16
The Crash Course
•What
•Why
•How
17
The Reasons to Mine Data
Expenses
Revenues
Increase
Profit
18
The Reasons to Mine Data
• For Marketing/CRM
– Targeting prospects
– Predicting future customer behaviour
– Costs
Revenues
• For Research
– Identify drugs likely to be successful
– Costs
• For Process Improvement
– Identify causes of production failures
– Costs
19
The Crash Course
•What
•Why
•How
20
Process
• Many different processes for Data Mining
– Vendor Driven
• SAS - SEMMA
– Sample, Explore, Modify, Model, Assess
• SPSS - 5 A’s
– Assess, Access, Analyze, Act, Automate
– Consulting Companies
– The Virtuous Cycle
• Michael Berry and Gordon Linoff
21
Process
The Virtuous Cycle
Transform
Data
Business
Problem
Act
Measure
22
Business Problem
•
•
•
•
Define the business problem
Understand the business and the rules
Determine if Data Mining fits the need
Understand the value to the business of
solving the problem
23
Data for Data Mining
• Type of Data Values
– Categorical
• Defined set of values
• Ontario, Quebec, PEI …
– Ranks
• High, Medium, Low
• 0 – 20 000, 20 001 – 35 000, 35 001 – 50 000
– Intervals
• Date
• Time
• Temperature
– True Numeric
• Values that support numeric operations
24
Transform Data
Steps
7
6
5
4
3
2
1
Conduct Modeling
Prepare Model Set
Add Derived Variables
Transpose to Right Granularity
Validate & Clean
Obtain Data
Identify Data
25
Transform Data
Step 1 - Identify Data
• What data is required to meet the modeling
need?
• What data is available?
1
Identify Data
26
Transform Data
Step2 - Obtain Data
•
•
•
•
•
OLTP
Data Warehouse
Data Marts and OLAP
Self Reported
External
2
1
Obtain Data
Identify Data
27
Transform Data
Step 3: Validate & Clean
• Data Issues:
–
–
–
–
Missing
Fuzzy
Incorrect
Outliers
3
2
1
• Solutions:
–
–
–
–
–
–
Change Source
Filter Out
Ignore
Integrate
Predict
Derive a New Variable
Validate & Clean
Obtain Data
Identify Data
28
Transform Data
Step 4: Transpose to right granularity
• Data sets for Data Mining need one view, one
record
• Grain must be consistent throughout
– Aggregates can be problematic
– Atomic data is often required to build data set
• Training data sets cast from point in time of
event looking back
4
3
2
1
Transpose to Right Granularity
Validate & Clean
Obtain Data
Identify Data
29
Transform Data
Step 5: Add Derived Variables
•
•
•
•
Combined Columns
Summarizations
Features from Columns
Time Series
5
4
3
2
1
Add Derived Variables
Transpose to Right Granularity
Validate & Clean
Obtain Data
Identify Data
30
Transform Data
Step 6: Prepare Model Set
• The Actual Input to the modeling
6
5
4
3
2
1
Prepare Model Set
Add Derived Variables
Transpose to Right Granularity
Validate & Clean
Obtain Data
Identify Data
31
Transform Data
Step 7: Conduct Modeling
• Get our result
–
–
–
–
–
Decision Trees
Neural Networks
Clustering
Nearest Neighbour
Rule Induction
5
4
3
2
1
7
6
Conduct Modeling
Prepare Model Set
Add Derived Variables
Transpose to Right Granularity
Validate & Clean
Obtain Data
Identify Data
32
Act
The Business has to actually do something with
the results or what was the point?
Marketing or Retention Campaigns
Business Changes
33
Measure
• Answer 2 Questions
– Was the Data Mining effort accurate?
– Were the Business Actions successful?
• Use different sets of data to compare real
results
– Actioned Customers vs. Non Actioned
• Accuracy Types
– Absolute
• Our prediction was 80% of Group D would buy Coke
and 78% really did
– Relative
• Our prediction was 80% of Group D would buy Coke
but 57 % really did, however Group C which we
predicted had a 60% propensity to buy Coke actually
bought Coke 42% of the time
34
And back around
Transform
Data
Business
Problem
Act
Measure
35
The Case Study
36
The Case Study
• Background
– The Business
– Data Warehouse Overview
– Strengths and Challenges
• The Project
–
–
–
–
Business Problem
Transform Data
Act
Measure
37
The Case Study
• Background
– The Business
– Data Warehouse Overview
– Strengths and Challenges
• The Project
–
–
–
–
Business Problem
Transform Data
Act
Measure
38
The Business
• One of the top 3 (4?) cellular phone providers
in Canada
• Recent Acquisitions
– Clearnet
– Quebectel
• Important Business Concepts
–
–
–
–
–
–
Handset
Subscriber
Client
Activity - Activations, Deactivations
Churn
Usage
39
DW Environment
Staging
Sources
Integration (3NF)
Cubes
Reports
Catalogues
Data Marts
40
(Dimensional)
Strengths at
•
•
•
•
Commitment to Data Warehousing
Prior Experience in Data Mining
Tools already Established
Strong business support for the outcomes
Data Mining would provide
41
Challenges at
• Data Warehouse still in midst of major rearchitecture effort
• Ongoing billing system integration projects
• A data mart for data mining had existed
(Clearnet) but it was a victim of both of the
above
– Successful at Churn Prediction
42
The Case Study
• Background
– The Business
– Data Warehouse Overview
– Strengths and Challenges
• The Project
–
–
–
–
Business Problem
Transform Data
Act
Measure
43
Using the Virtuous Cycle
Transform
Data
Business
Problem
Act
Measure
44
Business Problems
• Churn Modeling
– predict which subscriber is likely to leave
• Behavioural Segmentation
– clustering subscribers into subgroups based on some
commonality
– revenue, usage, demographic
• Client Value Estimation
– the present value of all future profits generated throughout
the lifetime of that client
45
Transform Data Steps
7
6
5
4
3
2
1
Conduct Modeling
Prepare Model Set
Add Derived Variables
Transpose to Right Granularity
Validate & Clean
Obtain Data
Identify Data
46
Identify Data
• Business Wanted:
EVERY POSSIBLE VARIABLE RELATED TO A SUBSCRIBER!
• They provided a detailed list, by subject area,
of the variables that they believed were
required to conduct the kind of Data Mining
Activities desired.
1
47
Identify Data
IT Challenges
• 19 Subject Areas identified with up to 75
variables each - What is the Priority?
• Avoid the big bang - How much can we
actually do?
• Where to source the data from?
• Resources - Who is going to do it?
1
48
Identify Data
Prioritizing
• First we asked the business to rate each
variable as H, M or L priority
– Almost everything was given an H
• Then we asked the business to rank the
subject areas in order of importance
– Hard to convince them of the value
– Hard to find consensus
– Necessary for determining a release strategy
1
49
Obtain Data
• For each Subject Area and each variable we
assessed and documented the following:
–
–
–
–
Where can it be sourced from (and when)?
What are the known issues?
Q&A back and forth on the variables with business
Identified possible additional variables
2
50
Obtain Data
Release Strategy:
+
=
•1 Release a Quarter
•3 planned releases
•15 of 19 Subject Areas
2
51
Obtain Data
• Data Mining Access tool against sources
– Data Warehouse
– OLTP
– External Data
• Creation of flat files to feed to Data Mining
team
– Multi source
– Validate and Clean
• Build a Data Mart for Data Mining
– Part of the Data Warehouse Architecture
2
52
Obtain Data
Staging
Sources
Integration (3NF)
Data Mart for
Data Mining
Data Marts53
(Dimensional)
Validate & Clean Data
Staging
Int
eg
ra
te
Clean
Integration (3NF)
an
Cle
Data Mart for
Data Mining
• Clean?
Data Marts
3
– Not Fuzzy
– Correct - mostly
– Missing and Outliers
Validate & Clean
54
FACTS_2
SUBSCRIBER_ID (FK)
MONTH (FK)
Transpose to Right Granularity
LOTS_OF_FACTS
UPDATE_DT
LOAD_DT
BILL_CYCLE_SNAPSHOT
_ID
OUNT_ID (FK)
PSHOT_MONTH
(FK)
S
NDSET_KEY
(FK)
ES_KEY
SET_KEY (FK)
ES_D ES
_PERSONAL_INFO
ES_C D
_DEMOGRAPHIC_INFO
ES_I D
NFO
INFO
_INFO
BI LLED
_KEY (FK)
• The
Grain
isY Subscriber by Bill
Cycle (FK)
PR OD
UC T_SKU _KE
SUBSCRIBER_ID
MONTH (FK)
PR OD UC T_I NFO
U
PDATE_DT
• Monthly
Snapshots of subscriber
Data
LOTS_OF_FACTS
LOAD _D T
UPDATE_DT
• Fact Data Cast to Bill Cycle LOAD_DT
FACTS_3
PR OD UC T_SKU
SU BSCR I BER_BILL_C Y C LE_SN APSH OT
FACTS_4
SUBSCRIBER_ID (FK)
MONTH (FK)
SU BSCR I BER_ID
MON TH
C UR REN T_H AND SET_KEY (FK)
FI RST_H AN DSET_K EY (FK)
SU BSCR I BER_PER SON AL_I NFO
SU BSCR I BER_DEMOGR APH IC _I NFO
C ON TR AC T_IN FO
R ATE_PLA N_IN FO
MUC H_MOR E_I NFO
LOAD _D T
U PDATE_DT
OUNT_BILL_CYCLE_SNAPSHOT
E _FAC TS
_AMT
COUNT_ID
PSHOT_MONTH
HE_ACCOUNT
NO_OF_CANCELLED_SUBS
NO_OF_SUSPENDED_SUBS
NO_OF_ACTIVE_SUBS
4
LOTS_OF_FACTS
UPDATE_DT
LOAD_DT
FACTS_5
SUBSCRIBER_ID (FK)
MONTH (FK)
Transpose to Right Granularity
LOTS_OF_FACTS
UPDATE_DT
LOAD_DT
55
Add Derived Variables
• Combined Columns
– PERCENT_BUCKET_USED
• IN_BUCKET_CALLS / RATE_PLAN_BUCKET_MINUTES)
• Summarizations
– Sphere of Influence
• UNIQUE_CALLED_NUMBER_CNT
• UNIQUE_CALLING_NUMBER_CNT
5
Add Derived Variables
56
TOTAL_CALLS_CNT: NUMBER
LONG_DISTANCE_CALLS_MOU: NUMBER(22,5)
LONG_DISTANCE_CALLS_CNT: NUMBER
WEEKEND_CALLS_MOU: NUMBER(22,5)
WEEKEND_CALLS_CNT: NUMBER
EVENING_CALLS_MOU: NUMBER(22,5)
EVENING_CALLS_CNT: NUMBER
PEAK_CALLS_MOU: NUMBER(22,5)
PEAK_CALLS_CNT: NUMBER
UPDATE_DT: DATE
LOAD_DT: DATE
OUT_BUCKET_CALLS_MOU: NUMBER(22,5)
IN_BUCKET_CALLS_MOU: NUMBER(22,5)
OUT_BUCKET_CALLS_CNT:
NUMBER
Was the the date of
snapshot for that
subscriber
IN_BUCKET_CALLS_CNT: NUMBER
between COMMIT_START_DATE
and
ROAMING_CALLS_MOU: NUMBER(22,5)
COMMIT_END_DATE
ROAMING_CALLS_CNT: NUMBER
Add Derived Variables
• Features from Columns
– CONTRACT_INDICATOR (Y/N Flag)
•
• Time Series
BILLED_USAGE_3_MTH_AVG
MONTH: NUMBER(8)
SUBSCRIBER_ID: NUMBER(22)
TOTAL_CALLS_MOU_3_MTH_AVG: NUMBER(22,5)
LONG_DISTANCE_CALLS_MOU_3_MTH_: NUMBER(22,5)
WEEKEND_CALLS_MOU_3_MTH_AVG: NUMBER(22,5)
EVENING_CALLS_MOU_3_MTH_AVG: NUMBER(22,5)
PEAK_CALLS_MOU_3_MTH_AVG: NUMBER(22,5)
UPDATE_DT: DATE
LOAD_DT: DATE
BUCKET_USAGE
5
Add Derived Variables
MONTH: NUMBER(8)
SUBSCRIBER_ID: NUMBER(22)
UPDATE_DT: DATE
LOAD_DT: DATE
57
Prepare Model Set
VALUE_ADDED_SERVICES
PRODUCT_SKU
VALUE_ADDED_SERVICES_KEY
PRODUCT_SKU_KEY
VALUE_ADDED_SERVICES_DES
VALUE_ADDED_SERVICES_CD
VALUE_ADDED_SERVICES_ID
UPDATE_DT
LOAD_DT
PRODUCT_INFO
UPDATE_DT
LOAD_DT
• Select Sample
FACTS_1
SUBSCRIBER_ID (FK)
MONTH (FK)
LOTS_OF_FACTS
UPDATE_DT
LOAD_DT
– Population
– Data for given
modeling effort
FACTS_2
SUBSCRIBER_ID (FK)
MONTH (FK)
LOTS_OF_FACTS
UPDATE_DT
LOAD_DT
SUBSCRIBER_BILL_CYCLE_SNAPSHOT
VALUE_ADDED_SERVICES_BILLED
MONTH (FK)
SUBSCRIBER_ID (FK)
VALUE_ADDED_SERVICES_KEY (FK)
VALUE_ADDED_SERVICES_AMT
UPDATE_DT
LOAD_DT
BILLED_AMOUNTS
SUBSCRIBER_ID (FK)
MONTH (FK)
SUBSCRIBER_ID
MONTH
CLIENT_ACCOUNT_ID (FK)
CLIENT_SNAPSHOT_MONTH (FK)
CURRENT_HANDSET_KEY (FK)
FIRST_HANDSET_KEY(FK)
SUBSCRIBER_PERSONAL_INFO
SUBSCRIBER_DEMOGRAPHIC_INFO
CONTRACT_INFO
RATE_PLAN_INFO
MUCH_MORE_INFO
LOAD_DT
UPDATE_DT
FACTS_3
SUBSCRIBER_ID (FK)
MONTH (FK)
LOTS_OF_FACTS
UPDATE_DT
LOAD_DT
FACTS_4
LOTS_OF_BILLED_REVENUE_FACTS
UPDATE_DT
LOAD_DT
SUBSCRIBER_ID (FK)
MONTH (FK)
LOTS_OF_FACTS
UPDATE_DT
LOAD_DT
• Denormalize
completely
CLIENT_ACCOUNT_BILL_CYCLE_SNAPSHOT
EQUIPMENT_ACTIVITY_SNAPSHOT
SUBSCRIBER_ID (FK)
MONTH (FK)
HANDSET_COUNTABLE_DATA
UPDATE_DT
LOAD_DT
CLIENT_ACCOUNT_ID
CLIENT_SNAPSHOT_MONTH
INFO_ON_THE_ACCOUNT
ACCOUNT_NO_OF_CANCELLED_SUBS
ACCOUNT_NO_OF_SUSPENDED_SUBS
ACCOUNT_NO_OF_ACTIVE_SUBS
LOAD_DT
UPDATE_DT
FACTS_5
SUBSCRIBER_ID (FK)
MONTH (FK)
LOTS_OF_FACTS
UPDATE_DT
LOAD_DT
DROPPED_AND_BLOCKED_USAGE
DROPPED_CALLS_CNT
UPDATE_DT
LOAD_DT
BLOCKED_USAGE
BLOCKED_CALLS_CNT
UPDATE_DT
LOAD_DT
08/ 01/ 2002
1,735,436.26
281,032.21
388,970.91
0.00
516,911.31
6
462,667.55
749,330.77
6,371.95
500,183.42 08/ 14/ 2002
Prepare Model Set
58
Prepare Model Set
Example of Denormalization for Data Mining
B ILLED_USA GE
MONTH (FK)
S UB SCRIBE R_ID (FK)
P RODUCT _S KU
P RODUCT _S KU_K E Y
P RODUCT _INFO
UPDAT E_DT
LOA D_DT
Denormalize
completely
S UB SCRIBE R_B ILL_CYCLE_SNAP SHOT
S UB SCRIBE R_ID
MONTH
CLIENT _A CCOUNT _ID (FK )
CLIENT _S NA PS HOT_MONTH (FK)
CURRE NT _HANDS ET _KE Y (FK )
FIRST _HA NDSE T_KE Y (FK)
S UB SCRIBE R_P ERSONA L_INFO
S UB SCRIBE R_DEMOGRA PHIC_INFO
CONTRACT_INFO
RAT E_PLAN_INFO
MUCH_MORE _INFO
LOA D_DT
UPDAT E_DT
T OT AL_CALLS_MOU
T OT AL_CALLS_CNT
LONG_DIS TA NCE_CA LLS _MOU
LONG_DIS TA NCE_CA LLS _CNT
WE EK END_CALLS _MOU
WE EK END_CALLS _CNT
E VE NING_CA LLS_MOU
E VE NING_CA LLS_CNT
P EA K_CA LLS _MOU
P EA K_CA LLS _CNT
UPDAT E_DT
LOA D_DT
OUT _B UCKE T_CA LLS _MOU
IN_BUCKE T_CA LLS _MOU
OUT _B UCKE T_CA LLS _CNT
IN_BUCKE T_CA LLS _CNT
ROA MING_CA LLS_MOU
ROA MING_CA LLS_CNT
B ILLED_USA GE _3_MT H_AV G
MONTH (FK)
S UB SCRIBE R_ID (FK)
T OT AL_CALLS_MOU_3_MTH_A V
Handset 1 Handset 2 Handset
3 Handset 4
LONG_DIS TA NCE_CA LLS _MOU_3
WE EK
H_
0
1
0 END_CALLS _MOU_3_MT
0
6
E VE NING_CA LLS_MOU_3_MTH_A
P EA K_CA LLS _MOU_3_MT H_AV G
UPDAT E_DT
LOA D_DT
Prepare Model Set
B UCKE T_US AGE
59
MONTH (FK)
S UB SCRIBE R_ID (FK)
Prepare Model Set
Example of Casting a a Test Set for Churn Modeling
• Sliding Windows Concept
Model Set Feb
Model Set Mar
Model Set Apr
Oct
Nov
Dec
Jan
Feb
Mar
Apr
3
2
3
1
2
3
X
1
2
P
X
1
P
X
P
3
2
1
Score Set
6
May
Jun
Jul
X
P
P
Prepare Model Set
60
Conduct Modeling
• Done by business team with help from Vendor
(SAS)
– Decision Trees and Neural Networks (Churn)
– Clustering (Segmentation)
Clustering
Decision Trees
Neural Networks
7
61
The Action
Various Marketing and CRM Activities
62
Measuring
• Not there yet
– Intend to compare actioned vs. non-actioned results
– 50 % to be actioned
63
And the cycle continues ...
Transform
Data
Business
Problem
Act
Measure
64
Jason Brown
[email protected]
Q U E S T I O N S
A N S W E R S
65

Data Mining:

Transcript Data Mining:

Directory