Data Understanding

Download Report

Transcript Data Understanding

Data Mining
Processes
Identify
actionable results
2-2
CRISP-DM
•
Cross-Industry Standard Process for
Data Mining
– One of first comprehensive attempts
toward standard process model for data
mining
– Independent of industry sector &
technology
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
2-3
CRISP-DM Phases
1. Business (or problem) understanding
2. Data understanding
3. Data preparation
•
Transform & create data set for modeling
4. Modeling
5. Evaluation
•
Check good models, evaluate to assure nothing
missing
6. Deployment
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
2-4
Business Understanding
• Solve a specific problem
• Clear definition helps
– Measurable success criteria
• Convert business objectives to
set of data-mining goals
– What to achieve in technical terms
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
2-5
Data Understanding
• Related data
Can come from many
sources
– Internal
• ERP (or MIS)
• Data Warehouse
– External
• Government data
• Commercial data
– Created
• Research
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
2-6
Data Preparation
Clean data
• Formats, gaps, filters
outliers & redundancies
Unified numerical scales
• Nominal data
– code
• Ordinal data
– Nominal code or scale
• Cardinal data
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
2-7
Types of Data
Type
Numerical
Integer
Binary
Categorical
Date/Time
String
Text
McGraw-Hill/Irwin
Features
Continuous
Yes/No
Finite
Synonyms
Range
Range
Flag
Set
Range
Typeless
©2007 The McGraw-Hill Companies, Inc. All rights reserved
2-8
Modeling
• Data Treatment
– Training set
– Test set
– Maybe others
• Techniques
–
–
–
–
–
McGraw-Hill/Irwin
Association
Classification
Clustering
Predictions
Sequential patterns
©2007 The McGraw-Hill Companies, Inc. All rights reserved
2-9
Evaluation
• Does model meet business
objectives?
• Any important business objectives
not addressed?
• Does model make sense?
• Is model actionable?
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
2-10
Deployment
• Ongoing monitoring
& maintenance
– Evaluate
performance against
success criteria
– Market reaction &
competitor changes
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
2-11
Example
• Training set for computer purchase
– 16 records
– 5 attributes
• Goal
– Find classifier for consumer behavior
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
2-12
Database (1st half)
Case Age
Income
Student
Credit
Gender
Buy?
A1
31-40
High
No
Fair
Male
Yes
A2
>40
Medium No
Fair
Female
Yes
A3
>40
Low
Yes
Fair
Female
Yes
A4
31-40
Low
Yes
Excellent
Female
Yes
A5
≤30
Low
Yes
Fair
Female
Yes
A6
>40
Medium Yes
Fair
Male
Yes
A7
≤30
Medium Yes
Excellent
Male
Yes
A8
31-40
Medium No
Excellent
Male
Yes
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
2-13
Database (2nd half)
Case Age
Income
Student
Credit
Gender
Buy?
A9
31-40
High
Yes
Fair
Male
Yes
A10
≤30
High
No
Fair
Male
No
A11
≤30
High
No
Excellent
Female
No
A12
>40
Low
Yes
Excellent
Female
No
A13
≤30
Medium
No
Fair
Male
No
A14
>40
Medium
No
Excellent
Female
No
A15
≤30
Unknown No
Fair
Male
Yes
A16
>40
Medium
N/A
Female
No
McGraw-Hill/Irwin
No
©2007 The McGraw-Hill Companies, Inc. All rights reserved
2-14
Data Selection
• Gender has weak relationship with
purchase
– Based on correlation
– Drop gender
• Selected Attribute Set
{Age, Income, Student, Credit}
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
2-15
Data Preprocessing
• Income unknown in Case 15
• Credit not available in Case 16
• Drop these noisy cases
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
2-16
Data Transformation
• Assign numerical values to each
attribute
–
–
–
–
McGraw-Hill/Irwin
Age:
≤30 = 3
31-40 = 2
>40 = 1
Income: High = 3
Medium = 2 Low = 1
Student: Yes = 2
No = 1
Credit: Excellent = 2 Fair = 1
©2007 The McGraw-Hill Companies, Inc. All rights reserved
2-17
Data Mining
• Categorize output
– Buys = C1
Doesn’t buy = C2
• Conduct analysis
– Model says A8, A12 don’t buy; rest do
– Of the actual yes, 8 correct and 1 not
– Of the actual no, 4 correct and 1 not
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
2-18
Data Interpretation
• Test on independent data
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
2-19
Test Data Set
Case
Actual
Model
B1
Yes
Yes
B2
Yes
Yes
B3
Yes
Yes
B4
Yes
Yes
B5
Yes
Yes
B6
Yes
Yes
B7
No
No
B8
No
Yes
B9
No
No
B10
No
No
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
2-20
Confusion Matrix
Model Buy Model Not Totals
Actual Buy
6
0
6
Actual Not
1
3
4
Totals
7
3
10
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
2-21
Measures
• Correct classification rate
9/10 = 0.90
• Cost function
cost of error:
model says buy, actual no
model says no, actual buy
• 1 x $20 + 0 x $200 = $20
McGraw-Hill/Irwin
$20
$200
©2007 The McGraw-Hill Companies, Inc. All rights reserved
2-22
Goals
• Avoid broad concepts:
• Gain insight; discover meaningful patterns;
learn interesting things
– Can’t measure attainment
• Narrow and specify:
• Identify customers likely to renew; reduce
churn;
• Rank order by propensity to…;
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
2-23
Goals
• Description: what is
– understand
– explain
– discover knowledge
• Prescription: what should be done
– classify
– predict
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
2-24
Goal
• Method A:
– four rules, explains 70%
• Method B:
– fifty rules, explains 72%
BEST?
Gain understanding:
Method A better
minimum description length (MDL)
Reduce cost of mailing:Method B better
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
2-25
Measurement
• Accuracy
– How well does model describe observed
data?
• Confidence levels
– a proportion of the time between lower
and upper limits
• Comprehensibility
• Whole or parts?
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
2-26
Measuring Predictive
• Classification & prediction:
error rate = incorrect / total
requires evaluation set be representative
• Estimators
predicted – actual
(MAD: Mean Absolute Deviation, MSE: Mean Squre Error
MAPE: Mean Absolute Percent Error )
variance = sum(predicted - actual)^2
standard deviation = square root of variance
distance - how far off
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
2-27
Statistics
• Population - entire group studied
• Sample - subset from population
• Bias - difference between sample
average & population average
– mean, median, mode
– distribution
– significance
– correlation, regression
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
2-28
Classification Models
• LIFT = probability in class by sample divided by
probability in class by population
– if population probability is 20% and
sample probability is 30%,
LIFT = 0.3/0.2 = 1.5
• Best lift not necessarily best
need sufficient sample size
as confidence increases, longer list but lower lift
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
2-29
Lift Chart
LIFT
100
90
80
responded
70
60
% mailed
50
% responded
40
30
20
10
0
0
5
10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100
mailed
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
2-30
Measuring Impact
• Ideal - $ (NPV, NetPresentValue)
because of expenditure
• Mass mailing may be better
• Depends on:
– fixed cost
– cost per recipient
– cost per respondent
– value of positive response
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
2-31
Bottom Line
• Return on investment
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
2-32
Example Application
• Telephone industry
• Problem: Unpaid bills
• Data mining used to
develop models to
predict nonpayment
as early as possible
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
2-33
Knowledge Discovery Process
1 Data Selection
Learning the application domain
Creating target data set
2 Data Preprocessing
Data cleaning & preprocessing
3 Data Transformation
Data reduction & projection
4 Data Mining
Choosing function
Choosing algorithms
Data mining
Interpretation
Using discovered knowledge
5 Data Interpretation
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
2-34
Telephone Bill Study
• Billing period sequence analyzed
– Use 2 months, receive bill, payment due month of
billing, disconnect if unpaid in given period
• Hypothesis: Insolvent customers would
change calling habits & phone usage during a
critical period before & immediately after
termination of billing period
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
2-35
1: Business Understanding
• Predict which customers would be
insolvent
– In time for firm to take preventive
measures (and avert losing good
customers)
• Hypothesis:
– Insolvent customers would change calling
habits & phone usage during a critical
period before & immediately after
termination of billing period
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
2-36
2: Data Understanding
• Static customer information available in files
– Bills, payments, usage
• Used data warehouse to gather & organize
data
– Coded to protect customer privacy
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
2-37
Creating Target Data Set
• Customer files
– Customer information
– Disconnects
– Reconnections
• Time-dependent data
– Bills
– Payments
– Usage
• 100,000 customers over 17-month period
• Stratified sampling to assure all groups appropriately
represented
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
2-38
3: Data Preparation
• Filtered out incomplete data
• Deleted inexpensive calls
– Reduced data volume about 50%
• Low number of fraudulent cases
• Cross-checked with phone disconnects
• Lagged data made synchronization
necessary
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
2-39
Data Reduction & Projection
•
•
•
•
Information grouped by account
Customer data aggregated by 2-week periods
Discriminant analysis on 23 categories
Calculated average owed by category
(significant)
• Identified extra charges (significant)
• Investigated payment by installments (not
significant)
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
2-40
Choosing Data Mining Function
• Classes:
– Most possibly solvent (99.3%)
– Most possibly insolvent (0.7%)
• Costs of error widely different
• New data set created through stratified sampling
– Retained all insolvent
– Altered distribution to 90% solvent
– Used 2,066 cases total
• Critical period identified
– Last 15 two-week periods before service interruption
• Variables defined by counting measures in two-week periods
– 46 variables as candidate discriminant factors
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
2-41
4: Modeling
• Discriminant Analysis
– Linear model
– SPSS – stepwise forward selection
• Decision Trees
– Rule-based classifier
• Neural Networks
– Nonlinear model
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
2-42
Data Mining
• Training set about 2/3rds
• Rest test
• Discriminant analysis
– Used 17 variables
– Equal costs – 0.875 correct
– Unequal costs – 0.930 correct
• Rule-based – 0.952 correct
• Neural network – 0.929 correct
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
2-43
5: Evaluation
• 1st objective to maximize accuracy of
predicting insolvent customers
– Decision tree classifier best
• 2nd objective to minimize error rate for
solvent customers
– Neural network model close to Decision
tree
• Used all 3 on case-by-case basis
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
2-44
Coincidence Matrix – Combined Models
Model
insolvent
Model
solvent
Unclass
Totals
Actual
insolvent
19
17
28
64
Actual
solvent
1
626
27
654
Totals
20
643
91
718
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
2-45
6: Implementation
• Every customer examined using all 3
algorithms
– If all 3 agreed, used that classification
– If disagreement, categorized as
unclassified
• Correct on test data 0.898
– Only 1 actually solvent customer would
have been disconnected
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved