Data Mining Challenges

Download Report

Transcript Data Mining Challenges

Everything you ever
wanted to know about
data mining but were
afraid to ask.
David Duling
Data Mining R&D Director
SAS Institute
Copyright © 2006, SAS Institute Inc. All rights reserved.
Abstract
 Data mining is the process of systematically sifting through
often large databases to identify patterns and trends
relevant to solving business problems such as increasing
sales and efficiency. Successful examples of data mining
can be found in many areas of business and science
including customer relations management, market basket
analysis, human resources, bio-informatics and medicine,
fraud detection, and searching the web. The rapid growth
in data mining is largely due to the increased availability of
large databases, advances in large scale computing and
development of data mining algorithms. This presentation
will begin with a brief history of data mining, cover current
trends with case studies and finish with a look into the
future of data mining from the SAS perspective.
Copyright © 2006, SAS Institute Inc. All rights reserved.
Company confidential - for internal use only
Q: Where did data mining start ?
 1841
• Lewis Tappan formed the Mercantile Company to provide creditworthiness reports (ie: credit scores) to New York merchants.
• Employed a number of ‘correspondents’’ in western frontier towns to
monitor the behavior of local traders, in addition to pooling merchant
records. A huge data base was accumulated.
• Enormous success !
 1849
• John Bradstreet starts a credit reporting company in Cincinnati, OH.
 1859
• The Mercantile Company was sold to Robert Graham Dun
 1933
• The Dun company merged with the Bradstreet company
Copyright © 2006, SAS Institute Inc. All rights reserved.
Company confidential - for internal use only
Q: How about something with computers ?
 Date: September, 1963
Authors: James Myers and Edward Forgy
Title: The Development of Numerical Credit Evaluation Systems
Publication: Journal of the American Statistical Association
 Abstract
Several discriminant and multiple regression analyses
were performed on retail credit application data to develop
a numerical scoring system for predicting credit risk in a
finance company. Results showed that equal weights for
all significantly predictive items were as effective as
weights from the more sophisticated techniques of
discriminant analysis and "stepwise multiple regression."
However, a variation of the basic discriminant analysis
produced a better separation of groups at the lower score
levels, where more potential losses could be eliminated
with a minimum cost of potentially good accounts.
Copyright © 2006, SAS Institute Inc. All rights reserved.
Company confidential - for internal use only
Data Mining, circa 1963
IBM 7090
600 applicants
“Machine storage limitations
restricted the total number of
variables which could be
considered at one time to 25.”
Copyright © 2006, SAS Institute Inc. All rights reserved.
Company confidential - for internal use only
Q: That Sounds like Statistics so what’s the
difference?
Statistics
 Experimental
 Prior Hypothesis
Data mining
 Commercial
 Posterior Hypothesis
• Idea before data acquisition
• Data acquisition planned
 Experimental Design
•
•
•
•
 No Experimental Design
Sampling strategies
Factorial designs
Required confidence
Minimize model terms
 Inference
• Hypothesis testing
• Prediction
Copyright © 2006, SAS Institute Inc. All rights reserved.
• Idea after data acquisition
• Data acquisition opportunistic
•
•
•
•
Explore data
Create hypothesis
Generate query
Create models
 Prediction
• Lift, Profit, Response
• Inference
Company confidential - for internal use only
It’s all about the Data
Experimental
Opportunistic
Purpose
Research
Operational
Value
Scientific
Commercial
Generation
Actively
controlled
Passively
observed
Size
Small
Massive
Hygiene
Clean
Dirty
State
Static
Dynamic
Copyright © 2006, SAS Institute Inc. All rights reserved.
Company confidential - for internal use only
Where does mining data come from ?
Data Warehouses store detail data on
transactions and states
Geo_Type
Geo_Type_Id: INTEGER
Geo_Type_Name: CHARACTER(30)
Region
Region_Id: INTEGER
County
County_Id: INTEGER
County_Type: INTEGER (FK)
County_Name: CHARACTER(30)
Region_Id: INTEGER (FK)
Region_Type: INTEGER (FK)
Region_Name: CHARACTER(30)
State_Id: INTEGER (FK)
Very simple demo example
State
State_Id: INTEGER
Street_Code
Street_Id: INTEGER
Country: CHARACTER(2)
Street_Name: CHARACTER(30)
Zip_Code: CHARACTER(10)
From_Street_No: NUMERIC(8)
To_Street_No: NUMERIC(8)
City: CHARACTER(22)
County: CHARACTER(25)
State: CHARACTER(2)
City_Id: INTEGER
State_Id: INTEGER
County_Id: INTEGER (FK)
Zip_Id: INTEGER (FK)
State_Type: CHARACTER(30)
State_Name: CHARACTER(30)
Country: CHARACTER(2) (FK)
Geo_Type_Id: INTEGER (FK)
Customer_Id: INTEGER
Country: CHARACTER(2)
Gender: CHARACTER(1)
Personal_Id: CHARACTER(15)
Customer_Name: CHARACTER(40)
Customer_Firstname: CHARACTER(20)
Customer_Lastname: CHARACTER(30)
Birthday: DATE
Customer_Address: CHARACTER(40)
Street_Id: INTEGER (FK)
Street_Number: CHARACTER(8)
Customer_Type_Id: INTEGER (FK)
Zip_Id: INTEGER
City_Name: CHARACTER(30)
Zipcode: CHARACTER(18)
City_Id: INTEGER (FK)
Supplier
Supplier_Id: INTEGER
City_Id: INTEGER
City_Name: CHARACTER(30)
Customer_Type
Customer_Type_Id: INTEGER
Supplier_Name: CHARACTER(30)
Street_Id: INTEGER (FK)
Supplier_Address: CHARACTER(30)
Supplier_Street_Nu: NUMERIC(3)
Country: CHARACTER(2)
Customer_Type: CHARACTER(40)
Customer_Group_Id: INTEGER
Customer_Group: CHARACTER(40)
Price_List
Promotion
Product_Id: INTEGER (FK)
Start_Date: DATE
Product_Id: INTEGER (FK)
Start_Date: DATE
End_Date: DATE
Unit_Cost_Price: DECIMAL(12,2)
Unit_Sales_Price: DECIMAL(12,2)
End_Date: DATE
Sales_Price: DECIMAL(12,2)
Promotion: DECIMAL(5,2)
Copyright © 2006, SAS Institute Inc. All rights reserved.
Country: CHARACTER(2)
Country_Name: CHARACTER(45)
Population: NUMERIC(6)
Office: CHARACTER(2)
Dir: CHARACTER(3)
Country_Id: INTEGER
Continent_Id: INTEGER (FK)
Country_Former_Nam: CHARACTER(45)
Customer
Zip_Code
City
Country
Order
Order_Id: INTEGER
Employee_Id: INTEGER (FK)
Customer_Id: INTEGER (FK)
Order_Date: DATE
Delivery_Date: DATE
Order_Type: INTEGER
Order_Item
Order_Id: INTEGER (FK)
Order_Item_No: INTEGER
Product_Id: INTEGER (FK)
Amount: SMALLINT
Price: DECIMAL(12,2)
Unit_Cost_Price: DECIMAL(12,2)
Promotion: DECIMAL(5,2)
Product
Product_Id: INTEGER
Product_Name: CHARACTER(45)
Supplier_Id: INTEGER (FK)
Product_Level_Id: INTEGER (FK)
Product_Ref_Id: INTEGER (FK)
Product_Level: NUMERIC(3)
Company confidential - for internal use only
Continent
Continent_Id: INTEGER
Continent_Name: CHARACTER(30)
Staff
Employee_Id: INTEGER (FK)
Start_Date: DATE
Salary: DECIMAL(12,2)
Birthday: DATE
End_Date: DATE
Emp_Hire_Date: DATE
Gender: CHARACTER(1)
Emp_Term_Date: DATE
Job_Title: CHARACTER(25)
Organization
Employee_Id: INTEGER
Org_Name: CHARACTER(40)
Country: CHARACTER(2)
Org_Level_Id: INTEGER (FK)
Start_Date: DATE
End_Date: DATE
Org_Ref_Id: INTEGER (FK)
Org_Level
Org_Level_Id: INTEGER
Org_Text: CHARACTER(40)
Product_Level
Product_Level_Id: INTEGER
Product_Level_Name: CHARACTER(30)
Data: majority of time spent on mining
Intelligent Enterprise Magazine
http://www.intelligententerprise.com/030405/606feat2_1.jhtml
Copyright © 2006, SAS Institute Inc. All rights reserved.
Company confidential - for internal use only
Each data type has its own mix of
data prep and mining
Demographics, Personal Information
Market baskets, Item sets
Market baskets with time order
Web paths: unique sequences
Time stamped transactions
Text normalization
Copyright © 2006, SAS Institute Inc. All rights reserved.
Company confidential - for internal use only
Use integrated data for mining
Web paths
Market baskets
Demographic,
Financial
Seasonal indices
Interactions
ID columns
Copyright © 2006, SAS Institute Inc. All rights reserved.
Text dimensions
Company confidential - for internal use only
Example
 Quest: Maximize response to this year’s summer promotion
 How: Find those customers most likely to respond
 Use response to last year’s summer promotion as indicator
of response to this year’s promotion. This is the dependent
variable.
 Use all customer data available before last summer. These
are the independent variables.
•
•
•
•
•
•
Demographics
Sales item history
Sales amount history
Web site history
Call center records
….
Copyright © 2006, SAS Institute Inc. All rights reserved.
Company confidential - for internal use only
Q: Why don’t we just select last year’s customer
lists ?
Data is non-stationary
(remember this point)
 Move to new locations
 Change jobs
 Income goes up or down
 Debt increase or decrease
 Marital status
 Parental status
 …
The model is a function of the attributes, not the individuals
Copyright © 2006, SAS Institute Inc. All rights reserved.
Company confidential - for internal use only
Q: What Functions are Popular ?
 Associations
unsupervised
 Clustering
unsupervised
 PCA/SVD
unsupervised
 Logistic Regression
supervised
 Decision Tree
supervised
 Neural Network
supervised
 Ensembles
supervised
 … and many other forms, variants, and names
Copyright © 2006, SAS Institute Inc. All rights reserved.
Company confidential - for internal use only
How do I find Patterns ?
 Try ASSOCIATIONS and SEQUENCES
 Searches for frequent patterns
 (Car Wreck  Dr. X )  (Diagnosis Code xxx  MRI)
 Confidence:
 If (A) happens then (B) happens 80% of the time
 C=(B|A)/A
 Support:
 (A)  (B) happens in 10% of all itemsets
 S=(B|A)/N
Copyright © 2006, SAS Institute Inc. All rights reserved.
Company confidential - for internal use only
Rule Sets show the next most likely action
Copyright © 2006, SAS Institute Inc. All rights reserved.
Company confidential - for internal use only
Associations and Sequences / Visualization
Copyright © 2006, SAS Institute Inc. All rights reserved.
Company confidential - for internal use only
What is the Most Popular Data Mining Function ?
-Logistic Regression Still Rules !
-Linear combination of terms (z)
-Relatively easy to compute
-Converges to a solution
-Explainable
Prob
Input
Copyright © 2006, SAS Institute Inc. All rights reserved.
Company confidential - for internal use only
Input
Q: What is CART and why do I need it ?
 A Decision Tree !
 Classification and Regression
 Strategy Development

Hunt
1966
Concept Learning System

Kass
1980
Chi-squared Automatic Interaction Detection

Breiman
1984
Classification and Regression Trees

Quinlan
1993
C 4.5 rule sets

Numerous others…
• Algorithms for efficiently building trees
• Hypothesis tests for finding split points
− Various measurement scales
Copyright © 2006, SAS Institute Inc. All rights reserved.
Company confidential - for internal use only
Building a Decision Tree
Keep doing that
until there are no
more beneficial
splits...
Copyright © 2006, SAS Institute Inc. All rights reserved.
Company confidential - for internal use only
Recursive Partitioning
Copyright © 2006, SAS Institute Inc. All rights reserved.
Company confidential - for internal use only
Benefits of Trees
 Interpretability
• Tree structured presentation
 Mixed Measurement Scales
• Nominal, ordinal, interval
• Regression trees
 Robustness
 Missing Values
Copyright © 2006, SAS Institute Inc. All rights reserved.
Company confidential - for internal use only
…Benefits
 Automatically
Prob
• Detects interactions (AID)
• Accommodates nonlinearity
• Selects input variables
Input
Input
Multivariate
Step Function
Copyright © 2006, SAS Institute Inc. All rights reserved.
Company confidential - for internal use only
Drawbacks of Trees
 Roughness
 Linear, Main Effects
 Instability
Copyright © 2006, SAS Institute Inc. All rights reserved.
Company confidential - for internal use only
Q: Why do they call it ‘Neural’ network ?
Neuron
Hidden Unit
Copyright © 2006, SAS Institute Inc. All rights reserved.
Company confidential - for internal use only
Feed Forward Neural Network
Input
Layer
Hidden
Layers
Copyright © 2006, SAS Institute Inc. All rights reserved.
Output
Layer
Company confidential - for internal use only
How does it work?
a
b
C= combination ( Weights * Inputs )
c
A = Activation ( C )
f(W,I) = A[C] + b -> output
…
y~ f(W,(s,t))
s= f(S,(p,q,r))
p
d
s
e
f
t
g
t= f(T,(p,q,r))
p= f(P,X)
q= f(Q,X)
r= f(R,X)
r
h
i
j
…
y ~ f(W,(f(S.(f(P,X), f(Q,X), f(R,X)))), f(T,(f(P,X), f(Q,X), f(R,X)))))
Err = E(Y,y) ~ (Y - y)^2
Copyright © 2006, SAS Institute Inc. All rights reserved.
y
q
Company confidential - for internal use only
Input Layer
Activation Function
Copyright © 2006, SAS Institute Inc. All rights reserved.
Company confidential - for internal use only
Training
 Iterative
Optimization
Algorithm
Parameter 1
 Error Function
Parameter 2
Copyright © 2006, SAS Institute Inc. All rights reserved.
Company confidential - for internal use only
Training history for our example
Error measure goes down with every iteration.
Weights evolve at every iteration
Copyright © 2006, SAS Institute Inc. All rights reserved.
Company confidential - for internal use only
Neural Pros and Cons
 Very Flexible functions
 Implicit transformation and interactions
 Good algorithms for controlling complexity
 No inference
 Complex function
 Many possible networks – large search space
Copyright © 2006, SAS Institute Inc. All rights reserved.
Company confidential - for internal use only
Q: How do I know the model will work on new data ?
 Make sure that you don’t have a perfect model !
• Real data has multiple forms of the dependent variable effect
 Limit exposure to data that changes over time
• Examine distributions of data at several time points
• Select stable data
• Use standardizations
• Use category=other
 Backtest
• Use a hold out sample from a later time period
 Monitor Performance
• Compare actual and expected results
• Compare input term distributions
 Don’t fit the noise.
Copyright © 2006, SAS Institute Inc. All rights reserved.
Company confidential - for internal use only
Q: How do I model signal instead of noise ?
 Limit model complexity by using Validation Data.
 Decision Tree: Pruning
 Neural Network: Early Stopping
Copyright © 2006, SAS Institute Inc. All rights reserved.
Company confidential - for internal use only
Complexity -> Overfitting
Training Set
Copyright © 2006, SAS Institute Inc. All rights reserved.
Company confidential - for internal use only
Test Set
Better Fitting … with a more simple model
Training Set
Copyright © 2006, SAS Institute Inc. All rights reserved.
Company confidential - for internal use only
Test Set
How do I select the best model ?
ROC for overall model performance:
Decision Tree
Lift for targeted model performance:
Neural Network
Copyright © 2006, SAS Institute Inc. All rights reserved.
Company confidential - for internal use only
Computers keep getting so much faster, why does
my neural network take so long to run?
Copyright © 2006, SAS Institute Inc. All rights reserved.
Company confidential - for internal use only
The enemy: growth in data warehouses
In the aggregate, the 2001 survey pool reported 632 TB of storage. Just two years later, those
surveyed were using almost 2 petabytes (2,000 TB) of storage. Based on the number of survey
respondents, the average large database — whether used for decision support or transaction
processing — increased its storage requirements three and one-half times in just two years.
DM-REVIEW
Company confidential - for internal use only
Copyright © 2006, SAS Institute Inc. All rights reserved.
Single disk size growth
Copyright © 2006, SAS Institute Inc. All rights reserved.
Company confidential - for internal use only
Clock speed vs. disk sizes
Copyright © 2006, SAS Institute Inc. All rights reserved.
Company confidential - for internal use only
Even worse, it’s all about complexity of data
1M rows x 100 columns x 8 bytes = 800MB
1000 rows x 1000 columns x 8 bytes = 8MB
Which data is more complex ?
Copyright © 2006, SAS Institute Inc. All rights reserved.
Company confidential - for internal use only
Q: so how does this make me money ?
 Models get deployed to operational systems
• New data is acquired
• Each case is scored with the model function
• Action taken on each case:
− Send promotion or don’t send promotion
− Select item for cross sell offer
− Grand credit or don’t grant credit
− Alert engineers that a manufacturing defect has been found.
 Model driven decision are nearly always better than
intuition
 …iff… the data miner has accounted for enough
sources of variation.
Copyright © 2006, SAS Institute Inc. All rights reserved.
Company confidential - for internal use only
Offline Applications
•
•
•
•
Scheduled Scoring
ETL process
ETL engine
ETL for model development and scoring
Scores generated on nightly basis
ID and Score data pre-loaded into data store
Score tables pushed to external applications
Model
Development
Data Mining
Scoring Engine
BI Application
Campaign
Planning
Operations
Campaign
Execution
Information
Technology
Data Store
Scores
Copyright © 2006, SAS Institute Inc. All rights reserved.
Company confidential - for internal use only
Online Applications
•
•
•
•
Scheduled Scoring
ETL process
Scores generated on nightly basis
ID and Score data pre-loaded into data store
Individual score requests contain one or more IDs
Decision server translates score to action
ETL engine
Model
Development
Scoring Engine
BI Application
Decision Server
Front Office
Application
Data Store
Scores
Copyright © 2006, SAS Institute Inc. All rights reserved.
Company confidential - for internal use only
Customer
call center
On-Demand Applications
•
•
•
•
•
•
Scheduled Scoring
ETL process
ETL engine
Model input data pre-loaded into data store
New data provided by application
Score engine pulls data by ID from data store
joins with new data
Scores generated immediately
Decision Server translates score to action
Model
Development
Front Office
Application
Decision Server
Automation
Application
Scoring Engine
Copyright © 2006, SAS Institute Inc. All rights reserved.
Company confidential - for internal use only
Fraud detection
Monty laundering
Medical diagnostics
Q: So what are the cool applications right now ?
 GOOGLE, YAHOO, ASK, etc…
• Huge model training task:
index and summarize the web
• Techniques:
text data processing; page rank
• Real time scoring task:
process your query
 NETFLIX
• $1M challenge:
beat their statisticians
• Huge sparse matrix:
fill in the blanks
• Techniques
SVD by numerical approximation
aka: Hebbian-learning Neural Net
Ensembles
Copyright © 2006, SAS Institute Inc. All rights reserved.
Company confidential - for internal use only
Copyright © 2006, SAS Institute Inc. All rights reserved.
Company confidential - for internal use only