Rule induction

Download Report

Transcript Rule induction

Overview of Data Mining Methods
Data mining techniques
What techniques do, examples,
advantages & disadvantages
結束
Contents
Reviews data mining tools
Compares data mining perspectives
Discusses data mining functions
Presents four sets of data used to demonstrate
tools in subsequent chapters
Shows the Enterprise Miner structure for data
mining analysis in the appendix
4-2
結束
Data mining applications
Automobile insurance company: Fraud detection
Business applications: loan evaluation, customer
segmentation, employee evaluation…
Data mining tools categorized by the tasks of
classification, estimation, prediction, clustering, and
summarization.
Classification, estimation, prediction are predictive,
while clustering and summarization are descriptive.
4-3
結束
History
Statistics
AI:
genetic algorithms, neural networks
analogies with biology
memory-based reasoning
link analysis from graph theory
See table. 4.1
4-4
結束
Data mining perspectives
Methods can be viewed from different perspectives,
data mining methods include:
Cluster analysis (Chapter 5)
Regression of various forms (best fit methods, chapter 6)
Discriminant analysis (use of regression for
classification, chapter 6)
Line fitting through the operations research tool of
multiple objective linear programming (Chapter 9)
AI:
ANN (chapter 7)
Rule induction (decision trees, chapter 8)
Genetic algorithms (supplement)
See page 55 for more descriptions
4-5
結束
Techniques
Statistical
Market-Basket Analysis - find groups of items
Memory-Based Reasoning- case based
Cluster Detection - undirected (quantitative)
Artificial Intelligence
Link Analysis - MCI’s Friends & Family
Decision Trees, Rule Induction - production rule
Neural Networks - automatic pattern detection
Genetic Algorithms - keep best parameters
4-6
結束
Models
Regression: Y = a + bX
Classification:assign new record to class
Predictive:
assign value to new record
Clustering: groups for data
Time-series: assign future value
Links:
patterns in data
4-7
結束
Fitting
Underfitting: not enough detail
leave out important variables
Overfitting: too much detail
memorizes training set, but doesn’t help with
new data
data set too small
redundancy in data
4-8
結束
Comparison of Features
Rules
Neural Net
CaseBase
Genetic
Noisy data
Good
Very good
Good
Very good
Missing data
Good
Good
Very good
Good
Very good
Poor
Good
Good
Different types
Good
Numerical
Very good
Transform
Accuracy
High
Very high
High
High
Explanation
Very good
Poor
Very good
Good
Integration
Good
Good
Good
Very good
Ease
Easy
Difficult
Easy
Difficult
Large sets
4-9
結束
Data Mining Functions
Classification
Identify categories in data
Prediction
Formula to predict future observations
Association
Rules using relationships among entities
Detection
Anomalies (unusual) & irregularities (fraud detection)
4-10
結束
Financial Applications
Technique
Application
Problem Type
Neural net
Forecast stock price
Prediction
NN, Rule
Forecast bankruptcy
Fraud detection
Prediction
Detection
NN, Case
Forecast interest rate
Prediction
NN, visual
Late loan detection
Detection
Rule
Credit assessment
Risk classification
Prediction
Classification
Rule, Case
Corporate bond rate (公司債)
Prediction
4-11
結束
Telecom Applications
Technique
Application
Neural net,
Forecast network
Rule induction behavior.
Problem Type
Prediction
Churn
Rule induction
Fraud detection
Classification
Detection
Case based
Classification
Call tracking
4-12
結束
Marketing Applications
Technique
Application
Market segment
Cross-selling
Problem Type
Classification
Association
Rule induction,
visual
Lifestyle analysis
Performance
analysis.
Classification
Association
Rule induction,
genetic, visual
Case based
Reaction to
Prediction
promotion
Online sales support Classification
Rule induction
4-13
結束
Web Applications
Technique
Rule induction,
Visualization
Rule-based
heuristics
Application
Problem Type
Classification,
User browsing
similarity analysis. Association
Web page content
Association
similarity
4-14
結束
Other Applications
Technique
Application
Problem Type
Neural net
Software cost
Detection
Neural net,
rule induction
Litigation assessment
Prediction
Rule induction
Insurance fraud
Healthcare except.
Detection
Detection
Case based
Insurance claim
Software quality
Genetic algorithm
Budget spending
Prediction
Classification
Classification
4-15
結束
Data Sets
Loan Applications
classification
Job Applications
classification
Insurance Fraud
detection
Expenditure Data
prediction
4-16
結束
Loan Data
650 observations
OUTCOMES (binary):
On-time
Late (default)
cost of error: $300
cost of error: $2,000
Variables
Age, Income, Assets, Debts, Want, Credit
Credit ordinal
Transform: Assets, Debts, & Want →Risk
4-17
結束
Job Application Data
500 observations
OUTCOMES (ordinal):
Unacceptable
Minimal
Acceptable
Excellent
Variables
Age, State, Degree, Major, Experience
State nominal; degree & major ordinal
State is superfluous
4-18
結束
Insurance Claim Data
5000 observations
OUTCOMES (binary):
OK
Fraudulent
cost of error $500
cost of error $2,500
Variables
Age, Gender, Claim, Tickets, Prior claims, Attorney
Gender & attorney nominal, tickets & prior claims
categorical
4-19
結束
Expenditure Data
10,000 observations
OUTCOMES:
Could predict response in a number of categories
Others
Variables:
Age, Gender, Marital, Dependents, Income, Job years,
Town years, Education years, Drivers license, Own
home, Number of credit cards
Churn, proportion of income spent on seven
categories
4-20