SAS Enterprise Miner

Download Report

Transcript SAS Enterprise Miner

Data Mining
A special presentation for BCIS 4660
Spring 2012
Dr. Nick Evangelopoulos, ITDS Dept.
Some slide material taken from: Groth, Han and Kamber, SAS Institute
Overview of this Presentation
• Introduction to Data Mining
• Examples of Data Mining applications
• The SEMMA Methodology
• SAS EM Demo: The Home Equity Loan Case
Logistic Regression
Decision Trees
Neural Networks
Introduction to DM
“It is a capital mistake to theorize
before one has data. Insensibly one
begins to twist facts to suit theories,
instead of theories to suit facts.”
Sir Arthur Conan Doyle: Sherlock Holmes, "A
Scandal in Bohemia"
What Is Data Mining?
• Data mining (knowledge discovery in
databases):
– A process of identifying hidden patterns
and relationships within data (Groth)
• Data mining:
– Extraction of interesting (non-trivial,
implicit, previously unknown and
potentially useful) information or
patterns from data in large databases
Multidisciplinary
Statistics
Pattern
Neurocomputing
Recognition
Machine
Data Mining Learning
Databases
KDD
AI
Architecture of a Typical Data Mining
System
Graphical user interface
Pattern evaluation
Data mining engine
Knowledge-base
Database or data
warehouse server
Filtering
Data cleaning &
data integration
Databases
Data
Warehouse
A Data Mining example:
The Home Equity Loan Case
• The analytic goal is to determine
who should be approved for a
home equity loan.
• The target variable is a binary
variable that indicates whether
an applicant eventually defaulted
on the loan.
• The input variables are variables
such as the amount of the loan,
amount due on the existing
mortgage, the value of the
property, and the number of
recent credit inquiries.
HMEQ case overview
– The consumer credit department of a bank wants to automate
the decision-making process for approval of home equity lines
of credit. To do this, they will follow the recommendations of
the Equal Credit Opportunity Act to create an empirically
derived and statistically sound credit scoring model. The
model will be based on data collected from recent applicants
granted credit through the current process of loan
underwriting. The model will be built from predictive modeling
tools, but the created model must be sufficiently interpretable
so as to provide a reason for any adverse actions (rejections).
– The HMEQ data set contains baseline and loan performance
information for 5,960 recent home equity loans. The target
(BAD) is a binary variable that indicates if an applicant
eventually defaulted or was seriously delinquent. This adverse
outcome occurred in 1,189 cases (20%). For each applicant, 12
input variables were recorded.
The HMEQ Loan process
1. An applicant comes forward with a specific
property and a reason for the loan (HomeImprovement, Debt-Consolidation)
2. Background info related to job and credit
history is collected
3. The loan gets approved or rejected
4. Upon approval, the Applicant becomes a
Customer
5. Information related to how the loan is serviced
is maintained, including the Status of the loan
(Current, Delinquent, Defaulted, Paid-Off)
The HMEQ Loan
Transactional Database
• Entity Relationship Diagram (ERD), Logical Design:
Loan
Reason
Approval
Date
APPLICANT
Applies for
HMEQ Loan on…
using…
becomes
PROPERTY
Balance
Status
OFFICER
CUSTOMER
ACCOUNT
MonthlyPayment
has
HISTORY
HMEQ Transactional database:
the relations
• Entity Relationship Diagram (ERD), Physical Design:
Officer
HMEQLoanApplication
OFFICERID
OFFICERNAME
PHONE
FAX
OFFICERID
APPLICANTID
PROPERTYID
LOAN
REASON
DATE
APPROVAL
Applicant
APPLICANTID
NAME
JOB
DEBTINC
YOJ
DEROG
CLNO
DELINQ
CLAGE
NINQ
Property
PROPERTYID
ADDRESS
VALUE
MORTDUE
Customer
Account
CUSTOMERID
APPLICANTID
NAME
ADDRESS
ACCOUNTID
CUSTOMERID
PROPERTYID
ADDRESS
BALANCE
MONTHLYPAYMENT
STATUS
History
HISTORYID
ACCOUNTID
PAYMENT
DATE
The HMEQ Loan
Data Warehouse Design
• We have some slowly changing attributes:
HMEQLoanApplication: Loan, Reason, Date
Applicant: Job and Credit Score related attributes
Property: Value, Mortgage, Balance
• An applicant may reapply for a loan, then
some of these attributes may have
changed.
– Need to introduce “Key” attributes and make
them primary keys
The HMEQ Loan
Data Warehouse Design
STAR 1 – Loan Application facts
• Fact Table: HMEQApplicationFact
• Dimensions: Applicant, Property, Officer, Time
STAR 2 – Loan Payment facts
• Fact Table: HMEQPaymentFact
• Dimensions: Customer, Property, Account, Time
Two Star Schemas for HMEQ Loans
Applicant
APPLICANTKEY
APPLICANTID
NAME
JOB
DEBTINC
YOJ
DEROG
CLNO
DELINQ
CLAGE
NINQ
Officer
OFFICERKEY
OFFICERID
OFFICERNAME
PHONE
FAX
Property
Customer
PROPERTYKEY
PROPERTYID
ADDRESS
VALUE
MORTDUE
CUSTOMERKEY
CUSTOMERID
APPLICANTID
NAME
ADDRESS
HMEQApplicationFact
HMEQPaymentFact
APPLICANTKEY
PROPERTYKEY
OFFICERKEY
TIMEKEY
LOAN
REASON
APPROVAL
CUSTOMERKEY
PROPERTYKEY
ACCOUNTKEY
TIMEKEY
BALANCE
PAYMENT
STATUS
Time
Account
TIMEKEY
DATE
MONTH
YEAR
ACCOUNTKEY
LOAN
MATURITYDATE
MONTHLYPAYMENT
The HMEQ Loan DW:
Questions asked by management
• How many applications were filed each month during the
last year? What percentage of them were approved each
month?
• How has the monthly average loan amount been
fluctuating during the last year? Is there a trend?
• Which customers were delinquent in their loan payment
during the month of September?
• How many loans have defaulted each month during the
last year? Is there an increasing or decreasing trend?
• How many defaulting loans were approved last year by
each loan officer? Who are the officers with the largest
number of defaulting loans?
The HMEQ Loan DW:
Some more involved questions
• Are there any patterns suggesting which applicants are
more likely to default on their loan after it is approved?
• Can we relate loan defaults to applicant job and credit
history? Can we estimate probabilities to default based
on applicant attributes at the time of application? Are
there applicant segments with higher probability?
• Can we look at relevant data and build a predictive
model that will estimate such probability to default on the
HMEQ loan? If we make such a model part of our
business policy, can we decrease the percentage of
loans that eventually default by applying more stringent
loan approval criteria?
Selecting Task-relevant attributes
Applicant
APPLICANTKEY
APPLICANTID
NAME
JOB
DEBTINC
YOJ
DEROG
CLNO
DELINQ
CLAGE
NINQ
Officer
OFFICERKEY
OFFICERID
OFFICERNAME
PHONE
FAX
Property
Customer
PROPERTYKEY
PROPERTYID
ADDRESS
VALUE
MORTDUE
CUSTOMERKEY
CUSTOMERID
APPLICANTID
NAME
ADDRESS
HMEQApplicationFact
HMEQPaymentFact
APPLICANTKEY
PROPERTYKEY
OFFICERKEY
TIMEKEY
LOAN
REASON
APPROVAL
CUSTOMERKEY
PROPERTYKEY
ACCOUNTKEY
TIMEKEY
BALANCE
PAYMENT
STATUS
Time
Account
TIMEKEY
DATE
MONTH
YEAR
ACCOUNTKEY
LOAN
MATURITYDATE
MONTHLYPAYMENT
HMEQ final task-relevant data file
Name
Model Role
Measurement Level
Description
BAD
Target
Binary
1=defaulted on loan, 0=paid back loan
REASON
Input
Binary
HomeImp=home improvement,
DebtCon=debt consolidation
JOB
Input
Nominal
Six occupational categories
LOAN
Input
Interval
Amount of loan request
MORTDUE
Input
Interval
Amount due on existing mortgage
VALUE
Input
Interval
Value of current property
DEBTINC
Input
Interval
Debt-to-income ratio
YOJ
Input
Interval
Years at present job
DEROG
Input
Interval
Number of major derogatory reports
CLNO
Input
Interval
Number of trade lines
DELINQ
Input
Interval
Number of delinquent trade lines
CLAGE
Input
Interval
Age of oldest trade line in months
NINQ
Input
Interval
Number of recent credit inquiries
HMEQ: Modeling Goal
– The credit scoring model should compute the
probability of a given loan applicant to default
on loan repayment. A threshold is to be
selected such that all applicants whose
probability of default is in excess of the
threshold are recommended for rejection.
– Using the HMEQ task-relevant data file, three
competing models will be built: A logistic
Regression model, a Decision Tree, and a
Neural Network
– Model assessment will allow us to select the
best of the three alternative models
Predictive Modeling
Inputs
Cases
.. .. .. .. .. .. .. .. ..
. . . . . . . . .
Target
...
...
...
...
...
...
...
...
...
...
.. ..
. .
Introducing
SAS Enterprise Miner (EM)
The SEMMA Methodology
– Introduced By SAS Institute
– Implemented in SAS Enterprise Miner (EM)
– Organizes a DM effort into 5 activity groups:
Sample
Explore
Modify
Model
Assess
Sample
Input Data Source
Sampling
Data Partition
Explore
Distribution
Explorer
Association
Multiplot
Variable Selection
Insight
Link Analysis
Modify
Data Set
Attributes
Clustering
Transform
Variables
Filter
Outliers
Self-Organized Maps
Kohonen Networks
Time Series
Replacement
Model
Regression
User Defined
Model
Tree
Ensemble
Neural Network
Memory Based
Reasoning
Princomp/
Dmneural
Two-Stage Model
Assess
Assessment
Reporter
SAS EM Demo: HMEQ Case