Transcript Document

Credit Card Analysis of Czech Bank
Goals
•
Our goal for this project is to analyze customer
and credit-card information, from the Berka
dataset, to extrapolate the type of customer who
makes a good candidate for a credit-card, and
what level of credit to extend to that customer.
•
The Berka dataset is from the 1999
PKDD Discovery Challenge.
•
The Berka dataset is a collection of
financial information from a Czech bank.
Credit Card Analysis of Czech Bank
Domain Description
Entity-Relationship
Description
•
•
•
•
Each account has both static
characteristics (e.g. date of
creation, address of the branch)
given in relation "account" and
dynamic characteristics (e.g.
payments debited or credited,
balances) given in relations "permanent order" and
"transaction".
Relation "client" describes characteristics of persons who can
manipulate with the accounts.
Relations "loan" and "credit card" describe some services
which the bank offers to its clients;
Relation "demographic data" gives some publicly available
information about the districts (e.g. the unemployment rate);
additional information about the clients can be deduced from
this.
Credit Card Analysis of Czech Bank
Domain Description - Continued
 The dataset contains the following tables:
• Accounts
o Each record describes static characteristics of an
account
o Size: 4500 records
• Clients
o Each record describes characteristics of a client
o Size : 5369 records
• Disposition (Disp)
o Each record relates a client with an account and
describes the client’s right to operate that account
o Size: 5369 records
Credit Card Analysis of Czech Bank
Domain Description - Continued
 Dataset Table Description, Continued
• Permanent Orders, Debit Only (Orders)
o Each record describes characteristics of a
payment order
o Size : 6471 records
• Transactions (Trans)
o Each record describes one transaction on an
account
o Size: 1056320 records
• Loans
o Each record describes a loan granted for a given
account
o Size: 682 records
Credit Card Analysis of Czech Bank
Domain Description - Continued
 Dataset Table Description, Continued
• Credit Cards (Cards)
o Each record describes a credit card issued to an
account
o Size : 892 records
• Demographic Data (District)
o Each record describes demographic
characteristics of a district
o Size: 77 records
Credit Card Analysis of Czech Bank
Data Preprocessing Activities
1.
Converted the ascii files to:
a.
MS Excel and/or MS Word files for cleaning data
b.
MS Access database for use in
i.
data mining,
ii. de-normalizing or ‘flattening’ files, and
iii. basic querying to learn more about the data.
c.
Put all modified files into file types recognized by Weka.
i.
These files are comma delimited with a ‘heading’ of
attribute definition information.
Credit Card Analysis of Czech Bank
Data Preprocessing Activities - Continued
2.
Verified all table relationships:
a.
Every account has an Owner via Disp and Account tables
b.
Order and Loan records are duplicated in transaction records. That
is, the transactions include Order records and Loan payments.
i.
Loan records in Trans are identified by k_symbol=”LP”
Credit Card Analysis of Czech Bank
Data Preprocessing Activities - Continued
3. De-normalize, or ‘flatten’, files for mining. Our database is
relational. In order to mine or cluster attributes, those attributes
must be in a single table. We have created a de-normalized table
based on our goals.
a.
Goal: Analyze credit-card information to extrapolate the type of
customer who makes a good candidate for a credit-card.
i.
Account-Client-Disp-Card-District-Loan-Transaction Table
•
Using information we discovered about accounts from
previous clustering, cluster customer information

Using card type as clustering attribute

Added "N" (None) as a possible value to the Loan Status
attribute
•
In order to better understand customers, we looked at this
table in two ways:
•
To identify, from all customers, which were credit card
holders and which were not.
•
To examine the variances that exist between all credit
card holding customers.
Credit Card Analysis of Czech Bank
Data Preprocessing Activities - Continued
4. Add, change, remove, and descretize attributes as necessary.
The tables shown below describe the changes made:
Preprocessing Activities – Account Table:
Changes Made With Excel
Column
Description
Changes
Missing
or Invalid
Values
Notes
Frequency
Frequency of Statement
Issuance
Translated values as follows:
•POPLATEK MESICNE changed to
MONTHLY ISSUANCE (MI)
•POPLATEK TYDNE changed to WEEKLY
ISSUANCE (WI)
•POPLATEK PO OBRATU change to
ISSUANCE AFTER TRANSACTION (TI)
N/A
Translated for ease of use
Date
Date of account creation
Removed
Ignore
This attribute is not used in
our mining effort
Credit Card Analysis of Czech Bank
Data Preprocessing Activities - Continued
Preprocessing Activities – Client Table:
Changes Made With Excel
Column
Description
Changes
Missing or
Invalid
Values
Notes
BirthNumber
Birthday and gender
Removed
This is a 6-digit number. The
documentation says that its format is as
follows:
- YYMMDD
(Men)
- YYMM50+DD
(Women)
Analysis suggests that the format is as
follows:
- YYMMDD
(Men)
- YY50+MMDD
(Women)
Format changed to:
- MM/DD/YYYY
- Created a new field for Client_Sex
N/A
Attribute removed from
dataset and replaced by
attributes Client_Sex and
Client_Age
Credit Card Analysis of Czech Bank
Data Preprocessing Activities - Continued
Preprocessing Activities – Client Table, Continued:
Column
Description
Changes
Missing or
Invalid
Values
Notes
Client_Sex
Gender derived from
BirthNumber attribute
Added
Built during data pre-processing.
Values are MALE (M) and FEMALE (F)
N/A
Distribution of Values:
- 49% Male
- 51% Female
Client_Age
Discretized value derived
from BirthNumber attribute
Values:
- 1 = YOUTH
(0 - 24)
- 2 = ADULT
(24 - 35)
- 3 = MIDDLE-AGE
(36 - 64)
- 4 = SENIOR
(65 - *)
Added
Built during data pre-processing.
N/A
None
Credit Card Analysis of Czech Bank
Data Preprocessing Activities - Continued
Preprocessing Activities – Disposition Table
Changes Made With Excel
Column
Description
Changes
Missing or
Invalid
Values
Notes
Type
Type of
Disposition
(owner/user)
Translated as follows:
•Disponent = User
•Owner = Owner
User
Distribution of Values:
- 84% Owner
- 16% User
Preprocessing Activities – Order Table
Changes Made With Excel
Column
Description
Changes
Missing or
Invalid Values
Notes
K_Symbol
Characterization of the
payment
Removed
22% of the
values are
missing
Removed attribute due to
missing values
Credit Card Analysis of Czech Bank
Data Preprocessing Activities - Continued
Preprocessing Activities – Loan Table
Changes Made With Excel
Column
Description
Changes
Missing or
Invalid
Values
Notes
Date
Date the loan was
granted
Changed format from YYMMDD
to MM/DD/YYYY
Ignore
Correlation with
Status Attribute:
• R2 = 98%
• r = 0.49
Amount
Amount of loan
•Removed
•Discretized values stored in
Amt attribute
N/A
Correlation with
Status Attribute:
• R2 = 68%
• r = 0.34
Duration
•Duration of the loan
•Possible Values:
• 12 months
• 24 months
• 36 months
• 48 months
• 60 months
•Removed
•Discretized values stored in
Dur attribute
N/A
•Correlation with
Status attribute:
• R2 = 100%
• r = 0.51
•Distribution of
Values:
• 12 = 19%
• 20 = 20%
• 36 = 19%
• 40 = 20%
• 60 = 21%
Credit Card Analysis of Czech Bank
Data Preprocessing Activities - Continued
Preprocessing Activities – Loan Table, Continued
Column
Description
Changes
Missing or
Invalid
Values
Notes
Payments
Monthly loan payment
•Removed
•Discretized values stored in
Amt attribute
N/A
Dur
Discretized Duration
Attribute:
• (*,30)
• (31,42)
• (43,54)
• (55,*)
•Added
•Discretized in Rosetta using
Entropy Algorithm
N/A
Distribution of
values:
• (*,30) = 39%
• (31,42) = 19%
• (43,54) = 20%
• (55,*) = 21%
Pmt
Discretized Payment
Attribute:
• (*,8041) = 1
• (8041,*) = 2
•Added
•Discretized in Rosetta using
Entropy Algorithm
•Rosetta discretized into 50+
values.. We merged values
using <1%
N/A
Distribution of
values:
• (*,8041) = 94%
• (8041,*) = 6%
Amt
Discretized Amount
Attribute:
• (*,30708) = 1
• (30709,49380) = 2
• (49381,76926) = 3
• (76927,230310) = 4
• (230311,*) = 5
•Added
•Discretized in Rosetta using
Entropy Algorithm
•Rosetta discretized into 100+
values.. We merged values
using <1%
N/A
Distribution of
values:
• (*,30708) = 9%
• (30709,49380) =
10%
• (49381,76926) =
13%
• (76927,230310) =
47%
• (230311,*) =
21%
Credit Card Analysis of Czech Bank
Data Preprocessing Activities - Continued
Preprocessing Activities – Transaction Table
Changes Made With Excel
Column
Description
Changes
Missing or
Invalid
Values
Date
Date of transaction
Format changed from YYMMDD to
MM/DD/YYYY
Ignore
Type
+/- transaction
Translated Values:
• PRIJEM = Credit
• VYDAJ = Withdrawal
Flag Record Do not use
Operation
Mode of transaction
Translated values:
• VYBER KARTOU = Credit Card
Withdrawal (CCW)
• VKLAD = Credit in Cash (CRC)
• PREVOD Z UCTU = Collection from
Another Bank (CAB)
• VYBER = Withdrawal in Cash (WC)
• PREVOD NA UCET = Remittance
to Another Bank (RAB)
Flag record Do not use
K_Symbol
Characterization of
transaction
Translated Values:
• POJISTNE = Insurance Payment
(IP)
• SLUZBY = Payment on Statement
(PS)
• UROK = Interest Credited (ICR)
• SANKC. UROK - Sanction Interest
(SI)
• SIPO - Household (H)
• DUCHOD - Old-age Pension (OP)
• UVER - Loan Payment (LP)
Flag record Do not use
Notes
Translated for ease of
use
Credit Card Analysis of Czech Bank
Data Preprocessing Activities - Continued
Preprocessing Activities – Demographic Table
Changes Made With Excel
Column
Description
Changes
Missing or
Invalid Values
Notes
A2
District Name
Replaced Spaces with
underscore ("_")
N/A
A3
Region Name
Replaced Spaces with
underscore ("_")
N/A
Distribution of Values:
• Prague = 1%
• Central_Bohemia = 16%
• South_Bohemia = 10%
• West_Bohemia = 13%
• North_Bohemia = 13%
• East_Bohemia = 16%
• South_Moravia = 18%
• North_Moravia = 14%
Issued
Date card issued
•Removed
•Eliminated null timestamp and changed date
format from YYMMDD to
MM/DD/YYYY
Ignore
For our purposes, date card
issued is not material
Credit Card Analysis of Czech Bank
Data Preprocessing Activities - Continued
Preprocessing Activities – Credit Card Table
Changes Made With Excel
Column
Description
Changes
Missing or
Invalid
Values
Notes
Issued
Date card issued
•Removed
•Eliminated null time-stamp and
changed date format from YYMMDD
to MM/DD/YYYY
Ignore
For our purposes, date
card issued is not
material
Credit Card Analysis of Czech Bank
Methodology
 Methodology Overview
1. Attribute Ranking
2. Classification Analysis
3. Clustering Analysis
Credit Card Analysis of Czech Bank
Methodology – Classification Analysis
•
Classification is a process where a model is built describing a
predetermined set of data classes.
• The model is constructed by analyzing all the records in the
database.
• Each tuple is assumed to belong to a predefined class, as
determined by one of the attributes called the class label.
• The tuples analyzed to build the model form the training data
set.
• Typically, the learned model is expressed in terms of decision
trees or classsification rules.
• These rules can be used to predict the test data set.
Credit Card Analysis of Czech Bank
Methodology – Classification Analysis
•
We used See5 as the classification tool for our project. A brief
description of the algorithm is as follows:
• The tree starts as a single node representing the training
samples.
•
If the samples are all of the same class, then the node
becomes a leaf and is labeled with that class.
• Otherwise, the algorithm uses an entropy-based
measure known as Information Gain as a heuristic for
selecting the attribute that will best separate the
samples into individual classes. This attribute
becomes the "test" or "decision" attribute at the node.
•
A branch is created for each known value of the test attribute
and the samples are partitioned accordingly.
Credit Card Analysis of Czech Bank
Methodology – Classification Analysis
•
See5 Algorithm, cont.
• The algorithm uses the same process recursively to form a
decision tree for the samples at each partition.
•
The recursive partition stops when all the samples for a given
node belong to the same class or if there are no remaining
attributes on which samples may be further partitioned.
Credit Card Analysis of Czech Bank
Methodology – Classification Analysis
We ran See5 using the RuleSets
option and the Boost option with 3
Client_District_ID =
10
trials. This represents simplified
version of the decision tree generated
Client_District ID =
38
by See5. To build this tree, we used
Trans_Avg_Balance
Class: 1
Rules obtained, which had high
> 46123.76
(8.5 /0.9)
confidence (>85%).
Trans_Avg_Balance
<= 37097.18
Class :0
(1828.3/93.4)
Class: 1
(6.5)
Account_Opened =
1995
Class: 1
(8.3)
Client_District ID =
1
Client_District ID =
15
Trans_Avg_Balance
> 66899.77
Client_Age = Senior
Class: 1
(8.3)
Class: 1
(5.3)
Class: 1
(5.3)
Client_District ID =
13
Class: 1
(5.3)
Credit Card Analysis of Czech Bank
Methodology – Classification Analysis
See5 also has the ability to express the clasifiers as Rule sets, which are
easier to understand. 96 rules were generated, out of which we have
listed 3 rules. All of these have confidence of more than 90%
Rule 1
Rule 1/1: (1828.4/93.4, lift 1.3)
Trans_Avg_Balance <= 37097.18 --> class 0 [0.948]
If Trans_Avg_Balance <= 37097.18, Then Class Prediction is
taken as Non-CardHolder
Credit Card Analysis of Czech Bank
Methodology – Classification Analysis
Rule 2
Rule 1/3: (8.3, lift 3.7)
Client_Age = M
Client_District_ID = 1
Trans_Avg_Balance > 66899.77 --> class 1 [0.903]
If (Client_Age = Middle-Age) and
(Client_District_ID = 1) and (Trans_Avg_Balance =
66899.77), Then Class Prediction is that
of a Card
Holder.
Credit Card Analysis of Czech Bank
Methodology – Classification Analysis
Rule 3
Rule 1/4: (8.3, lift 3.7)
Client_District_ID = 60
Trans_Avg_Balance > 46123.76
Loan_Status = none --> class 1 [0.903]
If (Client_District_ID = 60) and (Trans_Avg_Balance >
46123.76) and (Loan_Status = none), Then Class Prediction is
that of a Card Holder.
Credit Card Analysis of Czech Bank
Methodology – Classification Analysis
Training Set Evaluation
•
•
•
•
The estimated predictive error = (64 + 390)/3600 = 12.6%
The percentage of instances that were correctly classified as Non-card holders
= (2816/2880) = 97%
The percentage of instances that were incorrectly classified as Non-card
holders = (64/2880) = 22.2%
The percentage of instances that were in-correctly classified as Card holders =
(390/720) = 54%
The percentage of instances that were correctly classified as Card holders =
(330/720) = 45%
Credit Card Analysis of Czech Bank
Methodology – Classification Analysis
Test Set Evaluation
•
•
•
•
•
The estimated predictive error on the Test set is = ( 59 +119)/900 = 19.7%
The percentage of instances that were correctly classified as Non-card holders
= (669/728) =91%
The percentage of instances that were incorrectly classified as Non-card
holders = (59/728) = 8.1%
The percentage of instances that were in-correctly classified as Card holders =
(119/172) = 69%
The percentage of instances that were correctly classified as Card holders =
(53/172)= 73.6%
Credit Card Analysis of Czech Bank
Methodology – Cluster Analysis, Conceptual Method
•
We used cluster analysis to partition the data into a set of classes,
grouping together customers or attributes with similar
characteristics.
•
The purpose of cluster analysis is to place observations into
groups or clusters suggested by the data such that observations
in a given cluster tend to be similar to each other in some sense,
and objects in different clusters tend to be dissimilar.
•
The COBWEB Algorithm was chosen for this task. The
COBWEB algorithm used was implemented in the Weka Toolkit.
Credit Card Analysis of Czech Bank
Methodology – Cluster Analysis , Conceptual Method
•
The COBWEB algorithm was run twice. Once with a cutoff value of
0.18 and once with a cutoff value of 0.17.
• This is done because Weka returns the cluster information
only for the leaf clusters and prunes the tree by levels.
•
The results of the COBWEB Clustering analysis are interpreted in
the following chart.
Cluster 0
# Instances: 892
Cluster 1
# Instances: 423
Cluster 2
# Instances: 71
Cluster 3
# Instances: 185
Cluster 7
# Instances: 469
Cluster 4
# Instances: 167
Cluster 5
# Instances: 83
Cluster 57
# Instances: 278
Cluster 6
# Instances: 84
Cluster 58
# Instances: 191
Cluster 59
# Instances: 88
Cluster 60
# Instances: 103
Credit Card Analysis of Czech Bank
Methodology – Cluster Analysis , Partitioning Method
Description
Number of
Instances
All Card
Cluster 2
Holders
892
Cluster 3
75
182
Cluster 5
Cluster 6
83
Cluster 7
83
Cluster 57
Cluster 59
Cluster 60
469
277
94
83
Percentage Distributions within Clusters by AGE
Youth
21.00%
12.00%
0.00%
98.00%
0.00%
20.00%
0.00%
100.00%
0.00%
Adult
21.00%
24.00%
0.00%
0.00%
94.00%
19.00%
0.00%
0.00%
93.00%
Middle Age
56.00%
60.00%
100.00%
0.00%
0.00%
59.00%
100.00%
0.00%
0.00%
2.00%
4.00%
0.00%
1.00%
6.00%
2.00%
0.00%
0.00%
6.00%
Senior
Percentage Distributions within Clusters by GENDER
Female
47.00%
100.00%
100.00%
100.00%
100.00%
0.00%
0.00%
0.00%
0.00%
Male
53.00%
0.00%
0.00%
0.00%
0.00%
100.00%
100.00%
100.00%
100.00%
Mean Value of AVERAGE BALANCE
Mean
51653.23
54304.88
51668.22
46762.24
51136.43
52179.92
53217.44
48965.39
51136.43
Std. Deviation
11109.32
12241.39
9367.23
13780.03
10064.00
10983.18
11055.47
11241.49
10064.01
Median
51292.00
54205.72
51580.11
43661.80
50401.42
51449.19
52323.52
48804.16
50401.42
Percentage Distributions Within Clusters by NUMBER OF USERS PER ACCOUNT
Single
83.00%
0.00%
100.00%
100.00%
100.00%
83.00%
82.00%
88.00%
100.00%
Joint
17.00%
100.00%
0.00%
0.00%
0.00%
17.00%
18.00%
12.00%
0.00%
Percentage Distributions Within Clusters by CARDTYPE
Junior
17.00%
10.00%
0.00%
74.00%
1.00%
16.00%
0.00%
78.00%
1.00%
Classic
74.00%
80.00%
91.00%
21.00%
88.00%
73.00%
86.00%
18.00%
88.00%
10.00%
10.00%
9.00%
4.00%
11.00%
11.00%
13.00%
3.00%
11.00%
Gold
Credit Card Analysis of Czech Bank
Methodology – Cluster Analysis , Partitioning Method
•
The distinction of each cluster was not obvious beyond the coupling
of age group and card type.
•
To further investigate the characteristics of customers, we ran another
clustering analysis using partitioning methods. We used the SAS
Enterprise Miner product. SAS uses a partitioning clustering tool
implementing the WARD (Minimum variance) method.
Credit Card Analysis of Czech Bank
Methodology – Cluster Analysis , Conceptual Method
Description
All Card
Holders
Cluster 1
Cluster 2
Cluster 4
Cluster 6
Cluster 19
Cluster 21
Percentage Distributions within Clusters by AGE
Youth
21.00%
80.00%
18.18%
17.65%
62.50%
0.00%
0.00%
Adult
21.00%
0.00%
18.00%
11.76%
0.00%
100.00%
50.00%
Middle Age
56.00%
20%
63.64%
70.59%
37.50%
0.00%
50.00%
2.00%
0.00%
0.00%
0.00%
0.00%
0.00%
0.00%
Senior
Percentage Distributions within Clusters by GENDER
Female
47.00%
20.00%
45.45%
29.41%
37.50%
0.00%
50.00%
Male
53.00%
80.00%
54.55%
70.59%
62.50%
100.00%
50.00%
Mean Value of AVERAGE BALANCE
Mean
51653.23
26887.36
75684.47
73667.43
30521.51
78237.63
80549.51
Std. Deviation
11109.32
328.09
508.51
579.88
601.73
0.00%
909.67
Minimum
0.00%
26522.30
74787.91
72794.25
29640.67
78237.63
79906.27
Maximum
0.00%
27362.37
76453.60
7424.22
31253.80
78237.63
81192.75
Percentage Distributions Within Clusters by NUMBER OF USERS PER ACCOUNT
Single
83.00%
80.00%
54.55%
70.59%
100.00%
100.00%
100.00%
Joint
17.00%
20.00%
45.45%
29.41%
0.00%
0.00%
0.00%
Percentage Distributions Within Clusters by CARDTYPE
Junior
16.00%
60.00%
9.09%
11.76%
62.50%
0.00%
0.00%
Classic
74.00%
20.00%
45.45%
47.06%
37.50%
100.00%
100.00%
10.00%
20.00%
45.45%
41.18%
0.00%
0.00%
0.00%
Gold
Credit Card Analysis of Czech Bank
Results
The following rules represent our findings in analyzing
whether or not a customer will/will not have a credit card:
1. If the average account balance is less-than-or-equal 37097.18,
then the client will not be a card holder.
2. If the client is middle-age and lives in Prague, and his/her average
account balance is greater-than 66899.77, then that client will
be a card holder.
3. If the client lives in Prostejov, and has an average account balance
in excess of 46123.76, then that client will be a card holder.
Credit Card Analysis of Czech Bank
Results
The following are characterizations of cardholders within the
current population of bank clients
1.
2.
There is no apparent difference between men and
women in the
issuance of credit cards.
Middle aged single females have a higher percentage of
classic
and gold cards than the general population.
3.
All credit card customers that have taken a loan have good
status in repayment.
4.
Customers who are in the age bracket less than 24 and whose
average balance is under 50000.00 are likely candidates for a
JUNIOR Card.
5.
Otherwise, it is not easily determined from customer
information
which card type has been selected. Perhaps further examination
Credit Card Analysis of Czech Bank
Next Steps
•
Future continuation of this analysis might include
• identifying customers with a junior or classic credit card to whom the
bank can offer a higher-limit card.
•
Other Opportunities
• Which accounts are likely to default on loan payments? Why?
•
What are the characteristics of a good bank client ("good" is defined
here, in the most general of terms, as who will make payments on
their outstanding balances in a timely manner)?
•
What are the identifying characteristics of a good bank branch
(again, we define "good" only in the most general terms of which
branches are most successful in the collection of loan payments)?