Presentation(PowerPoint)

Download Report

Transcript Presentation(PowerPoint)

In memory of
Dr. Jan Zytkow
SEP 09 1944 - JAN 16 2001
Mining Financial Data
Histograms & Contingency Tables
Shishir Gupta
Under the guidance of
Dr. Mirsad Hadzikadic
Agenda
•
•
•
•
•
•
•
Database
Task goals
Tool & technique used
Data preparation and cleaning
Attribute selection
Data transformation
Data Mining/Pattern
Evaluation
• Knowledge presentation
• Pros/Cons
• Questions & Demonstration
Database
• Financial Dataset from
PKDD 1999
• Financial Dataset from a
Czech Bank
• Relational Dataset
• 8 Relations
–
–
–
–
ACCOUNT
DEMOGRAPH
TRANSACTION
DISPOSITION
- LOAN
- ORDER
- CARD
- CLIENT
Task Goal
• Determine Good Client to offer
some additional service
• Determine Bad Client to watch
carefully to minimize bank loss
• Offer Services :
– Loan
– Credit Card
Technique Used - Histogram
SQL Statement used
SELECT age, COUNT(age)
FROM table_x
GROUP BY age
ORDER BY age
Technique Used – C-Tables
SQL Statement used
SELECT sex, COUNT(sex), age
FROM table_x a, table_y b
WHERE a.id = b.fid
GROUP BY sex, age
ORDER BY sex, age
Technique Used – Correlation
SQL Statement used
SELECT x, y
FROM table_x a, table_y b
WHERE a.id = b.fid
ORDER BY x, y
Tool - Architecture
Tool - Description
Data Cleaning
• Missing Value
– Relation
DEMOGRAPHIC
• Incorrect Values
– Relation
TRANSACTION
(Data reduced by
10% after
cleaning)
Data Preparation
• Relation CLIENT
– Separating SEX &
BDATE from
BIRTHNUMBER
• All Date fields
converted to AGE
– Ref 199901.
Data Preparation
Cont….
• Creating Table
definitions
• Setting up data in
table compatible
format
• Loading data into
Database
• Evaluate loading
errors and
changing attribute
definitions
accordingly
Attribute Selection
• Decision Relation
A4?
– LOAN
Y
• Decision Attributes
– STATUS
• Classification
Attributes
– All other attributes
that do not belong to
LOAN relation.
N
A1?
Y
Class1
A6?
N
Class2
Y
Class1
N
Class2
Data Transformation
• Discretization
– Continuous attributes into 4 to 10 buckets
• Transactions performed in the year 1997
considered for relation TRANSACTION.
– Due to resource limitations
– Maximum loans were approved during this
period
TRANSFORM
Data Mining/Pattern Evaluation
• Run Histogram on all
non-key attributes to
study its distribution.
• Discretize continuous
attributes.
• Run Contingency Table
study the reference
among two attributes.
• Check significance with
Correlation function if
both attributes are
continuous.
Knowledge Presentation - 1
• All loans on
accounts where a
second person is
allowed to
dispose are
GOOD LOANS
(100%)
Knowledge Presentation - 2
• Permanent
Orders of type
household &
leasing indicates
financial stability
Knowledge Presentation - 3
• Accounts with
Cash withdrawals
are more likely to
repay their loans
Knowledge Presentation - 4
• Accounts with
low transaction
amounts indicate
good loans
Knowledge Presentation - 5
• Accounts that are
in debt indicates
BAD LOAN
Pros
• Flexibility to alter data presentation to
understand the nature of data
• Customers with no background with
datamining can appreciate the output
results because of its simplicity
• Since there is a provision to store the
results in a file, subsequent analysis
on a subset of data becomes very
easy
Cons
• Needs capability for Multi-Variable
analysis.
• Some kind of quantification needs to
be put in.
• Performance issues with using
RDBMS.
Questions & Demonstration