Transcript Methodology

Methodology
Qiang Yang, MTM521
Material
A High-level Process View for
Data Mining
1.
2.
3.
4.
5.
6.
7.
8.
9.
Develop an understanding of application, set goals, lay down all
questions a user might pose as queries
Create dataset for study (from Data Warehouse, Web site, surveys)
Data Cleaning and Preprocessing:
Data Reduction and projection
Choose Data Mining task: blackbox or whitebox? Classification or
clustering?
Choose Data Mining algorithms:
Use algorithms to perform task
Interpret, evaluation and cross validation, and iterate thru 1-7 if
necessary
Deploy: integrate into operational systems, feedback and revise
goals and redo 1-9.
Case Study: German Bank
Credit

Application




Bank credit assessment
Decision: Approval of loan or not approval of
loan
Usage: Automatic Online Screening or Human
assistant
Objective:


Accurate prediction of values
Give reasons behind decision is important
Potential Queries





Who are likely to be approved loan?
What are the most important characteristics
of an applicant to look at?
What are the most indicative features for
yes/no answers
What subset of customers to market to? And
what are the associated profit?
Added: what advice to give to applicant to
improve chance in future?
Create Data Set for Study

Access to bank data warehouse or
conduct a customer survey


Cost of obtaining data must be factored
in?
Likeliness of obtaining quality data in a
limited amount of time?
Questions to be Asked







Attribute 1: (qualitative)
Status of existing checking account
A11 :
... < 0 DM
A12 : 0 <= ... < 200 DM
A13 :
... >= 200 DM /
salary assignments for at
least 1 year
A14 : no checking account











Attribute 2: (numerical)
Duration in month












Attribute 3: (qualitative)
Credit history
A30 : no credits taken/
all credits paid back duly
A31 : all credits at this bank paid back duly
A32 : existing credits paid back duly till now
A33 : delay in paying off in the past
A34 : critical account/
other credits existing (not at
this bank)










Attribute 4: (qualitative)
Purpose
A40 : car (new)
A41 : car (used)
A42 : furniture/equipment
A43 : radio/television
A44 : domestic appliances
A45 : repairs
A46 : education
A47 : (vacation - does not exist?)
A48 : retraining
A49 : business
A410 : others
Attribute 5: (numerical)
Credit amount
Attibute 6: (qualitative)
Savings account/bonds
A61 :
... < 100 DM
A62 : 100 <= ... < 500 DM
A63 : 500 <= ... < 1000 DM
A64 :
.. >= 1000 DM
A65 : unknown/ no savings account
Data Cleaning and
Preprocessing:

What to do with missing values?




How to fill in missing values and identify and
correct incorrect values?
Do we know the cost of classification
mistakes?
Do we know the cost of obtaining each
feature?
How do we reduce noise? What are the
sources of noise for each attribute?
Rudimentary Analysis




What is the data distribution?
How can you view data from different
angles?
What does the rudimentary data
analysis tell you?
Are you satisfied with the analysis?
Are there more queries that you
cannot answer through this analysis?
Data reduction




How many data features do we want in
the end?
Is it a data reduction problem or data
transformation problem?
Is it supervised data reduction or
unsupervised data reduction problem?
Is it linear data reduction or nonlinear
data reduction problem?
Choose data mining task




Do we apply rule-based methods for better
understanding?
Do we apply K Nearest neighbor methods
for dense data sets?
Do we apply SVM methods for accuracy but
for black-box models?
Is a final result (yes/no) important or the
action important (what to do to reduce
customer likelihood of being rejected?)
Use Algorithm to Perform Task






Which hardware platform to use?
Which software platform to use?
Is speed and scale more important
than visual effects?
Is data porting issue important?
Is API important or final answer
important?
How much does each package cost?
Evaluation



Do we have separate training and testing
data?
Is data scarce?
What kind of cross validation do we use?



N folds, N=?
Bootstrapping or not?
Is ranking important (lift, ROC) or confusion
matrix important?
Interpretation




What does the results mean?
Do we need to support causal effect of
the final decisions?
Do we need to go back to experts in
the domain of application?
Do we need visual effects or ranking of
final results?
Iteration



After obtaining one set of results, do
we need to return to the beginning to
revise our objectives and obtain new
data?
How many iterations are needed?
Is the process a one shot or
continuous process?
Deployment Issues



Do we need to integrate with a real
online banking system?
Do we need to provide API for the
software?
Do we need to use new data to
supplement training data set? If so,
how often?