7. Decision Trees and Decision Rules

Download Report

Transcript 7. Decision Trees and Decision Rules

國立雲林科技大學
National Yunlin University of Science and Technology
A data mining approach to the prediction
of corporate failure
Advisor : Dr. Hsu
Presenter : ching-wen Hong
Author
: Feng Yu Lin, Sally McClean
Intelligent Database Systems Lab
Outline
2
N.Y.U.S.T.
I. M.

Motivation

Objection

Classifier technique

The five steps of Data Mining: the SAS SEMMA methodology

Sampling

Data exploration

Data manipulation

Modelling and results

Assessment

Conclusion

My opinion
Intelligent Database Systems Lab
Motivation

N.Y.U.S.T.
I. M.
Due to recent changes in the world economy and as more firms,
large or small, seem to fail now more than ever corporate failure
prediction is of increasing importance.

3
Intelligent Database Systems Lab
Objection

4
N.Y.U.S.T.
I. M.
This paper uses a data mining approach to the prediction of
corporate failure.
Intelligent Database Systems Lab
Classifier technique
5
N.Y.U.S.T.
I. M.

The classifiers include numerous statistical methods and machine
learning methods.

The statistical methods include discriminant analysis (DA) and
logistic regression (LG).

The machine learning methods include artificial neural networks
(NN) and decision tree method (C5.0).

Another one, the hybrid method is proposed by the author.
Intelligent Database Systems Lab
The five steps of Data Mining: the
SAS SEMMA methodology
Sampling:
The data sampling consists of company financial data
from the UK.
Explore: Preprocessing of data
Manipulate: Feature selection
Model: The classification models used to the prediction of
corporate failure
 Assess: Comparison of prediction accuracy
6
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Sampling
7
N.Y.U.S.T.
I. M.

The data sampling consists of company financial data
from the UK. The financial data were accessed from
Datastream/ICV.

The companies are divided into two groups:one is the
failed companies group and the other is the nonfailed
companies group.

This training sample consists of 690 nonfailed companies
and 106 failed companies.(1980-1990)

The test dataset consists of 289 nonfailed companies and
48 failed companies.(1991-1999)
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
8
Intelligent Database Systems Lab
Data exploration
N.Y.U.S.T.
I. M.
 Exploration
or preprocessing of data is very
important and is sometimes the most timeconsuming part.
 Exploration of data is included:
(1)The data is presented in a ready to use state.
(2)The preprocessing of missing data.
(3)We must filter out the redundant records such as
duplicated data.
We
decide to delete x719,x734,x735,x761,x766and
x792, since these variables include too many
missing values.
9
Intelligent Database Systems Lab
Data manipulation
10
N.Y.U.S.T.
I. M.

One main purpose of data manipulation is to
feature selection.

We will use two feature selection methods. The
first one is based on financial theory and human
judgement( Feature selection І ), the second is
based on ANOVA ( Feature selection Ⅱ ),.
Intelligent Database Systems Lab
Feature selection І- human judgement
based on financial theory


11
N.Y.U.S.T.
I. M.
Laurent[21],Ezzamuel et al.[13] and Clarke[10]
attempted to reduce the wide variety of financial
ratios into several major groups (e.g. profitability,
liquidity,etc), the suggestion being that a
researcher need only select one ratio from each
of the groups to obtain an indication of the
companies overall performance.
The features selected are from these categories:
return on capital employed(ROCE);Turnover to
total assets employed(turnover/TA);capital
gearing(CG);working capital(WC)ratio.
Intelligent Database Systems Lab
12
Feature selection Ⅱ- ANOVA
statistical method
N.Y.U.S.T.
I. M.

To classify failed companies and the nonfailed companies
effectively, the predictors should be chosen to enable us to
distinguish between these two groups.

We use ANOVA to select the variables that are significantly
different between these two groups.

Table 3, the ANOVA output of the training set, we can get the
variables indicated by *, which indicate a significantly
different between the failed and nonfailed group.
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
13
Intelligent Database Systems Lab
Modelling and results
The
classification models used in this study are:
Statistics:DA,LG;
Machine learning methods:decision trees,NN.
14
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
15
Hybrid models combining different
classifiers
N.Y.U.S.T.
I. M.

A hybrid method usually integrates two or more technologies.
The purpose of integrating technologies is to strengthen the
best features of each.

A hybrid method that combines the best features of several
classification models is developed to increase the prediction
performance.
Intelligent Database Systems Lab
Our purpose is to minimize the number of wrong
predictions:
Min δ=∑i=1n (Ti*Oi)
s.t. Oi=f(w1Vi1+ w2Vi2+ w3Vi3+…+ wmVim-θ),i=1,…,n
n=the number of classifiers.
m=the number of companies in the sample.
Ti the actual outcome of ith company in the test sample.
Oi the predicted outcome of ith company in the test
sample based on the hybrid method
Oi=1 if w1Vi1+ w2Vi2+ w3Vi3+…+ wmVim>θ, Oi=0 else
Vij the predicted output of ith company in the test
sample by classifier j.
wj is the weight of classifier j,Θ is the threshold.
*returns 1 if both sides of *are different,
*returns 0 if both sides of *are same.

16
N.Y.U.S.T.
I. M.
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.

17
The problem of finding such a combination of
classifiers is to try to find (1)the combination of
weight,(2)their respective classifiers and (3) the
threshold θ such that the miss hit δ between
actual output Ti and Oi is as small as possible.
Intelligent Database Systems Lab
The hybrid algorithm



18
N.Y.U.S.T.
I. M.
1.Compte the total hit ratio (total accuracy) for
the training sample itself for all m independent
classifiers.(w1, w2 ,… , wm )
2.Take the outcome prediction Vij of classifier j
for all the companies i in the test sample for
j=1,…,m, i=1,…,n
3.Take all the population of classifiers, or subsets
of it. Compute w1Vi1+ w2Vi2+ w3Vi3+…+
wmVim for each companies i in the test sample.
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.

4. if w1Vi1+ w2Vi2+ w3Vi3+…+ wmVim>θ,then
Oi=1 , else Oi=0. Adjust the parameter θ,
where 0<θ< w1+ w2+ w3+…+ wm , such that
the misclassification δ is as small as possible.
19
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
We
present three hybrid classifiers.
Hybrid1-DA+LG+NN+C5.0
Hybrid2-DA+NN+C5.0
Hybrid3-LG+C5.0
20
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
21
Intelligent Database Systems Lab
Assessment
22
N.Y.U.S.T.
I. M.
Intelligent Database Systems Lab
Conclusions



23
N.Y.U.S.T.
I. M.
For all the models,DA,LG,NN,C5.0 and hybrid
classifiers, we found the ANOVA feature selection is
better than human judgement feature selection except
for DA.
The machine learning methods (NN and decision trees)
show better performance than the statistical approach.
We present three hybrid classifiers:Hybrid1-DA+LG
+NN+C5.0;Hybrid2-DA+NN+C5.0; Hybrid3- LG +
C5.0.The empirical tests show that a hybrid classifiers
produces higher prediction accuracy than individual
classifiers.
Intelligent Database Systems Lab
My opinion

N.Y.U.S.T.
I. M.
Advantage: A hybrid method that combines the best features of
several classification models is developed to increase the
prediction performance.

24
Disadvantage: (1)The hybrid algorithm is time-consuming in
computing w1Vi1+ w2Vi2+ w3Vi3+…+ wmVim ,i=1,…,n for
subsets of classifiers. (2) The hybrid method is not good in the
application.
Intelligent Database Systems Lab