Eman B. A. Nashnush
Download
Report
Transcript Eman B. A. Nashnush
Cost-Sensitive Bayesian Network algorithm
Eman Nashnush
[email protected]
University of Salford ,Manchester, UK
Sponsor in Libya ( Tripoli University )
Introduction:
Machine learning algorithms are becoming an increasingly important area for research and application in the field of Artificial Intelligence and data mining. One of the most important algorithm is Bayesian
network, this algorithm have been widely used in real world applications like medical diagnosis, image recognition, fraud detection, and inference problems. In all of these applications, evaluation method as
accuracy is not enough because there are costs involve each decision. For example, in a fraud detection application to predict new case, there are several costs involved when the classifier predicts a fraudulent
case as a non-fraudulent case. Also, fraud databases have an unbalanced class distribution which is known to affect learning algorithms adversely. Therefore, this project develops new algorithm that aims to
minimize the costs of prediction, misclassification, imbalance data, time and test.
In this work, we attempt to create a new cost-sensitive Bayesian network learning algorithm by adapting Bayesian network algorithm, which focuses on accuracy only. There are several ways of adapting our
algorithm and make it cost-sensitive, this includes: changing distribution of the data; changing the construction process and even adopting alternative measure in the algorithms that take account of cost; and
using Genetic Algorithm to learn structure of BN. This work will apply different approaches such as amending distributions, amending formula, and using Genetic algorithms. Finally, an empirical evaluation of the
developed algorithms will be carried on the artificial data sets (e.g diabetes data, lung cancer data, Bank data …etc).
Cost-insensitive Vs. cost-sensitive
(Research problem)
Hypotheses/The problem
In the real world problems such as fraud detection, medical diagnosis, or any decision
problem. Often, one class label in dataset such as (Non-fraud class) is very rare and
expansive than another class, because the cost of not recognizing some of the
instances which belong to the rare class is high. Therefore, most of machine learning
Methodology
A cost-insensitive classifier focus on accuracy only (class label output)..
Learner
Classifier
Therefore, three methods have been proposed to tackle those problems and
Training
Data
1. Decision trees
2. Rules
3. Naive Bayes
($43.45,retail,10040, .. nonfraud)
($246,70,weapon,94583,.,fraud)
Transaction
{fraud,nonfraud}
minimize the expected misclassification cost.
...
Amend the data distribution to reflect cost.
methods do not take cost into account. Thus, those algorithms (cost-insensitive
Testing data
algorithms) have a poor result, because ignoring cost might produce a very week
model. In reality, misclassification problems (error of classification) are very common
problem in real-world data mining when the data is imbalanced in class label.
Results
250
240
600
220
180
60
100
40
25
0
32
30
10
20
9.5
15
28
26
24
22
iono
10
9
0
8.5
ionosphere
labor
5
20
100
Up to now, two new methods for cost-sensitive Bayesian Network algorithms
50
0
breast
15
5
50
hypo
150
80
13.5
4
60
13
3
40
12.5
2
20
12
1
0
11.5
0
5
0
0
pima
sonar
costs.
0
horse
horse-colic
50
statistical measures) that amends the selection measure to take account of
10
hepati
100
approach and another that uses a transparent box approach (modifying the
20
60
0
have been developed and explored: one that uses a black box (Sampling)
30
65
20
10
mushroom
bupa liver
diorder
breastcancear
55
0
heart
10.5
40
40
5
0
40
10
25
10
diabetes
150
60
60
20
10
0
80
15
80
15
30
Bayes Network algorithm.
30
20
50
20
tic-tac
50
30
40
200
0
0
german
150
100
0
0
spambase
crx
sensitive Bayes Network algorithm via changing the distributions, and the original
50
50
190
25
100
100
200
gymexamg
150
150
210
0
200
200
230
20
method with the original algorithm. In the figure below, I show the results of Cost-
Conclusion:
250
800
80
with the existing methods, and also compare the performance of this proposed
Utilize a Genetic algorithm to evolve a 'fittest' Bayesian network.
Cost-sensitive attempt to minimize the expected cost..
Up to Now, I have investigated experimentally how changing the distribution of data
data sets from the UCI repository database. I try to compare my proposed algorithm
nonfraud
fraud
($99.99,pharmacy,10027,...,?)
($1.00,gas,00234,...,?)
200
approach that called “Cost-Sensitive Bayesian Network using Sampling” with 24
Amend the formula by modifying the statistical measures to include cost.
Class Labels
Classifier
400
will affect the performance and cost of a Bayesian classifier. I experiment my
The previously mentioned problems are happened during classification data set.
unbalanced
The effect of our algorithms are evaluated and compared with other algorithms,
such as (MetaCost+J4.8, standard decision tree(J48), and standard Bayesian
networks).
weather