Paper D1.S3.8 - Department of Computer and Information Sciences

Download Report

Transcript Paper D1.S3.8 - Department of Computer and Information Sciences

Dr. Abdul Aziz
Associate Dean
Faculty of Computer Sciences
Riphah International University
Islamabad, Pakistan
[email protected]
Dr. Nazir A. Zafar
Department of Computer & Information Sciences
Pakistan Institute of Engineering & Applied Sciences
Nilore, Islamabad, Pakistan
[email protected]
Reduction in Over-Fitting for Classification
without Compromising on Accuracy
and Effectiveness
2
Machine Learning
Machine learning covers following main types of learning:
• Classification learning:
Learn to put instances into pre-defined classes based
on other attributes
• Association learning:
Learn relationships between the attributes
• Clustering:
Discover classes of instances that belong together
• Regression:
Learn to predict a numeric quantity instead of a class
3
Roots of Classification
1. Classification draws on the concepts of three
major paradigms:
•
•
•
Database technology
Statistics
Machines
2. Domain knowledge, i.e. the expertise of the
end-user.
5
Knowledge Discovery in Databases
1. KDD process typically generates a model using
past records with known target classes
(outputs) and these models are used to predict
outputs of future records (new cases).
2. Applications include fraud detection, marketing,
investment analysis, insurance.
7
Marketing example
The goal is to predict whether a customer will buy a
product given gender, country and age.
Gender
Country
Age
Buy?
M
M
F
F
F
M
M
F
F
M
France
England
France
England
France
Germany
Germany
Germany
France
France
25
21
23
34
30
21
20
18
34
55
Yes
Yes
Yes
Yes
No
No
No
No
No
No
Freitas and Lavington (1998) Data Mining, CEC99.
8
country?
Germany
no
England
yes
France
age?
<= 25
yes
> 25
no
Internal branching node
This is the
decision tree
induced by the
Marketing
example data.
The first
branch is
called the root
of the tree.
Leaf node
9
Tree induction
1. The tree is built by selecting one attribute at a
time - the one that ‘best’ separates the classes.
2. The set of examples is then partitioned
according to value of selected attributes.
3. This is repeated at each branch node until
segmentation is complete.
10
(4Y
6N)
country?
Germany
England
no
(0Y
France (2Y
age?
3N)
<= 25
(2Y
> 25
yes
(2Y
yes
3N)
0N)
no
(0Y
3N)
Notice that in
this simple
example the leaf
nodes contain
records of one
class only.
0N) The number of
yes and no
examples is
conserved as
you move up
and down the
tree.
11
Rule derivation
Rules can be extracted directly from induction trees.
(4Y
6N)
country?
Germany
England
no
(0Y
France (2Y
age?
3N)
<= 25
(2Y
0N)
What are the other
rules?
> 25
yes
(2Y
yes
3N)
If (country = Germany)
then (Buy? = No)
0N)
no
(0Y
3N)
12
Heart Disease Dataset
13
What is needed?
1. With databases of enormous size, the user
needs help to analyse the data more effectively
than just simply querying and reporting.
2. Semi-automatic methods to extract useful,
unknown (higher-level) information in a
concise format will help the user make more
sense of their data.
14
The KDD roadmap
1. KDD may be divided into the following stages:
Problem
Specfication
Resourcing
Data
Cleansing
Pre-processing
Data Mining
Evaluation
Interpretation
Exploitation
KDD
1
KDD
1
2. Note the iterative nature of the process.
15
Expertise required
1. Any organisation that undertakes a project in
KDD will require much expert input to ensure
that the results produced are
•
•
•
•
of high quality,
valid,
interesting/useful/novel/surprising,
and comprehensible by the human user.
2. “If patient is pregnant then gender is female”
is very accurate, but is neither useful nor
surprising.
16
Test data
Data
Training
data
Classifier
algorithm
Model
Classification
17
Validation
Test data
Data
Training
data
Classifier
algorithm
Model
Classification
18
Relative error reduction
S.No. Data Set
SRS
SDS
ER %
1
Heart Disease(Cleveland)
77.37
82.12
20.99
2
Credit-A
84.52
84.93
02.65
3
Diabetes (PIMA)
72.45
73.96
05.48
4
Liver disorder (BUPA)
64.86
65.90
02.96
5
Breast cancer Wisconsin
94.63
94.86
4.28
6
Hepatitis
78.20
88.31
46.38
7
Ionosphere
89.09
90.91
16.68
8
Boston housing
82.53
83.79
07.21
9
Credit (German)
71.64
72.20
01.97
10
Iris
91.93
98.67
83.52
11
Sonar
73.46
70.19
-12.32
Over all average
SRS: Simple Random Sampling
80.06
82.35
16.35
SDS: Systematic Distribution Sampling
19
Comparison
Data Set
SRS
SIS
Accuracy
Over fitting
Accuracy
Over fitting
Heart-C
77.37
06.14
79.58
03.11
Credit-A
84.52
05.25
87.08
02.93
Diabetes
72.45
07.83
72.66
03.58
Liver
64.86
10.47
67.17
04.79
Cancer
94.63
02.58
94.62
01.26
Hepatitis
78.20
06.16
83.12
03.07
Ionosphere
89.09
04.94
89.41
02.19
Housing
82.53
03.68
83.00
02.53
Credit-G
71.64
08.23
72.56
03.97
Iris
91.93
02.91
96.00
01.18
Sonar
73.46
08.26
76.73
03.95
Average
80.06
06.04
81.99
2.96
SRS: Simple Random Sampling
SIS:Stratified Induction Sampling
20
Conclusion
In this study, we have shown that the
original data sets partitioned into training
and test data sets by using stratified
induction approach reduces over fitting
significantly without compromising on
accuracy factor.
21
Supporting Texts
Data Warehousing, Data Mining and OLAP, Alex
Berson & Stephen Smith, McGraw-Hill (1997), ISBN
0-07-006272-2
Predictive Data Mining, Sholom Weiss & Nitin
Indurkhya, Morgan Kauffmann (1998), ISBN 1-55860403-0
Data Mining, Ian Witten & Eibe Frank, Morgan
Kaufmann (1999), ISBN 1-55860-552-5
22
Useful urls
1. University of East Anglia
School of Computing Sciences, UK
http://www.cmp.uea.ac.uk/research/groups/mag/kdd/
2. UCI ML repository, USA
http://www.ics.uci.edu/~mlearn/MLRepository.html
3. KD Nuggets, USA
http://www.kdnuggets.com/
23
Questions
and
Answers Discussion
24
THANK YOU
25