Data Miing and Knowledge Discvoery - Web
Download
Report
Transcript Data Miing and Knowledge Discvoery - Web
Classification and Prediction
Bamshad Mobasher
DePaul University
What Is Classification?
The goal of data classification is to organize and
categorize data in distinct classes
A model is first created based on the data distribution
The model is then used to classify new data
Given the model, a class can be predicted for new data
Classification = prediction for discrete and nominal
values
With classification, I can predict in which bucket to put the ball, but I
can’t predict the weight of the ball
2
Prediction, Clustering, Classification
What is Prediction?
The goal of prediction is to forecast or deduce the value of an attribute
based on values of other attributes
A model is first created based on the data distribution
The model is then used to predict future or unknown values
Supervised vs. Unsupervised Classification
Supervised Classification = Classification
We know the class labels and the number of classes
Unsupervised Classification = Clustering
We do not know the class labels and may not know the number of
classes
3
Classification: 3 Step Process
1. Model construction (Learning):
Each record (instance) is assumed to belong to a predefined class, as determined
by one of the attributes, called the class label
The set of all records used for construction of the model is called training set
The model is usually represented in the form of classification rules, (IF-THEN
statements) or decision trees
2. Model Evaluation (Accuracy):
Estimate accuracy rate of the model based on a test set
The known label of test sample is compared with the classified result from model
Accuracy rate: percentage of test set samples correctly classified by the model
Test set is independent of training set otherwise over-fitting will occur
3. Model Use (Classification):
The model is used to classify unseen instances (assigning class labels)
Predict the value of an actual attribute
4
Model Construction
5
Model Evaluation
6
Model Use: Classification
7
Classification Methods
Decision Tree Induction
Neural Networks
Bayesian Classification
Association-Based
Classification
K-Nearest Neighbor
Case-Based Reasoning
Genetic Algorithms
Fuzzy Sets
Many More
8
Decision Trees
A decision tree is a flow-chart-like tree structure
Internal node denotes a test on an attribute (feature)
Branch represents an outcome of the test
All records in a branch have the same value for the tested attribute
Leaf node represents class label or class label distribution
outlook
sunny
humidity
high
N
overcast
rain
windy
P
normal
P
true
N
false
P
9
Decision Trees
Example: “is it a good day to play golf?”
a set of attributes and their possible values:
outlook
sunny, overcast, rain
temperature
cool, mild, hot
humidity
high, normal
windy
true, false
A particular instance in the
training set might be:
<overcast, hot, normal, false>: play
In this case, the target class
is a binary attribute, so each
instance represents a positive
or a negative example.
10
Using Decision Trees for Classification
Examples can be classified as follows
1. look at the example's value for the feature specified
2. move along the edge labeled with this value
3. if you reach a leaf, return the label of the leaf
4. otherwise, repeat from step 1
Example (a decision tree to decide whether to go on a picnic):
outlook
sunny
humidity
high
N
overcast
So a new instance:
rain
will be classified as “noplay”
windy
P
normal
P
<rainy, hot, normal, true>: ?
true
N
false
P
11
Decision Trees and Decision Rules
outlook
If attributes are continuous,
internal nodes may test
against a threshold.
sunny
overcast rain
yes
humidity
> 75%<= 75%
no
windy
> 20
yes
no
<= 20
yes
Each path in the tree represents a decision rule:
Rule1:
If (outlook=“sunny”) AND (humidity<=0.75)
Then (play=“yes”)
Rule3:
If (outlook=“overcast”)
Then (play=“yes”)
Rule2:
If (outlook=“rainy”) AND (wind>20)
Then (play=“no”)
...
12
Top-Down Decision Tree Generation
The basic approach usually consists of two phases:
Tree construction
At the start, all the training examples are at the root
Partition examples are recursively based on selected attributes
Tree pruning
remove tree branches that may reflect noise in the training data and lead to
errors when classifying test data
improve classification accuracy
Basic Steps in Decision Tree Construction
Tree starts a single node representing all data
If sample are all same class then node becomes a leaf labeled with class label
Otherwise, select feature that best separates sample into individual classes.
Recursion stops when:
Samples in node belong to the same class (majority)
There are no remaining attributes on which to split
13
Trees Construction Algorithm (ID3)
Decision Tree Learning Method (ID3)
Input: a set of training examples S, a set of features F
1. If every element of S has a class value “yes”, return “yes”; if every element of
S has class value “no”, return “no”
2. Otherwise, choose the best feature f from F (if there are no features
remaining, then return failure);
3. Extend tree from f by adding a new branch for each attribute value of f
3.1. Set F’ = F – {f},
4. Distribute training examples to leaf nodes (so each leaf node n represents the
subset of examples Sn of S with the corresponding attribute value
5. Repeat steps 1-5 for each leaf node n with Sn as the new set of training
examples and F’ as the set of attributes
Main Question:
how do we choose the best feature at each step?
Note: ID3 algorithm only deals with categorical attributes, but can be extended
(as in C4.5) to handle continuous attributes
14
Choosing the “Best” Feature
Use Information Gain to find the “best” (most discriminating) feature
Assume there are two classes, P and N (e.g, P = “yes” and N = “no”)
Let the set of instances S (training data) contains p elements of class P and n
elements of class N
The amount of information, needed to decide if an arbitrary example in S
belongs to P or N is defined in terms of entropy, I(p,n):
I ( p, n) Pr( P)log 2 Pr( P) Pr( N )log 2 Pr( N )
Note that Pr(P) = p / (p+n) and Pr(N) = n / (p+n)
15
Choosing the “Best” Feature
More generally, if we have m classes, and s1, s2, …, sm are the number
of instances of S in each class, then the entropy is:
m
I ( s1 , s2 ,
, sm ) pi log 2 pi
i 1
where pi is the probability that an arbitrary instance belongs to the
class i.
16
Choosing the “Best” Feature
Now, assume that using attribute A a set S of instances will be partitioned into
sets S1, S2 , …, Sv each corresponding to distinct values of attribute A.
If Si contains pi cases of P and ni cases of N, the entropy, or the expected
information needed to classify objects in all subtrees Si is
E ( A) Pr( Si ) I ( pi , ni ) where, Pr( Si )
i 1
Si
S
pi ni
pn
The probability
that an arbitrary
instance in S
belongs to the
partition Si
The encoding information that would be gained by branching on A:
Gain( A) I ( p, n) E ( A)
At any point we want to branch using an attribute that provides the highest
information gain.
17
Attribute Selection - Example
The “Golf” example: what attribute should we choose as the root?
Day
D1
D2
D3
D4
D5
D6
D7
D8
D9
D10
D11
D12
D13
D14
outlook
temp
humidity
wind
sunny
sunny
overcast
rain
rain
rain
overcast
sunny
sunny
rain
sunny
overcast
overcast
rain
hot
hot
hot
mild
cool
cool
cool
mild
cool
mild
mild
mild
hot
mild
high
high
high
high
normal
normal
normal
high
normal
normal
normal
high
normal
high
weak
strong
weak
weak
weak
strong
strong
weak
weak
weak
strong
strong
weak
strong
Gain(outlook) = .94 - (4/14)*0
- (5/14)*.97
- (5/14)*.97
= .24
play
No
No
Yes
Yes
Yes
No
Yes
No
Yes
Yes
Yes
Yes
Yes
No
S: [9+,5-]
Outlook?
overcast
[4+,0-]
sunny
[2+,3-]
rainy
[3+,2-]
I(9,5) = -(9/14).log(9/14) - (5/14).log(5/14)
= 0.94
I(4,0) = -(4/4).log(4/4) - (0/4).log(0/4)
=0
I(2,3) = -(2/5).log(2/5) - (3/5).log(3/5)
= 0.97
I(3,2) = -(3/5).log(3/5) - (2/5).log(2/5)
= 0.97
18
Attribute Selection - Example (Cont.)
Day
D1
D2
D3
D4
D5
D6
D7
D8
D9
D10
D11
D12
D13
D14
outlook
temp
humidity
wind
sunny
sunny
overcast
rain
rain
rain
overcast
sunny
sunny
rain
sunny
overcast
overcast
rain
hot
hot
hot
mild
cool
cool
cool
mild
cool
mild
mild
mild
hot
mild
high
high
high
high
normal
normal
normal
high
normal
normal
normal
high
normal
high
weak
strong
weak
weak
weak
strong
strong
weak
weak
weak
strong
strong
weak
strong
play
No
No
Yes
Yes
Yes
No
Yes
No
Yes
Yes
Yes
Yes
Yes
No
S: [9+,5-] (I = 0.940)
humidity?
high
[3+,4-] (I = 0.985)
[6+,1-] (I = 0.592)
Gain(humidity) = .940 - (7/14)*.985 - (7/14)*.592
= .151
S: [9+,5-] (I = 0.940)
wind?
weak
So, classifying examples by humidity provides
more information gain than by wind. Similarly,
we must find the information gain for “temp”.
In this case, however, you can verify that
outlook has largest information gain, so it’ll be
selected as root
normal
[6+,2-] (I = 0.811)
strong
[3+,3-] (I = 1.00)
Gain(wind) = .940 - (8/14)*.811 - (8/14)*1.0
= .048
19
Attribute Selection - Example (Cont.)
Partially learned decision tree
S: [9+,5-]
Outlook
sunny
{D1, D2, …, D14}
overcast
rainy
?
yes
?
[2+,3-]
[4+,0-]
[3+,2-]
{D3, D7, D12, D13}
{D4, D5, D6, D10, D14}
{D1, D2, D8, D9, D11}
which attribute should be tested here?
Ssunny = {D1, D2, D8, D9, D11}
Gain(Ssunny, humidity) = .970 - (3/5)*0.0 - (2/5)*0.0 = .970
Gain(Ssunny, temp) = .970 - (2/5)*0.0 - (2/5)*1.0 - (1/5)*0.0 = .570
Gain(Ssunny, wind) = .970 - (2/5)*1.0 - (3/5)*.918 = .019
20
Dealing With Continuous Variables
Partition continuous attribute into a discrete set of intervals
sort the examples according to the continuous attribute A
identify adjacent examples that differ in their target classification
generate a set of candidate thresholds midway
problem: may generate too many intervals
Another Solution:
take a minimum threshold M of the examples of the majority class in each
adjacent partition; then merge adjacent partitions with the same majority class
Example: M = 3
70.5
Temperature 64 65 68 69 70 71
Play?
yes no yes yes yes no
77.5
72 72 75 75 80 81 83
no yes yes yes no yes no
85
no
Same majority, so they are merged
Final mapping: temperature 77.5 ==> “yes”; temperature > 77.5 ==> “no”
21
Over-fitting in Classification
A tree generated may over-fit the training examples due to noise
or too small a set of training data
Two approaches to avoid over-fitting:
(Stop earlier): Stop growing the tree earlier
(Post-prune): Allow over-fit and then post-prune the tree
Approaches to determine the correct final tree size:
Separate training and testing sets or use cross-validation
Use all the data for training, but apply a statistical test (e.g., chi-square) to
estimate whether expanding or pruning a node may improve over entire
distribution
Use Minimum Description Length (MDL) principle: halting growth of the tree
when the encoding is minimized.
Rule post-pruning (C4.5): converting to rules before pruning
22
Pruning the Decision Tree
A decision tree constructed using the training data may need to
be pruned
over-fitting may result in branches or leaves based on too few examples
pruning is the process of removing branches and subtrees that are generated
due to noise; this improves classification accuracy
Subtree Replacement: merge a subtree into a leaf node
Using a set of data different from the training data
At a tree node, if the accuracy without splitting is higher than the accuracy with
splitting, replace the subtree with a leaf node; label it using the majority class
color
red
blue
yes
no
1
2
Suppose with test set we find 3 red “no” examples, and
2 blue “yes” example. We can replace the tree with a
single “no” node. After replacement there will be only
2 errors instead of 5.
23
Bayesian Methods
Bayes’s theorem plays a critical role in probabilistic learning and
classification
Uses prior probability of each category given no information about an item
Categorization produces a posterior probability distribution over the possible
categories given a description of an item
The models are incremental in the sense that each training example can incrementally
increase or decrease the probability that a hypothesis is correct. Prior knowledge can
be combined with observed data
Given a data sample X with an unknown class label, H is the hypothesis
that X belongs to a specific class C
The conditional probability of hypothesis H given observation X, Pr(H|X), follows the
Bayes’s theorem:
Pr( X | H ) Pr( H )
Pr( H | X )
Pr( X )
Practical difficulty: requires initial knowledge of many probabilities,
significant computational cost
24
Axioms of Probability Theory
0 P( A) 1
All probabilities between 0 and 1
True proposition has probability 1, false has probability 0.
P(true) = 1
P(false) = 0
The probability of disjunction is:
P( A B) P( A) P( B) P( A B)
A
A B
B
25
Conditional Probability
P(A | B) is the probability of A given B
Assumes that B is all and only information known.
Defined by:
P( A B)
P( A | B)
P( B)
A
A B
B
26
Independence
A and B are independent iff:
P( A | B) P( A)
P( B | A) P( B)
These two constraints are logically equivalent
Therefore, if A and B are independent:
P( A | B)
P( A B)
P( A)
P( B)
P( A B) P( A) P( B)
Bayes’s Rule:
P( H | E )
P( E | H ) P( H )
P( E )
27
Bayesian Categorization
Let set of categories be {c1, c2,…cn}
Let E be description of an instance.
Determine category of E by determining for each ci
P(ci ) P( E | ci )
P(ci | E )
P( E )
P(E) can be determined since categories are complete and disjoint.
n
n
P(ci ) P( E | ci )
P(ci | E )
1
P( E )
i 1
i 1
n
P( E ) P(ci ) P( E | ci )
i 1
28
Bayesian Categorization (cont.)
Need to know:
Priors: P(ci)
and
Conditionals: P(E | ci)
P(ci) are easily estimated from data.
If ni of the examples in D are in ci,then P(ci) = ni / |D|
Assume instance is a conjunction
of binary features/attributes:
E e1 e2 em
E Outlook rain Temp cool Humidity normal Windy true
29
Naïve Bayesian Categorization
Problem: Too many possible instances (exponential in m) to
estimate all P(E | ci)
If we assume features/attributes of an instance are independent
given the category (ci) (conditionally independent)
m
P( E | ci ) P(e1 e2 em | ci ) P(e j | ci )
j 1
Therefore, we then only need to know P(ej | ci) for each feature
and category
30
Estimating Probabilities
Normally, probabilities are estimated based on observed
frequencies in the training data.
If D contains ni examples in category ci, and nij of these ni
examples contains feature/attribute ej, then:
P(e j | ci )
nij
ni
However, estimating such probabilities from small training sets
is error-prone.
If due only to chance, a rare feature, ek, is always false in the training data,
ci :P(ek | ci) = 0.
If ek then occurs in a test example, E, the result is that ci: P(E | ci) = 0 and
ci: P(ci | E) = 0
31
Smoothing
To account for estimation from small samples, probability
estimates are adjusted or smoothed.
Laplace smoothing using an m-estimate assumes that each
feature is given a prior probability, p, that is assumed to have
been previously observed in a “virtual” sample of size m.
P(e j | ci )
nij mp
ni m
For binary features, p is simply assumed to be 0.5.
32
Naïve Bayesian Classifier - Example
Here, we have two classes C1=“yes” (Positive) and C2=“no” (Negative)
Pr(“yes”) = instances with “yes” / all instances = 9/14
If a new instance X had outlook=“sunny”, then Pr(outlook=“sunny” | “yes”) = 2/9
(since there are 9 instances with “yes” (or P) of which 2 have outlook=“sunny”)
Similarly, for humidity=“high”, Pr(humidity=“high” | “no”) = 4/5
And so on.
33
Naïve Bayes (Example Continued)
Now, given the training set, we can compute all the probabilities
Suppose we have new instance X = <sunny, mild, high, true>. How should it
be classified?
X = < sunny , mild , high , true >
Pr(X | “no”) = 3/5 . 2/5 . 4/5 . 3/5
Similarly:
Pr(X | “yes”) = (2/9 . 4/9 . 3/9 . 3/9)
34
Naïve Bayes (Example Continued)
To find out to which class X belongs we need to maximize: Pr(X | Ci).Pr(Ci),
for each class Ci (here “yes” and “no”)
X = <sunny, mild, high, true>
Pr(X | “no”).Pr(“no”) = (3/5 . 2/5 . 4/5 . 3/5) . 5/14 = 0.04
Pr(X | “yes”).Pr(“yes”) = (2/9 . 4/9 . 3/9 . 3/9) . 9/14 = 0.007
To convert these to probabilities, we can normalize by dividing each by the
sum of the two:
Pr(“no” | X) = 0.04 / (0.04 + 0.007) = 0.85
Pr(“yes” | X) = 0.007 / (0.04 + 0.007) = 0.15
Therefore the new instance X will be classified as “no”.
35
Association-Based Classification
Recall “quantitative” association rules:
If the right-hand-side of the rules are restricted to the class attribute to
be predicted, the rules can be used directly for classification
It mines high support and high confidence rules in the form of
“cond_set => Y”
where Y is a class label.
Has been shown to work better than decision tree models in some cases.
36
Measuring Effectiveness of
Classification Models
When the output field is ordinal or nominal (e.g., in two-class
prediction), we use the classification table, the so-called
confusion matrix to evaluate the resulting model
Example
Predicted Class
Actual Class
T
F
Total
T
18
3
21
F
2
15
17
Total
20
18
38
Overall correct classification rate = (18 + 15) / 38 = 87%
Given T, correct classification rate = 18 / 20 = 90%
Given F, correct classification rate = 15 / 18 = 83%
37
Measuring Effectiveness: Lift
usually used for classification, but can be adopted to other methods
measure change in conditional probability of a target class when going from the
general population (full test set) to a biased sample:
lift Pr(classt | sample) / Pr(classt | population)
Example:
suppose expected response rate to a direct mailing campaign is 5% in the training set
use classifier to assign “yes” or “no” value to a target class “predicted to respond”
the “yes” group will contain a higher proportion of actual responders than the test set
suppose the “yes” group (our biased sample) contains 50% actual responders
this gives lift of 10 = 0.5 / 0.05
What if the lift sample is too small
No. of
respondents
need to increase the sample size
trade-off between lift and sample size
lift
Sample size
38
What Is Prediction?
Prediction is similar to classification
First, construct a model
Second, use model to predict unknown value
Prediction is different from classification
Classification refers to predicting categorical class label (e.g., “yes”, “no”)
Prediction models are used to predict values of a numeric target attribute
They can be thought of as continuous-valued functions
Major method for prediction is regression
Linear and multiple regression
Non-linear regression
K-Nearest-Neighbor
Most common application domains:
recommender systems, credit scoring, customer lifetime values
39
Prediction: Regression Analysis
Most common approaches to prediction: linear or multiple regression.
Linear regression: Y = + X
The model is a line which best reflects the data distribution; the line allows for prediction
of the Y attribute value based on the single attribute X.
Two parameters , and specify the line and are to be estimated by using the data at
hand
Common approach: apply the least squares criterion to the known values of Y1, Y2, …,
X1, X2, ….
Regression applet:
http://www.math.csusb.edu/faculty/stanton/probstat/regression.html
Multiple regression: Y = b0 + b1 X1 + b2 X2
Necessary when prediction must be made based on multiple attributes
E.g., predict Customer LTV based on: Age, Income, Spending, Items purchased, etc.
Many nonlinear functions can be transformed into the above.
40
Measuring Effectiveness of Prediction
Predictive models are evaluated based on the accuracy of their
predictions on unseen data
accuracy measured in terms of error rate (usually % of records classified incorrectly)
error rate on a pre-classified evaluation set estimates the real error rate
Prediction Effectiveness
Difference between predicted scores and the actual results (from evaluation set)
Typically the accuracy of the model is measured in terms of variance (i.e., average
of the squared differences)
E.g, Root Mean Squared Error: compute the standard deviation (i.e., square root
of the co-variance between predicted and actual ratings)
( p1 a1 ) 2 ( pn an ) 2
RMSE
n
41
Example: Recommender Systems
Basic formulation as a prediction problem
Given a profile Pu for a user u, and a target item it,
predict the interest score of user u on item it
Typically, the profile Pu contains interest
scores by u on some other items, {i1, …, ik}
different from it
Interest scores on i1, …, ik may have been obtained
explicitly (e.g., movie ratings) or implicitly (e.g., time spent
on a product page or news article)
42
Example: Recommender Systems
Content-based recommenders
Predictions for unseen (target) items are computed based on their
similarity (in terms of content) to items in the user profile.
E.g., user profile Pu contains
recommend highly:
and recommend “mildly”:
43
Content-Based
Recommender
Systems
44
Example: Recommender Systems
Collaborative filtering recommenders
Predictions for unseen (target) items are computed based the other users’ with
similar interest scores on items in user u’s profile
i.e. users with similar tastes (aka “nearest neighbors)
requires computing correlations between user u and other users according to interest
scores or ratings
Star Wars Jurassic Park Terminator 2 Indep. Day
Sally
7
6
3
7
Bob
7
4
4
6
Chris
3
7
7
2
Lynn
4
4
6
2
Karen
7
4
3
?
Average
5.75
5.25
4.75
4.00
Pearson
0.82
0.96
-0.87
-0.57
4.67
K
Pearson
Can we predict
Karen’s
rating on the unseen item Independence Day?
1
6
2
6.5
3
5
45
Example: Recommender Systems
Collaborative filtering recommenders
Predictions for unseen (target) items are computed based the other users’ with
similar interest scores on items in user u’s profile
i.e. users with similar tastes (aka “nearest neighbors)
requires computing correlations between user u and other users according to interest
scores or ratings
Star Wars Star
Jurassic
Terminator
2 Indep. Day
Average
WarsPark
Jurassic
Park Terminator
2 Indep.
Day
Sally
7
3
Sally
7 6
6
3 7
75.75
Bob
7
4
Bob
7 4
4
4 6
65.25
Chris
3
7
Chris
3 7
7
7 2
24.75
Lynn
4
6
Lynn
4 4
4
6 2
24.00
Karen
K
1
2
3
7
Karen
7 4
Pearson
prediction
K
Pearson
61
6
6.5
2
6.5
53
5
4
3
3
?
4.67
?
Predictions for Karen on
Indep. Day based on the K
nearest neighbors
Pearson
Average
0.82
5.75
0.96
5.25
-0.87
4.75
-0.57
4.00
Pearson
0.82
0.96
-0.87
-0.57
4.67
Correlation to Karen
46
Possible Interesting Project Ideas
Build a content-based recommender for
Movies (e.g., previous example)
News stories (requires basic text processing and indexing of documents)
Music (based on features such as genre, artist, etc.
Build a collaborative recommender for
Movies (using movie ratings), e.g., movielens.org
Music, e.g., pandora.com
Recommend songs or albums based on collaborative ratings
Or, recommend whole playlists based on playlists from other users (this
might be a good candidate application for association rule mining (why?)
47
Other Forms of Collaborative
and Social Filtering
Social Tagging (Folksonomy)
people add free-text tags to their content
where people happen to use the same terms then their
content is linked
frequently used terms floating to the top to create a kind of
positive feedback loop for popular tags.
Examples:
Del.icio.us
Flickr
48
Social Tagging
Deviating from standard mental models
No browsing of topical, categorized navigation or searching for an explicit
term or phrase
Instead is use language I use to define my world (tagging)
Sharing my language and contexts will create community
Tagging creates community through the overlap of perspectives
This leads to the creation of social networks which may further develop and
evolve
But, does this lead to dynamic evolution of complex
concepts or knowledge? Collective intelligence?
49
Clustering and Collaborative Filtering
:: clustering based on ratings: movielens
50
Clustering and Collaborative Filtering
:: tag clustering example
51
Classification Example - Bank Data
Want to determine likely responders to a direct mail campaign
a new product, a "Personal Equity Plan" (PEP)
training data include records kept about how previous customers responded and
bought the product
in this case the target class is “pep” with binary value
want to build a model and apply it to new data (a customer list) in which the
value of the class attribute is not available
id
ID12101
ID12102
ID12103
ID12104
ID12105
ID12106
ID12107
ID12108
ID12109
ID12110
ID12111
…
age
48
40
51
23
57
57
22
58
37
54
66
…
sex
FEMALE
MALE
FEMALE
FEMALE
FEMALE
FEMALE
MALE
MALE
FEMALE
MALE
FEMALE
…
region
income married children car
INNER_CITY
17546
NO
1
NO
TOWN
30085.1 YES
3
YES
INNER_CITY 16575.4 YES
0
YES
TOWN
20375.4 YES
3
NO
RURAL
50576.3 YES
0
NO
TOWN
37869.6 YES
2
NO
RURAL
8877.07
NO
0
NO
TOWN
24946.6 YES
0
YES
SUBURBAN 25304.3 YES
2
YES
TOWN
24212.1 YES
2
YES
TOWN
59803.9 YES
0
NO
…
…
…
…
…
save_act current_act mortgage pep
NO
NO
NO
YES
NO
YES
YES
NO
YES
YES
NO
NO
NO
YES
NO
NO
YES
NO
NO
NO
YES
YES
NO
YES
NO
YES
NO
YES
YES
YES
NO
NO
NO
NO
NO
NO
YES
YES
NO
NO
YES
YES
NO
NO
…
…
…
…
52
Data Preparation
Several steps for prepare data for Weka and for See5
open training data in Excel, remove the “id” column, save results (as a comma
delimited file (e.g., “bank.csv”)
do the same with new customer data, but also add a new column called “pep”
as the last column; the value of this column for each record should be “?”
Weka
must convert the the data to ARFF format
attribute specification and data are in the same file
the data portion is just the comma delimited data file without the label row
See5/C5
create a “name” file and a “data” file
“name” file contains attribute specification; “data” file is same as above
first line of “name” file must be the name(s) of the target class(es) - in this case
“pep”
53
Data File Format for Weka
Training Data
New Cases
@relation ’train-bank-data'
@attribute 'age' real
@attribute 'sex' {'MALE','FEMALE'}
@attribute 'region' {'INNER_CITY','RURAL','TOWN','SUBURBAN'}
@attribute 'income' real
@attribute 'married' {'YES','NO'}
@attribute 'children' real
@attribute 'car' {'YES','NO'}
@attribute 'save_act' {'YES','NO'}
@attribute 'current_act' {'YES','NO'}
@attribute 'mortgage' {'YES','NO'}
@attribute 'pep' {'YES','NO'}
@data
48,FEMALE,INNER_CITY,17546,NO,1,NO,NO,NO,NO,YES
40,MALE,TOWN,30085.1,YES,3,YES,NO,YES,YES,NO
...
@relation 'new-bank-data'
@attribute 'age' real
@attribute 'region' {'INNER_CITY','RURAL','TOWN','SUBURBAN'}
...
@attribute 'pep' {'YES','NO'}
@data
23,MALE,INNER_CITY,18766.9,YES,0,YES,YES,NO,YES,?
30,MALE,RURAL,9915.67,NO,1,NO,YES,NO,YES,?
54
C4.5 Implementation in Weka
To build a model (decision tree)
using the classifiers.trees.j48..J48
class
Decision Tree
Output (pruned)
children <= 2
| children <= 0
| | married = YES
| | | mortgage = YES
| | | | save_act = YES: NO (16.0/2.0)
| | | | save_act = NO: YES (9.0/1.0)
| | | mortgage = NO: NO (59.0/6.0)
| | married = NO
| | | mortgage = YES
| | | | save_act = YES: NO (12.0)
| | | | save_act = NO: YES (3.0)
| | | mortgage = NO: YES (29.0/2.0)
| children > 0
| | income <= 29622
| | | children <= 1
| | | | income <= 12640.3: NO (5.0)
| | | | income > 12640.3
| | | | | current_act = YES: YES (28.0/1.0)
| | | | | current_act = NO
| | | | | | income <= 17390.1: NO (3.0)
| | | | | | income > 17390.1: YES (6.0)
| | | children > 1: NO (47.0/3.0)
| | income > 29622: YES (48.0/2.0)
children > 2
| income <= 43228.2: NO (30.0/2.0)
| income > 43228.2: YES (5.0)
55
C4.5 Implementation in Weka
=== Error on training data ===
The rest of the output
contains statistical
information about the
model, including confusion
matrix, error rates, etc.
Correctly Classified Instances
Incorrectly Classified Instances
Mean absolute error
Root mean squared error
Relative absolute error
Root relative squared error
Total Number of Instances
281
93.6667 %
19
6.3333 %
0.1163
0.2412
23.496 %
48.4742 %
300
=== Confusion Matrix ===
a
b
<-- classified as
122 13 |
a = YES
6 159 |
b = NO
=== Stratified cross-validation ===
The model can be saved to
be later applied to the test
data (or to new unclassified
instances).
Correctly Classified Instances
Incorrectly Classified Instances
Mean absolute error
Root mean squared error
Relative absolute error
Root relative squared error
Total Number of Instances
274
91.3333 %
26
8.6667 %
0.1434
0.291
28.9615 %
58.4922 %
300
=== Confusion Matrix ===
a
b
<-- classified as
118 17 |
a = YES
9 156 |
b = NO
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70