ITCS 6265/8265 Project

Download Report

Transcript ITCS 6265/8265 Project

ITCS 6265/8265 Project
Group 5
Gabriel Njock
Tanusree Pai
Ke Wang
Outline







Domain
Problem Statement and Objective
Data Description
Problem Characteristics and Method Used
Implementation
Data Formating
Feature Selection
Boosting and Derived Attributes
Testing & Results
References
Domain



COIL CHALLENGE
Direct mailings to a company's potential customers - "junk
mail" to many - can be a very effective way for them to market
a product or a service. However, as we all know, much of this
junk mail is really of no interest to the people that receive it.
Most of it ends up thrown away, not only wasting the money
that the company spent on it, but also filling up landfill waste
sites or needing to be recycled. If the company had a better
understanding of who their potential customers were, they
would know more accurately who to send it to, so some of this
waste and expense could be reduced.
Motivation for Data Mining : cost reduction - realized by only
targeting a portion of the potential customers.
Problem Statement

The data used in this problem represents a frequently
occurring problem: analysis of data about customers of a
company, in this case an insurance company.

Information about customers consists of 86 variables and
includes product usage data and socio-demographic data
derived from zip codes.

The data was supplied by the Dutch data mining company
Sentient Machine Research, and is based on real world
business data.
Coil Challenge objective

The competition consists of two tasks:

Predict which customers are potentially interested
in a caravan insurance policy .

Describe the actual or potential customers; and
possibly explain why these customers buy a
caravan policy.
Project Objective

Propose a solution that will allow us to predict
whether a customer is interested in a caravan
insurance policy.

Find the subset of customers with a probability of
purchasing a caravan insurance policy above some
boundary probability.
Data Description



TRAINING SET
5822 customer records
86 attributes. The attributes could be broadly categorized as
follows:
Socio-demographic (43)
Insurance Policy Related (42)
Contribution-per-policy type
(21)
Number-of-policies
(21)
Decision Attribute “Caravan”
Data Description

TEST SET

4000 customer records

85 attributes. Caravan attribute was missing. Need
to predict the Caravan attribute value.

Note: Attribute values in both sets were prediscretized .
Problem Characteristics

The problem reduces to a classification analysis of
customers: two classes are of those who are
interested in purchasing a Caravan policy and those
who are not.

The learning of the classification model is
supervised because the training set provides the
decision attribute values.
Method Used
Naive Bayesian Classification
Bayesian classifiers are statistical classifiers [1] used
to predict class membership probabilities.
Bayes Theorem
If X is a data sample whose class label is unknown
and H is some hypothesis such that X belongs to class C, then
the probability that the hypothesis H holds given the observed
data sample X is denoted as P(H|X) and given by Bayes
theorem as
Naive Bayes Classification
The naive Bayesian classifier, is based on Bayes theorem:
1. Each data sample is represented by an n-dimensional feature vector,
X = (x1, x2, ..., xn), depicting n measurements made on the sample from
n attributes, respectively A1, A2, ..., An. For our problem we have 86
attributes, so n = 86.
2. If there are m classes, C1, C2, ..., Cm, then given an unknown data
sample, X (i.e. having no class label), the classifier will predict that X
belongs to the class having the highest posterior probability, conditioned
on X. That is, the naive Bayesian classifier assigns an unknown sample
X to the class Ci if and only if
P(Ci|X) > P(Cj|X) for 1 <= j <= m, j <> i
Thus we maximize P(Ci|X). The class Ci for which P(Ci|X) is maximized,
is called the maximum posteriori hypothesis. By Bayes theorem,
Naive Bayes Classification
3. As P(X) is constant for all classes, only P(X | Ci) P(Ci) need be
maximized. If the class prior probabilities are not known, then it is
commonly assumed that the classes are equally likely, i.e., P(C1) =
P(C2) = ...= P(Cm), and we would therefore maximize P(X | Ci).
Otherwise we maximize P(X | Ci) P(Ci).
4. In order to reduce computation in evaluating P(X | Ci), the naive
assumption of class independence is made. This presumes that the
values of the attributes are conditionally independent of one another.
5. To classify an unknown sample X, P(X | Ci) P(Ci) is evaluated for
each class Ci. Sample X is then assigned to the class Ci if and only if
P(X | Ci) P(Ci) > P(X | Cj) P(Cj) for 1 <= j <= m, j <> I
In other words, it is assigned to the class Ci for which P(X | Ci) P(Ci)
is the maximum.
Naive Bayes Classification

Advantages
1. Comparable in performance with decision trees
and neural networks.
2. High accuracy and speed when applied to large
databases.
3. Theoretically, Bayesian classifiers have the
minimum error rate in comparison to all other
classifiers.

Disadvantages
1. It makes certain assumptions, which may lead to
inaccuracy.
Implementation

Data Formatting and Tools Used:

The data set (training as well as test) available was
a simple text file with values separated by space.

A database was created COILDB.mdb) with two
tables (COILDB_TRAIN and COILDB_TEST)
having appropriate column definitions for all
attributes. Data from the text file was populated into
the tables.
Software used: MS Access
Purpose:
Allow executing sql query for data
analysis
Implementation

Data from the text file was populated into
spreadsheets.
Software used: MS Excel, Analyse-It
Purpose:
Allow statistical analysis
(histogram, correlation) and have
graphical output.

.arff files were created
Software Used: Weka
Purpose:
Use Bayesian Classification for
machine learning as well as testing.
Implementation

Feature selection
The analysis to determine the relevance of each
attribute was carried out in two steps:
1. Analyze the relevance of demographical
attributes.
2. Analyze the relevance of the non-demographical
attributes.
Feature Selection – Demographical Attributes

The attribute selection feature in Weka was used to rank the
demographical attributes according to information gain.

The 4 demographical attributes that had the highest
information gain values were
Customer Type (Mostype),
Customer Subtype (Moshoofd),
Average Income (Minkgem) and
Purchasing Power Class (Mkoopla).
Feature Selection – Demographical Attributes

Simple Naive Bayesian classification was used with
different combinations of the 4 demographical
attributes along with the non-demographical
attributes to determine which combination of these
four attributes would yield in the best accuracy.

The percentage of correctly classified instances and
percentage of incorrectly classified instances were
then compared for all the combined attribute groups.
Feature Selection – Demographical Attributes

A correlation analysis
was also conducted on
the 4 demographical
attributes.

Figure shows the
Customer Type and
Customer Subtype
attributes with a Pearson
Correlation factor of 0.99:
Feature Selection – Demographical Attributes


Based on the correlation analysis between the 4
demographical attributes and results from the
comparison of percentage of correctly classified
instances and percentage of incorrectly classified
instances, it was decided to retain only 2 of the 43
demographical attributes.
Average Income (Minkgem)
Accuracy:

Purchasing Power Class (Mkoopla)
Accuracy:

92.2363
92.0818
All attributes
Accuracy:
83.3047
Feature Selection – Policy Attributes

The 42 non-demographical attributes were mainly insurance
policy related and of two types:

Contribution-per-policy Attributes

Number-of-policy Attributes

Preliminary Analysis:
1. The contribution-per-policy and number-of-policies
attributes were highly correlated.
2. For 37 out of the 43 policy related attributes (including the
caravan policy ownership attribute), more then 90% of the
records has only 1 value (mainly: 0) – sparsely used attributes.
3. the vast majority of customers buys mostly the fire, car and
third party insurance policies.
Boosting and Deriving Attributes

For each pair of attributes (contribution-per-policy,
number-of-policy) we performed two kinds of
analysis:
1. Determine the correlation factor among the
attributes
2. Derive the Total Contribution Attribute, which was
the product of the contribution from a policy and the
number of those policies.
Boosting and Deriving Attributes

Simple Naive Bayes classification was conducted
using the derived attributes and it was found that
the product attribute gave a higher accuracy than
the attributes individually. The three derived
attributes which made a significant difference were:
1. CAR Policies (PPERSAUT, APERSAUT)
2. FIRE Policies (PRAND, ABRAND)
3. Private Third Party Insurance Policies
(PWAPART, AWAPART)
Boosting and Deriving Attributes


The classification was performed again, using all
combinations of the derived attributes to determine
which of the three derived attributes had to be
retained.
We compared the percentage of correctly classified
instances and percentage of incorrectly classified
instances.
The highest accuracy and lowest error was found
when using all three derived attributes.
Feature Selection – Policy Attributes

The number of non-demographical attributes were then
reduced from 42 to 39 by replacing 6 attributes with the
derived attributes.

These attributes were combined with the Average Income and
Purchasing Power Class Attribute individually as well as
together.

Accuracy in each case:
Avg. Income & Policy Attributes (with 3 derived):
93.181
Purchasing Power Class & Policy Attributes (with 3 derived):
93.1982
Avg. Income, Purchasing Power class & Policy Attributes (3
derived):
93.0608
Feature Selection – Policy Attributes




We ranked the remaining attributes again according
to information gain to test their relevance.
The attributes with significant information gain were
related to Boat policies and SSN policies.
Correlation analysis as well accuracy analysis with
Classification was used to determine the final set of
attributes.
Accuracy reached with 6 attributes (Purchasing
Power, 3 derived attributes representing contribution
from Car, Fire and Private Third Party policies,
Number of Boat policies and Number of SSN
Policies was:
93.9711
Testing & Results

After training of the model, the test data was used to
predict the subset of customers likely to purchase
the CARAVAN policy.

A cut-off probability of 80% was used.

We obtained 115 records with 80% or higher
probability of purchasing the CARAVAN policy.
Conclusion

From a set of 4000 customer records, our
classification analysis predicted only 115 with a
probability of purchasing the Caravan Policy higher
than 80%.

Significant reduction in cost can be obtained by
targeting only those customers.
References

[1]. The Insurance Company (TIC) Benchmark: The
Coil Challenge Report,
http://www.liacs.nl/~putten/library/cc2000/

[2]. Sentient Machine Research. http://www.smr.nl/