Transcript Yes

DATA MINING FINAL
REPORT
Vipin Saini M964011062
許博淞 M964020009
陳昀志 M964020043
Outline





Introduction
DM Methodology(Step1~Step3)
DM Methodology(Step4~Step8)
DM Methodology(Step9~Step10)
Conclusion
Introduction
•
•
•
•
•
Direct marketing
Response rate
Telecommunications company
Publicly available business data
Addition of random companies
Step2-Records
Some characteristics about each prospect
 Number of employees at a particular office
 Number of employees for the entire company
 Annual sales (in thousands) at a particular office
 Annual sales (in thousands) for the entire company
 Whether or not the company does business outside the United States
 Annual advertising expense
 Whether the company has moved recently or is a new business
 The type of ownership
 Specific industry code
 General industry code
 Age of the company (in years)
Step3-Data Type
Correcting the data types.
 Make sure "Buyer" is the type Yes/No.
 Change the type of Age to integer.
 Make sure the "International" type is string or
Boolean.
 Change "Local Employees" to integer.
 Change "Local Sales" to integer.
 Change "Industry Type" to categorical.
 Change "Total Employees" to integer.
 Change "Total Sales" to integer.
Step4 Create a Model Set


The number of employees and the number of sales
differ based on the size of the company. All of
these characteristics represent a picture of company
size.
Employee Ratio, Sales Ratio, Productivity Ratio
Step4: Create a Model Set

With our newly applied rules, the World dataset
now has redundant columns.
Step5: Fix Problems with the Data

Categorical variables with too many values
Step6: Transform the data


create a training and testing set
Total Records:13117
Step7: Build Model

We use PolyAnalyst to help us to mine the data,
and the version is 5.0.
Step7: Build Model

We used MarketData.CSV file which we edited as
the source. After the software filtrated out missing
values, we had the decision tree.
the Decision Tree
Root
Local
Employee
<23
Age<3
Sales
Ratio
<
0.002
7
Local
Employee
>=23
Age>=3
Age<
2
Age>
=2
Sales
Ratio
>=
0.002
7
Sales
Ratio
=N/A
Local
Emplo
yee
<10
Industry
Category
=C
Local
Emplo
yee
>=10
Industry
Category
=H
Industry
Category
=F
Industry
Category
=E
Employee
Ratio<
0.214
Industry
Category
=D
Employee
Ratio <
0.214
Industry
Category
=A
Industry
Category
=B
Industry
Category
=G
Industry
Category
=I
the Decision Tree

We made a decision tree with:
 Number
of non-terminal nodes : 41
 Number of leaves : 91
 Depth of the tree : 8
Step 8:Assess model
•
the result of decision tree of Training set:
Real/predict
No
Yes
•
•
•
•
No
3018
379
Yes
528
2535
undefined
49
49
Total classification error: 14.04%
Classification accuracy: 85.96%
Classification error for class No: 14.89%
Classification error for class Yes: 13.01%
Step 8:Assess model

If we use top 40% of data and can use this model
to predict 80% corrected response.
Step 9. Deploy models

The testing set is random selected 50 % of records
from the whole dataset.
Real/predict
No
Yes
 Total
No
3074
396
Yes
610
2395
undefined
45
39
classification error: 15.54%
 Classification accuracy: 84.46%
 Classification error for class No: 16.56%
 Classification error for class Yes: 14.19%
Step 10. Assess result
Root
No
Local
Employee<23
Yes
Local
Employee>=23
Step 10. Assess result

Almost every company that have more than 23
employee have higher ratio to respond. (Class label is
Yes and the ratio is 75.5%).


a bigger company with more employee which have higher
trends to response.
the number of employee is smaller than 23, are likely
not to response
(Class label is No and the ratio is 72.9%)

a small company doesn’t have trends to response
Step 10. Assess result
Root
Local
Employee
<23
Industry
Category
=C
Local
Employee
>=23
Industry
Category
=H
Industry
Category
=F
Industry
Category
=E
Industry
Category
=D
No
Yes
Employee Employee
Ratio<
Ratio >=
0.214
0.214
Industry
Category
=A
Industry
Category
=B
Industry
Category
=G
Industry
Category
=I
Step 10. Assess result


if the Local Employee ratio is smaller than 0.214 then
the response ratio is low.
(class label is No and the ratio is 85.7%)
if the Local Employee ratio is bigger than 0.214 then
the response ratio is high.
(class label is Yes and the ratio is 66.2%)

the Local employee ratio have influence on response ratio of
the bigger companies and Industry Category is E, depends
on how is the Local employee Ratio is.
Step 10. Assess result
Root
Local
Employee<23
Local
Employee>=23
Age<3
Age<2
Sales
Ratio <
0.0027
Sales
Ratio >=
0.0027
Yes
Age>=3
Age>=2
Sales
Ratio
=N/A
Local
Employee
<10
Local
Employee
>=10
Step 10. Assess result

if the Sales ratio is more than 0.27% then the
response ration is high
(class label is Yes and the ratio is 98.2%)
a
new beginning company and his sales rate is good, so
he likes to response.
Conclusion


We use a decision tree to approach the target
marketing.
Knowing how the industry category type is, we can
get more information from this mining result.
Thanks For Your Listening!