Transcript Yes
DATA MINING FINAL
REPORT
Vipin Saini M964011062
許博淞 M964020009
陳昀志 M964020043
Outline
Introduction
DM Methodology(Step1~Step3)
DM Methodology(Step4~Step8)
DM Methodology(Step9~Step10)
Conclusion
Introduction
•
•
•
•
•
Direct marketing
Response rate
Telecommunications company
Publicly available business data
Addition of random companies
Step2-Records
Some characteristics about each prospect
Number of employees at a particular office
Number of employees for the entire company
Annual sales (in thousands) at a particular office
Annual sales (in thousands) for the entire company
Whether or not the company does business outside the United States
Annual advertising expense
Whether the company has moved recently or is a new business
The type of ownership
Specific industry code
General industry code
Age of the company (in years)
Step3-Data Type
Correcting the data types.
Make sure "Buyer" is the type Yes/No.
Change the type of Age to integer.
Make sure the "International" type is string or
Boolean.
Change "Local Employees" to integer.
Change "Local Sales" to integer.
Change "Industry Type" to categorical.
Change "Total Employees" to integer.
Change "Total Sales" to integer.
Step4 Create a Model Set
The number of employees and the number of sales
differ based on the size of the company. All of
these characteristics represent a picture of company
size.
Employee Ratio, Sales Ratio, Productivity Ratio
Step4: Create a Model Set
With our newly applied rules, the World dataset
now has redundant columns.
Step5: Fix Problems with the Data
Categorical variables with too many values
Step6: Transform the data
create a training and testing set
Total Records:13117
Step7: Build Model
We use PolyAnalyst to help us to mine the data,
and the version is 5.0.
Step7: Build Model
We used MarketData.CSV file which we edited as
the source. After the software filtrated out missing
values, we had the decision tree.
the Decision Tree
Root
Local
Employee
<23
Age<3
Sales
Ratio
<
0.002
7
Local
Employee
>=23
Age>=3
Age<
2
Age>
=2
Sales
Ratio
>=
0.002
7
Sales
Ratio
=N/A
Local
Emplo
yee
<10
Industry
Category
=C
Local
Emplo
yee
>=10
Industry
Category
=H
Industry
Category
=F
Industry
Category
=E
Employee
Ratio<
0.214
Industry
Category
=D
Employee
Ratio <
0.214
Industry
Category
=A
Industry
Category
=B
Industry
Category
=G
Industry
Category
=I
the Decision Tree
We made a decision tree with:
Number
of non-terminal nodes : 41
Number of leaves : 91
Depth of the tree : 8
Step 8:Assess model
•
the result of decision tree of Training set:
Real/predict
No
Yes
•
•
•
•
No
3018
379
Yes
528
2535
undefined
49
49
Total classification error: 14.04%
Classification accuracy: 85.96%
Classification error for class No: 14.89%
Classification error for class Yes: 13.01%
Step 8:Assess model
If we use top 40% of data and can use this model
to predict 80% corrected response.
Step 9. Deploy models
The testing set is random selected 50 % of records
from the whole dataset.
Real/predict
No
Yes
Total
No
3074
396
Yes
610
2395
undefined
45
39
classification error: 15.54%
Classification accuracy: 84.46%
Classification error for class No: 16.56%
Classification error for class Yes: 14.19%
Step 10. Assess result
Root
No
Local
Employee<23
Yes
Local
Employee>=23
Step 10. Assess result
Almost every company that have more than 23
employee have higher ratio to respond. (Class label is
Yes and the ratio is 75.5%).
a bigger company with more employee which have higher
trends to response.
the number of employee is smaller than 23, are likely
not to response
(Class label is No and the ratio is 72.9%)
a small company doesn’t have trends to response
Step 10. Assess result
Root
Local
Employee
<23
Industry
Category
=C
Local
Employee
>=23
Industry
Category
=H
Industry
Category
=F
Industry
Category
=E
Industry
Category
=D
No
Yes
Employee Employee
Ratio<
Ratio >=
0.214
0.214
Industry
Category
=A
Industry
Category
=B
Industry
Category
=G
Industry
Category
=I
Step 10. Assess result
if the Local Employee ratio is smaller than 0.214 then
the response ratio is low.
(class label is No and the ratio is 85.7%)
if the Local Employee ratio is bigger than 0.214 then
the response ratio is high.
(class label is Yes and the ratio is 66.2%)
the Local employee ratio have influence on response ratio of
the bigger companies and Industry Category is E, depends
on how is the Local employee Ratio is.
Step 10. Assess result
Root
Local
Employee<23
Local
Employee>=23
Age<3
Age<2
Sales
Ratio <
0.0027
Sales
Ratio >=
0.0027
Yes
Age>=3
Age>=2
Sales
Ratio
=N/A
Local
Employee
<10
Local
Employee
>=10
Step 10. Assess result
if the Sales ratio is more than 0.27% then the
response ration is high
(class label is Yes and the ratio is 98.2%)
a
new beginning company and his sales rate is good, so
he likes to response.
Conclusion
We use a decision tree to approach the target
marketing.
Knowing how the industry category type is, we can
get more information from this mining result.
Thanks For Your Listening!