Agenda - Gerstein Lab

Download Report

Transcript Agenda - Gerstein Lab

Data Mining At Tech Journal
Agenda
Background
Questions of Interest
Data Overview
Selected Approach
Potential Issues
Current Status
First Results
Agenda
Background
Questions of Interest
Data Overview
Selected Approach
Potential Issues
Current Status
First Results
The Company
• A US company (“TechJournal”) publishes an on-line journal
(“TechPub”) with content specifically aimed at IT professionals
• TechJournal is 15 years old; TechPub is 5 years old
• Content for TechPub comes from three sources:
– Aggregated content from public sources
– TechJournal created content
– Peer contributed content
• TechJournal core business is to produce a high-end list product
for the marketing departments of IT manufacturers
The Journal
• The content on the publication website is available to both
anonymous and registered users
• Registered users get access to some premium services as well
• Most content is free. Some whitepapers for sale.
• Three very unique features of the site
– Peer contributed content
– Auction system -> readers to get paid to contribute content
– New: personalized content for each reader
The Readers
• Target: IT Professional involved in their organization’s technology
purchasing decision
• Different levels of “readership”:
Number of
Individuals
E Mail Recipients E Mail Recipients E Mail Recipients
Anonymous Visits Visited Site
Repeat Visitor
Registered
Light Reader
Registered
Heavy Reader
• The company continuously tries to stimulate new readership
through e-mail campaigns
The Business Model
Company
Resources For
Reinvestment
List Value To
Technology
Manufactures
New Readers:
Company
Prospecting
Gathering New
Content
New Readers:
Reader Word Of
Mouth
Tuning of
Content
TechPub Reader
Activity
Quality Of List
Product
Knowledge of
Readers'
Interests
Total Readers
“Active Readers Produce Better Lists” Loop
“Success Breeds Success” Loop
“Known Readers Make For Better Journal” Loop
“Buzz Marketing” Loop
Agenda
Background
Questions of Interest
Data Overview
Selected Approach
Potential Issues
Current Status
First Results
Company
Resources For
Reinvestment
List Value To
Technology
Manufactures
New Readers:
Company
Prospecting
Gathering New
Content
New Readers:
Reader Word Of
Mouth
Tuning of
Content
TechPub Reader
Activity
Quality Of List
Product
Focal Areas For Data Mining
Knowledge of
Readers'
Interests
Total Readers
“Success Breeds Success” Loop
•
•
Given email recipient attributes, what is the likelihood of a visit to website?
Which content headlines would maximize that visit likelihood?
“Known Readers Make For Better Journal” Loop
• Given registered readers’ attributes, which stories will they be interested in?
• Given past stories read, what is a registered reader most likely to also read?
• Given registered readers’ attributes, which will be most active?
“Active Readers Produce Better Lists” Loop
• Is TechJournal’s current content taxonomy effective or
would some content taxonomy be more useful?
Agenda
Background
Questions of Interest
Data Overview
Selected Approach
Potential Issues
Current Status
First Results
The Data
My “Chunk of Data” to Mine:
An Issues Table
713,110 records
Issues - Content Linker Table
2,185,664 records
Content Items Table
590 records
Page Visit Table
43,580 records
Recipients Table
195,455 records
Taxonomy Click Table
9,385 records
Attributes to Work With
Primary Key
Data Mining Attributes
Reader Attributes
Content Attributes
Recipient ID
IP Address
Content ID
Title
Abstract
Headline Main
Content Type
Media Type
Author
Content Taxonomy
Click Rate
City
State
Country
Zip
Phone
Format Attributes
Issue ID
Template Type
Media Type (HTML,
Or Video)
IT Budget
Employees
Sales
SIC Code
Industry
Time Sent
Time Opened
Time of Visit
Time Content Click
= Features that can be utilized directly or derived from for Classification
Creating Content Classes
TechJournal’s current taxonomy for classifying content:
9,750 Visits
spread
over
31 Classes
• Manually derived
• Aggregation of other credible taxonomy fragments
• From a content provider point of view
• Goes out to 21 levels in some cases, others as shallow as three
Level
Classes
1
1
2
5
3
46
4
798
5
1909
.
.
.
.
.
.
21
5000 +
#Visits
2925
2736
1187
670
314
282
278
131
118
97
75
53
42
40
38
36
32
28
21
13
8
8
7
4
4
4
3
3
2
1
1
ContentClass
|Software|Business
|Hardware|Storage
|Software|Operating Systems
|Hardware|Networking
|Software|Software Development
|Hardware|Computers
|Industries|News
|Hardware|Telecom
|Industries|IT Management
|Hardware|Mobile Devices
|Online|Search
|Online|Portal
|Hardware|Printers
|Software|Consumer
|Industries|PCs
|Industries|Legal
|Hardware|Power
|Software|Networking
|Hardware|News
|Industries|Standards
|Hardware
|Industries|Hacking
|Online|News
|Online|Software as a Service
|Hardware|Chips
|Services|Disaster Recovery
|Online|Email
|Online|IM
|Services|Security
|Hardware|Software
|Services|Software Development
Agenda
Background
Questions of Interest
Data Overview
Selected Approach
Potential Issues
Current Status
Preliminary Results
A Variety of Approaches
PREDICTIVE MODELING
•
Given email recipient attributes, what is the likelihood of a visit to website?
•
Which content headline would maximize that visit likelihood?
•
Given registered readers attributes, which readers will be most active?
•
Given registered reader attributes, which types of content will they read?
CLUSTER ANALYSIS
• Is TechJournal’s current content taxonomy effective or would some other taxonomy be more useful?
ASSOCIATION ANALYSIS
• Given past stories read, what is a registered reader most likely to also read?
Agenda
Background
Questions of Interest
Data Overview
Selected Approach
Potential Issues
Current Status
First Results
Potential Issues
• Database evolution produces noisy, dirty, unevenly populated data
• Data comes from multiple sources, producing consistent data has been a challenge
• Still not clear if we will end up with enough data to see anything meaningful
• Content taxonomy is relatively new; most likely has real problems with how its structured
• Taxomony measures article subject matter, but behavior stimulating content may be in headlines
• Features are somewhat related:
Industry
Size
Sales
Employees
Title
Location
• Features have high number of discrete values – need to be put into meaningful groupings
• Under-representation of several feature and class values
Feature Grouping - Location
10
7
1
5
2
6
3
Other
11
4
9
8
Feature Grouping - Title
• Start with ~ 1000 distinct self-reported Titles in the Database
• Most interested in Title as it correlates with impact, influence on IT buying decisions
• Reclassify them based on three concepts: Senority, Function, Employees in Company
Result: 24 Categories
Owner
Chairman/CEO
Functional
Area 1
Manager of
Managers
Functional
Area N
Functional
Area 1
Manager of
Doer
Doer
Assistant
Assistant
1
2,20 - 29
Functional
Area 10
3,30 - 39
4
Agenda
Background
Questions of Interest
Data Overview
Selected Approach
Potential Issues
Current Status
First Results
Where I Am In The Process
Problem
Definition
Data
Gathering
Data
Prep
Data
Mining
Results
Analysis
Visualiz.
Sum Up
Insights
Agenda
Background
Questions of Interest
Data Overview
Selected Approach
Potential Issues
Current Status
First Results
First Results
Q: Given registered readers attributes, which readers will be most active?
Method: Decision Tree Induction – Training Set 599 Records, Test Set 187 Records
0.1429
n=7
0.7037
n = 27
MSE on Training Set = .1313
MSE on Test Set = .1451
First Results
Q: Given the attributes of a registered reader, which content types they will read?
Method: Decision Tree Induction
n= 786
node), split, n, deviance, yval
* denotes terminal node
1) root 786 223508.000 29.44402
2) LocGrpID< 1.5 96 23784.990 24.01042
4) RIC>=70.5 53 10433.890 19.66038 *
5) RIC< 70.5 43 11112.050 29.37209
10) RIC< 66 33 8432.545 25.27273 *
11) RIC>=66 10 294.900 42.90000 *
3) LocGrpID>=1.5 690 196494.400 30.20000
6) RIC< 71.5 438 127844.900 28.34475
12) RIC>=14.5 411 120569.000 27.69586 *
13) RIC< 14.5 27 4468.667 38.22222 *
7) RIC>=71.5 252 64521.570 33.42460
14) Title_Code>=38 20 4712.950 20.45000 *
15) Title_Code< 38 232 56151.570 34.54310 *
20.45
n = 20
35.54
n = 232
First Results
Q: Given registered reader attributes, which types of content will they read?
Method: Kernel SVM with Gaussian Kernel
% Predictions
Were Accurate
67%
60%
40%
83%
45%
39%
67%
Pred
2
6
12
16
42
45
46
True
0
0
0
0
0
3
0
0
1
0
0
0
0
0
2
0
2
2
0
3
0
21
19
0
5
0
0
0
0
5
6
0
6
0
15
9
1
29
20
0
7
0
0
0
0
0
3
0
9
0
0
0
0
2
3
0
Overall Training Error = .569975
10
0
1
1
0
1
4
0
12
0
2
33
0
34
18
2
13
0
1
1
0
3
10
0
16
0
0
0
5
5
10
0
17
0
3
1
0
1
0
0
18
0
0
5
0
17
16
0
20
0
0
0
0
1
2
0
24
0
0
0
0
0
2
0
25
0
0
0
0
0
1
0
27
0
0
1
0
5
2
0
30
0
0
2
0
4
5
0
33 42
0
0
0
1
0 12
0
0
1 151
0 42
0
1
43
0
0
1
0
0
1
0
44 45
0
0
0
1
3
7
0
0
1 44
3 126
0
0
46
1
1
4
0
9
28
6
% In Class Pred ------------> 0% 0% 4% 0% 20% 0% 0% 0% 37% 0% 25% 0% 0% 0% 0% 0% 0% 0% 0% 73% 0% 0% 71% 12%
15 |Industries|Hacking
16 |Industries|IT Management
17 |Industries|Legal
18 |Industries|News
20 |Industries|PCs
21 |Industries|Standards
24 |Online|Email
25 |Online|IM
26 |Online|News
27 |Online|Portal
30 |Online|Search
33 |Online|Software as a Service
37 |Services|Security
42 |Software|Business
43 |Software|Consumer
44 |Software|Networking
45 |Software|Operating Systems
46 |Software|Software Development
Defining Project Success
Success for this project could come in different forms:
• Insights gained on any of the six questions within
the project’s scope;
- and/or –
• Insight into how TechJournal should modify its
data capture policies to facilitate data mining for the
answers to these questions in the future
Questions/Comments