Improving Web Design Mining Web Data at SCMP.com
Download
Report
Transcript Improving Web Design Mining Web Data at SCMP.com
Improving the Web Design
Mining Web Data at Cityjob.com
Hing-Po Lo, Linda Lu, Miriam Chan
Department of Management Sciences
City University of Hong Kong, Hong Kong
[email protected]
I. Introduction
Data Mining
Customer Relationship
Management
The Web
A. The Web
• More than 200 millions surfers per day
• Huge volume of data captured from the Web
• Only 2% of web data analyzed
Worldwide Internet Commerce Revenues:
Business and Consumer Segments,1996-2002
US$B
600
500
400
300
200
100
0
1996
1997
1998
Consumer
1999
2000
Business-Business
2001
2002
B. Customer Relationship Management
•
DOT COM companies
• work in an “information-intensive” and
“ultra-competitive” mode
• require the use of CRM to establish a personalized
relationship with their customers
C. Data Mining Tools
•There are many software and web vendors that
may help to explore and mine the web log files.
•Most study the “clickstream” at the “session
level”. In order to conduct CRM, one has to
analyze the web log file at the “customer level”.
•A tailor-made software using SAS macro and
Enterprise Miner has been developed.
Cityjob.COM
• It offers information on almost all posts
available from major companies in HK.
• It receives on average over several thousand
visitors per day.
II. The Data
Study Period:
11 December 2000 to 4 February 2001
Three types of data files:
• Web log files;
•
Subscribers’ profiles;
•
Jobs’ profiles.
1. Web log files
#Software: Microsoft Internet Information Server 4.0
#Version: 1.0
#Date: 2000-12-11 00:00:00
#Fields: date time c-ip cs-username s-sitename s-computername s-ip
cs-method cs-uri-stem cs-uri-query sc-status sc-win32-status sc-bytes
cs-bytes time-taken cs(Cookie)
2000-12-11 00:00:00 208.223.166.3 - W3SVC4 PROD5_WEB
202.130.170.225 GET /default.asp - 200 0 15838 645 1297
RMID=d0dfa603398e0850;+CityjobID=LASTUPD=20001130&LO
GIN=sloo;+IND=000;+OPN=000;+CTY=091;+RDB=c8020000000
0000000020028311b1b0000000000000000;+ASPSESSIO
2. Subscribers’ profiles
User
ID
Age
Sex
Ed.
level
cityjob94290 27
F
SEC
cityjob94293 26
M
DIP
cityjob94338 28
F
SEC
cityjob94345 34
M
UC
Cont’d
Ind
Reg. Date
P.
H.
Country
income income
2
8
9
Marital
Status
Em.
Status
Occ.
HK
S
FT
CUS
HK
S
FT
FIN
HK
S
FT
ACC
HK
M
FT
MGT
Interest
HOT
20001030
MKT
BNK
20001030
BANK,
OMF
20001030
ENTER,
GAME, HKNEWS, PROPOMF
DPT
20001030
CNEWS,
COMPU,
FIN,
INVEST,
MKT
ECON,
ENTER, HKNEWS,
3.
Job ID
Title
cityjobB7200
ORG.
MANAGER
cityjobAVU10
Jobs’ profiles
Type Work
Exp.
Quali. Industry
Level
IT
4
UC
BANK
MID
EXECUTIVE
OFFICER II
LEG
3
DIP
GOV
JUN
cityjobB7040
ASST.
ACCOUNTANT
ACC
5
SEC
RET
PRO
cityjobB7530
SALES
EXECUTIVE
SAL
4
UC
TDG
JUN
Web log files
Subscribers’ files
Jobs’ files
SAS macros were written to perform
the following tasks:
A: Reading the web log files
B: Cleaning the data files
C: Creating new variables
D: Merging the data files
E:
Prepare different SAS data files
Useful Summary Information
A. Subscribers’ profiles
B. Jobs’ profiles
C. Web log files
D. Web log files + User ID
E. Web log files + Job ID
Time
6a
m
4a
m
2a
m
12
mi
dn
i te
10
pm
8p
m
6p
m
4p
m
2p
m
12
no
on
10
am
8a
m
Relative Percentage
Relative Percentage of Count in Each Hour
8%
7%
6%
5%
4%
3%
2%
1%
0%
The most popular jobs
Job ID
Title
Industry
Visit
No.
Popularity
Index
cityjobCM070
OFFICER - CORPORATE
BANKING
BNK
7748
100.0
cityjobC8570
ADMINISTRATIVE
ASSISTANT
GOV
6552
84.6
cityjobCDU20
EXECUTIVE TRAINEE INVESTMENT PRODUCTS
BNK
5148
64.9
cityjobCL580
CONTRACT HOUSING
OFFICER
GOV
4944
63.8
cityjobCK570
EXECUTIVES FOR
CORPORATE FINANCE
BNK
4664
60.2
Ⅲ. Collaborative Filtering
1. By Association Rules
• Whenever a visitor
enquires about a particular
job, we can “cross sell”
similar jobs by
recommending other jobs
that have the highest
association with the
original one.
• The association is based
on the click history of all
the visitors to the Web.
For example,if
• Job A: cityjobCF520:
Title: Assistant Accountant; Qualification: Diploma; Working experience:
one year
then
• Job B: cityjobCF180:
Title: Assistant Accountant; Qualification: Diploma; Working experience:
three year
• Job C: cityjobCF100:
Title: Assistant Accountant; Qualification: University/College; Working
experience: not specified
• Job D: cityjobCEUJ0:
Title: Assistant Accountant; Qualification: Not specified; Working experience:
two years
This group of 4 jobs has a
• Confidence Value of 50.3% :
given a visitor enquires about job A, the probability that he
would also enquire about jobs B, C, and D is 0.503;
• Lift Value of 298.46 :
if a visitor has enquired about job A, he is almost 300
times more likely to enquire about jobs B, C, and D than a
visitor chosen at random.
2. By Popularity Index
For example,if
• Job A: cityjobCDU20
Title: EXECUTIVE TRAINEE - INVESTMENT PRODUCTS, Type: FIN, Working
Experience: 0, Qualification: UC, Industry: BNK, Level: JUN, Index of popularity:
64.9.
then (with same type, industry and qualification)
• Job B: cityjobCM470
Title: ASSOCIATE (TREASURY), Type: FIN, Working Experience: 3, Qualification:
UC, Industry: BNK, Level: JUN, Index of popularity: 59.2.
• Job C: cityjobCM470
Title: ASSOCIATES (CRM), Type: FIN, Working Experience: 2, Qualification: UC,
Industry: BNK, Level: JUN, Index of popularity: 44.6.
• Job D: cityjobCFLC0
Title: DEALER & INVESTOR ADVISOR, Type: FIN, Working Experience: 3,
Qualification: UC, Industry: BNK, Level: PRO, Index of popularity: 36.6.
Ⅳ. Predictive Models
1. Churn (Attrition) model
To identify subscribers with high likelihood of ceasing their
current activity of visiting the Web site,thus the Cityjob.com
can take action to retain them. It is often less expensive to
retain them than it is to win them back.
2. Popular job model
What are the characteristics of jobs that would attract more
visitors? Are they related to their job type and job industry?
1. The Churn (Attrition) Model
• Sample: All subscribers of Cityjob.com.
• Dependent Variable: Visit = 1 if the subscriber has
visited the Cityjob.com during the study period;
Visit = 0 otherwise.
• Factors used: Gender; Age; Educational Level
dummy variables for interest and country;
no. of days since registration.
• Sampling procedure: Stratified sampling based on
the variable “Visit” is used to obtain equal number
of observations from the two groups of
subscribers (Y=1 and Y=0).
• Data partition: Training data 70%, Validation data 30%
• Lift Chart
Churn model
(logistic regression )
important factors:
1. No. of days
registration;
since
2. Educational level,
3. Gender
4. Whether has interest in
computer games or not.
2. The Popular Job Model
• Sample : All jobs advertised on the Cityjob.com.
• Dependent Variable: Popular = 1 if the job has been
visited for at least 20 times, Popular = 0 otherwise.
• Factors used: Dummy variables for different job types,
job industries, job level, qualification required,
working experience.
• Data partition: Training data 70%, Validation data 30%
• Missing values: missing values for working experience
and qualification required were replaced by 0 and
3 (Secondary school completed) respectively.
• Lift
Chart
popular job model
(logistic regression )
Important factors:
1. higher qualification(more likely)
2. higher level (more likely)
3. jobs industries:
accounting, banking, building ,
construction ( more likely )
4. jobs types:
art/design/creative, engineering,
sales (less likely)
Ⅴ. Recommendation
1. Web Design
a. To develop a collaborative filtering system
b. To include a popularity index
2. Marketing Strategies
a. To develop appropriate marketing strategies
for customer retention
b. To develop Cityjob.com’s own web monitor
system
Ⅵ.Unexpected Discovery
There was a user who came everyday during
the study period at exactly the same time
(4:00 a.m. HK time) and stayed for one to
three hours browsing more than 500 pages
each time (average 5 sec. per page).
The End