LinkSelector: A Web Mining Approach to Hyperlink Selection
Download
Report
Transcript LinkSelector: A Web Mining Approach to Hyperlink Selection
LinkSelector: A Web Mining Approach to
Hyperlink Selection for Web Portals
Xiao Fang
University of Arizona
10/18/2002
Agenda
Introduction
Related work
Problem definition -- Hyperlink Selection
Solution -- LinkSelector
Evaluation
Contributions, limitations and future
work
2
Introduction
Size of WWW (Lawrence and Giles,1999)
800 million web pages
1 million pages added daily
How to find information on the Web
Using search engines (best coverage 38.3%)
(Lawrence and Giles,1999)
Clicking on hyperlinks
3
Introduction
Web Page 1
Web Page 2
Product Category List
A
B
C
D
E
F
Click on A
Product Category A
Product List
A1
A2
A3
A4
A5
Click on A2
Web Page 3
Product A2
Price: 1000
Detailed description
B2
4
Introduction
Portal page: is the entrance to a
website.
Portal page
Homepage of a website
Default web portal (e.g.,My Yahoo!)
Most My Yahoo! users never customize their
default web portals (Manber et al., 2000).
5
Introduction
Hyperlinks in a portal page are selected
from a hyperlink pool.
A hyperlink pool is a set of hyperlinks
pointing to top-level web pages, e.g.,
hyperlink in a site index page.
6
Portal
page
7
Hyperlink
pool
8
Portal
page
9
Hyperlink
pool
10
Introduction
Number of hyperlinks in a portal page:
several dozens (e.g., 32 in the Arizona
Home page).
Number of hyperlinks in a hyperlink
pool: several hundreds (e.g., 743 in the
Arizona Index page).
11
Introduction
It is too computational expensive to do
an exhaustive search (e.g., C 1.44E 56 ).
32
743
Current practice of hyperlink selection –
expert selection
Only reflect domain experts’ perspectives
Subjective
12
Introduction
Our approach is based on
web access patterns extracted from a web
log – objective and reflect web surfers’
perspectives
web structural patterns extracted from an
existing website -- objective
13
Related work
Web mining is the process of applying
data mining techniques to extract
patterns from the Web.
Web Data
Content: texts and graphics in web pages
Structure: hyperlinks
Usage: web logs
14
Related work
Web content mining is the process of
automatically retrieving, parsing,
indexing and categorizing web
documents.(Chakrabrati, 2000)
Web structure mining
HITS (Kleinberg, 1998)
PageRank (Brin and Page, 1998)
15
Related work
Web usage mining is the process of applying
data mining techniques to extract web access
patterns from a web log.
16
Related work
Web usage mining
General purpose, e.g., Chen et al. 1996;
Cooley et al., 1999
Website improvement, e.g., Perkowitz and
Etzioni, 2000
Personalization, e.g., Yan 1996
17
Related work
Limitations of previous web usage
mining research
Not considering web structure information,
e.g., Chen et al., 1996
Web structure information are used to
exclude “uninteresting” web visiting
patterns, e.g., Yan et al., 1996 and Cooley
et al., 1999
18
Hyperlink Selection
The quality of a portal page is
measured using a web log and a web
log can be divided into sessions.
Metrics to measure the quality of a
portal page
Effectiveness
Efficiency
Usage
19
Hyperlink Selection
Effectiveness: is the percentage of usersought top-level web pages that can be
easily accessed from the portal page.
What are the user-sought top-level web
pages?
How to define the easiness to find a web
page from a Portal page?
20
Hyperlink Selection
User-sought top-level web pages
Session j: L1, L10, L11, L2, L13, L14, L5,
L9, L7, L12
L1, L2, L5, and L7 are in the hyperlink pool
User-sought top-level web pages: L1, L2,
L5, L7
21
Hyperlink Selection
Usually, web pages that are 1-2 clicks away
from a portal page can be easily found from
the portal page.
22
Hyperlink Selection
• Effectiveness measured at session level
effectiveness( S j )
UHL( S j ) HL
UHL( S j )
UHL( S j ) {L1, L2, L5, L7}
HL {L1, L2, L3, L4, L5, L8}
{L1, L2, L5}
effectiveness( S j )
75%
{L1, L2, L5, L7}
• Effectiveness measured at log level
s
effectiven ess ( wl)
effectiven ess( S
j)
j 1
s
23
Hyperlink Selection
• Efficiency measures the usefulness of hyperlinks
placed in a portal page.
• Efficiency measured at session level
efficiency( S j )
UHL( S j ) HL
PHL
UHL( S j ) {L1, L2, L5, L7}
HL {L1, L2, L3, L4, L5, L8}
PHL {L1, L2, L3}
efficiency( S j )
{L1, L2, L5}
1
{L1, L2, L3}
• Efficiency measured at log level
s
efficiency ( wl)
efficiency( S
j)
j 1
s
24
Hyperlink Selection
Usage : how often a portal page is visited.
usage( wl)
s
UHL(S
j ) HL
j 1
25
Hyperlink Selection
Definition : Given a website w, its hyperlink pool HP and
the number of hyperlinks to be placed in the portal page of w – N,
where N HP , the hyperlink selection problem is to construct
the portal page by selecting N hyperlinks from the hyperlink pool HP
to maximize the effectiveness, efficiency and usage of the
resulting portal page (i.e., all metrics are measured at the web log level).
26
LinkSelector
LinkSelector is based on relationships
between hyperlinks in a hyperlink pool.
Structure Relationship
Access Relationship
27
LinkSelector
Structure Relationship
28
LinkSelector
Structure Relationship
L2
L4
Web page 2
L6
L1
L8
L3
L5
Web page 1
L7
Web page 3
Structure relationships:
L1L2
L1L4
L3L5
L3L7
L1L6
L1L8
29
LinkSelector
Access Relationship
k-HS is denoted as a hyperlink set with k
hyperlinks. e.g., {L1,L2} is a 2-HS
The support of a k-HS is the percentage of
sessions that web pages pointed to by
hyperlinks in the k-HS are accessed together.
e.g., If web pages pointed to by L1 and L2
are accessed together in 20 sessions out of
total 100 sessions, then the support of 2-HS
{L1,L2} is 20%.
30
LinkSelector
Access Relationship
Definition : For a k-HS , where k
2 , there exists an
access relationship among elements in the k-HS
if and only if its support is greater than a
pre-defined threshold.
31
LinkSelector
32
LinkSelector
Group-I relationship
L2
L4
L1
L3
Web page 1
Web page 2
L6
L8
L5
L7
Web page 3
Structure relationships:
L1L2
L1L4
L3L5
L3L7
L1L6
L1L8
33
LinkSelector
Group-I relationship
L2
L4
L1
L3
Web page 1
L6
L8
L5
L7
L1L4
L3L5
L3L7
L1L6
Web page 3
Access relationships:
Structure relationships:
L1L2
Web page 2
L1L8
{L1,L2},0.1 {L1,L4},0.1
{L1,L6},0.05
{L1,L8},0.05
{L3,L5},0.4
{L3,L7},0.5
34
LinkSelector
Group-I relationship provides indicators of
preference for individual hyperlinks in hyperlink
selection
the number of structure relationships a
hyperlink participate in as an initial hyperlink
the quality of these structure relationships
35
LinkSelector
Group-II relationship
no structure relationship between L9 and L12
an access relationship between L9 and L12
36
LinkSelector
Group-II relationship provides indicators of
hyperlink pair preference in hyperlink
selection:
hyperlink pairs with Group-II relationships are
preferred to hyperlink pairs without
Group-II relationships
within hyperlink pairs with Group-II
relationships, hyperlink pairs with higher support
of access relationship are preferred to those with
lower support of access relationships
37
LinkSelector
Group-III relationship reveals patterns that
are not relevant to hyperlink selection
L1
L5
Web page 2
Web page 1
Group-IV relationship does not reveal
interesting patterns.
38
LinkSelector
The Sketch of LinkSelector
39
LinkSelector
Discover Structure Relationships
40
LinkSelector
Access relationship can be discovered
from a web log using association rule
mining
Data Preprocessing
Association rule mining
41
LinkSelector
Data Preprocessing
Web log cleaning
Error logs (e.g., status code 404)
Accessory logs (e.g., .gif)
Session identification
Modify web server
30-min time interval
42
LinkSelector
Association rule mining
Given: a transaction database
tid
item
001
1,2,3
002
1,2,4
003
2,3,4
004
4,5,6
An itemset is a set of items, e.g., {1,2}
The support of an itemset is the percentage of transactions
that contain (e.g., purchase) the itemst.
Objective: discover all itemsets with supports
larger than a user-defined threshold.
Apriori Algorithm (Agrawal and Srikant,1994 )
43
LinkSelector
Calculate Preferences for Hyperlinks
HP {L1 , L2 , L3 , L4 , L5 , L6 , L7 }
SR {L1 L2 , L1 L3 , L1 L4 ,}
AR {({L1 , L2 },0.022), ({L1 , L3},0.018), ({L1 , L4 },0.01) }
PREL1 0.022 0.018 0.01 0.05
44
LinkSelector
Calculate Preferences for Hyperlinks Sets
AR {({L3, L7 , L9},0.022), ({L4 , L8},0.02),}
No structure relationships between L3 and L7 ;
L3 and L9 ; L7 and L9 .
Preference for hyperlink set {L3 , L7 , L9}is 0.022
45
LinkSelector
Clustering is a data mining algorithm to
segment objects into groups based on
their similarities
1
3
0.2
5
1 2 3 4
5
1 0 0.2 0.1 0.1 0.05
2 0.2 0 0.1 0.1 0.05
3 0.1 0.1
4
5
2
4
0.12
46
LinkSelector
Hyperlink Clustering :
Hyperlinks Objects
Preferences for hyperlinks Weights of
objects
Preferences for hyperlinks sets
Similarities among hyperlinks
47
LinkSelector
Limitations of classical clustering
algorithms
Weights of objects are not considered.
Only considers similarities between two
objects
48
LinkSelector
Our solution
Indexes of the proposed similarity matrix
are clusters while indexes of the traditional
similarity matrix are objects to be
clustered.
Similarities involving two and more objects
are considered in the proposed similarity
matrix.
Weights of objects are considered
49
LinkSelector
50
Experiment
Data collected from UA website
Hyperlink pool: 110 links
Web log: collected in Sep. 2001
10 M records 4.2 M records
total 344 K sessions
262 k sessions Training data (23 days)
82 k sessions Testing data (7 days)
51
Experiment
Hyperlinks selected by LinkSelector, Domain experts
and access frequency (N=6)
52
Experiment
Average improvement: 12.7%
Improvement decrease from 22.1% to 8.4%
Average number of sessions per day: 11.5k
53
Experiment
Average improvement: 17.0%
Improvement decrease from 30.2% to 9.4%
Absolute number of hyperlinks improved: 15610 to 6509
54
Experiment
Average improvement: 16.9%
Improvement decrease from 30.2% to 9.3%
55
Experiment
Hyperlinks selected by LinkSelector, Classical
Hierarchical Clustering and Association rule mining
(N=6)
56
Experiment
Average improvement compared with association
rule mining: 25.8%
Average improvement compared with classical
clustering: 102.0%
57
Experiment
Average improvement compared with association
rule mining: 31.7%
Average improvement compared with classical
clustering: 124.0%
58
Experiment
Average improvement compared with association
rule mining: 31.6%
Average improvement compared with classical
clustering: 123.0%
59
Contributions
1. We proposed and formally defined a new and important research
problem – hyperlink selection.
2. We proposed and showed what a web mining based hyperlink
selection approach outperforms other hyperlink selection approaches.
3.We developed a new clustering algorithm for hyperlink selection.
60
Limitations and Future work
User Study
Adaptive LinkSelector
Structure of website changes
Web visiting patterns change
61