LinkSelector: A Web Mining Approach to Hyperlink Selection

Download Report

Transcript LinkSelector: A Web Mining Approach to Hyperlink Selection

LinkSelector: A Web Mining Approach to
Hyperlink Selection for Web Portals
Xiao Fang
University of Arizona
10/18/2002
Agenda






Introduction
Related work
Problem definition -- Hyperlink Selection
Solution -- LinkSelector
Evaluation
Contributions, limitations and future
work
2
Introduction

Size of WWW (Lawrence and Giles,1999)



800 million web pages
1 million pages added daily
How to find information on the Web


Using search engines (best coverage 38.3%)
(Lawrence and Giles,1999)
Clicking on hyperlinks
3
Introduction
Web Page 1
Web Page 2
Product Category List
A
B
C
D
E
F
Click on A
Product Category A
Product List
A1
A2
A3
A4
A5
Click on A2
Web Page 3
Product A2
Price: 1000
Detailed description
B2
4
Introduction


Portal page: is the entrance to a
website.
Portal page


Homepage of a website
Default web portal (e.g.,My Yahoo!)

Most My Yahoo! users never customize their
default web portals (Manber et al., 2000).
5
Introduction


Hyperlinks in a portal page are selected
from a hyperlink pool.
A hyperlink pool is a set of hyperlinks
pointing to top-level web pages, e.g.,
hyperlink in a site index page.
6
Portal
page
7
Hyperlink
pool
8
Portal
page
9
Hyperlink
pool
10
Introduction


Number of hyperlinks in a portal page:
several dozens (e.g., 32 in the Arizona
Home page).
Number of hyperlinks in a hyperlink
pool: several hundreds (e.g., 743 in the
Arizona Index page).
11
Introduction

It is too computational expensive to do
an exhaustive search (e.g., C  1.44E 56 ).
32
743

Current practice of hyperlink selection –
expert selection


Only reflect domain experts’ perspectives
Subjective
12
Introduction

Our approach is based on


web access patterns extracted from a web
log – objective and reflect web surfers’
perspectives
web structural patterns extracted from an
existing website -- objective
13
Related work


Web mining is the process of applying
data mining techniques to extract
patterns from the Web.
Web Data



Content: texts and graphics in web pages
Structure: hyperlinks
Usage: web logs
14
Related work


Web content mining is the process of
automatically retrieving, parsing,
indexing and categorizing web
documents.(Chakrabrati, 2000)
Web structure mining


HITS (Kleinberg, 1998)
PageRank (Brin and Page, 1998)
15
Related work
Web usage mining is the process of applying
data mining techniques to extract web access
patterns from a web log.

16
Related work

Web usage mining



General purpose, e.g., Chen et al. 1996;
Cooley et al., 1999
Website improvement, e.g., Perkowitz and
Etzioni, 2000
Personalization, e.g., Yan 1996
17
Related work

Limitations of previous web usage
mining research


Not considering web structure information,
e.g., Chen et al., 1996
Web structure information are used to
exclude “uninteresting” web visiting
patterns, e.g., Yan et al., 1996 and Cooley
et al., 1999
18
Hyperlink Selection


The quality of a portal page is
measured using a web log and a web
log can be divided into sessions.
Metrics to measure the quality of a
portal page



Effectiveness
Efficiency
Usage
19
Hyperlink Selection

Effectiveness: is the percentage of usersought top-level web pages that can be
easily accessed from the portal page.


What are the user-sought top-level web
pages?
How to define the easiness to find a web
page from a Portal page?
20
Hyperlink Selection

User-sought top-level web pages



Session j: L1, L10, L11, L2, L13, L14, L5,
L9, L7, L12
L1, L2, L5, and L7 are in the hyperlink pool
User-sought top-level web pages: L1, L2,
L5, L7
21
Hyperlink Selection

Usually, web pages that are 1-2 clicks away
from a portal page can be easily found from
the portal page.
22
Hyperlink Selection
• Effectiveness measured at session level
effectiveness( S j ) 
UHL( S j )  HL
UHL( S j )
UHL( S j )  {L1, L2, L5, L7}
HL  {L1, L2, L3, L4, L5, L8}
{L1, L2, L5}
effectiveness( S j ) 
 75%
{L1, L2, L5, L7}
• Effectiveness measured at log level
s
effectiven ess ( wl) 
 effectiven ess( S
j)
j 1
s
23
Hyperlink Selection
• Efficiency measures the usefulness of hyperlinks
placed in a portal page.
• Efficiency measured at session level
efficiency( S j ) 
UHL( S j )  HL
PHL
UHL( S j )  {L1, L2, L5, L7}
HL  {L1, L2, L3, L4, L5, L8}
PHL  {L1, L2, L3}
efficiency( S j ) 
{L1, L2, L5}
1
{L1, L2, L3}
• Efficiency measured at log level
s
efficiency ( wl) 
 efficiency( S
j)
j 1
s
24
Hyperlink Selection

Usage : how often a portal page is visited.
usage( wl) 
s
 UHL(S
j )  HL
j 1
25
Hyperlink Selection
Definition : Given a website w, its hyperlink pool HP and
the number of hyperlinks to be placed in the portal page of w – N,
where N  HP , the hyperlink selection problem is to construct
the portal page by selecting N hyperlinks from the hyperlink pool HP
to maximize the effectiveness, efficiency and usage of the
resulting portal page (i.e., all metrics are measured at the web log level).
26
LinkSelector

LinkSelector is based on relationships
between hyperlinks in a hyperlink pool.

Structure Relationship

Access Relationship
27
LinkSelector

Structure Relationship
28
LinkSelector

Structure Relationship
L2
L4
Web page 2
L6
L1
L8
L3
L5
Web page 1
L7
Web page 3
Structure relationships:
L1L2
L1L4
L3L5
L3L7
L1L6
L1L8
29
LinkSelector

Access Relationship
k-HS is denoted as a hyperlink set with k
hyperlinks. e.g., {L1,L2} is a 2-HS

The support of a k-HS is the percentage of
sessions that web pages pointed to by
hyperlinks in the k-HS are accessed together.

e.g., If web pages pointed to by L1 and L2
are accessed together in 20 sessions out of
total 100 sessions, then the support of 2-HS
{L1,L2} is 20%.

30
LinkSelector

Access Relationship
Definition : For a k-HS , where k
 2 , there exists an
access relationship among elements in the k-HS
if and only if its support is greater than a
pre-defined threshold.
31
LinkSelector
32
LinkSelector

Group-I relationship
L2
L4
L1
L3
Web page 1
Web page 2
L6
L8
L5
L7
Web page 3
Structure relationships:
L1L2
L1L4
L3L5
L3L7
L1L6
L1L8
33
LinkSelector

Group-I relationship
L2
L4
L1
L3
Web page 1
L6
L8
L5
L7
L1L4
L3L5
L3L7
L1L6
Web page 3
Access relationships:
Structure relationships:
L1L2
Web page 2
L1L8
{L1,L2},0.1 {L1,L4},0.1
{L1,L6},0.05
{L1,L8},0.05
{L3,L5},0.4
{L3,L7},0.5
34
LinkSelector

Group-I relationship provides indicators of
preference for individual hyperlinks in hyperlink
selection
the number of structure relationships a
hyperlink participate in as an initial hyperlink


the quality of these structure relationships
35
LinkSelector

Group-II relationship

no structure relationship between L9 and L12

an access relationship between L9 and L12
36
LinkSelector

Group-II relationship provides indicators of
hyperlink pair preference in hyperlink
selection:


hyperlink pairs with Group-II relationships are
preferred to hyperlink pairs without
Group-II relationships
within hyperlink pairs with Group-II
relationships, hyperlink pairs with higher support
of access relationship are preferred to those with
lower support of access relationships
37
LinkSelector

Group-III relationship reveals patterns that
are not relevant to hyperlink selection
L1
L5
Web page 2
Web page 1

Group-IV relationship does not reveal
interesting patterns.
38
LinkSelector

The Sketch of LinkSelector
39
LinkSelector

Discover Structure Relationships
40
LinkSelector

Access relationship can be discovered
from a web log using association rule
mining


Data Preprocessing
Association rule mining
41
LinkSelector

Data Preprocessing

Web log cleaning



Error logs (e.g., status code 404)
Accessory logs (e.g., .gif)
Session identification


Modify web server
30-min time interval
42
LinkSelector

Association rule mining
Given: a transaction database
tid
item
001
1,2,3
002
1,2,4
003
2,3,4
004
4,5,6
 An itemset is a set of items, e.g., {1,2}
 The support of an itemset is the percentage of transactions
that contain (e.g., purchase) the itemst.


Objective: discover all itemsets with supports
larger than a user-defined threshold.
Apriori Algorithm (Agrawal and Srikant,1994 )
43
LinkSelector
 Calculate Preferences for Hyperlinks
HP  {L1 , L2 , L3 , L4 , L5 , L6 , L7 }
SR  {L1  L2 , L1  L3 , L1  L4 ,}
AR  {({L1 , L2 },0.022), ({L1 , L3},0.018), ({L1 , L4 },0.01) }
PREL1  0.022 0.018 0.01  0.05
44
LinkSelector
 Calculate Preferences for Hyperlinks Sets
AR  {({L3, L7 , L9},0.022), ({L4 , L8},0.02),}
No structure relationships between L3 and L7 ;
L3 and L9 ; L7 and L9 .
Preference for hyperlink set {L3 , L7 , L9}is 0.022
45
LinkSelector

Clustering is a data mining algorithm to
segment objects into groups based on
their similarities
1
3
0.2
5
1 2 3 4
5
1 0 0.2 0.1 0.1 0.05
2 0.2 0 0.1 0.1 0.05
3 0.1 0.1
4
5
2
4
0.12
46
LinkSelector

Hyperlink Clustering :



Hyperlinks  Objects
Preferences for hyperlinks  Weights of
objects
Preferences for hyperlinks sets 
Similarities among hyperlinks
47
LinkSelector

Limitations of classical clustering
algorithms


Weights of objects are not considered.
Only considers similarities between two
objects
48
LinkSelector

Our solution



Indexes of the proposed similarity matrix
are clusters while indexes of the traditional
similarity matrix are objects to be
clustered.
Similarities involving two and more objects
are considered in the proposed similarity
matrix.
Weights of objects are considered
49
LinkSelector
50
Experiment

Data collected from UA website


Hyperlink pool: 110 links
Web log: collected in Sep. 2001




10 M records  4.2 M records
total 344 K sessions
262 k sessions  Training data (23 days)
82 k sessions  Testing data (7 days)
51
Experiment
Hyperlinks selected by LinkSelector, Domain experts
and access frequency (N=6)
52
Experiment
Average improvement: 12.7%
Improvement decrease from 22.1% to 8.4%
Average number of sessions per day: 11.5k
53
Experiment
Average improvement: 17.0%
Improvement decrease from 30.2% to 9.4%
Absolute number of hyperlinks improved: 15610 to 6509
54
Experiment
Average improvement: 16.9%
Improvement decrease from 30.2% to 9.3%
55
Experiment
Hyperlinks selected by LinkSelector, Classical
Hierarchical Clustering and Association rule mining
(N=6)
56
Experiment
Average improvement compared with association
rule mining: 25.8%
Average improvement compared with classical
clustering: 102.0%
57
Experiment
Average improvement compared with association
rule mining: 31.7%
Average improvement compared with classical
clustering: 124.0%
58
Experiment
Average improvement compared with association
rule mining: 31.6%
Average improvement compared with classical
clustering: 123.0%
59
Contributions
1. We proposed and formally defined a new and important research
problem – hyperlink selection.
2. We proposed and showed what a web mining based hyperlink
selection approach outperforms other hyperlink selection approaches.
3.We developed a new clustering algorithm for hyperlink selection.
60
Limitations and Future work

User Study

Adaptive LinkSelector

Structure of website changes

Web visiting patterns change
61