Finding Relevant Pages

Download Report

Transcript Finding Relevant Pages

VIPAS: Virtual Link Powered
Authority Search in the Web
Chi-Chun Lin and Ming-Syan Chen
Network Database Laboratory
National Taiwan University
Outline
 Motivation
and Goal
 Preliminaries and Related work

Introduction to Link-analysis
 Defects
of Traditional Link-analysis and
Ideas for Improvement
 System Framework and Algorithms
 Implementation and Experimental Results
 Conclusions
M.-S. Chen
NTU
2
Motivation and Goal
 To
find the most relevant pages satisfying
the user’s information need in the Web
 Traditional means for this task

Keyword-based search engines
 Problems

Some relevant pages do not contain the
keywords in the page text
 An

alternative method
Analyze the links contained in Web pages
instead of ranking by keywords
M.-S. Chen
NTU
3
HITS
(1/3)
 Authority

A page pointed to by many other pages
 Hub

pages
A page pointing to many other pages
 Mutual



pages
reinforcement
An authority pointed to by many hub pages is
an even better authority
A hub pointing to many authority pages is an
even better hub
Based on this argument, the goal of HITS is to
find the set of best authority pages
M.-S. Chen
NTU
4
HITS
(2/3)
Let xp and yp denote the authority and hub score
of page p, respectively

q1
q2
page p
xp := sum of yq
for all qp
q3
q2
page p
yp := sum of xq
for all pq
M.-S. Chen
q1
NTU
q3
5
HITS

(3/3)
Iterative algorithm
1.
2.
3.
4.
5.
Obtain a set of Web pages using a keywordbased query and expand it to form a base set
Assign each page of the base set an initial
authority and hub score of 1
According to its links, update the scores of
each page
Normalize the scores so that
(xp)2=1 and (yp)2=1 for all p in the base
set
Do steps 3 and 4 iteratively until the scores
converge
M.-S. Chen
NTU
6
The Problem with HITS
 Links
in Web pages only reflect page
creators’ judgment
 Sometimes a link will not be put in the
page even though its destination is very
relevant

e.g: There will be no link to a company’s
competitor in the same industry in its
homepage
 We
argue: Page readers’ consideration
should be of equal importance
M.-S. Chen
NTU
7
The Notion of Virtual Links
 The


basic idea
Identify pages that are heavily accessed
within a period, and form a “hot set” from
these pages
Create “virtual links” for pages in the hot set
and incorporate them into the computation of
authority scores
 Design
a Web warehouse for this task and
utilize it to identify authoritative Web
pages
M.-S. Chen
NTU
8
System Framework
Query
Interface
page content
Page
Archive
& links
Web Pages
keywords
virtual links
Virtual Link
Creator
Authority
Evaluator
scores
Keyword
& Ranking
Database
query results
Clicking
Observer
Clickstream
Database
M.-S. Chen
NTU
9
Creating Virtual Links
 Scenario:
A user interested in Javarelated Web pages came to our system




She submitted a query with keyword “java”
Assume that the query result contains 100
URLs
She clicked top 1-10 of the 100 URLs except
the 6th
The hot set consists of the 9 URLs clicked
M.-S. Chen
NTU
10
Creating Virtual Links
2
(cont’d)
criteria
URL 1
URL 1
URL 2
URL 2
Hub 1
URL 5
URL 6
URL 5
Virtual Hub
URL 6
URL 7
URL 7
URL 10
URL 10
M.-S. Chen
Hub 2
NTU
Hub n
11
Algorithm VIPAS
(Virtual LInk Powered Authority Search)

1.
2.

3.
4.
5.
Initialization Phase
For a query term, perform the regular HITS analysis
Collect a base set of pages with computed authority
and hub scores and store them in the database
Virtual Link Collection Phase
Monitor the user behavior to see whether a URL in
the list is clicked by the user or not
After a period of user behavior observation, put URLs
that are often accessed into the “hot set”
Create virtual links for pages in the hot set
M.-S. Chen
NTU
12
Algorithm VIPAS

6.
7.

(cont’d)
Refinement Phase
For each page in the hot set, compute its new
authority and hub scores
Run several iterations of score updating for pages in
the base set
2 flavors


VIPAS-VH(VIPAS with virtual links from a Virtual Hub)
VIPAS-TH(VIPAS with virtual links from Top Hubs)
M.-S. Chen
NTU
13
Finding Hot Sets
 Some
relevant URLs that have already
been browsed by the user will be skipped
1.
2.
3.
4.
In an observing period, pay attention to clicks of
continuous URLs in the list
When a user continuously clicks several URLs and
then skips some URLs following, we mark those
that have been skipped
Exclude pages marked with a frequency greater
than  from the forming of hot sets
Among pages left, those that are accessed by at
least % users are put into the hot set
M.-S. Chen
NTU
14
Finding Hot Sets
(cont’d)
1.
2.
3.
4.
5.
6.
http://java.sun.com/
http://www.sun.com/java/
http://www.javaworld.com/
http://java.oreilly.com/
http://www.jars.com/
…………..
clicked
1.
2.
3.
4.
5.
6.
http://java.sun.com/
http://www.sun.com/java/
http://www.javaworld.com/
http://java.oreilly.com/
http://www.jars.com/
…………..
skipped
M.-S. Chen
clicked
clicked
URL 4 is marked
skipped
clicked
clicked
clicked
skipped
URL 4 is marked,
but URL 1 is not
clicked
NTU
15
Assigning Weights to Virtual
Links
n pages in the hot set: t1,t2,…,tn
Clickstream 1:
(t1,t2,t3,t4,x1,x2)
4
4

(  0.267)
6 1 2  3 4
4
3
w1, 2  
(  0.200)
6 1 2  3 4
4
2
w1,3  
(  0.133)
6 1 2  3 4
4
1
w1, 4  
(  0.067)
6 1 2  3 4
w1,5  w1,6  ...  w1,n  0
w1,1 
M.-S. Chen
Clickstream 2:
(t3,x1,t1)
2
1

(  0.222)
3 1 2
2
2
w2,3  
(  0.444)
3 1 2
w2, 2  w2, 4  w2,5  ...  w2,n  0
w2,1 
NTU
16
Assigning Weights to Virtual
Links (cont’d)
N ( T1 )

Final weight:


wh (T1 ) 
w
k 1
k ,h
N (T1 )
0.267  0.222
 0.245
2
0.200  0
w2 (T1 ) 
 0.100
2
w1 (T1 ) 
For period Ti where i  2
N ( Ti )
w (Ti ) 
'
h
w
k 1
k ,h
N (Ti )
1
2
wh (Ti )   wh (Ti 1 )   wh' (Ti )
3
3
(1/3 is the degeneration factor)
M.-S. Chen
NTU
17
Computing the New Scores
 Let
xp and yp denote the authority and hub
score of page p, respectively
 For
each page p, we update p’s authority
score by x p   yq   A  wq ', p yq '
q : ( q, p )  E
 Similarly,
we update p’s hub score by
yp 
M.-S. Chen
q ' : ( q ', p )  E '

q : ( p ,q )  E
xq   H
NTU
w
x
p ,q ' q '
q ' : ( p , q ')  E '
18
User-behavior Observation
 Use
an ASP script
Query result for keyword: “Java”
1.
2.
3.
plain URL http://java.sun.com/
replaced by
wrapper.asp?URL=http://java.sun.com/
The Source of Java(TM) Technology
http://java.sun.com/
………………….
http://….
………
http://…
1.
2.
3.
Query result page
M.-S. Chen
NTU
Increment the click count of
http://java.sun.com/
Record the time
Redirect the user to
http://java.sun.com/
19
Implementation and
Experiments
 Experimental

NTUEE website (http://www.ee.ntu.edu.tw/)
 Data

testbed
collection
03/28/’02 ~ 05/31/’02
 Parameters
M.-S. Chen
Parameter Value

20%

A
H
40%
10
10
NTU
20
Evaluation Method
 For
a keyword, we manually select a list
of authority pages and compare it with
the output of each algorithm
SN
URL (H denotes http://www.ee.ntu.edu.tw)
Title
5633
H/www/faculty/rb-wu/rb-wu.htm
Homepage of professor Ruey-Beei Wu
7228
H/html_2000/WWW/faculty/english/Wu-Rei-Bei.html
[no title]
8682
H/html_2000/www/faculty/rb-wu/rb-wu.htm
Homepage of professor Ruey-Beei Wu
 Discrepancy
coefficient 
n

M.-S. Chen
 (R
k 1
k
 k)
n
NTU
21
Discrepancy Coefficient –
Regular HITS
Rank SN
URL (H denotes http://www.ee.ntu.edu.tw)
Title
1
5633
H/www/faculty/rb-wu/rb-wu.htm
Homepage of professor Ruey-Beei Wu
2
93
H/professor_c.html
Faculty members of NTUEE
3
34
H/prodata_c.html
Faculty members of NTUEE
4
94
H/professor_e.html
Faculty members of NTUEE
5
8682
H/html_2000/www/faculty/rb-wu/rb-wu.htm
Homepage of professor Ruey-Beei Wu
6
7229
H/html_2000/WWW/faculty/english/Cao-Heng-Wei.html
[no title]
7
7269
H/html_2000/WWW/faculty/english/Chen-Qiu-Lin.html
[no title]
8
5892
H/html_2000/WWW/faculty/NoSort.html
[no title]
9
4959
H/content/chinese/required/differential_equations.html
Engineering Mathematics I: Diff….
10
8904
H/html_2000/content/chinese/required/differential_equations.html
Engineering Mathematics I: Diff….
41
7228
H/html_2000/WWW/faculty/english/Wu-Rei-Bei.html
[no title]
R1 = 1(SN 5633), R2 = 5(SN 8682), R3 = 41(SN 7228)


(1  1)  (5  2)  (41  3)
 13.67
3
M.-S. Chen
NTU
22
Discrepancy Coefficient –
VIPAS-VH
Rank SN
URL (H denotes http://www.ee.ntu.edu.tw)
Title
1
5633
H/www/faculty/rb-wu/rb-wu.htm
Homepage of professor Ruey-Beei Wu
2
93
H/professor_c.html
Faculty members of NTUEE
3
34
H/prodata_c.html
Faculty members of NTUEE
4
94
H/professor_e.html
Faculty members of NTUEE
5
8682
H/html_2000/www/faculty/rb-wu/rb-wu.htm
Homepage of professor Ruey-Beei Wu
6
7228
H/html_2000/WWW/faculty/english/Wu-Rei-Bei.html
[no title]
7
7229
H/html_2000/WWW/faculty/english/Cao-Heng-Wei.html
[no title]
8
7269
H/html_2000/WWW/faculty/english/Chen-Qiu-Lin.html
[no title]
9
5892
H/html_2000/WWW/faculty/NoSort.html
[no title]
10
4959
H/content/chinese/required/differential_equations.html
Engineering Mathematics I: Diff….
R1 = 1(SN 5633), R2 = 5(SN 8682), R3 = 6(SN 7228)


(1  1)  (5  2)  (6  3)
2
3
M.-S. Chen
NTU
23
Evaluation Method
 Grouping
coefficient 
n

2
[(
R

k
)


]
 k
k 1
n
 Stability

The standard deviation of each algorithm’s
discrepancy coefficients for all of the
keywords
M.-S. Chen
NTU
24
Grouping Coefficient –
Regular HITS
Rank SN
URL (H denotes http://www.ee.ntu.edu.tw)
Title
1
5633
H/www/faculty/rb-wu/rb-wu.htm
Homepage of professor Ruey-Beei Wu
2
93
H/professor_c.html
Faculty members of NTUEE
3
34
H/prodata_c.html
Faculty members of NTUEE
4
94
H/professor_e.html
Faculty members of NTUEE
5
8682
H/html_2000/www/faculty/rb-wu/rb-wu.htm
Homepage of professor Ruey-Beei Wu
6
7229
H/html_2000/WWW/faculty/english/Cao-Heng-Wei.html
[no title]
7
7269
H/html_2000/WWW/faculty/english/Chen-Qiu-Lin.html
[no title]
8
5892
H/html_2000/WWW/faculty/NoSort.html
[no title]
9
4959
H/content/chinese/required/differential_equations.html
Engineering Mathematics I: Diff….
10
8904
H/html_2000/content/chinese/required/differential_equations.html
Engineering Mathematics I: Diff….
41
7228
H/html_2000/WWW/faculty/english/Wu-Rei-Bei.html
[no title]
R1 = 1(SN 5633), R2 = 5(SN 8682), R3 = 41(SN 7228)

[(1  1)  13.67]2  [(5  2)  13.67]2  [( 41  3)  13.67]2

 17.25
3
M.-S. Chen
NTU
25
Grouping Coefficient –
VIPAS-VH
Rank SN
URL (H denotes http://www.ee.ntu.edu.tw)
Title
1
5633
H/www/faculty/rb-wu/rb-wu.htm
Homepage of professor Ruey-Beei Wu
2
93
H/professor_c.html
Faculty members of NTUEE
3
34
H/prodata_c.html
Faculty members of NTUEE
4
94
H/professor_e.html
Faculty members of NTUEE
5
8682
H/html_2000/www/faculty/rb-wu/rb-wu.htm
Homepage of professor Ruey-Beei Wu
6
7228
H/html_2000/WWW/faculty/english/Wu-Rei-Bei.html
[no title]
7
7229
H/html_2000/WWW/faculty/english/Cao-Heng-Wei.html
[no title]
8
7269
H/html_2000/WWW/faculty/english/Chen-Qiu-Lin.html
[no title]
9
5892
H/html_2000/WWW/faculty/NoSort.html
[no title]
10
4959
H/content/chinese/required/differential_equations.html
Engineering Mathematics I: Diff….
R1 = 1(SN 5633), R2 = 5(SN 8682), R3 = 6(SN 7228)

[(1  1)  2]2  [(5  2)  2]2  [( 41  3)  2]2

 1.41
3
M.-S. Chen
NTU
26
Discrepancy Coefficient
Experimental Results
25
HITS
VIPAS-VH
VIPAS-TH
20
15
10
5
0
Grouping Coefficient
1
2
3
4
5
6
7
8
6
7
8
20
HITS
VIPAS-VH
VIPAS-TH
16
12
8
4
0
1
2
3
4
5
Keyword
M.-S. Chen
NTU
27
Stability
Experimental Results
9
8
7
6
5
4
3
2
1
0
HITS
M.-S. Chen
(cont’d)
VIPAS-VH
NTU
VIPAS-TH
28
Conclusions
 Link-analysis
algorithms are popular in
Web information retrieval

 In


But they need further improvement
our work, we built a Web warehouse
Incorporate user feedback into the
identification of authoritative resources
(Algorithm VIPAS)
Experimental results show that VIPAS is very
effective and the warehouse is able to retrieve
much more valuable information for users
M.-S. Chen
NTU
29