Finding Relevant Pages
Download
Report
Transcript Finding Relevant Pages
VIPAS: Virtual Link Powered
Authority Search in the Web
Chi-Chun Lin and Ming-Syan Chen
Network Database Laboratory
National Taiwan University
Outline
Motivation
and Goal
Preliminaries and Related work
Introduction to Link-analysis
Defects
of Traditional Link-analysis and
Ideas for Improvement
System Framework and Algorithms
Implementation and Experimental Results
Conclusions
M.-S. Chen
NTU
2
Motivation and Goal
To
find the most relevant pages satisfying
the user’s information need in the Web
Traditional means for this task
Keyword-based search engines
Problems
Some relevant pages do not contain the
keywords in the page text
An
alternative method
Analyze the links contained in Web pages
instead of ranking by keywords
M.-S. Chen
NTU
3
HITS
(1/3)
Authority
A page pointed to by many other pages
Hub
pages
A page pointing to many other pages
Mutual
pages
reinforcement
An authority pointed to by many hub pages is
an even better authority
A hub pointing to many authority pages is an
even better hub
Based on this argument, the goal of HITS is to
find the set of best authority pages
M.-S. Chen
NTU
4
HITS
(2/3)
Let xp and yp denote the authority and hub score
of page p, respectively
q1
q2
page p
xp := sum of yq
for all qp
q3
q2
page p
yp := sum of xq
for all pq
M.-S. Chen
q1
NTU
q3
5
HITS
(3/3)
Iterative algorithm
1.
2.
3.
4.
5.
Obtain a set of Web pages using a keywordbased query and expand it to form a base set
Assign each page of the base set an initial
authority and hub score of 1
According to its links, update the scores of
each page
Normalize the scores so that
(xp)2=1 and (yp)2=1 for all p in the base
set
Do steps 3 and 4 iteratively until the scores
converge
M.-S. Chen
NTU
6
The Problem with HITS
Links
in Web pages only reflect page
creators’ judgment
Sometimes a link will not be put in the
page even though its destination is very
relevant
e.g: There will be no link to a company’s
competitor in the same industry in its
homepage
We
argue: Page readers’ consideration
should be of equal importance
M.-S. Chen
NTU
7
The Notion of Virtual Links
The
basic idea
Identify pages that are heavily accessed
within a period, and form a “hot set” from
these pages
Create “virtual links” for pages in the hot set
and incorporate them into the computation of
authority scores
Design
a Web warehouse for this task and
utilize it to identify authoritative Web
pages
M.-S. Chen
NTU
8
System Framework
Query
Interface
page content
Page
Archive
& links
Web Pages
keywords
virtual links
Virtual Link
Creator
Authority
Evaluator
scores
Keyword
& Ranking
Database
query results
Clicking
Observer
Clickstream
Database
M.-S. Chen
NTU
9
Creating Virtual Links
Scenario:
A user interested in Javarelated Web pages came to our system
She submitted a query with keyword “java”
Assume that the query result contains 100
URLs
She clicked top 1-10 of the 100 URLs except
the 6th
The hot set consists of the 9 URLs clicked
M.-S. Chen
NTU
10
Creating Virtual Links
2
(cont’d)
criteria
URL 1
URL 1
URL 2
URL 2
Hub 1
URL 5
URL 6
URL 5
Virtual Hub
URL 6
URL 7
URL 7
URL 10
URL 10
M.-S. Chen
Hub 2
NTU
Hub n
11
Algorithm VIPAS
(Virtual LInk Powered Authority Search)
1.
2.
3.
4.
5.
Initialization Phase
For a query term, perform the regular HITS analysis
Collect a base set of pages with computed authority
and hub scores and store them in the database
Virtual Link Collection Phase
Monitor the user behavior to see whether a URL in
the list is clicked by the user or not
After a period of user behavior observation, put URLs
that are often accessed into the “hot set”
Create virtual links for pages in the hot set
M.-S. Chen
NTU
12
Algorithm VIPAS
6.
7.
(cont’d)
Refinement Phase
For each page in the hot set, compute its new
authority and hub scores
Run several iterations of score updating for pages in
the base set
2 flavors
VIPAS-VH(VIPAS with virtual links from a Virtual Hub)
VIPAS-TH(VIPAS with virtual links from Top Hubs)
M.-S. Chen
NTU
13
Finding Hot Sets
Some
relevant URLs that have already
been browsed by the user will be skipped
1.
2.
3.
4.
In an observing period, pay attention to clicks of
continuous URLs in the list
When a user continuously clicks several URLs and
then skips some URLs following, we mark those
that have been skipped
Exclude pages marked with a frequency greater
than from the forming of hot sets
Among pages left, those that are accessed by at
least % users are put into the hot set
M.-S. Chen
NTU
14
Finding Hot Sets
(cont’d)
1.
2.
3.
4.
5.
6.
http://java.sun.com/
http://www.sun.com/java/
http://www.javaworld.com/
http://java.oreilly.com/
http://www.jars.com/
…………..
clicked
1.
2.
3.
4.
5.
6.
http://java.sun.com/
http://www.sun.com/java/
http://www.javaworld.com/
http://java.oreilly.com/
http://www.jars.com/
…………..
skipped
M.-S. Chen
clicked
clicked
URL 4 is marked
skipped
clicked
clicked
clicked
skipped
URL 4 is marked,
but URL 1 is not
clicked
NTU
15
Assigning Weights to Virtual
Links
n pages in the hot set: t1,t2,…,tn
Clickstream 1:
(t1,t2,t3,t4,x1,x2)
4
4
( 0.267)
6 1 2 3 4
4
3
w1, 2
( 0.200)
6 1 2 3 4
4
2
w1,3
( 0.133)
6 1 2 3 4
4
1
w1, 4
( 0.067)
6 1 2 3 4
w1,5 w1,6 ... w1,n 0
w1,1
M.-S. Chen
Clickstream 2:
(t3,x1,t1)
2
1
( 0.222)
3 1 2
2
2
w2,3
( 0.444)
3 1 2
w2, 2 w2, 4 w2,5 ... w2,n 0
w2,1
NTU
16
Assigning Weights to Virtual
Links (cont’d)
N ( T1 )
Final weight:
wh (T1 )
w
k 1
k ,h
N (T1 )
0.267 0.222
0.245
2
0.200 0
w2 (T1 )
0.100
2
w1 (T1 )
For period Ti where i 2
N ( Ti )
w (Ti )
'
h
w
k 1
k ,h
N (Ti )
1
2
wh (Ti ) wh (Ti 1 ) wh' (Ti )
3
3
(1/3 is the degeneration factor)
M.-S. Chen
NTU
17
Computing the New Scores
Let
xp and yp denote the authority and hub
score of page p, respectively
For
each page p, we update p’s authority
score by x p yq A wq ', p yq '
q : ( q, p ) E
Similarly,
we update p’s hub score by
yp
M.-S. Chen
q ' : ( q ', p ) E '
q : ( p ,q ) E
xq H
NTU
w
x
p ,q ' q '
q ' : ( p , q ') E '
18
User-behavior Observation
Use
an ASP script
Query result for keyword: “Java”
1.
2.
3.
plain URL http://java.sun.com/
replaced by
wrapper.asp?URL=http://java.sun.com/
The Source of Java(TM) Technology
http://java.sun.com/
………………….
http://….
………
http://…
1.
2.
3.
Query result page
M.-S. Chen
NTU
Increment the click count of
http://java.sun.com/
Record the time
Redirect the user to
http://java.sun.com/
19
Implementation and
Experiments
Experimental
NTUEE website (http://www.ee.ntu.edu.tw/)
Data
testbed
collection
03/28/’02 ~ 05/31/’02
Parameters
M.-S. Chen
Parameter Value
20%
A
H
40%
10
10
NTU
20
Evaluation Method
For
a keyword, we manually select a list
of authority pages and compare it with
the output of each algorithm
SN
URL (H denotes http://www.ee.ntu.edu.tw)
Title
5633
H/www/faculty/rb-wu/rb-wu.htm
Homepage of professor Ruey-Beei Wu
7228
H/html_2000/WWW/faculty/english/Wu-Rei-Bei.html
[no title]
8682
H/html_2000/www/faculty/rb-wu/rb-wu.htm
Homepage of professor Ruey-Beei Wu
Discrepancy
coefficient
n
M.-S. Chen
(R
k 1
k
k)
n
NTU
21
Discrepancy Coefficient –
Regular HITS
Rank SN
URL (H denotes http://www.ee.ntu.edu.tw)
Title
1
5633
H/www/faculty/rb-wu/rb-wu.htm
Homepage of professor Ruey-Beei Wu
2
93
H/professor_c.html
Faculty members of NTUEE
3
34
H/prodata_c.html
Faculty members of NTUEE
4
94
H/professor_e.html
Faculty members of NTUEE
5
8682
H/html_2000/www/faculty/rb-wu/rb-wu.htm
Homepage of professor Ruey-Beei Wu
6
7229
H/html_2000/WWW/faculty/english/Cao-Heng-Wei.html
[no title]
7
7269
H/html_2000/WWW/faculty/english/Chen-Qiu-Lin.html
[no title]
8
5892
H/html_2000/WWW/faculty/NoSort.html
[no title]
9
4959
H/content/chinese/required/differential_equations.html
Engineering Mathematics I: Diff….
10
8904
H/html_2000/content/chinese/required/differential_equations.html
Engineering Mathematics I: Diff….
41
7228
H/html_2000/WWW/faculty/english/Wu-Rei-Bei.html
[no title]
R1 = 1(SN 5633), R2 = 5(SN 8682), R3 = 41(SN 7228)
(1 1) (5 2) (41 3)
13.67
3
M.-S. Chen
NTU
22
Discrepancy Coefficient –
VIPAS-VH
Rank SN
URL (H denotes http://www.ee.ntu.edu.tw)
Title
1
5633
H/www/faculty/rb-wu/rb-wu.htm
Homepage of professor Ruey-Beei Wu
2
93
H/professor_c.html
Faculty members of NTUEE
3
34
H/prodata_c.html
Faculty members of NTUEE
4
94
H/professor_e.html
Faculty members of NTUEE
5
8682
H/html_2000/www/faculty/rb-wu/rb-wu.htm
Homepage of professor Ruey-Beei Wu
6
7228
H/html_2000/WWW/faculty/english/Wu-Rei-Bei.html
[no title]
7
7229
H/html_2000/WWW/faculty/english/Cao-Heng-Wei.html
[no title]
8
7269
H/html_2000/WWW/faculty/english/Chen-Qiu-Lin.html
[no title]
9
5892
H/html_2000/WWW/faculty/NoSort.html
[no title]
10
4959
H/content/chinese/required/differential_equations.html
Engineering Mathematics I: Diff….
R1 = 1(SN 5633), R2 = 5(SN 8682), R3 = 6(SN 7228)
(1 1) (5 2) (6 3)
2
3
M.-S. Chen
NTU
23
Evaluation Method
Grouping
coefficient
n
2
[(
R
k
)
]
k
k 1
n
Stability
The standard deviation of each algorithm’s
discrepancy coefficients for all of the
keywords
M.-S. Chen
NTU
24
Grouping Coefficient –
Regular HITS
Rank SN
URL (H denotes http://www.ee.ntu.edu.tw)
Title
1
5633
H/www/faculty/rb-wu/rb-wu.htm
Homepage of professor Ruey-Beei Wu
2
93
H/professor_c.html
Faculty members of NTUEE
3
34
H/prodata_c.html
Faculty members of NTUEE
4
94
H/professor_e.html
Faculty members of NTUEE
5
8682
H/html_2000/www/faculty/rb-wu/rb-wu.htm
Homepage of professor Ruey-Beei Wu
6
7229
H/html_2000/WWW/faculty/english/Cao-Heng-Wei.html
[no title]
7
7269
H/html_2000/WWW/faculty/english/Chen-Qiu-Lin.html
[no title]
8
5892
H/html_2000/WWW/faculty/NoSort.html
[no title]
9
4959
H/content/chinese/required/differential_equations.html
Engineering Mathematics I: Diff….
10
8904
H/html_2000/content/chinese/required/differential_equations.html
Engineering Mathematics I: Diff….
41
7228
H/html_2000/WWW/faculty/english/Wu-Rei-Bei.html
[no title]
R1 = 1(SN 5633), R2 = 5(SN 8682), R3 = 41(SN 7228)
[(1 1) 13.67]2 [(5 2) 13.67]2 [( 41 3) 13.67]2
17.25
3
M.-S. Chen
NTU
25
Grouping Coefficient –
VIPAS-VH
Rank SN
URL (H denotes http://www.ee.ntu.edu.tw)
Title
1
5633
H/www/faculty/rb-wu/rb-wu.htm
Homepage of professor Ruey-Beei Wu
2
93
H/professor_c.html
Faculty members of NTUEE
3
34
H/prodata_c.html
Faculty members of NTUEE
4
94
H/professor_e.html
Faculty members of NTUEE
5
8682
H/html_2000/www/faculty/rb-wu/rb-wu.htm
Homepage of professor Ruey-Beei Wu
6
7228
H/html_2000/WWW/faculty/english/Wu-Rei-Bei.html
[no title]
7
7229
H/html_2000/WWW/faculty/english/Cao-Heng-Wei.html
[no title]
8
7269
H/html_2000/WWW/faculty/english/Chen-Qiu-Lin.html
[no title]
9
5892
H/html_2000/WWW/faculty/NoSort.html
[no title]
10
4959
H/content/chinese/required/differential_equations.html
Engineering Mathematics I: Diff….
R1 = 1(SN 5633), R2 = 5(SN 8682), R3 = 6(SN 7228)
[(1 1) 2]2 [(5 2) 2]2 [( 41 3) 2]2
1.41
3
M.-S. Chen
NTU
26
Discrepancy Coefficient
Experimental Results
25
HITS
VIPAS-VH
VIPAS-TH
20
15
10
5
0
Grouping Coefficient
1
2
3
4
5
6
7
8
6
7
8
20
HITS
VIPAS-VH
VIPAS-TH
16
12
8
4
0
1
2
3
4
5
Keyword
M.-S. Chen
NTU
27
Stability
Experimental Results
9
8
7
6
5
4
3
2
1
0
HITS
M.-S. Chen
(cont’d)
VIPAS-VH
NTU
VIPAS-TH
28
Conclusions
Link-analysis
algorithms are popular in
Web information retrieval
In
But they need further improvement
our work, we built a Web warehouse
Incorporate user feedback into the
identification of authoritative resources
(Algorithm VIPAS)
Experimental results show that VIPAS is very
effective and the warehouse is able to retrieve
much more valuable information for users
M.-S. Chen
NTU
29