Transcript ppt
BrowseRank: letting the web users vote for page
importance
Yuting Liu, Bin Gao, Tie-Yan Liu, Ying Zhang, Zhiming Ma, Shuyuan He, Hang Li
SIGIR 2008
2009. 04.10.
Summarized & presented by Babar Tareen, IDS Lab., Seoul National University
Center for E-Business Technology
Seoul National University
Seoul, Korea
Introduction
Page importance is a key factor for web search
Currently page importance is measured by using the link
graph
HITS
PageRank
If many important pages link to a page then the page is also
likely to be important
Copyright 2008 by CEBT
2
PageRank Drawbacks
Link graph is not reliable
Links can easily be created and deleted on the web
Can easily be manipulated by web spammers using link farms
PageRank does not considers the length of time which a web
surfer spends on the web page
Copyright 2008 by CEBT
3
BrowseRank
Utilize user browsing graph
Generated from user behavior data
Behavior data can be recorded by Internet browsers at web
clients and collected at web servers
Behavior data includes
–
URL
–
Time
–
Method of visiting (URL input or hyperlink click)
Copyright 2008 by CEBT
4
BrowseRank (2)
More visits of the page and longer time spent on a page
indicates that the page is important
Uses continuous-time Markov process as model on user
browsing graph
Markov process is a process in which the likelihood of a given
future state, at any given moment, depends only on its
present state, and not on any past states
Past
Present
Copyright 2008 by CEBT
Future
5
Originality
Propose the use of browsing graph for computing page
importance
Propose the use of continuous-time Markov process to model
a random walk on the user browsing graph
Copyright 2008 by CEBT
6
User Behavior Data
When user surfs on the web
Can input the URL
Choose to click on a hyperlink
Behavior data can be stored as triples
<URL, TIME, TYPE>
Copyright 2008 by CEBT
7
User Behavior Data (2)
Session Segmentation
Time Rule: If time of current record is 30 minutes behind that of
previous record, then current record is considered as new session
Type Rule: If the type of the record is ‘INPUT’ we will consider it
as new session
URL Pair construction
Within session, URL’s are placed in adjacent records
Indicates that the user transits from the first page to the second
page
Copyright 2008 by CEBT
8
User Behavior Data (3)
Reset probability estimation
For sessions segmented by type rule, the first URL is input by the
user
Assign reset probabilities to those URL’s
Staying time extraction
For each URL pair, use the time difference of second and first
page as staying time
For last session either use random time [for time rule] or time
difference from next session [for type rule]
Copyright 2008 by CEBT
9
User Browsing Graph
Vertex: Represent a URL
Metadata: Reset Probabilities, Staying Time
Directed Edge: Represents Transition between pages
Edge Weight: Number of transitions
18
25
30
6
3
14
15
7
45
17
Copyright 2008 by CEBT
10
Model
Continuous-time time-homogeneous Markov Process model
Assumptions
Independence of users ad sessions
Markov Property
Time-homogeneity
Copyright 2008 by CEBT
11
Continuous-time Markov Model
Xs represents page which the surfer is visiting at time s, s >
0
Continuous-time time-homogenous Markov Process
X { Xs, S 0}
Pij(t) denotes the transition probability from page i to
page j for time interval t
Stationary probability distribution Π unique and independent
P(t ), t 0
of t
Computing matrix P is difficult because it is hard to get
information for all time intervals
Algorithm is based on
Q P ' ( 0)
Copyright 2008 by CEBT
12
Algorithm
Copyright 2008 by CEBT
13
Experiments
Website-Level BrowseRank
Finding important websites and depressing spam sites
Page-Level BrowseRank
Improving relevance ranking
Dataset
3 billion records
950 million unique URL’s
Website Level Graph
–
5.6 million vertices
–
53 million edges
–
40 million websites
Copyright 2008 by CEBT
14
Top-20 Websites
Copyright 2008 by CEBT
15
Spam fighting
2714 websites labeled spam by human experts
Copyright 2008 by CEBT
16
Page Level Testing
Adopted 3 measures to evaluate performance
MAP
Precission (P@n)
Normalized Discounted Cummulative Gain (NDCG@n)
Copyright 2008 by CEBT
17
Results (1)
Copyright 2008 by CEBT
18
Results (2)
Copyright 2008 by CEBT
19
Technical Issues
User behavior data tends to be sparse
User behavior data can lead to reliable importance calculation
for the head web pages, but not for the tail web pages
Time homogeneity assumption is mainly for technical
convenience
Content information and metadata was not used in
BrowseRank
Copyright 2008 by CEBT
20
Discussion
Better approach to find page importance
Already highlights technical issues
Spammers can alter BrowseRank by sending fake user
behavior data. This will be easy too as behavior data is
collected from client.
Copyright 2008 by CEBT
21