Transcript ppt

BrowseRank: letting the web users vote for page
importance
Yuting Liu, Bin Gao, Tie-Yan Liu, Ying Zhang, Zhiming Ma, Shuyuan He, Hang Li
SIGIR 2008
2009. 04.10.
Summarized & presented by Babar Tareen, IDS Lab., Seoul National University
Center for E-Business Technology
Seoul National University
Seoul, Korea
Introduction
 Page importance is a key factor for web search
 Currently page importance is measured by using the link
graph

HITS

PageRank
 If many important pages link to a page then the page is also
likely to be important
Copyright  2008 by CEBT
2
PageRank Drawbacks
 Link graph is not reliable

Links can easily be created and deleted on the web

Can easily be manipulated by web spammers using link farms
 PageRank does not considers the length of time which a web
surfer spends on the web page
Copyright  2008 by CEBT
3
BrowseRank
 Utilize user browsing graph

Generated from user behavior data

Behavior data can be recorded by Internet browsers at web
clients and collected at web servers

Behavior data includes
–
URL
–
Time
–
Method of visiting (URL input or hyperlink click)
Copyright  2008 by CEBT
4
BrowseRank (2)
 More visits of the page and longer time spent on a page
indicates that the page is important
 Uses continuous-time Markov process as model on user
browsing graph
 Markov process is a process in which the likelihood of a given
future state, at any given moment, depends only on its
present state, and not on any past states
Past
Present
Copyright  2008 by CEBT
Future
5
Originality
 Propose the use of browsing graph for computing page
importance
 Propose the use of continuous-time Markov process to model
a random walk on the user browsing graph
Copyright  2008 by CEBT
6
User Behavior Data
 When user surfs on the web

Can input the URL

Choose to click on a hyperlink
 Behavior data can be stored as triples

<URL, TIME, TYPE>
Copyright  2008 by CEBT
7
User Behavior Data (2)
 Session Segmentation

Time Rule: If time of current record is 30 minutes behind that of
previous record, then current record is considered as new session

Type Rule: If the type of the record is ‘INPUT’ we will consider it
as new session
 URL Pair construction

Within session, URL’s are placed in adjacent records

Indicates that the user transits from the first page to the second
page
Copyright  2008 by CEBT
8
User Behavior Data (3)
 Reset probability estimation

For sessions segmented by type rule, the first URL is input by the
user

Assign reset probabilities to those URL’s
 Staying time extraction

For each URL pair, use the time difference of second and first
page as staying time

For last session either use random time [for time rule] or time
difference from next session [for type rule]
Copyright  2008 by CEBT
9
User Browsing Graph
 Vertex: Represent a URL

Metadata: Reset Probabilities, Staying Time
 Directed Edge: Represents Transition between pages
 Edge Weight: Number of transitions
18
25
30
6
3
14
15
7
45
17
Copyright  2008 by CEBT
10
Model
 Continuous-time time-homogeneous Markov Process model
 Assumptions

Independence of users ad sessions

Markov Property

Time-homogeneity
Copyright  2008 by CEBT
11
Continuous-time Markov Model
 Xs represents page which the surfer is visiting at time s, s >
0
 Continuous-time time-homogenous Markov Process
X  { Xs, S  0}
 Pij(t) denotes the transition probability from page i to
page j for time interval t
 Stationary probability distribution Π unique and independent
 P(t ), t  0
of t


 Computing matrix P is difficult because it is hard to get
information for all time intervals
 Algorithm is based on
Q  P ' ( 0)
Copyright  2008 by CEBT
12
Algorithm
Copyright  2008 by CEBT
13
Experiments
 Website-Level BrowseRank

Finding important websites and depressing spam sites
 Page-Level BrowseRank

Improving relevance ranking
 Dataset

3 billion records

950 million unique URL’s

Website Level Graph
–
5.6 million vertices
–
53 million edges
–
40 million websites
Copyright  2008 by CEBT
14
Top-20 Websites
Copyright  2008 by CEBT
15
Spam fighting
 2714 websites labeled spam by human experts
Copyright  2008 by CEBT
16
Page Level Testing
 Adopted 3 measures to evaluate performance

MAP

Precission (P@n)

Normalized Discounted Cummulative Gain (NDCG@n)
Copyright  2008 by CEBT
17
Results (1)
Copyright  2008 by CEBT
18
Results (2)
Copyright  2008 by CEBT
19
Technical Issues
 User behavior data tends to be sparse
 User behavior data can lead to reliable importance calculation
for the head web pages, but not for the tail web pages
 Time homogeneity assumption is mainly for technical
convenience
 Content information and metadata was not used in
BrowseRank
Copyright  2008 by CEBT
20
Discussion
 Better approach to find page importance
 Already highlights technical issues
 Spammers can alter BrowseRank by sending fake user
behavior data. This will be easy too as behavior data is
collected from client.
Copyright  2008 by CEBT
21