ranking_post[1]. - Computer Science and Engineering

Download Report

Transcript ranking_post[1]. - Computer Science and Engineering

Link Analysis Ranking Algorithms on the
World Wide Web
Allan Borodin
Computer Science Department, University of Toronto, and
Gammasite
Gareth O. Roberts
Department of Mathematics and Statistics, Lancaster University
Jeffrey S. Rosenthal
Department of Statistics, University of Toronto
Panayiotis Tsaparas
Computer Science Department, University of Toronto
Link Analysis Ranking on the Web
• View the Web as a graph
– Each web page is a node
– Each hyperlink is a directed edge
• Underlying Intuition:
– A link from node i to node j, denotes endorsement
of node j as an authority on a topic
• The Problem: Mine the Web Graph Secrets
– Discover good authorities.
– Rank nodes according to their authority weight
Roadmap
•
•
•
•
•
•
Previous Work
Extensions of existing algorithms
A novel Bayesian Algorithm
Experimental Results
A Theoretical Framework
The Grand Finale
Previous Algorithms
• Page Rank [Brin and Page 1997]
– Query independent
– Random Surfer Model
• Hubs and Authorities [Kleinberg 1997]
–
–
–
–
Query dependent
Kleinberg [Kleinberg 1997]
SALSA [Lempel and Moran 2000]
Other
• [Henzinger and Bharat 1998]
• [Rafiei and Mendelzon 2000]
• PHITS [Cohn and Chang 2001]
Hubs and Authorities
IN
Root
Set
OUT
BASE SET
•
•
•
•
Create Root Set from text-based search engine.
Expand to Base Set
Construct underlying Graph
Remove intra-domain links
Hubs and Authorities
• Pages with double identity (hubs,
authorities)
• Good hubs point to many good
authorities
• Good authorities are pointed by many
good hubs
• Assign each page i an authority
weight ia and a hub weight ih.
• Target: Find good authorities
HUBS
AUTHORITIES
Kleinberg Algorithm
• Initialize all weights to 1.
• Repeat until convergence
– I operation: authorities “collect” the hub
weights
ai 
h
j: j i
j
– O operation : hubs “collect” the authority
weights
hi 
a
j:i  j
j
– Normalize weights under some norm
HUBS
AUTHORITIES
Kleinberg Algorithm (cont.)
• Equivalent to SVD decomposition of adjacency matrix A
• Authority weights converge to principal eigenvector of
ATA
• Hub weights converge to principal eigenvector of AAT
SALSA
• Replace the I and O operations with
– I’ operation: authorities average the hub weights
1
ai 
B(i)
h
j: j i
j
– O’ operation: hubs average the authority weights
1
hi 
F (i)
a
j:i  j
j
SALSA
• Equivalent to a random walk on the
bipartite graph
• For a connected component the
stationary distribution satisfies
B (i )
ai 
B
• For the whole graph, pick starting point
uniformly at random. The authority
weight of node i in component j
N j B(i )
ai 
N Bj
HUBS
AUTHORITIES
pSALSA
• Pick the initial node with probability proportional to the
popularity (in-degree) of the node.
• Perform the same random walk as in SALSA
• Stationary distribution
B (i )
ai 
B
Kleinberg Algorithm and Random Walks
• pSALSA is equivalent to a single I
operation.
• The nth step of the Kleinberg algorithm
gives weight
ai 
( BF ) n (i )
( BF ) n
• The stationary distribution of a random
walk with transition probabilities
n
P(i, j ) 
( BF ) (i, j )
( BF )n (i )
HUBS
AUTHORITIES
Roadmap
•
•
•
•
•
•
Previous Work
Extensions of existing algorithms
A novel Bayesian Algorithm
Experimental Results
A Theoretical Framework
The Grand Finale
Hub-Averaging Algorithm
• Asymmetric view of Hubs, Authorities
• Good hubs point only to good authorities
• Algorithm
– Perform I operation of Kleinberg
– Perform O’ operation of SALSA
Threshold Algorithms
• Hub Threshold Algorithm
– I operation: Keep only the hub weights above
average
• Authority Threshold Algorithm
– O operation: Keep only the top K authority weights
• Full Threshold Algorithm
– Apply Thresholds to both I and O operations
Breadth First Search Algorithm
• pSALSA: weights according to 1-neighborhood popularity
• Kleinberg: weights according to global structure
• BFS: Combine
– Assign weights according to
n-neighborhood popularity.
– Visit neighbors in a BFS fashion,
alternating between B and F steps
– Apply exponentially decreasing
weighting
HUBS
AUTHORITIES
ai  2n BN (i)  2n1 BFN (i)  2n2 BFBN (i)    ( BF )nN (i)
Roadmap
•
•
•
•
•
•
Previous Work
Extensions of existing algorithms
A novel Bayesian Algorithm
Experimental Results
A Theoretical Framework
The Grand Finale
Bayesian Algorithm:The Model
• Assign to each page i parameters ai , hi , ei
– ( ei : “link tendency” parameter)
• Probability of a link between i and j
exp( a j hi  ei )
P[i  j ] 
1  exp( a j hi  ei )
• Simplified Bayesian:
P[i  j ] 
a j hi
1  a j hi
• Assign prior distributions to the parameters
hi ~ Exp(1)
ai ~ Exp(1)
ei ~ N (  ,  2 )
Bayesian Algorithm:The Algorithm
• Condition on the observed adjacency matrix A
• Obtain posterior distribution using Bayes Rule
 (a1 ,, an , h1 ,, hn , e1 ,, en ) 
n

 exp  hi  exp  ai  exp  ei   2
i 1
2
  Pi  j   Pi  j 
( i , j ): A ( i , j ) 1
( i , j ): A ( i , j )  0
• Compute the conditional means using Metropolis
Algorithm
aˆi   ai   (a1 ,, an , h1 ,, hn , e1 ,, en )
• Output the conditional means of the authority parameters
Roadmap
•
•
•
•
•
•
Previous Work
Extensions of existing algorithms
A novel Bayesian Algorithm
Experimental Results
A Theoretical Framework
The Grand Finale
• http://www.cs.toronto.edu/~tsap/experiments
Experimental Results
• No undisputed “best” algorithm
– No algorithm performs consistently well
– There are queries where no algorithm performs well
– There are queries where all algorithms perform well
• Some algorithms are more “focused”, others more
“spread”
• Some algorithms are more prone to topic drift
• The construction of the Base Set Graph is very important
Experimental Results
• Kleinberg
– Converges to the most Tightly Knit Community
(TKC phenomenon)
– Prone to topic drift
• pSALSA
– Spreads the authority weight over different
communities
– May introduce spurious authorities
• The two ends of the spectrum (east v.s. west) (genetic)
Comparative Evaluation of Algorithms
HThresh
Kleinberg
•
•
•
•
AThresh
FThresh
Hub-Avg
BFS
Bayesian
Similarity: intersection over top ten
Simplified Bayesian very close to pSALSA
Threshold algorithms close to Kleinberg
Other algorithms range in the middle
SBayesian
pSALSA
Roadmap
•
•
•
•
•
•
Previous Work
Extensions of existing algorithms
A Bayesian Approach
Experimental Results
A Theoretical Framework
The Grand Finale
A Theoretical Framework
• Link Analysis Ranking algorithm A
A : GN   N
– A(G)[j]: authority weight of jth page.
• L-algorithm A: vector A(G) is normalized under L norm
• Unnormalized algorithm: no normalization at any step
– e.g. unnormalized pSALSA
Monotonicity
• Definition: An algorithm A is monotone if for any two
nodes, j,k:
BN ( j )  BN (k )  A(G)[ j ]  A(G)[k ]
j
k
• All algorithms we consider are monotone
Similarity:Distance Measures
• L1 norm:
– Distance between weight vectors
w1 , w2 :
d1 w1 , w2    w1[i]  w2 [i]
– Distance between algorithms
A1 , A2 :
d1 ( A1 , A2 )  max d1  A1 (G), A2 (G)
GGN
Similarity:Distance Measures
• Rank Distance (counts the number of swapped pairs)
– Distance between weight vectors
w1 , w2 :
1, if w1 (i )  w1 ( j ) AND w2 (i )  w2 ( j )
I ( w1 , w2 ) (i, j )  
0, o.w.
1 N N
d r w1 , w2    I ( w1 ,w2 ) (i, j )
N i 1 j 1
– Distance between algorithms
A1 , A2 :
d r ( A1 , A2 )  max d r  A1 (G), A2 (G)
GGN
Similarity
• Definition:Two L-algorithms A1 , A2are similar if
d1 ( A1 , A2 )   ( N )
• Definition: Two algorithms A1 , A2are rank similar if
d r ( A1 , A2 )   ( N )
• Definition: Two algorithms A1 , A2are rank matching if
d r ( A1 , A2 )  0
Similarity:Results
• Hub-Averaging and pSALSA are neither similar, nor
rank similar
• Kleinberg and pSALSA are neither similar nor rank
similar
• Kleinberg and Hub-Averaging are neither similar nor
rank similar
Stability
• Intuition: An algorithm is stable if small changes on the
graph have small effect on the output
• Let G'  G   1 ,  2 ,,  k 
• Definition: An L-algorithm A is stable if
d1 ( A(G),   A(G' ))   ( N )
• Definition: An algorithm A is rank stable if
d r ( A(G), A(G' ))   ( N )
Stability: Results
• Kleinberg and Hub-Averaging are neither stable nor rank
stable
• pSALSA is stable, and rank stable
Locality
• Let G'  G  q  p
• Local:
A(G)[i]  A(G' )[i]  0 for all i  p
• Pairwise local:
A(G )[i] A(G ' )[i]

for all i, j  p
A(G)[ j ] A(G' )[ j ]
• Rank Local:
I ( A(G ), A(G ')) (i, j )  0 for all i, j  p
Locality: Results
• Unnormalized pSALSA is local
• pSALSA is rank local, and pairwise local
• Theorem (Uniqueness of pSALSA)
Any algorithm that is monotone, label independent and
local is rank matching with pSALSA
Future Work
• Investigate the use of other statistical and machine
learning techniques for link analysis
• Expand and explore the Theoretical Framework
• Investigate the similarity of Simplified Bayesian and
pSALSA