Methods of Computing the PageRank Vector

Download Report

Transcript Methods of Computing the PageRank Vector

Methods of
Computing the
PageRank Vector
Tom Mangan
Brief History of
Web Search
• Boolean term matching
Brief History of
Web Search
• Boolean term matching
• Sergey Brin and Larry Page
• Reputation based ranking
• PageRank
Reputation
• Count links to a page
• Weight links by how many come from a
page
• Further weight links by the reputation of
the linker
1
2
3
5
4
6
1
2
3
5
Link Matrix
4
6
Calculating Rank
Where:
= the set of all pages linking to P
= # of links from page Q
Calculating Rank
Where:
= the set of all pages linking to P
= # of links from page Q
The PageRank
Vector
Define:
where
where
From our
earlier mini-web:
where
Taken one row at a time:
where
Iterating this equation is called the
Power Method
where
Iterating this equation is called the
Power Method
and we define the
PageRank vector:
Power Method
• Convergence requires:
irreducibility (Perron-Frobenius Thm)
Definitions
• Markov chain
• The conditional probability of each
future state depends only on the
present state
• Markov matrix
• Transition matrix of a Markov chain
Transition Matrix
From our
earlier mini-web:
Markov Matrix
Properties
• Row-stochastic
• Stationary vector gives long-term
probability of each state
• All eigenvalues λ ≤ 1
not row-stochastic
Define a vector a such that:
Then we obtain a row-stochastic matrix:
or
S may or may not be reducible,
so we make one more fix:
The Google Matrix:
Now G is a positive, irreducible, row-stochastic
matrix, and the power method will converge,
but we’ve lost sparsity.
Note that:
so now the power method looks like:
Power method converges at
the same rate as
thus
1
2
3
5
Link Matrix
4
6
A Linear System
Formulation
• Amy Langville and Carl Meyer
• Exploit dangling nodes
• Solve a system instead of iterating
By Langville and Meyer,
solving the system
and letting
produces the PageRank vector
(proof omitted)
Exploiting Dangling Nodes:
Re-order the rows and
columns of H such that
Exploiting Dangling Nodes:
Re-order the rows and
columns of H such that
then
has some nice properties
that simplify solving the
linear system.
Non-singular
Source: L&M, A Reordering for the PageRank Problem
Langville and Meyer
Algorithm 1
• Re-order rows and columns so that
dangling nodes are lumped at bottom
• Solve
• Compute
• Normalize
Improvement
• In testing, Algorithm 1 reduces the time
necessary to find the PageRank vector
by a factor of 1-6
• This time is data-dependent
Further
Improvement?
• First improvement came from finding
zero rows in
• Now find zero rows in
Source: L&M, A Reordering for the PageRank Problem
Langville and Meyer
Algorithm 2
• Reorder rows and columns so that all
submatrices have zero rows at bottom
• Solve
• For i = 2 to b, compute
• Normalize
Problem with
Algorithm 2
• Finding submatrices of zero rows takes
longer than time saved in solve step
• L & M wait until all submatrices are
reordered to solve primary
Proposal
• As each submatrix is isolated, send it
out for parallel solving
Source: L&M, A Reordering for the PageRank Problem
Sources
DeGroot, M. and Schervish, M., Probability and Statistics, 3rd Ed., Addison Wesley, 2002
Langville, A. and Meyer, C., A Reordering for the PageRank Problem,
Journal of Scientific Computing, Vol. 27 No. 6, 2006
Langville, A. and Meyer, C., Deeper Inside PageRank, 2004
Lee, C., Golub, G. and Zenios, S., A Fast Two-Stage Algorithm for Computing PageRank,
undated
Rebaza, J., Lecture Notes