Transcript Slide 1

Exploiting Web Matrix Permutations
to Speedup PageRank Computation
Presented by: Aries Chan, Cody Lawson, and Michael Dwyer
Introduction

Internet Statistics
151 million active internet user as of
January 2004
 76% used a search engine at least once
during the month
 Average time spent searching was about
40 minutes

Introduction

Search Engines
Most common means of accessing Web
 Easiest method of organizing and
accessing information
 Therefore, high quality / usable search
engines are very important

Rank
Provider
-
All Search
1
Searches (000)
Share of Searches
10,812,734
100.0%
Google Search
6,986,580
64.6%
2
Yahoo! Search
1,726,060
16.0%
3
MSN/Windows Live/Bing Search
1,156,415
10.7%
Introduction

Basic Idea of a Search Engine

Scanning Web Graph
 Using
crawlers to create an index of the
information

Ranking
 Organizing
usable way
and ordering information in a
Introduction

Webpage Ranking Problems

Personalization
 http://www.google.com/history/?hl=en

Updating – Keeping order up to date
 25%
of links are changed in one week
 5% “new content” in one week
 35% of entire web changed in eleven weeks

Reducing Computation Time
Introduction

Accelerating the PageRank
Use compression to fit the Web Graph
into main memory
 Use sequence of scans of the Web
Graph to efficiently compute in external
memory
 Reduce computation time through
numerical methods

Introduction

Reduce computation time through numerical
methods
 Use iterative methods such as Gauss-Seidel
and Jacobi method (Arasu et al and
Bianchini et al)
 Subtract off estimates of non-principal
eigenvectors (Kamvar et al )
 Sort the graph lexicographically by url
resulting in an approximate block structure
(Kamvar et al)
 Split into two problems: dangling nodes and
non-dangling nodes (Lee et al)
 Others have been working on ways to only
update the ranks of nodes influenced by
Web Graph changes
Introduction

Del Corso, Gulli, and Romani
(authors of the paper)
 Numerical optimization of PageRank
 View
the PageRank computation as a linear
system


Transform a dense problem into one which uses
a sparse matrix
Treatment of dangling nodes which naturally
adapts to the random surfer model
 Exploiting

web matrix permutations
Increase data locality and reduce number of
iterations necessary to solve the problem
Google’s PageRank Overview
Web as an oriented graph
 The random surfer

= ∑ pji vj
(sum of PageRanks of nodes linking to i
weighted by the transition probability)
 vi

Equilibrium distribution
= vTH (left eigenvector of H with
eigenvalue 1)
 vT
Problems with the ideal model

Dangling nodes (trap the user)
Impose a random jump to every other
page
 B = H + auT


Cyclic paths in the Web Graph
(reducibility)
Brin and Page suggested adding artificial
transitions (low probability jump to all
nodes)
 G = B + (1  )euT

Current PageRank Solution

Since G is just a rank one
modification of H, the power
method takes advantage of the
sparsity of matrix H.
Google’s PageRank eigenproblem
as a linear system

Expand
= vT (eigenproblem)
 G = (H + auT) + (1  )euT
(expansion of Google matrix)
  vT (H + auT) + (1  )vTeuT = vT
 vTG

Restructure
Taking S = I  HT  uaT
 And with vTe = 1
  Sv = (1  )u
(after taking transpose and rearranging)

Dangling Nodes Problems

Dangling Nodes



Pages with no links to other pages
Pages whose existence is inferred but
crawler has not reached
According to a 2001 sample,
approximately 76% are dangling
nodes.
Dangling Nodes Problems

Natural Model



B = H + auT
Jump with probability 1 to any other node
Drastic Model


Completely removes dangling nodes
Problems



Dangling nodes themselves are not ranked
Removing nodes create new dangling nodes
Self-loop Model




eg.
B=H+F
Fij = {1 if i = j & dangling; 0 otherwise}
Still row stochastic and is similar to natural
Problem

Gives unreasonably high rank to the dangling nodes
Which model is the best?

Natural model
the most “accurate”.
 Problem

 Gives
a much more dense matrix B
 It is at least partially for this reason that the
problem is approached as an eigenproblem
to exploit the sparsity of H

Can we have an equally lightweight
iterative approach?
Iterative Approach with Sparsity


Sparse matrix R
 R = I  H T
 PageRank v obtained by solving
 Ry = u, v = y such that ||v||1=1
Why does this work?
 Since S = R  uaT
 and Sv = (1  )u
 (R  uaT)v = (1  )u
 Use Sherman-Morrison formula to
calculate the inverse
Iterative Approach with Sparsity

We get
(R  uaT)-1 = R-1 +
v = (1 )(1 +
R-1uaTR-1
1/ + aTR-1u
aTy
1/ + aTy
)y
v = y
 = (1 )(1 +

aTy
1/ + aTy
)
The vector v is our PageRank vector
and was solved using sparse matrix R
Exploiting Web Matrix
Permutations

Use a variety of “cheap” operators to
permute the web graph in an organized
way in hopes to:




increase data locality
reduce the number of iterations in order to
solve the problem
Explore different iterative methods in order
to solve the problem quicker.
Compare the performance of different
iterative methods based on specifically
permuted web graphs.
Permutation Strategies

The following operations were chosen
based on their “limited impact on the
computational cost” (Del Corso et al.)






O - orders nodes by increasing out-degree
Q - orders nodes by decreasing out-degree
X - orders nodes by increasing in-degree
Y - orders nodes by decreasing in-degree
B - ordering the nodes according to their
BFS (Breadth First Search) order of visit
T - transposes the matrix
Permutation Strategies (cont.)



O, Q, X, Y, and T operators  a full matrix
The B operator conveniently arranges R into a
lower block triangular structure due to BFS order
of visit.
Combining these operations on R, the following
structures of the permuted web matrix are
produced.
Permutation Strategies (cont.)

Visual representations of the permuted web graph.
Iterative Strategies

Power Method


Computes dominant eigenvector
Jacobi Method

Using an initial guess, approximates the solution to the
linear system of equations. Each successive iteration
uses the previous approximated solution as its next
guess till a degree of convergence is reached.
(*further explained)

Gauss-Seidel Method (Reverse Gauss-Seidel)

Modification of the Jacobi Method, which approximates
the solution to each successive equation in the linear
system based on the values derived from the previous
equations, all within each iterative loop. (*further explained)
*(http://college.cengage.com/mathematics/larson/elementary_linear/5e/students/ch08-10/chap_10_2.pdf)
Iterative Strategies (cont.)


Further exploration led to iterative methods
based on the distinct block structure of certain
web graph permutations.
DN (or DNR) Method


The permuted matrix ROT has the property that the lower
diagonal block coincides with the identity matrix. The
matrix can be easily partitioned into non-dangling and
dangling portions. Then the non-dangling part is solved
by Gauss-Seidel (or Reverse Gauss-Seidel respectively).
LB/UB/LBR/UBR Methods

Uses the Gauss-Seidel or Reverse Gauss-Seidel methods
to solve the individual blocks of the triangular block
matrices produced by the B operator.
Results
Results (cont.)

For both the Power Method and Jacobi Method, the number of Mflops is not
dependent on permutations of the web matrix. (“the small differences in
numerical data are due to the finite precision” [Del Corso et al.])

The Jacobi Method (applied to the matrix R) is only a slight improvement
(about 3%) compared to the Power Method (applied to the matrix G).
Results (cont.)

The Gauss-Seidel and Reverse Gauss-Seidel Methods reduced Mflops by
around 40% and running time by around 45% on the full matrix compared
to the Power Method on the full matrix.

In particular the Reverse Gauss-Seidel Method performed on the permuted
matrix RYTB reduced the number of Mflops by 51% and running time by
82% when compared to the Power method on the full matrix.
Results (cont.)


The block methods achieved even better results
In particular, the best overall reduction in computation time and number of
Mflops was achieved by LBR method on the permuted matrix RQTB.
58% reduction in terms of Mflops
89% reduction in terms of running time
Conclusion

Objective:


Contribution:




Accelerating Google PageRank by numerical
methods
Viewed web matrix as a sparse linear
system
Formalized new method for treating
dangling nodes
Explored new iterative methods and applied
them to web matrix permutations
Achievement:


1/10 of the computation time
Reduced over 50% Mflops
References
G.M. Del Corso, A. Gulli, F. Romani,
“Exploiting Web Matrix Permutations
to Speedup PageRank Computation”
 Nielsen MegaView Search (http://enus.nielsen.com/rankings/insights/rank
ings/internet)
