The effect of New Links on Google Pagerank

Download Report

Transcript The effect of New Links on Google Pagerank

The effect of New Links on
Google Pagerank
By Hui Xie
Apr , 07
Computing PageRank
Matrix representation
Let P be an nn matrix and pij be the entry at the
i-th row and j-th column.
If page i has k>0 outgoing links
pij = 1/k if page i has a link to page j
pij = 0 if there is no link from i to j
If page I has no outgoing links
pij = 1/n j=1,…,n
Google matrix
• G=cP+(1-c)(1/n)eeT
e=(1,…,1)T
• G is stochastic matrix
Ge=e
• There exists a unique column vector π such
that
πT G= πT,
πT e=1
• πT =(1-c)/n eT(I-cP)-1
Discrete Time Markov Chains
• A sequence of random variables {Xn} is
called a Markov chain if it has the Markov
property:
• States are usually labeled {(0,)1,2,…}
• State space can be finite or infinite
Transition Probability
• Probability to jump from state i to state j
• Assume stationary: independent of time
• Transition probability matrix:
P = (pij)
• Two state MC:
Side Topic: Markov Chains
• A discrete time stochastic process is a sequence
of random variables {X0, X1, …, Xn, …} where the
0, 1, …, n, … are discrete points in time.
• A Markov chain is a discrete-time stochastic
process defined over a finite (or countably infinite)
set of states S in terms of a matrix P of transition
probabilities.
• Memorylessness property: for a Markov chain
• Pr[Xt+1 = j | X0 = i0, X1 = i1, …, Xt = i] =
Pr[Xt+1 = j | Xt = i]
•
Side Topic: Markov Chains
• Let pi(t) be the probability of being in state i at time
step t.
• Let p(t) = [p0(t), p1(t), … ] be the vector of
probabilities at time t.
• For an initial probability distribution p(0), the
probabilities at time n are
•
p(n) = p(0) Pn
• A probability distribution p is stationary if p = p P
• P(Xm+n =j|Xm = i) = P(Xn =j|X0 = i) = Pn(i,j)
absorbing Markov chain
Define a discrete-time absorbing markov chain
{Xt ,t=0,1,…}with the state space {0,1,…,n}
Where transitions between the states 1,…, n are
conducted by the matrix cP, and the state 0 is
absorbing.
The transition matrix is
0 
1


 (1  c)e cP 
Random walk interpretation
• Walk starts at a uniformly chosen web page
• At each step, if currently at page p
• W/p  , go to a uniformly chosen
outneighbor of p
• W/p 1 -  , stop
• Let Nj be the total number of
visits to state j before absorption
including the visit at time t = 0
if X0 is j . Formally,
N  1
,
j  1,..., n.

j
t 0
{ X t  j}
• Then zij=(I-cP)-1ij=E(Nj|X0=I)
• Let qij be the probability of reaching
the state j before absorption if
the initial state is i. Then we
have
• Theorem Let X denote a Markov chain
with state space E. The total number of
visits to a state j∈E under the condition
that the chain starts in state i is given by
P(Nj=m|X0=j)=qjjm-1(1-qjj)
and for i!=j
P(Nj=m|X0=i)= 1-qij
if m=0
qij qjjm-1(1-qjj) if m>=1
Corollary For all i,j ∈E the relations
zij=(1-qii)-1 and zij=qijzjj hold
Outgoing links from i do not affect qji for any j!=I
So by changing the outgoing links, a page can control its
PageRank up to multiplication by a factor zii=1/(1-qii)
For 0<=qii<=c2 ,
1<=zii<=(1-c2)-1≈3.6 for c=0.85
Rank one update of google pagerank
• Page 1 with k0 old links has k1 newly
created links to page 2 to k1+1
• k=k0+k1 , p1T be the first row of matrix P
• Updated hyperlink matrix
P  P  e1uT ,
k1 1
k1 T
1
T
T
u   ei  p1
k i 2
k
• According to (9) the ranking of
page 1 increases when
For z11=1/(1-q11),
zi2=qi1z11, i>1
The above is equivalent to
Hence, the page 1 increases its ranking when it
refers to pages that are characterized by a
high value of qi1. These must be the pages that
refer to page 1 or at least belong to the same
Web community. Here by a Web community we mean
a set of Web pages that a surfer can reach from
one to another in a relatively small number of
steps.
the PageRank of page j increases if
c
k1
k1 1
z
i 2
ij
 z1 j
1 c n
pj 
zkj

n k 1
the PageRank of page j increases if
c
k1
k1 1
z
i 2
ij
 z1 j
if several new links are added then the PageRank
of page j might actually decrease even if this
page receives one of the new links.
Such situation occurs when most of newly created
links point to “irrelevant” pages.
• For instance, let j = 2 and assume
that there is no hyperlink path
from pages 3,…,k+1 to page 2.Then
zij is close to zero for i = 3,…, k
+ 1, and the PageRank of page 2
will increase only if (c/k1)z22 >
z12, which is not necessarily true,
especially if z12 and k1 are
considerably large.
Asymptotic analysis
• Let  j  min{n  , X n  j} be the stopping time of
the first visit to the state j
• Mij=E( |X
=i) be the average time needed to
j 0
reach j starting from i(mean first passage time)
Optimal Linking Strategy
• Consider a page i = 1,…,n and
assume that i has links to pages
i1,…,ik distinct from i. Further,
let mij(c) be the mean first passage
time from page i to page j for the
Google transition matrix G with
parameter c.
1
pi 
mii (c)
• outgoing links from i do not affect
mji(c) for any j!= i. Thus, by linking
from i to j , one can only alter k, this
means that the owner of the page I has
very little control over its pagerank.
The best that he can do is to link only
to one page j* such that
m j i (c)  min{m ji (c)}
•
j
*
Note that (surprisingly) the PageRank of
j* plays no role here.
• Theorem. The optimal linking
strategy for a Web page is to have
only one outgoing link pointing to
a Web page with a shortest mean
first passage time back to the
original page.
Conclusions
• Our main conclusion is that a Web
page cannot significantly
manipulate its PageRank by changing
its outgoing links.
• Furthermore, keeping a logical
hyperlink structure and linking to
a relevant Web community is the
most sensible and rewarding policy.