Transcript ppt
HEINZ NIXDORF INSTITUTE
University of Paderborn
Algorithms and Complexity
Christian Schindelhauer
Search Algorithms
Winter Semester 2004/2005
22 Nov 2004
6th Lecture
Christian Schindelhauer
[email protected]
1
HEINZ NIXDORF INSTITUTE
Chapter III
University of Paderborn
Algorithms and Complexity
Christian Schindelhauer
Chapter III
Searching the Web
22 Nov 2004
Search Algorithms, WS 2004/05
2
HEINZ NIXDORF INSTITUTE
Searching the Web
University of Paderborn
Algorithms and Complexity
Christian Schindelhauer
Introduction
The Anatomy of a Search Engine
Google’s Pagerank algorithm
– The Simple Algorithm
– Periodicity and convergence
Kleinberg’s HITS algorithm
– The algorithm
– Convergence
The Structure of the Web
– Pareto distributions
– Search in Pareto-distributed graphs
Search Algorithms, WS 2004/05
3
Overview Search Engines
HEINZ NIXDORF INSTITUTE
http://www.searchengineshowdown.com/
University of Paderborn
Algorithms and Complexity
Christian Schindelhauer
(March 2002)
Number of documents
Search Engine
Showdown
Claim
Estimate
(millions)
(millions)
Google
WiseNut
AllTheWeb
968
579
580
1,500
1,500
507
Northern Light
AltaVista
417
397
358
500
Hotbot
332
500
MSN Search
292
500
Search Algorithms, WS 2004/05
4
Overview Search Engines
HEINZ NIXDORF INSTITUTE
http://www.searchengineshowdown.com/
University of Paderborn
Algorithms and Complexity
Christian Schindelhauer
(Dez. 2002)
Number of documents
Search Engine
Showdown
Claim
Estimate
(millions)
(millions)
Google
3,033
3,083
AlltheWeb
AltaVista
2,106
1,689
2,116
1,000
WiseNut
1,453
1,500
Hotbot
1,147
3,000
MSN Search
1,018
3,000
Teoma
NLResearch
Gigablast
1,015
733
275
500
125
150
Search Algorithms, WS 2004/05
5
HEINZ NIXDORF INSTITUTE
Problems of Searching the Web
University of Paderborn
Algorithms and Complexity
Christian Schindelhauer
Currently (Nov 2004) more than 8 billion = 8.000 millions web-pages
– 10.000 words cover more than 95% of each text
– much more web-pages than words
– Users hardly ever look through more than 40 results
The problem is not to find a pattern, but to find the most important pages
Problems:
– Important pages do not contain the search pattern
• www.porsche.com does not contain sports car or even car
• www.google.com does not contain web search engine
• www.airbus.com does not contain airplane
– Certain pages have nearly every word (dictionary)
– Names are misleading
• http://www.whitehouse.org/ is not the web-site of the white house
• www.theonion.com is not about vegetables
– Certain pattern can be found everywhere, e.g. page, web, windows, ...
Search Algorithms, WS 2004/05
6
HEINZ NIXDORF INSTITUTE
How to rank Web-pages
University of Paderborn
Algorithms and Complexity
Christian Schindelhauer
The main problem about searching the web is to rank the importance
Links are very helpful:
– Humans are usually introduced on purpose
– The context of the links gives some clues about the meaning of the web-page
– Pages where many people point to are of probably very important
– Most search rely on links
Other approach: Ontology of words
– Compare the combination of words with the search word
– Good for comparing text
– Difficult if single word patterns are given
Search Algorithms, WS 2004/05
7
HEINZ NIXDORF INSTITUTE
The Anatomy of a Web Search Engine
University of Paderborn
Algorithms and Complexity
Christian Schindelhauer
“The Anatomy of a Large-Scale
Hypertextual Web Search Engine”,
Sergey Brin and Lawrence Page,
Computer Networks and ISDN
Systems, Vol. 30, 1-6, p. 107-117, 1998
Design of the prototype
– Stanford University 1998
Key components:
– Web Crawler
– Indexer
– Pagerank
– Searcher
Zur Anzeige wird der QuickTime™
Dekompressor „TIFF (Unkomprimiert)“
benötigt.
Main difference between Google and
other search engines (in 1998)
– The Pagerank mechanism
Search Algorithms, WS 2004/05
8
HEINZ NIXDORF INSTITUTE
Simplified PageRank-Algorithmus
University of Paderborn
Algorithms and Complexity
Christian Schindelhauer
Simplified PageRank-Algorithmus
– Rank of a wep-page R(u) [0,1]
– Important pages hand their rank down to the pages they link to.
– c is a normalisation factor such that ||R(u)||1= 1, i.e.
• the sum of all page ranks add to 1
– Predecessor nodes Bu
– sucessor nodes Fu
Search Algorithms, WS 2004/05
9
The Simplifed Pagerank Algorithm and an
example
HEINZ NIXDORF INSTITUTE
University of Paderborn
Algorithms and Complexity
Christian Schindelhauer
Search Algorithms, WS 2004/05
10
HEINZ NIXDORF INSTITUTE
Matrix representaion
University of Paderborn
Algorithms and Complexity
Christian Schindelhauer
R cMR,
where R is a vector (R(1),R(2),…
R(n)) and M denotes the following n
n – Matrix
Search Algorithms, WS 2004/05
11
HEINZ NIXDORF INSTITUTE
The Simplified Pagerank Algorithm
University of Paderborn
Algorithms and Complexity
Christian Schindelhauer
Does it converge?
If it converges, does it converge to a single result?
Is the result reasonable?
Search Algorithms, WS 2004/05
12
HEINZ NIXDORF INSTITUTE
The Eigenvector and Eigenvalue of the Matrix
University of Paderborn
Algorithms and Complexity
Christian Schindelhauer
For vector x and n n-matrix and a number λ:
– If M x = λ x then x is called the eigenvector and λ the eigen-value
Every n n-matrix M has at most n eigenvalues
Compute the eigenvalues by eigen-decomposition
Mx=λx
(M - I λ) x = 0,
where I is the identity matrix
– This equality has only non-trivial solutions if
Det(M - I λ) = 0
– This leads to a polynomial equation of degree n, which has always n
solutions λ1, λ2, ..., λn
• (Fundamental theorem of algebra)
– Solving the linear equations (M - I λi) x = 0 lead to the eigenvectors
The eigenvektor of the matrix is a fix point of the recursion of the simplified
pagerank algorithm
Search Algorithms, WS 2004/05
13
HEINZ NIXDORF INSTITUTE
Stochastic Matrices
University of Paderborn
Algorithms and Complexity
Christian Schindelhauer
Consider n discrete states and a sequence of random variable X1, X2, ...
over this set of states
The sequence X1, X2, ... is a Markov chain if
A stochastic matrix M is the transition matrix for a finite Markov chain, also
called a Markov matrix:
– Elements of the matrix M must be real numbers of [0, 1].
– The sum of all column in M is 1
Observation for the matrix M of the simpl. pagerank algorithm
– M is stochastic if all nodes have at least one outgoing link
Search Algorithms, WS 2004/05
14
HEINZ NIXDORF INSTITUTE
The Random Surfer
University of Paderborn
Algorithms and Complexity
Christian Schindelhauer
Consider the following algorithm
– Start in a random web-page according to a probability distribution
– Repeat the following for t rounds
• If no link is on this page, exit and produce no output
• Uniformly and randomly choose a link of the web-page
• Follow that link and go to this web-page
– Output the web-page
Lemma
The probability that a web-page i is output by the random surfer after t rounds
started with probability distribution x1, .., xn is described by the i-th entry of the
output of the simplified Pagerank-algorithm iterated for t rounds without
normalization.
Proof follows applying the definition of Markov chains
Search Algorithms, WS 2004/05
15
HEINZ NIXDORF INSTITUTE
Eigenvalues of Stochastic Matrices
University of Paderborn
Algorithms and Complexity
Christian Schindelhauer
Notations
– Die L1-Norm of a vector x is defined as
– x0, if for all i: xi 0
– x0, if for all i: xi 0
Lemma
For every stochastic matrix M and every vector x we have
• || M x ||1 || x ||1
• || M x ||1 = || x ||1, if x0 or x0
Eigenvalues of M |i|
1
Theorem
For every stochastic matrix M there is an eigenvector x with eigenvalue
1 such that x 0 and ||x||1 = 1
Search Algorithms, WS 2004/05
16
HEINZ NIXDORF INSTITUTE
The problem of periodicity - Example
University of Paderborn
Algorithms and Complexity
Christian Schindelhauer
Search Algorithms, WS 2004/05
17
HEINZ NIXDORF INSTITUTE
Periodicity - Example 2
University of Paderborn
Algorithms and Complexity
Christian Schindelhauer
Search Algorithms, WS 2004/05
18
HEINZ NIXDORF INSTITUTE
Periodic Matrices
University of Paderborn
Algorithms and Complexity
Christian Schindelhauer
Definition
– A square matrix M such that the matrix power Mk=M for k a positive integer is
called a periodic matrix.
– If k is the least such integer, then the matrix is said to have period k.
– If k = 1, then M2 = M and M is called idempotent.
Fact
– For non-periodic matrices there are vectors x, such that limk Mk x does not
converge.
Definition
– The directed graph G=(V,E) of a n x n-matrix consistis of the node set
V={1,..., n} and has edges
• E = {(i,j) | Mij 0}
– A path is a sequence of edges (u1,u2),(u2,u3),(u3,u4),..,(ut,ut+1) of a graph
– A graph cycle is a path where the start node is the end node
– A strongly connected subgraph S is a maximum sub-graph such that every
graph cycle starting and ending in a node of S is contained in S.
Search Algorithms, WS 2004/05
19
Necessary and Sufficient Conditions for
Periodicity
HEINZ NIXDORF INSTITUTE
University of Paderborn
Algorithms and Complexity
Christian Schindelhauer
Theorem (necessary condition)
– If the stochastic matrix M is periodic with period t2,
then for the graph G of M there exists a strongly connected subgraph S of at
least two nodes such that every directed graph cycle within S has a length of
the form i t for natural number i.
Theorem (sufficient condition)
– Let the graph consist of one strongly connected subgraph and
– let L1,L2, ..., Lm be the lengths all directed graph cycles of maximal length n
– Then M is non-periodic if and only if gcd(L1,L2, ..., Lm) = 1
Notation:
– gcd(L1,L2, ..., Lm) = greatest common divisor of numbers L1,L2, ..., Lm
Corollary
– If the graph is strongly connected and there exists a graph cycly of length 1
(i.e. a loop), then M is non-periodic.
Search Algorithms, WS 2004/05
20
HEINZ NIXDORF INSTITUTE
Disadvantages of the Simplified PagerankAlgorithm
University of Paderborn
Algorithms and Complexity
Christian Schindelhauer
The Web-graph has sinks, i.e. pages without links
M is not a stochastic matrix
The Web-graph is periodic
Convergence is uncertain
The Web-graph is not strongly connected
Several convergence vectors possible
Rank-sinks
– Strongly connected subgraphs absorb all weight of the predecessors
– All predecessors pointing to a web-page loose their weight.
Search Algorithms, WS 2004/05
21
HEINZ NIXDORF INSTITUTE
The (non-simplified) Pagerank-Algorithm
University of Paderborn
Algorithms and Complexity
Christian Schindelhauer
Add to a sink links to all web-pages
Uniformly and randomly choose a web-page
– With some probability q < 1 perform a step of the simplified Pagerank
algorithm
– With probability 1-q start with the first step (and choose a random web-page)
Note M ist stochastic
Search Algorithms, WS 2004/05
22
HEINZ NIXDORF INSTITUTE
Properties of the Pagerank-Algorithm
University of Paderborn
Algorithms and Complexity
Christian Schindelhauer
Graph der Matrix is strongly connected
There are graph cycles of length 1
Theorem
In non-periodic matrices of strongly connected graphs the Markov-chain
converges to a unique eigenvector with eigenvalue 1.
PageRank converges to this unique eigenvector
Search Algorithms, WS 2004/05
23
HEINZ NIXDORF INSTITUTE
University of Paderborn
Algorithms and Complexity
Christian Schindelhauer
Thanks for your attention
End of 6th lecture
Next lecture:
Mo 29 Nov 2004, 11.15 am, FU 116
Next exercise class:
Mo 22 Nov 2004, 1.15 pm, F0.530 or
We 24 Nov 2004, 1.00 pm, E2.316
24