Link Analysis in Web Search

Transcript Link Analysis in Web Search

What is page importance?
• Page importance is hard to define unilaterally such
that it satisfies everyone. There are however some
desiderata:
– It should be sensitive to
• The query
– Or at least the topic of the query..
• The user
How about:
“Eloquence” of the page
“informativeness” of the page
– Or at least the user population
• The link structure of the web
• The amount of accesses the page gets
– It should be stable w.r.t. small random changes in the
network link structure
– It shouldn’t be easy to subvert with intentional changes to
link structure
Desiderata for link-based ranking
• A page that is referenced by lot of important pages (has
more back links) is more important (Authority)
– A page referenced by a single important page may be more
important than that referenced by five unimportant pages
• A page that references a lot of important pages is also
important (Hub)
• “Importance” can be propagated
Different
Notions of
importance
– Your importance is the weighted sum of the importance
conferred on you by the pages that refer to you
– The importance you confer on a page may be proportional
to how many other pages you refer to (cite)
• (Also what you say about them when you cite them!)
Qn: Can we assign consistent authority/hub values
to pages?
Authorities and Hubs
as mutually reinforcing properties
• Authorities and hubs related to the same query
tend to form a bipartite subgraph of the web
graph.
hubs
authorities
• Suppose each page has an authority score a(p)
and a hub score h(p)
Authority and Hub Pages
I: Authority Computation: for each page p:
q1
a(p) =

h(q)
q: (q, p)E
q2
q3
O: Hub Computation: for each page p:
h(p) =

a(q)
p
q: (p, q)E
A set of simultaneous equations… Can we solve these?
p
q1
q2
q3
Authority and Hub Pages (8)
Matrix representation of operations I and O.
Let A be the adjacency matrix of SG: entry (p, q) is
1 if p has a link to q, else the entry is 0.
Let AT be the transpose of A.
Let hi be vector of hub scores after i iterations.
Let ai be the vector of authority scores after i
iterations.
i
T
T
Operation I: ai = AT hi-1 ai  A Aai 1 ai  A A a0
T i
T
Operation O: hi = A ai
hi  AA hi 1 hi  AA  h0
Normalize after every multiplication
Authority and Hub Pages (11)
Example: Initialize all scores to 1.
1st Iteration:
q1
p1
I operation:
q2
a(q1) = 1, a(q2) = a(q3) = 0,
p2
a(p1) = 3, a(p2) = 2
q3
O operation: h(q1) = 5,
h(q2) = 3, h(q3) = 5, h(p1) = 1, h(p2) = 0
Normalization: a(q1) = 0.267, a(q2) = a(q3) = 0,
a(p1) = 0.802, a(p2) = 0.535, h(q1) = 0.645,
h(q2) = 0.387, h(q3) = 0.645, h(p1) = 0.129, h(p2)
=0
Authority and Hub Pages (12)
After 2 Iterations:
a(q1) = 0.061, a(q2) = a(q3) = 0, a(p1) = 0.791,
a(p2) = 0.609, h(q1) = 0.656, h(q2) = 0.371,
h(q3) = 0.656, h(p1) = 0.029, h(p2) = 0
After 5 Iterations:
q1
p1
a(q1) = a(q2) = a(q3) = 0,
q2
p2
a(p1) = 0.788, a(p2) = 0.615
q3
h(q1) = 0.657, h(q2) = 0.369,
h(q3) = 0.657, h(p1) = h(p2) = 0
What happens if you multiply a vector by a
matrix?
• In general, when you multiply a vector by a matrix, the vector
gets “scaled” as well as “rotated”
– ..except when the vector happens to be in the direction of one of the eigen
vectors of the matrix
– .. in which case it only gets scaled (stretched)
• A (symmetric square) matrix has all real eigen values, and the
values give an indication of the amount of stretching that is done
for vectors in that direction
• The eigen vectors of the matrix define a new ortho-normal space
– You can model the multiplication of a general vector by the matrix in terms of
• First decompose the general vector into its projections in the eigen vector directions
– ..which means just take the dot product of the vector with the (unit) eigen vector
• Then multiply the projections by the corresponding eigen values—to get the new
vector.
– This explains why power method converges to principal eigen vector..
• ..since if a vector has a non-zero projection in the principal eigen vector direction,
then repeated multiplication will keep stretching the vector in that direction, so that
eventually all other directions vanish by comparison..
Optional
(why) Does the procedure converge?
x1  Mx0 ( M  AAT )
x x
2
x2  Mx1  M x0
2
xk
xk  M k x0
diag( 1 , 2 ,... n ), 1  2 ... n )
M


E

E 1
[ eˆ1eˆ2 ... eˆn ]
M 2  EE 1 EE 1  E2 E 1
 
M  E E  1 E  k
 1
x0  c1eˆ1  c2 eˆ2  ...  cn eˆn
k
k
M k x0  eˆ1
1
k
k
 1
 E


The rate of convergence depends on the “eigen gap” 1  2
Can we power iterate to get other
(secondary) eigen vectors?
• Yes—just find a matrix M2 such that M2 has
the same eigen vectors as M, but the eigen
value corresponding to the first eigen vector
e1 is zeroed out..
M 2  M  e1e1 '
Why? 1. M2e1 = 0
2. If e2 is the second eigen vector of M,
then it is also an eigen vector of M2
• Now do power iteration on M2
• Alternately start with a random vector v, and
find a new vector v’ = v – (v.e1)e1 and do
power iteration on M with v’
Authority and Hub Pages
Algorithm (summary)
submit q to a search engine to obtain the root
set S;
expand S into the base set T;
obtain the induced subgraph SG(V, E) using T;
initialize a(p) = h(p) = 1 for all p in V;
for each p in V until the scores converge
{ apply Operation I;
apply Operation O;
normalize a(p) and h(p); }
return pages with top authority & hub scores;
Base set computation
can be made easy by storing the link
structure of the Web in advance
Link structure table (during
crawling)
--Most search engines serve this
information now. (e.g. Google’s
link: search)
parent_url
url1
url1
child_url
url2
url3
Handling “spam” links
Should all links be equally treated?
Two considerations:
• Some links may be more
meaningful/important than other links.
• Web site creators may trick the system to
make their pages more authoritative by
adding dummy pages pointing to their
cover pages (spamming).
Handling Spam Links (contd)
•
Transverse link: links between pages with
different domain names.
Domain name: the first level of the URL of a page.
•
Intrinsic link: links between pages with the
same domain name.
Transverse links are more important than
intrinsic links.
Two ways to incorporate this:
1. Use only transverse links and discard
intrinsic links.
2. Give lower weights to intrinsic links.
Considering link “context”
For a given link (p, q), let V(p, q) be the vicinity
(e.g.,  50 characters) of the link.
• If V(p, q) contains terms in the user query
(topic), then the link should be more useful
for identifying authoritative pages.
• To incorporate this: In adjacency matrix A,
make the weight associated with link (p, q) to
be 1+n(p, q),
•
•
where n(p, q) is the number of terms in V(p, q) that appear
in the query.
Alternately, consider the “vector similarity” between
V(p,q) and the query Q
Evaluation
Sample experiments:
• Rank based on large in-degree (or backlinks)
query: game
Rank in-degree URL
1
13
http://www.gotm.org
2
12
http://www.gamezero.com/team-0/
3
12
http://ngp.ngpc.state.ne.us/gp.html
4
12
http://www.ben2.ucla.edu/~permadi/
gamelink/gamelink.html
5
11
http://igolfto.net/
6
11
http://www.eduplace.com/geo/indexhi.html
• Only pages 1, 2 and 4 are authoritative game pages.
Evaluation
Sample experiments (continued)
• Rank based on large authority score.
query: game
Rank Authority
1
0.613
2
0.390
3
4
5
6
0.342
0.324
0.324
0.306
URL
http://www.gotm.org
http://ad/doubleclick/net/jump/
gamefan-network.com/
http://www.d2realm.com/
http://www.counter-strike.net
http://tech-base.com/
http://www.e3zone.com
• All pages are authoritative game pages.
Authority and Hub Pages (19)
Sample experiments (continued)
• Rank based on large authority score.
query: free email
Rank Authority URL
1
0.525
http://mail.chek.com/
2
0.345
http://www.hotmail/com/
3
0.309
http://www.naplesnews.net/
4
0.261
http://www.11mail.com/
5
0.254
http://www.dwp.net/
6
0.246
http://www.wptamail.com/
• All pages are authoritative free email pages.
Tyranny of Majority
Which do you think are
Authoritative pages?
Which are good hubs?
-intutively, we would say
that 4,8,5 will be authoritative
pages and 1,2,3,6,7 will be
hub pages.
1
2
3
BUT The power iteration will show that
Only 4 and 5 have non-zero authorities
[.923 .382]
And only 1, 2 and 3 have non-zero hubs
[.5 .7 .5]
6
4
5
7
8
Tyranny of Majority (explained)
Suppose h0 and a0 are all initialized to 1
m
a1 ( p )  m
a1 ( q )  n
a1 ( q ) 
p2
pm
normalized
a1 ( p ) 
p1
h1 ( pi ) 
m
m n
n
2
2
m2  n2
h1 ( qi ) 
q1
p
n qn
q
m>n
m
m2  n2
n
m2  n2
a2 ( p ) 
a2 ( q ) 
m2
m2  n2
n2
m2  n2
a2 ( q )  n 
 
a2 ( p )  m 
2
k
ak ( q )  n 
  0
ak ( p )  m 
Impact of Bridges..
9
1
When the graph is disconnected,
only 4 and 5 have non-zero authorities
[.923 .382]
And only 1, 2 and 3 have non-zero hubs
[.5 .7 .5]CV
2
3
When the components are bridged by
adding one page (9)
the authorities change
only 4, 5 and 8 have non-zero authorities
[.853 .224 .47]
And o1, 2, 3, 6,7 and 9 will have non-zero hubs
[.39 .49 .39 .21 .21 .6]
6
4
5
7
8
Finding minority Communities
• How to retrieve pages from smaller communities?
A method for finding pages in nth largest community:
– Identify the next largest community using the existing
algorithm.
– Destroy this community by removing links associated
with pages having large authorities.
– Reset all authority and hub values back to 1 and
calculate all authority and hub values again.
– Repeat the above n  1 times and the next largest
community will be the nth largest community.
Multiple Clusters on “House”
Query: House (first community)
Authority and Hub Pages (26)
Query: House (second community)
PageRank
The importance of publishing..
• A/H algorithm was
published in SODA
as well as JACM
– Kleinberg became
very famous in the
scientific community
(and got a McArthur
Genius award)
• Pagerank algorithm
was rejected from
SIGIR and was
never explicitly
published
– Larry Page never got
a genius award or
even a PhD
• `(and had to be
content with being a
mere billionaire)
PageRank
(Importance as Stationary Visit Probability on a Markov
Chain)
Basic Idea:
Think of Web as a big graph. A random surfer keeps randomly clicking on the
links.
The importance of a page is the probability that the surfer finds herself on that
page
--Talk of transition matrix instead of adjacency matrix
Transition matrix M derived from adjacency matrix A
--If there are F(u) forward links from a page u,
then the probability that the surfer clicks
on any of those is 1/F(u) (Columns sum to 1. Stochastic matrix)
[M is the normalized version of At]
--But even a dumb user may once in a while do something other than
follow URLs on the current page..
--Idea: Put a small probability that the user goes off to a page not
pointed to by the current page.
Markov Chains & Random Surfer
Model
•
Markov Chains & Stationary
distribution
–
–
Necessary conditions for existence
of unique steady state distribution:
Aperiodicity and Irreducibility
Irreducibility: Each node can be
reached from every other node with
non-zero probability
•
•
The parameters of random surfer
model
–
•
–
•
–
–
Because we can have
several different steady
state distributions
depending on which
disconnected component
we get stuck in
Sufficient to put a low probability
link from every node to every other
node (in addition to the normal
weight links corresponding to actual
hyperlinks)
E.g. query specific links have higher
prob. Links in bold have higher prop
etc
K the reset distribution of the surfer
great thing to tweak
•
Must not have disconnected
components
»
Can make the links have differing
transition probability
–
»
–
The larger it is, the more the surfer
sticks to what the page says
M the way link matrix is converted to
markov chain
•
Must not have sink nodes (which
have no out links)
Because we can have
several different steady
state distributions based on
which sink we get stuck in
If there are sink nodes, change
them so that you can transition from
them to every other node with low
probability
c the probability that surfer follows
the page
•
It is quite feasible to have m different
reset distributions corresponding to
m different populations of users (or
m possible topic-oriented searches)
It is also possible to make the reset
distribution depend on other things
such as
–
–
trust of the page [TrustRank]
Recency of the page [Recencysensitive rank]
Computing PageRank (10)
Example: Suppose the Web graph is:
D
C
A
A
A= B
C
D
A
0
0
0
1
B
0
0
0
1
C
1
1
0
0
B
D
0
0
1
0
A
B
M =C
D
A
0
0
1
0
B
0
0
1
0
C
0
0
0
1
D
½
½
0
0
Computing PageRank
Matrix representation
Let M be an NN matrix and muv be the entry at
the u-th row and v-th column.
muv = 1/Nv if page v has a link to page u
muv = 0 if there is no link from v to u
Let Ri be the N1 rank vector for I-th iteration
and R0 be the initial rank vector.
Then
Ri = M  Ri-1
Computing PageRank
If the ranks converge, i.e., there is a rank vector R such
that
R = M  R,
R is the eigenvector of matrix M with eigenvalue being 1.
Convergence is guaranteed only if
•
M is aperiodic (the Web graph is not a big cycle). This is practically
guaranteed for Web.
•
M is irreducible (the Web graph is strongly connected). This is usually not
.
true
Computing PageRank (6)
Rank sink: A page or a group of pages is a
rank sink if they can receive rank
propagation from its parents but cannot
propagate rank to other pages.
Rank sink causes the loss of total ranks.
Example:
A
B
C
D
(C, D) is a rank sink
Computing PageRank (7)
A solution to the non-irreducibility and rank sink
problem.
• Conceptually add a link from each page v to
every page (include self).
• If v has no forward links originally, make all
entries in the corresponding column in M be
1/N.
• If v has forward links originally, replace 1/Nv in
the corresponding column by c1/Nv and then
add (1-c) 1/N to all entries, 0 < c < 1.
Motivation comes also from random-surfer model
Computing PageRank (8)
Z will have 1/N
For sink pages
And 0 otherwise
K will have 1/N
For all entries
M*= c (M + Z) + (1 – c) x K
• M* is irreducible.
• M* is stochastic, the sum of all entries of each
column is 1 and there are no negative entries.
Therefore, if M is replaced by M* as in
Ri = M*  Ri-1
then the convergence is guaranteed and there
will be no loss of the total rank (which is 1).
Computing PageRank (10)
Example: Suppose the Web graph is:
D
C
A
B
A
B
M =C
D
A
0
0
1
0
B
0
0
1
0
C
0
0
0
1
D
½
½
0
0
Computing PageRank (11)
Example (continued): Suppose c = 0.8. All
entries in Z are 0 and all entries in K are
¼.
0.05 0.05 0.05 0.45
0.05 0.05 0.05 0.45
0.85 0.85 0.05 0.05
M* = 0.8 (M+Z) + 0.2 K =
0.05 0.05 0.85 0.05
Compute rank by iterating
MATLAB says:
R := M*xR
R(A)=.338
R(B)=.338
R(C)=.6367
R(D)=.6052
D
C
A
B
Combining PR & Content similarity
Incorporate the ranks of pages into the ranking
function of a search engine.
• The ranking score of a web page can be a
weighted sum of its regular similarity with a
query and its importance.
ranking_score(q, d)
= wsim(q, d) + (1-w)  R(d), if sim(q, d) > 0
= 0, otherwise
where 0 < w < 1.
– Both sim(q, d) and R(d) need to be normalized
to between [0, 1].
We can pick and choose
• Two alternate ways of
computing page
importance
– I1. As authorities/hubs
– I2. As stationary
distribution over the
underlying markov chain
• Two alternate ways of
combining importance
with similarity
– C1. Compute importance
over a set derived from
the top-100 similar pages
– C2. Combine apples &
organges
• a*importance +
b*similarity
We can pick any pair of alternatives
(even though I1 was originally proposed with C1 and I2 with C2)
Efficient computation: Prioritized
Sweeping
We can use asynchronous
iterations where the iteration
uses some of the values
updated in the current iteration
Efficient Computation:
Preprocess
• Remove ‘dangling’ nodes
– Pages w/ no children
• Then repeat process
– Since now more danglers
• Stanford WebBase
– 25 M pages
– 81 M URLs in the link graph
– After two prune iterations: 19 M nodes
Representing ‘Links’ Table
• Stored on disk in binary format
Source node
(32 bit int)
Outdegree
(16 bit int)
Destination nodes
(32 bit int)
0
1
4
3
12, 26, 58, 94
5, 56, 69
2
5
1, 9, 10, 36, 78
• Size for Stanford WebBase: 1.01 GB
– Assumed to exceed main memory
=
Algorithm 1
Dest
dest node
source node
Links (sparse)

Source
s Source[s] = 1/N
while residual > {
d Dest[d] = 0
while not Links.eof() {
Links.read(source, n, dest1, … destn)
for j = 1… n
Dest[destj] = Dest[destj]+Source[source]/n
}
d Dest[d] = c * Dest[d] + (1-c)/N
/* dampening */
residual = Source – Dest
/* recompute every few iterations */
Source = Dest
}
Analysis of Algorithm 1
• If memory is big enough to hold Source & Dest
– IO cost per iteration is | Links|
– Fine for a crawl of 24 M pages
– But web ~ 800 M pages in 2/99
[NEC study]
– Increase from 320 M pages in 1997
[same authors]
• If memory is big enough to hold just Dest
– Sort Links on source field
– Read Source sequentially during rank propagation step
– Write Dest to disk to serve as Source for next iteration
– IO cost per iteration is | Source| + | Dest| + | Links|
• If memory can’t hold Dest
– Random access pattern will make working set = | Dest|
– Thrash!!!
Block-Based Algorithm
• Partition Dest into B blocks of D pages each
– If memory = P physical pages
– D < P-2 since need input buffers for Source & Links
• Partition Links into B files
– Linksi only has some of the dest nodes for each source
– Linksi only has dest nodes such that
• DD*i <= dest < DD*(i+1)
• Where DD = number of 32 bit integers that fit in D pages
=
Dest
dest node
source node
Links (sparse)

Source
Partitioned Link File
Source node Outdegr Num out
(32 bit int) (16 bit) (16 bit)
Destination nodes
(32 bit int)
0
1
2
4
3
5
2
1
3
12, 26
5
1, 9, 10
0
1
2
4
3
5
1
1
1
58
56
36
0
1
2
4
3
5
1
1
1
94
69
78
Buckets
0-31
Buckets
32-63
Buckets
64-95
Block-based Page Rank algorithm
Analysis of Block Algorithm
• IO Cost per iteration =
– B*| Source| + | Dest| + | Links|*(1+e)
– e is factor by which Links increased in size
• Typically 0.1-0.3
• Depends on number of blocks
• Algorithm ~ nested-loops join
Comparing the Algorithms
Effect of collusion on PageRank
C
A
.066 .066 .866
M *  .866 .066 .066
.066 .866 .066
C
A
B
Assuming a0.8 and K=[1/3]
Rank(A)=Rank(B)=Rank(C)=
0.5774
B
.066 .066 .466
M *  .866 .066 .466
.066 .866 .066
Rank(A)=0.37
Rank(B)=0.6672
Rank(C)=0.6461
Moral: By referring to each other, a cluster of pages can artificially boost
their rank (although the cluster has to be big enough to make an
appreciable difference.
Solution: Put a threshold on the number of intra-domain links that will count
Counter: Buy two domains, and generate a cluster among those..
Topic Specific Pagerank
•
•
For each page compute k different
page ranks
– K= number of top level
hierarchies in the Open
Directory Project
– When computing PageRank
w.r.t. to a topic, say that with e
probability we transition to one
of the pages of the topick
When a query q is issued,
– Compute similarity between q
(+ its context) to each of the
topics
– Take the weighted combination
of the topic specific page ranks
of q, weighted by the similarity
to different topics
Stability of Rank
Calculations
(From Ng et. al. )
The left most column
Shows the original rank
Calculation
-the columns on the right
are result of rank
calculations
when 30% of pages are
randomly removed
Can be done
For base set too
Can be done
For full web too
See topic-specific
Page-rank idea..
More stable because
random surfer model
allows low prob edges
to every place.CV
Can be made stable with subspace-based
A/H values [see Ng. et al.; 2001]
Summary of Key Points
• PageRank Iterative Algorithm
• Rank Sinks
• Efficiency of computation – Memory!
– Single precision Numbers.
– Don’t represent M* explicitly.
– Break arrays into Blocks.
– Minimize IO Cost.
• Number of iterations of PageRank.
• Weighting of PageRank vs. doc
similarity.
Challenges in Web Search Engines
• Spam
– Text Spam
– Link Spam
– Cloaking
• Content Quality
– Anchor text quality
• Quality Evaluation
– Indirect feedback
• Web Conventions
– Articulate and develop
validation
• Duplicate Hosts
– Mirror detection
• Vaguely Structured
Data
– Page layout
– The advantage of making
rendering/content
language be same
Fighting Page Spam
We saw discussion of
these in the Henzinger
et. Al. paper
Can social networks, which gave
rise to the ideas of page importance
computation,
also rescue these computations
from spam?
TrustRank idea
[Gyongyi et al, VLDB 2004]
 Tweak the “default” distribution used in page rank
computation (the distribution that a bored user uses
when she doesn’t want to follow the links)


From uniform
To “Trust based”
 Very similar in spirit to the Topic-sensitive or Usersensitive page rank

Where too you fiddle with the default distribution
 Sample a set of “seed pages” from the web
 Have an oracle (human) identify the good pages and the
spam pages in the seed set

Expensive task, so must make seed set as small as
possible
 Propagate Trust (one pass)
 Use the normalized trust to set the initial distribution
Slides modified from Anand Rajaraman’s lecture at Stanfor
Rules for trust propagation
 Trust attenuation

The degree of trust conferred
by a trusted page decreases
with distance
 Trust splitting


The larger the number of
outlinks from a page, the less
scrutiny the page author gives
each outlink
Trust is “split” across outlinks
 Combining splitting and
damping, each out link of a
node p gets a propagated trust
of: bt(p)/|O(p)|

0<b<1; O(p) is the out degree
and t(p) is the trust of p
 Trust additivity

Propagated trust from different
directions is added up
Picking the seed set
 Two conflicting considerations
 Human has to inspect each seed page, so
seed set must be as small as possible
 Must ensure every “good page” gets
adequate trust rank, so need make all good
pages reachable from seed set by short
paths
Approaches to picking seed set
 Suppose we want to pick a seed set of k pages
 The best idea would be to pick them from the
top-k hub pages.
 Note that “trustworthiness” is subjective
 Aljazeera may be considered more trustworthy
than NY Times by some (and the reverse by
others)
 PageRank
 Pick the top k pages by page rank
 Assume high page rank pages are close to other
highly ranked pages
 We care more about high page rank “good” pages
Inverse page rank
 Pick the pages with the maximum
number of outlinks
 Can make it recursive
 Pick pages that link to pages with many
outlinks
 Formalize as “inverse page rank”
 Construct graph G’ by reversing each edge
in web graph G
 Page Rank in G’ is inverse page rank in G
 Pick top k pages by inverse page rank
Anatomy of Google
(circa 1999)
Slides from
http://www.cs.huji.ac.il/~sdbi/2000/google/index.htm
Some points…
• Fancy hits?
• Why two types of
barrels?
• How is indexing
parallelized?
• How does Google show
that it doesn’t quite care
about recall?
• How does Google avoid
crawling the same URL
multiple times?
• What are some of the
memory saving things
they do?
• Do they use TF/IDF?
• Do they normalize?
(why not?)
• Can they support
proximity queries?
• How are “page
synopses” made?
Types of Web Queries
• Navigational
– User is looking for the address of a specific page
(so the “relevant” set is a singleton!)
• Success on these is responsible for much of the “OOooo”
appeal of search engines..
• Informational
– User is trying to learn information about a specific
topic (so the relevant set can be non-singleton)
• Transactional
– The user is searching with the final aim of
conducting a transaction on that page..
• E.g. comparison shopping
Search Engine Size over Time
Number of indexed pages, self-reported
Google: 50% of the web?
Information from
searchenginewatch.com
Google Search Engine Architecture
SOURCE: BRIN & PAGE
URL Server- Provides URLs to be
fetched
Crawler is distributed
Store Server - compresses and
stores pages for indexing
Repository - holds pages for indexing
(full HTML of every page)
Indexer - parses documents, records
words, positions, font size, and
capitalization
Lexicon - list of unique words found
HitList – efficient record of word locs+attribs
Barrels hold (docID, (wordID, hitList*)*)*
sorted: each barrel has range of words
Anchors - keep information about links
found in web pages
URL Resolver - converts relative
URLs to absolute
Sorter - generates Doc Index
Doc Index - inverted index of all words
in all documents (except stop
words)
Links - stores info about links to each
page (used for Pagerank)
Pagerank - computes a rank for each
page retrieved
Searcher - answers queries
Major Data Structures
• Big Files
– virtual files spanning multiple file systems
– addressable by 64 bit integers
– handles allocation & deallocation of File
Descriptions since the OS’s is not enough
– supports rudimentary compression
Major Data Structures (2)
• Repository
– tradeoff between speed & compression
ratio
– choose zlib (3 to 1) over bzip (4 to 1)
– requires no other data structure to access
it
Major Data Structures (3)
• Document Index
– keeps information about each document
– fixed width ISAM (index sequential access mode)
index
– includes various statistics
• pointer to repository, if crawled, pointer to info lists
– compact data structure
– we can fetch a record in 1 disk seek during search
Major Data Structures (4)
• Lexicon
– can fit in memory for reasonable price
• currently 256 MB
• contains 14 million words
• 2 parts
– a list of words
– a hash table
Major Data Structures (4)
• Hit Lists
– includes position font & capitalization
– account for most of the space used in the
indexes
– 3 alternatives: simple, Huffman , handoptimized
– hand encoding uses 2 bytes for every hit
Major Data Structures (4)
• Hit Lists (2)
Major Data Structures (5)
• Forward Index
–
–
–
–
–
partially ordered
used 64 Barrels
each Barrel holds a range of wordIDs
requires slightly more storage
each wordID is stored as a relative difference from
the minimum wordID of the Barrel
– saves considerable time in the sorting
Major Data Structures (6)
• Inverted Index
– 64 Barrels (same as the Forward Index)
– for each wordID the Lexicon contains a
pointer to the Barrel that wordID falls into
– the pointer points to a doclist with their hit
list
– the order of the docIDs is important
• by docID or doc word-ranking
– Two inverted barrels—the short barrel/full barrel
Major Data Structures (7)
• Crawling the Web
–
–
–
–
–
fast distributed crawling system
URLserver & Crawlers are implemented in phyton
each Crawler keeps about 300 connection open
at peek time the rate - 100 pages, 600K per second
uses: internal cached DNS lookup
– synchronized IO to handle events
– number of queues
– Robust & Carefully tested
Major Data Structures (8)
• Indexing the Web
– Parsing
• should know to handle errors
–
–
–
–
HTML typos
kb of zeros in a middle of a TAG
non-ASCII characters
HTML Tags nested hundreds deep
• Developed their own Parser
– involved a fair amount of work
– did not cause a bottleneck
Major Data Structures (9)
• Indexing Documents into Barrels
– turning words into wordIDs
– in-memory hash table - the Lexicon
– new additions are logged to a file
– parallelization
• shared lexicon of 14 million pages
• log of all the extra words
Major Data Structures (10)
• Indexing the Web
– Sorting
• creating the inverted index
• produces two types of barrels
– for titles and anchor (Short barrels)
– for full text (full barrels)
• sorts every barrel separately
• running sorters at parallel
• the sorting is done in main memory
Searching
• Algorithm
–
–
–
–
– 5. Compute the rank of that
1. Parse the query
document
2. Convert word into
– 6. If we’re at the end of the
wordIDs
short barrels start at the
doclists of the full barrel,
3. Seek to the start of
unless we have enough
the doclist in the short
barrel for every word – 7. If were not at the end of any
doclist goto step 4
4. Scan through the
doclists until there is a – 8. Sort the documents by rank
document that
return the top K
matches all of the
• (May jump here after 40k pages)
search terms
The Ranking System
• The information
– Position, Font Size, Capitalization
– Anchor Text
– PageRank
• Hits Types
– title ,anchor , URL etc..
– small font, large font etc..
The Ranking System (2)
• Each Hit type has it’s own weight
– Counts weights increase linearly with counts at first
but quickly taper off this is the IR score of the doc
– (IDF weighting??)
• the IR is combined with PageRank to give the final
Rank
• For multi-word query
– A proximity score for every set of hits with a
proximity type weight
• 10 grades of proximity
Feedback
• A trusted user may optionally evaluate
the results
• The feedback is saved
• When modifying the ranking function we
can see the impact of this change on all
previous searches that were ranked
Results
• Produce better results than major commercial
search engines for most searches
• Example: query “bill clinton”
–
–
–
–
–
return results from the “Whitehouse.gov”
email addresses of the president
all the results are high quality pages
no broken links
no bill without clinton & no clinton without bill
Storage Requirements
• Using Compression on the repository
• about 55 GB for all the data used by the
SE
• most of the queries can be answered by
just the short inverted index
• with better compression, a high quality
SE can fit onto a 7GB drive of a new PC
Storage Statistics
Total size of
Fetched Pages
Compressed
Repository
Short Inverted
Index
Temporary
Anchor Data
Document
Index Incl.
Variable Width
Data
Links Database
147.8 GB
Web Page
Statistics
24 million
53.5 GB
Number of Web
Pages Fetched
76.5 million
4.1 GB
Number of URLs
Seen
6.6 GB
Number of Email
Addresses
1.7 million
9.7 GB
Number of 404’s
1.6 million
3.9 GB
Total Without 55.2 GB
Repository
System Performance
•
•
•
•
•
It took 9 days to download 26million pages
48.5 pages per second
The Indexer & Crawler ran simultaneously
The Indexer runs at 54 pages per second
The sorters run in parallel using 4 machines,
the whole process took 24 hours