Web Search Engines - Penn Engineering

Download Report

Transcript Web Search Engines - Penn Engineering

Web Search Engines
Zachary G. Ives
University of Pennsylvania
CSE 455 / CIS 555 – Internet and Web Systems
March 27, 2008
Administrivia
 Homework 3 due 4/1
 Please decide on a group of 2 or 3 partners to work
on the project (groups of 3 or 4)
2
Improved PageRank
 Remove out-degree 0 nodes (or consider them to refer back
to referrer)
 Add decay factor to deal with sinks
 PageRank(p) = d b  B(p) (PageRank(b) / N(b)) + (1 – d)
 Intuition in the idea of the “random surfer”:
 Surfer occasionally stops following link sequence and jumps to new
random page, with probability 1 - d
3
Stopping the Hog
g'
y’
a’
Google
Amazon
0 0 0.5
= 0.8 0.5 1 0.5 *
0.5 0 0
g
y
a
0.2
+ 0.2
0.2
Yahoo
Running for multiple iterations:
g
y
a
=
0.6
1.8 ,
0.6
0.44
2.12 ,
0.44
0.38
2.25 ,
0.38
0.35
2.30
0.35
… though does this seem right?
4
Summary of Link Analysis
 Use back-links as a means of adjusting the
“worthiness” or “importance” of a page
 Use iterative process over matrix/vector values to
reach a convergence point
 PageRank is query-independent and considered
relatively stable
 But vulnerable to SEO
 Just one piece of the overall picture…
5
Overall Ranking Strategies in
Web Search Engines
 Everybody has their own “secret sauce” that uses:







Vector model (TF/IDF)
Proximity of terms
Where terms appear (title vs. body vs. link)
Link analysis
Info from directories
Page popularity
gorank.com “search engine optimization site” compares these factors
 Some alternative approaches:
 Some new engines (Vivisimo, Teoma, Clusty) try to do clustering
 A few engines (Dogpile, Mamma.com) try to do meta-search
6
Google Architecture [Brin/Page 98]
Focus was on scalability
to the size of the Web
First to really exploit
Link Analysis
Started as an academic
project @ Stanford;
became a startup
Our discussion will be
on early Google – today
they keep things secret!
7
The Heart of Google Storage
“BigFile” system for storing indices, tables
 Support for 264 bytes across multiple drives,
filesystems
 Manages its own file descriptors, resources
First use: Repository



Basically, a warehouse of every HTML page (this
is the cached page entry), compressed in zlib
(faster than bzip)
Useful for doing additional processing, any
necessary rebuilds
Repository entry format:
 [DocID][ECode][UrlLen][PageLen][Url][Page]

The repository is indexed (not inverted here)
8
Repository Index
 One index for looking up documents by
DocID
 Done in ISAM (think of this as a B+ Tree
without smart re-balancing)
 Index points to repository entries (or to URL
entry if not crawled)
 One index for mapping URL to DocID
 Sorted by checksum of URL
 Compute checksum of URL, then binsearch
by checksum
 Allows update by merge with another similar
file
9
Lexicon
 The list of searchable words
 (Presumably, today it’s used to
suggest alternative words as well)
 The “root” of the inverted index
 As of 1998, 14 million “words”
 Kept in memory (was 256MB)
 Two parts:
 Hash table of pointers to words and
the “barrels” (partitions) they fall into
 List of words (null-separated)
10
Indices – Inverted and “Forward”
 Inverted index divided into
“barrels” (partitions by range)
 Indexed by the lexicon; for each
DocID, consists of a Hit List of
entries in the document
 Forward index uses the same
barrels
 Used to find multi-word queries
with words in same barrel
 Indexed by DocID, then a list of
WordIDs in this barrel and this
document, then Hit Lists
corresponding to the WordIDs
 Two barrels: short (anchor and
title); full (all text)
Lexicon: 293 MB
WordID
ndocs
WordID
ndocs
WordID
ndocs
Inverted Barrels: 41 GB
DocID: 27
nhits: 8
hit hit hit hit
DocID: 27
nhits: 8
hit hit hit
DocID: 27
nhits: 8
hit hit hit hit
DocID: 27
nhits: 8
hit hit
forward barrels: total 43 GB
DocID WordID: 24
nhits: 8
hit hit hit
WordID: 24
nhits: 8
hit hit
NULL
hit hit hit
DocID WordID: 24
nhits: 8
hit
WordID: 24
nhits: 8
hit hit
WordID: 24
nhits: 8
hit hit hit
NULL
hit
original tables from
http://www.cs.huji.ac.il/~sdbi/2000/google/index.htm
11
Hit Lists (Not Mafia-Related)
 Used in inverted and forward indices
 Goal was to minimize the size – the bulk of data is in
hit entries
 For 1998 version, made it down to 2 bytes per hit (though
that’s likely climbed since then):
Plain
cap 1
Fancy
cap 1
Anchor
cap 1
font: 3
position: 12
vs.
font: 7
type: 4
position: 8
special-cased to:
font: 7
type: 4
hash: 4
pos: 4
12
Google’s Distributed Crawler
 Single URL Server – the coordinator
 A queue that farms out URLs to crawler nodes
 Implemented in Python!
 Crawlers have 300 open connections apiece
 Each needs own DNS cache – DNS lookup is major
bottleneck
 Based on asynchronous I/O (which is now supported in
Java)
 Many caveats in building a “friendly” crawler (we’ll
talk about some momentarily)
13
Google’s Search Algorithm
1.
2.
3.
4.
5.
6.
7.
8.
Parse the query
Convert words into wordIDs
Seek to start of doclist in the short barrel for every word
Scan through the doclists until there is a document that
matches all of the search terms
Compute the rank of that document
If we’re at the end of the short barrels, start at the
doclists of the full barrel, unless we have enough
If not at the end of any doclist, goto step 4
Sort the documents by rank; return the top K
14
Ranking in Google
 Considers many types of information:
Position, font size, capitalization
Anchor text
PageRank
Count of occurrences (basically, TF) in a way that tapers
off
 (Not clear if they do IDF?)




 Multi-word queries consider proximity also
15
Google’s Resources
 In 1998:




24M web pages
About 55GB data w/o repository
About 110GB with repository
Lexicon 293MB
 Worked quite well with low-end PC
 Today, > 25 billion pages, 400M queries/day:
 Don’t attempt to include all barrels on every machine!
 e.g., 5+TB repository on special servers separate from index servers
 Many special-purpose indexing services (e.g., images)
 Much greater distribution of data (now ~20K PCs?), huge net BW
 Advertising needs to be tied in (100,000 advertisers claimed)
16
Web Crawling in More Detail
 First, some politeness criteria
 Then we’ll look at the Mercator distributed crawler
as an example
17
The Basic Crawler (Spider/Robot)
Process
 Start with some initial page P0
 Collect all URLs from P0 and add to the crawler queue
 Consider <base href> tag, anchor links, optionally image links, CSS,
DTDs, scripts
 Considerations:





What order to traverse (polite to do BFS – why?)
How deep to traverse
What to ignore (coverage)
How to escape “spider traps” and avoid cycles
How often to crawl
18
Essential Crawler Etiquette
 Robot exclusion protocols
 First, ignore pages with:
<META NAME="ROBOTS” CONTENT="NOINDEX">
 Second, look for robots.txt at root of web server
 See http://www.robotstxt.org/wc/robots.html
 To exclude all robots from a server:
User-agent: *
Disallow: /
 To exclude one robot from two directories:
User-agent: BobsCrawler
Disallow: /news/
Disallow: /tmp/
19
Mercator
 A scheme for building a distributed web crawler
 Expands a “URL frontier”
 Avoids re-crawling same URLs
 Also considers whether a document has been seen
before
 Every document has signature/checksum info computed as
it’s crawled
20
Mercator Web Crawler
1.
2.
3.
4.
Dequeue frontier URL
Fetch document
Record into RewindStream (RIS)
Check against fingerprints to
verify it’s new
5.
6.
7.
8.
Extract hyperlinks
Filter unwanted links
Check if URL repeated
(compare its hash)
Enqueue URL
21
Mercator’s Polite Frontier Queues
 Tries to go beyond breadth-first approach – want to
have only one crawler thread per server
 Distributed URL frontier queue:
 One subqueue per worker thread
 The worker thread is determined by hashing the
hostname of the URL
 Thus, only one outstanding request per web server
22
Mercator’s HTTP Fetcher
 First, needs to ensure robots.txt is followed
 Caches the contents of robots.txt for various web sites as
it crawls them
 Designed to be extensible to other protocols
 Had to write own HTTP requestor in Java – their
Java version didn’t have timeouts
 Today, can use setSoTimeout()
 Can also use Java non-blocking I/O if you wish:
 http://www.owlmountain.com/tutorials/NonBlockingIo.htm
 But they use multiple threads and synchronous I/O
23
Other Caveats
 Infinitely long URL names (good way to get a buffer
overflow!)
 Aliased host names
 Alternative paths to the same host
 Can catch most of these with signatures of
document data (e.g., MD5)
 Crawler traps (e.g., CGI scripts that link to
themselves using a different name)
 May need to have a way for human to override certain
URL paths – see Section 5 of paper
24
Mercator Document Statistics
PAGE TYPE PERCENT
text/html
69.2%
image/gif
17.9%
image/jpeg
8.1%
text/plain
1.5%
pdf
0.9%
audio
0.4%
zip
0.4%
postscript
0.3%
other
1.4%
Histogram of document sizes
(60M pages)
Further Considerations
 May want to prioritize certain pages as being most
worth crawling
 Focused crawling tries to prioritize based on relevance
 May need to refresh certain pages more often
26
Web Search Summarized
 Two important factors:
 Indexing and ranking scheme that allows most relevant
documents to be prioritized highest
 Crawler that manages to be (1) well-mannered, (2) avoid
traps, (3) scale
 We’ll be using Pastry to distribute the work of
crawling and to distribute the data (what Google
calls “barrels”)
27