Supporting the Emergence of Ideas in Spatial Hypertext

Download Report

Transcript Supporting the Emergence of Ideas in Spatial Hypertext

Anatomy of a Large-Scale
Hypertextual Web Search
Engine
(e.g. Google)
The Web
Index sizes
– 1994: World Wide Web Worm (McBryan) indexes 110,000
pages/documents
– 1997: search engines claim from 2-100 million pages indexed
– 2005: Google claims 8 billion
– later claim of 25 billion was removed
Queries
–
–
–
–
1994: WWWW had 1500 queries per day
1997: AltaVista had 20 million queries per day
2002: Google had 250 million queries a day
2004: 2.5 billion queries per day
Clearly Web search engines must scale to increased
corpus size and increased use
Google Design Goals
Scale up to work with much larger collections
– In 1997, older search engines were breaking down
due to manipulations by advertisers, etc.
Exhibit high precision among top-ranked
documents
– Many documents are kind-of relevant
– For the Web, “relevant” should be just the very best
documents
Provide data sets for academic research on
search engines
Changes from Earlier Engines
PageRank
–
–
–
–
Rank importance of pages
Use citation patterns like for academic research
Normalize for number of links on page
Variation on probability that a random surfer would
view a page
Anchor Text
– Store text of anchors with the pages they link to
– Incoming link anchors can be better index than page
content
– Allows uncrawled and nontextual content to be
indexed
– Not a new idea: used by WWWW in 1994
Related Work
Information retrieval
– Traditional IR metrics (TREC 96) called 20GB a very
large corpus
– Term vectors assume all documents have some
value
• short documents (few words) that match query term get
ranked highest
• “bill clinton” returns page that only says “bill clinton sucks”
Different from well-controlled collections
– Content has abnormal vocabulary (e.g. product
codes)
– Content generated by systems (e.g. complex ids,
binhexed data)
– Content includes bugs/flakiness/disinformation
System Anatomy
Google Anatomy
URL Server
– Manages lists of
URLs to be retrieved
Crawlers
– Retrieve content for
URLs provided
Store Server
– Compresses and
stores content in
repository
Google Anatomy
Indexer
– Reads & uncompresses
content in repository
– Parses documents for
word occurrences (“hits”)
and links
– Records word, position in
document, relative font
size, capitalization for
each hit
– Hits divided into barrels
– Records anchor text and
from and to information
for each link in Anchors
file
– Generates a lexicon
(vocabulary list)
Google Anatomy
URL Resolver
– Converts relative URLs
into absolute URLs
– Generates document ID
for each absolute URL
– Puts anchor text into
forward index for page that
link points to
– Generates a database of
links that are pairs of
document IDs
Google Anatomy
Sorter
– Resorts (in place)
barrels by word ID to
generate inverted
index
– Produces list of word
IDs and offsets into
inverted index
Google Anatomy
PageRank
– Generates page
ranking based on
links
DumpLexicon
– Takes lexicon and
inverted index and
generates lexicon
used by Searcher
Google Data Structures
BigFiles
– Used to create a virtual file system spanning
multiple file systems
– Addressable by 64 bit integers
Repository
– Page contents compressed using zlib to
balance speed and compression factor
– Stores document ID, length, and URL
prefixed to document
– Documents stored sequentially
Google Data Structures
Document Index
– Ordered by document ID
– Includes document status,
pointer into repository,
document checksum,
other statistics
Lexicon
– 14 million words (rare words not included
here) concatenated together with separating
nulls
– Hash table of pointers
Google Data Structures
Hit Lists
– Each hit encoded in two bytes
– Plain and fancy hits
– Plain hits include
• 1 capitalization bit
• 3 font size bits
• 12 position bits
– Fancy hits (identified by 111 in font size bits)
• 4 bits for hit type
• 8 bits for position
• Anchor type hits have
– 4 bits for hash of document ID of the anchor
– 4 bits for position
Google Data Structures
Forward Index
– Stored in 64 barrels
– Barrel has a range of word IDs
– Barrel includes document
ID and list of word IDs that
belong in barrel
– Document IDs duplicated across barrels
– Word IDs stored as differences from
minimum ID for the barrel (24 bits per word)
Google Data Structures
Inverted Index
– Same barrels as forward index
– Lexicon points to barrel
and to a list of document
IDs with their hit lists
Two sets of barrels
– One for title and link anchor text
• Use this one first
– One for the rest
• Use this if not enough hits in above barrels
Crawling, Indexing, Searching
Web sites were surprised to be crawled in 1997
– Requires people to deal with email questions
Parsing HTML is a challenge
– Does not use YACC to generate CFG parser
• Too much overhead (too slow)
– Uses flex to generate a lexical analyzer
Ranking
– Use vector of count weights and vector of type
weights
– Bias towards proximity of search terms
– Use PageRank to give a final rank