slides - SEAS - University of Pennsylvania
Download
Report
Transcript slides - SEAS - University of Pennsylvania
Google and Scalable Query Services
Zachary G. Ives
University of Pennsylvania
CIS 650 – Database & Information Systems
October 29, 2008
Administrivia
Next reading: Olston: Pig Latin
Please write a review of this paper
Please send me an email update of your project
status by Tuesday 1PM
2
Databases and Search
Eric Brewer, UC Berkeley and founder, Inktomi:
Database management systems are not the right tool
for running a search engine
Why?
Scale-up – they don’t focus on thousands of machines
Overheads – SQL parsing, locking, …
Most of the features aren’t used – transactions, joins, …
3
So How Would One Build a Search
System from the Ground Up?
Fortunately, some database PhD students at
Stanford tried to do that…
4
Google Architecture [Brin/Page 98]
Focus was on scalability
to the size of the Web
First to really exploit
Link Analysis
Started as an academic
project @ Stanford;
became a startup
Our discussion will be
on early Google – today
they keep things secret!
5
Google: A Very Specialized DBMS
Commodity, cheap hardware
Unreliable
Not very powerful
A fair amount of memory, reasonable hard disks
Lots of racks
Special air conditioning, power systems, big net pipes
Special queries, no need for locking
Partitioning of service between different versions:
The version being crawled and fleshed out
The version being searched
(Really, different pieces can be crawled & updated at different times)
6
What Does Google Need to Do?
Scalable crawling of documents
Archival of documents (“cache”)
Inverted indexing
Duplicate removal
Ranking – requires iteration over link structure
PageRank
TF/IDF
Heuristics
Do the new Google services change any of that?
Some may not need the crawler, e.g., maps, perhaps Froogle
7
The Heart of Google Storage
The main database: Repository
Basically, a warehouse of every
HTML page (this is the cached page
entry), compressed in zlib
Useful for doing additional
processing, any necessary rebuilds
Repository entry format:
[DocID][ECode][UrlLen][PageLen][Url][
Page]
The repository is indexed (not
inverted here)
8
Repository Index
One index for looking up documents by
DocID
Done in ISAM (think of this as a B+ Tree
without smart re-balancing)
Index points to repository entries (or to URL
entry if not crawled)
One index for mapping URL to DocID
Sorted by checksum of URL
Compute checksum of URL, then binsearch
by checksum
Allows update by merge with another similar
file
9
Lexicon
The list of searchable words
(Presumably, today it’s used to
suggest alternative words as well)
The “root” of the inverted index
As of 1998, 14 million “words”
Kept in memory (was 256MB)
Two parts:
Hash table of pointers to words and
the “barrels” (partitions) they fall into
List of words (null-separated)
10
Indices – Inverted and “Forward”
Inverted index divided into
“barrels” / “shards” (partitions)
Indexed by the lexicon; for each
DocID, consists of a Hit List of
entries in the document
Forward index uses the same
barrels
Used to find multi-word queries
with words in same barrel
Indexed by DocID, then a list of
WordIDs in this barrel and this
document, then Hit Lists
corresponding to the WordIDs
Two barrels: short (anchor and
title); full (all text)
Lexicon: 293 MB
WordID
ndocs
WordID
ndocs
WordID
ndocs
Inverted Barrels: 41 GB
DocID: 27
nhits: 8
hit hit hit hit
DocID: 27
nhits: 8
hit hit hit
DocID: 27
nhits: 8
hit hit hit hit
DocID: 27
nhits: 8
hit hit
forward barrels: total 43 GB
DocID WordID: 24
nhits: 8
hit hit hit
WordID: 24
nhits: 8
hit hit
NULL
hit hit hit
DocID WordID: 24
nhits: 8
hit
WordID: 24
nhits: 8
hit hit
WordID: 24
nhits: 8
hit hit hit
NULL
hit
original tables from
http://www.cs.huji.ac.il/~sdbi/2000/google/index.htm
11
Hit Lists (Not Mafia-Related)
Used in inverted and forward indices
Goal was to minimize the size – the bulk of data is in
hit entries
For 1998 version, made it down to 2 bytes per hit (though
that’s likely climbed since then):
Plain
cap 1
Fancy
cap 1
Anchor
cap 1
font: 3
position: 12
vs.
font: 7
type: 4
position: 8
special-cased to:
font: 7
type: 4
hash: 4
pos: 4
12
Google’s Search Algorithm
1.
2.
3.
4.
5.
6.
7.
8.
Parse the query
Convert words into wordIDs
Seek to start of doclist in the short barrel for every word
Scan through the doclists until there is a document that
matches all of the search terms
Compute the rank of that document
If we’re at the end of the short barrels, start at the
doclists of the full barrel, unless we have enough
If not at the end of any doclist, goto step 4
Sort the documents by rank; return the top K
13
Ranking in Google
Considers many types of information:
Position, font size, capitalization
Anchor text
PageRank
Done offline, in a non-query-sensitive way
Count of occurrences (basically, TF) in a way that tapers
off
Multi-word queries consider proximity also
14
Why Isn’t Google Based on a DBMS?
Transactional locking is not necessary anywhere in the
system
Helps with partitioning and replication
Main memory indexing on lexicon
Unusual query model – what’s special here?
Weird consistency model!
OK if different users see different views
As long as we route same user to same machine(s), we’re OK
Updates are happening in a separate “instance”
Slipstream it in place
Can even extend this to change versions of software on the
machines – as long as interfaces stay the same
15
Could We Change a DBMS?
What would a DBMS for Google-like environments
look like?
What would it be useful for, other than Google?
16
Beyond Google
What if we wanted to:
Add on-the-fly query capabilities to Google?
e.g., query over up-to-the-second stock market results
Use WordNet or some thesaurus to supplement Google?
Do PageRank in a topic-specific way?
Supplement Google with “ontology” info?
Do some sort of XML path matching along with
keywords?
Allow for OLAP-style analysis?
Do a cooperative, e.g., P2P, Google?
Benefits of this?
17
Beyond Search…
… Can we think about programming custom, highly
parallel apps on a Google-like cluster?
“Cloud computing”
MapReduce / Hadoop
Pig Latin
Dryad
Sawzall
EC2
Windows Azure
18