slides - SEAS - University of Pennsylvania

Transcript slides - SEAS - University of Pennsylvania

Google and Scalable Query Services
Zachary G. Ives
University of Pennsylvania
CIS 650 – Database & Information Systems
October 29, 2008
Administrivia
 Next reading: Olston: Pig Latin
Please write a review of this paper
 Please send me an email update of your project
status by Tuesday 1PM
2
Databases and Search
 Eric Brewer, UC Berkeley and founder, Inktomi:
Database management systems are not the right tool
for running a search engine
 Why?
 Scale-up – they don’t focus on thousands of machines
 Overheads – SQL parsing, locking, …
 Most of the features aren’t used – transactions, joins, …
3
So How Would One Build a Search
System from the Ground Up?
 Fortunately, some database PhD students at
Stanford tried to do that…
4
Google Architecture [Brin/Page 98]
Focus was on scalability
to the size of the Web
First to really exploit
Link Analysis
Started as an academic
project @ Stanford;
became a startup
Our discussion will be
on early Google – today
they keep things secret!
5
Google: A Very Specialized DBMS
 Commodity, cheap hardware
 Unreliable
 Not very powerful
 A fair amount of memory, reasonable hard disks
 Lots of racks
 Special air conditioning, power systems, big net pipes
 Special queries, no need for locking
 Partitioning of service between different versions:
 The version being crawled and fleshed out
 The version being searched
 (Really, different pieces can be crawled & updated at different times)
6
What Does Google Need to Do?





Scalable crawling of documents
Archival of documents (“cache”)
Inverted indexing
Duplicate removal
Ranking – requires iteration over link structure
 PageRank
 TF/IDF
 Heuristics
 Do the new Google services change any of that?
 Some may not need the crawler, e.g., maps, perhaps Froogle
7
The Heart of Google Storage
The main database: Repository
 Basically, a warehouse of every
HTML page (this is the cached page
entry), compressed in zlib
 Useful for doing additional
processing, any necessary rebuilds
 Repository entry format:
 [DocID][ECode][UrlLen][PageLen][Url][
Page]
 The repository is indexed (not
inverted here)
8
Repository Index
 One index for looking up documents by
DocID
 Done in ISAM (think of this as a B+ Tree
without smart re-balancing)
 Index points to repository entries (or to URL
entry if not crawled)
 One index for mapping URL to DocID
 Sorted by checksum of URL
 Compute checksum of URL, then binsearch
by checksum
 Allows update by merge with another similar
file
9
Lexicon
 The list of searchable words
 (Presumably, today it’s used to
suggest alternative words as well)
 The “root” of the inverted index
 As of 1998, 14 million “words”
 Kept in memory (was 256MB)
 Two parts:
 Hash table of pointers to words and
the “barrels” (partitions) they fall into
 List of words (null-separated)
10
Indices – Inverted and “Forward”
 Inverted index divided into
“barrels” / “shards” (partitions)
 Indexed by the lexicon; for each
DocID, consists of a Hit List of
entries in the document
 Forward index uses the same
barrels
 Used to find multi-word queries
with words in same barrel
 Indexed by DocID, then a list of
WordIDs in this barrel and this
document, then Hit Lists
corresponding to the WordIDs
 Two barrels: short (anchor and
title); full (all text)
Lexicon: 293 MB
WordID
ndocs
WordID
ndocs
WordID
ndocs
Inverted Barrels: 41 GB
DocID: 27
nhits: 8
hit hit hit hit
DocID: 27
nhits: 8
hit hit hit
DocID: 27
nhits: 8
hit hit hit hit
DocID: 27
nhits: 8
hit hit
forward barrels: total 43 GB
DocID WordID: 24
nhits: 8
hit hit hit
WordID: 24
nhits: 8
hit hit
NULL
hit hit hit
DocID WordID: 24
nhits: 8
hit
WordID: 24
nhits: 8
hit hit
WordID: 24
nhits: 8
hit hit hit
NULL
hit
original tables from
http://www.cs.huji.ac.il/~sdbi/2000/google/index.htm
11
Hit Lists (Not Mafia-Related)
 Used in inverted and forward indices
 Goal was to minimize the size – the bulk of data is in
hit entries
 For 1998 version, made it down to 2 bytes per hit (though
that’s likely climbed since then):
Plain
cap 1
Fancy
cap 1
Anchor
cap 1
font: 3
position: 12
vs.
font: 7
type: 4
position: 8
special-cased to:
font: 7
type: 4
hash: 4
pos: 4
12
Google’s Search Algorithm
1.
2.
3.
4.
5.
6.
7.
8.
Parse the query
Convert words into wordIDs
Seek to start of doclist in the short barrel for every word
Scan through the doclists until there is a document that
matches all of the search terms
Compute the rank of that document
If we’re at the end of the short barrels, start at the
doclists of the full barrel, unless we have enough
If not at the end of any doclist, goto step 4
Sort the documents by rank; return the top K
13
Ranking in Google
 Considers many types of information:
 Position, font size, capitalization
 Anchor text
 PageRank
 Done offline, in a non-query-sensitive way
 Count of occurrences (basically, TF) in a way that tapers
off
 Multi-word queries consider proximity also
14
Why Isn’t Google Based on a DBMS?
 Transactional locking is not necessary anywhere in the
system
 Helps with partitioning and replication
 Main memory indexing on lexicon
 Unusual query model – what’s special here?
 Weird consistency model!
 OK if different users see different views
 As long as we route same user to same machine(s), we’re OK
 Updates are happening in a separate “instance”
 Slipstream it in place
 Can even extend this to change versions of software on the
machines – as long as interfaces stay the same
15
Could We Change a DBMS?
 What would a DBMS for Google-like environments
look like?
 What would it be useful for, other than Google?
16
Beyond Google
 What if we wanted to:
 Add on-the-fly query capabilities to Google?
 e.g., query over up-to-the-second stock market results
Use WordNet or some thesaurus to supplement Google?
Do PageRank in a topic-specific way?
Supplement Google with “ontology” info?
Do some sort of XML path matching along with
keywords?
 Allow for OLAP-style analysis?
 Do a cooperative, e.g., P2P, Google?




 Benefits of this?
17
Beyond Search…
 … Can we think about programming custom, highly
parallel apps on a Google-like cluster?
 “Cloud computing”






MapReduce / Hadoop
Pig Latin
Dryad
Sawzall
EC2
Windows Azure
18

slides - SEAS - University of Pennsylvania

Transcript slides - SEAS - University of Pennsylvania

Directory