A d - Centrum voor Wiskunde en Informatica

Download Report

Transcript A d - Centrum voor Wiskunde en Informatica

Implementation Issues
Arjen P. de Vries
Centrum voor Wiskunde en Informatica
Amsterdam
[email protected]
Overview
• Indexing collections
• Some text statistics
• File organization
– Inverted files
– Signature files
• Search engines
• Indexing XML
Supporting the Search Process
Source
Selection
IR System
Query
Formulation
Predict
Nominate
Choose
Query
Search
Query Reformulation
and
Relevance Feedback
Ranked List
Selection
Document
Examination
Source
Reselection
Document
Delivery
Supporting the Search Process
Source
Selection
IR System
Query
Formulation
Query
Search
Ranked List
Selection
Indexing
Document
Index
Examination
Acquisition
Document
Collection
Delivery
Some Questions for Today
• How long will it take to find a document?
– Is there any work we can do in advance?
• If so, how long will that take?
• How big a computer will I need?
– How much disk space? How much RAM?
• What if more documents arrive?
– How much of the advance work must be repeated?
– Will searching become slower?
– How much more disk space will be needed?
Text Statistics
Frequent words on the WWW
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
65002930
62789720
60857930
57248022
54078359
52928506
50686940
49986064
45999001
42205245
41203451
39779377
35439894
35284151
34446866
33528897
31583607
the
a
to
of
and
in
s
for
on
this
is
by
with
or
at
all
are
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
30998255
30755410
30080013
29669506
29417504
28542378
28162417
28110383
28076530
27650474
27572533
24548796
24420453
24376758
24171603
23951805
23875218
from
e
you
be
that
not
an
as
home
it
i
have
if
new
t
your
page
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
22292805
22265579
22107392
21647927
21368265
21367950
21102223
20621335
19898015
19689603
19613061
19394862
19279458
19199145
19075253
18636563
about
com
information
will
can
more
has
no
other
one
c
d
m
was
copyright
us
(see http://elib.cs.berkeley.edu/docfreq/docfreq.html)
Plotting Word Frequency by
Rank
• Main idea:
– Count how many times tokens occur in the text
• Sum over all of the texts in the collection
• Now order these tokens according to how
often they occur (highest to lowest)
• This is called the rank
Zipf Distribution
• The product of the frequency of words (f) and their
rank (r) is approximately constant
– Rank = order of words’ frequency of occurrence
f  C 1 / r
C  N / 10
• Another way to state this is with an approximately
correct rule of thumb:
– Say the most common term occurs C times
– The second most common occurs C/2 times
– The third most common occurs C/3 times
Zipf Distribution
Linear Scale
Logarithmic Scale
Illustration by Jacob Nielsen
What has a Zipf Distribution?
• Words in a text collection
– Virtually any use of natural language
• Library book checkout patterns
• Incoming Web Page Requests (Nielsen)
• Outgoing Web Page Requests (Cunha &
Crovella)
• Document Size on Web (Cunha & Crovella)
Related Distributions/Laws
• For more examples, see Robert A. Fairthorne,
“Empirical Distributions (Bradford-ZipfMandelbrot) for Bibliometric Description and
Prediction,” Journal of Documentation, 25(4),
319-341
• Pareto distribution of wealth; Willis taxonomic
distribition in biology; Mandelbrot on selfsimilarity and market prices and
communication errors; etc.
Consequences of Zipf
• There are always a few very frequent tokens that are
not good discriminators.
– Called “stop words” in IR
– Usually correspond to linguistic notion of “closed-class”
words
• English examples: to, from, on, and, the, ...
• Grammatical classes that don’t take on new members.
• There are always a large number of tokens that occur
once (and can have unexpected consequences with
some IR algorithms)
• Medium frequency words most descriptive
Word Frequency vs. Resolving
Power
(from Van Rijsbergen 79)
The most frequent words are not the most descriptive.
File Organization
File Structures for IR
• Lexicographical indices
– indices that are sorted
– inverted files
– Patricia (PAT) trees (Suffix trees and arrays)
• Indices based on hashing
– signature files
Document Vectors
ID
A
B
C
D
E
F
G
H
I
nova
galaxy heat h'wood film
role
10
5
3
5
10
10
8
9
10
diet
7
5
“Nova” occurs 10 times in text A
“Galaxy” occurs 5 times in text A
“Heat” occurs 3 times in text A
5 (Blank means
7 0 occurrences.) 9
6
10
2
7
8
5
fur
10
9
10
10
1
3
Document Vectors
ID
A
B
C
D
E
F
G
H
I
nova
galaxy heat h'wood film
role
diet
10
5
3
5
10
“Hollywood” occurs107 times8 in text7I
9 in10text I 5
“Film” occurs 5 times
“Diet” occurs 1 time in text I
“Fur” occurs 3 times in text I
5
6
7
10
fur
10
9
10
10
1
3
9
2
7
8
5
Document Vectors
ID
A
B
C
D
E
F
G
H
I
nova
galaxy heat h'wood film
role
10
5
3
5
10
10
8
9
10
5
6
7
10
diet
fur
7
5
10
9
10
10
1
3
9
2
7
8
5
Inverted Index
• The primary data structure for text indexes
• Main Idea:
– Invert documents into a big index
• Basic steps:
– Make a “dictionary” of all the tokens in the
collection
– For each token, list all the docs it occurs in.
– Do a few things to reduce redundancy in the data
structure
Inverted Indexes
• An Inverted File is a vector file “inverted” so
that rows become columns and columns
become rows
docs
D1
D2
D3
D4
D5
D6
D7
D8
D9
D10
t1
1
1
0
1
1
1
0
0
0
0
t2
0
0
1
0
1
1
1
1
0
1
t3
1
0
1
0
1
0
0
0
1
1
Terms D1
t1
t2
t3
D2
1
0
1
D3
1
0
0
D4
0
1
1
D5
1
0
0
D6
1
1
1
…
D7
1
1
0
0
1
0
How Inverted Files Are Created
• Documents are parsed to extract
tokens
• Save token with Document ID
Doc 1
Doc 2
Now is the time
for all good men
to come to the aid
of their country
It was a dark and
stormy night in
the country
manor. The time
was past midnight
Term
now
is
the
time
for
all
good
men
to
come
to
the
aid
of
their
country
it
was
a
dark
and
stormy
night
in
the
country
manor
the
time
was
past
midnight
Doc #
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
What are the tokens?
•
•
•
•
Stemming?
Case folding?
Thesauri and/or soundex?
Spelling correction?
How Inverted Files are Created
• After all documents have
been parsed the inverted
file is sorted
alphabetically
Term
now
is
the
time
for
all
good
men
to
come
to
the
aid
of
their
country
it
was
a
dark
and
stormy
night
in
the
country
manor
the
time
was
past
midnight
Doc #
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
Term
a
aid
all
and
come
country
country
dark
for
good
in
is
it
manor
men
midnight
night
now
of
past
stormy
the
the
the
the
their
time
time
to
to
was
was
Doc #
2
1
1
2
1
1
2
2
1
1
2
1
2
2
1
2
2
1
1
2
2
1
1
2
2
1
1
2
1
1
2
2
How Inverted Files are Created
• Multiple term entries
for a single
document are
merged
• Within-document
term frequency
information is
compiled
Term
a
aid
all
and
come
country
country
dark
for
good
in
is
it
manor
men
midnight
night
now
of
past
stormy
the
the
the
the
their
time
time
to
to
was
was
Doc #
2
1
1
2
1
1
2
2
1
1
2
1
2
2
1
2
2
1
1
2
2
1
1
2
2
1
1
2
1
1
2
2
Term
a
aid
all
and
come
country
country
dark
for
good
in
is
it
manor
men
midnight
night
now
of
past
stormy
the
the
their
time
time
to
was
Doc #
Freq
2
1
1
2
1
1
2
2
1
1
2
1
2
2
1
2
2
1
1
2
2
1
2
1
1
2
1
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
2
1
1
1
2
2
How Inverted Files are Created
Term
a
aid
all
and
come
country
country
dark
for
good
in
is
it
manor
men
midnight
night
now
of
past
stormy
the
the
their
time
time
to
was
Doc #
Freq
2
1
1
2
1
1
2
2
1
1
2
1
2
2
1
2
2
1
1
2
2
1
2
1
1
2
1
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
2
1
1
1
2
2
Dictionary
Term
a
aid
all
and
come
country
dark
for
good
in
is
it
manor
men
midnight
night
now
of
past
stormy
the
their
time
to
was
N docs
Postings
Doc #
Tot Freq
1
1
1
1
1
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
1
2
1
1
1
1
1
1
1
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
4
1
2
2
2
Freq
2
1
1
2
1
1
2
2
1
1
2
1
2
2
1
2
2
1
1
2
2
1
2
1
1
2
1
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
2
1
1
1
2
2
Searching
Three general steps
• Vocabulary search
– Identify the words and patterns in the query
– Search them in the vocabulary
• Retrieval of occurrences
– Retrieve the lists of occurrences of all the
words
• Manipulation of occurrences
– Solve phrases, proximity, or Boolean
operations
– Find the exact word positions when block
addressing is used
How Inverted Files are Used
Dictionary
Term
a
aid
all
and
come
country
dark
for
good
in
is
it
manor
men
midnight
night
now
of
past
stormy
the
their
time
to
was
N docs
Postings
Doc #
Tot Freq
1
1
1
1
1
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
1
2
1
1
1
1
1
1
1
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
4
1
2
2
2
Freq
2
1
1
2
1
1
2
2
1
1
2
1
2
2
1
2
2
1
1
2
2
1
2
1
1
2
1
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
2
1
1
1
2
2
Boolean Query on
“time” AND “dark”
2 docs with “time” in
dictionary ->
IDs 1 and 2 from
posting file
1 doc with “dark” in
dictionary ->
ID 2 from posting
file
Therefore, only doc 2
satisfied the query.
Query optimization
• Consider a query that is an AND of t terms.
• The idea: for each of the t terms, get its
term-doc incidence from the postings, then
AND together.
• Process in order of increasing freq:
– start with smallest set,
then keep cutting further.
Keep tf
in dictionary!
Processing for ranking systems
For each query term
for each document in inverted list
augment similarity coefficient
For each document
finish calculation of similarity coefficient
Perform sort of similarity coefficients
Retrieve and present document
Processing Vector Space Queries
(I)
Set A  {}. A is the set of accumulators
For each query term t  Q,
Stem t
Search the lexicon, record ft and the address of It
Set wt  1 + loge (N/ft)
Processing VS Queries II
Read the inverted file entry, It
For each (d,fd,t) pair in It
If Ad  A then
set Ad 0
set A  A + {Ad}
Set Ad  A + loge (1 + fd,t)*wt
For each Ad  A
Set Ad  Ad/Wd
(where Wd is weight of document D)
Ad is now proportional to the value cosine(Q,Dd)
#unique
terms
How Large is the Dictionary?
2.0e6
• Heaps’ law:
1.0e6
#docs
– the vocabulary grows as O(n), : 0.4~0.6
Term Storage in Dictionary
• Store pointers to every kth on term string.
• Need to store term lengths (1 extra byte)
….7systile9syzygetic8syzygial6syzygy11szaibelyite8szczecin9szomo….
Freq.
Postings ptr. Term ptr.
33
29
44
126
7
 Save 9 bytes
 on 3
 pointers.
Lose 4 bytes on
term lengths.
Searching the Dictionary
• Sorted Arrays
–
–
–
–
store the list of keywords in a sorted array
using a standard binary search
advantage: easy to implement
disadvantage: updating the index is expensive
• Hashing Structures
• B-Trees, Tries, Pat(ricia) Trees, …
• Combinations of these structures
The lexicon as a Hash Table
• Store terms in a Hash table:
– Map the M terms into a set of m integer values 0 <= h(ti)
<= m-1 (m << M).
– A simple hash function for strings:
t


h(t )    ti  wi  mod m
 i 1

– Retrieval time: O(|t|)
– Cons:
• collisions vs. space
• Cannot handle wildcards
Tries
•
•
•
•
A structure constructed from strings
Name comes from the word retrieval
Used for string search in normal text
Each edge represents a character of the
string including the terminator $
• Leaves contain the keys
TRIE - Non-compact
Each character stored in one edge of the tree
Consider 5 character
strings:
BIG
BIGGER
BILL
GOOD
GONG
B
G
I
G
G
$
O
O
L
L
N
D
G
$
$
BIG$
E
$
BILL$
GOOD$
R
$
BIGGER$
GONG$
TRIE - Properties
• An internal node can have from 1 to d
children where d is the size of the alphabet.
• A path from the root of T to an internal node
i corresponds to an i-character prefix of a
string S
• The height of the tree is the length of the
longest string
• If there are S unique strings, T has S external
nodes
• Looking up a string of length M is O(M)
Compact Trie
Remove chains leading to leaves
B
G
I
G
O
O
L
N
B
G
$
L
D
G
I
BIG$
E
$
$
G
GONG$
R
BIGGER$
O
L
BILL$
G
$
O
$
BILL$
GOOD$
G
BIGGER$
$
BIG$
GOOD$
N
GONG$
PATRICIA (PAT TREE)
Practical Algorithm To Retrieve Information Coded In Alphabets - introduced
by D.R. Morrison in October 1968
Collapse all unary nodes
B
I
G
O
BI
G
O
L
BILL$
G
GOOD$
GO
N
GONG$
G
L
$
O
N
BILL$
BIGGER$
BIG$
G
BIGGER$
$
BIG$
GOOD$
GONG$
PATRICIA - Comments
• Number of nodes proportional to number of strings
• Strings on edges become variable length
• Use an auxiliary data structure to actually store the
strings
• The TRIE only stores triplets of numbers indicating
where in the auxiliary data structure to look
Partially specified query (*)
• Using n-grams
– E.g., decompose labor into la-ab-bo-or
– Mark begin/end of word with special char ($)
– Labor ->$l-la-ab-bo-or-r$
• Answers to lab*r are obtained via
$l and la and ab and r$
Use of bigrams
Number
Term
1
2
3
4
5
6
7
8
Abhor
Bear
Laaber
Labor
Laborator
Labour
Lavacaber
Slab
Bigram
Term id
$a
$b
$l
$s
aa
ab
bo
la
or
ou
ra
ry
r$
sl
1
2
3,4,5,6,7
8
3
1,3,4,5,6,7,8
4,5,6
3,4,5,6,7,8
1,4,5
6
5
5
1,2,3,4,5,6,7
8
Problems with bi-grams
• Query $l and la and ab and r$
– might be false match:
– In example, will match laaber and lavacaber (from TREC
collection)..
– Why: all bigrams are present, but not consecutive
• What about labrador?
– Right answer but not what was intended
– Wildcards with n-grams are risky!
• False matches can be checked with additional exact
pattern matcher
Can we support wildcards with TRIE
• Easy to answer queries with wildcard tails
– E.g., lab*, labor*, …
• What about lab*r or *bour?
• Use a rotated lexicon
– Store: $labor, abor$l, bor$la, or$lab, r$labo, $labor
• To search for lab*r
– rotate until wildcard is at end: r$lab* then
– search for strings that have this prefix
Rotated Lexicon
id
Term
1
2
3
4
5
6
7
8
Abhor
Bear
Laaber
Labor
Laborator
Labour
Lavacaber
Slab
Rotated
Add
Rotated
Add
$abhor
$bear
$laaber
$labor
$laborator
$labour
$lavacaber
$slab
aaber$l
abhor$
aber$la
abor$l
(1,0)
(2,0)
(3,0)
(4,0)
(5,0)
(6,0)
(7,0)
(8,0)
(3,2)
(1,1)
(3,3)
(4,2)
aborator$l
abour$l
aber$lavac
ab$sl
r$abho
r$bea
r$laabe
r$labo
r$laborato
r$labou
r$lavacabe
slab$
(5,2)
(6,2)
(7,6)
(8,3)
(1,5)
(2,4)
(3,6)
(4,5)
(5,9)
(6,6)
(7,9)
(8,1)
What Goes in a Postings File?
• Boolean retrieval
– Just the document number
• Ranked Retrieval
– Document number and term weight (TF*IDF, ...)
• Proximity operators
– Word offsets for each occurrence of the term
• Example: Doc 3 (t17, t36), Doc 13 (t3, t45)
How Big Is the Postings File?
• Very compact for Boolean retrieval
– About 10% of the size of the documents
• If an aggressive stopword list is used!
• Not much larger for ranked retrieval
– Perhaps 20%
• Enormous for proximity operators
– Sometimes larger than the documents!
Full inversion (all words, exact positions, 4-byte pointers)
Indexing
Small
collection Medium collection Large
(1 MB)
Addressing 45%
(200MB)
Collection
(2GB)
73%
36%
64%
35%
63%
26%
18%
32%
26%
47%
41%
18%
32%
5%
9%
25%
1.7%
2.4%
0.5%
0.7%
words
Addressing 19%
documents
Addressing 27%
64k blocks
Addressing 18%
256 blocks
2 or 1 byte(s) per pointer independent of the text size
document size (10KB), 1, 2, 3 bytes per pointer, depending on text size
Stop words are not indexed
All words are indexed
Inverted File Compression
• Key idea, each postings list can be stored as
an ascending sequence of integers
• Example: if t occurs in 8 documents, list the
postings in ascending doc id
– <8;3,5,20,21,23,76,77,78>
• Storing the gaps would change to
<8; [3,2,15,1,2,53,1,1]>
Inverted File Compression
• More formally, if t appears in nt docs
– <nt;d1,d2,…,dft>
– where dk < dk+1
• The postings can be equivalently stored as an initial
position followed by a list of
d-gaps: dk+1-dk
• Example
– From <8;3,5,20,21,23,76,77,78>
– To <8;3,2,15,1,2,53,1,1>
Inverted File Compression
• No information loss, but …
… where does the saving come from?
– The largest d-gap in the 2nd rep is potentially the
same as the largest document id in the 1st
– If there are N documents, and a fixed binary
encoding is used, both methods require log N bits
per any stored pointer…
Intuition
• Frequent words have d-gaps significantly
smaller than rare words
• Variable length representations are
considered, in which small values are more
likely and coded more economically than
large one
• Various methods based on probability
distribution of index numbers.
Bitmaps
• For every term in the lexicon, a bitvector is
stored, each bit corresponding to a
document. Set to 1 if term appears in
document, set to 0 otherwise
• Thus if term “pot” appears in documents 2
and 5, the corresponding bits are set
– pot -> 0100100
• Bitmaps are efficient for Boolean queries
– Answers to a query are obtained by simply
combining bitvectors with Boolean operators
Example of bitmaps
term
cold
days
hot
bitvector
100100
001001
100100
in
it
010010
010010
like
000110
nine
old
pease
porridge
pot
some
the
001001
001001
110000
110000
010010
000110
010010
Query “some” AND “pot”
Answer:
010010 AND 000110
Yields: 000010
d5 unique answer
Signature Files
What Are Signature Files?
In your email programme you will find the option
“signatures”. For example in outlook express go to
“tools”, “options” and you will see “signatures”.
Signature files offer you as a person or business to state
contact information at the end of an e-mail message. They
are also known as "sig. files".
Using sig. files is an opportunity to advertise is free.
Signature Files
• Signature files were popular in the past
because they are implicitly compressed (and
take less space than uncompressed inverted
files)
• A signature file is a probabilistic method for
indexing text: each document has an
associated signature or descriptor, a string of
bits that captures the content of the
document
Not a new idea…
Word signatures
• Assume that words have 16 bit descriptors
constructed by taking 3 hash functions
generating values between 1…16, and setting
corresponding bits in a 16 bit descriptor
• Example
– if 3 hashing values for cold are 3,5,16
– Cold -> 1000 0000 0010 0100
• Very unlikely but not impossible that
signatures of 2 words are identical
Example for given lexicon
term
cold
days
hot
Hash string
1000 0000 0010 0100
0010 0100 0000 1000
0000 1010 0000 0000
in
it
0000 1001 0010 0000
0000 1000 1000 0010
like
nine
old
pease
porridge
pot
some
the
0100
0010
1000
0100
0100
0000
0100
1010
0010
1000
1000
0100
0100
0010
0100
1000
0000
0000
0100
0010
0010
0110
0000
0000
0001
0100
0000
0000
0000
0000
0001
0000
Document signatures
• Signatures for each word in a given document
are superimposed
– The words signatures of each word in a document
d are ORed, to form the document signature
– Example
• document 6: nine days old
Example
•
•
•
•
nine
days
old
d6 “nine days old”
0010
0010
1000
1010
1000
0100
1000
1100
0000
0000
0100
0100
0100
1000
0000
1100
• To test whether a word is in a document, calculate its signature.
If the corresponding bits are set in the document, there is a
high probability that the document contains the word.
False hits
• Reduce false hits by increasing number of bits
set for each term and increase length of
signature
• Yet need to fetch the document at query
time, and scan/parse/stem it to check that
the term really occurs: costs time
• Effective when queries are long (lower
probabilities of false hits)
Example of false hits
Doc
Text
1
signature
1100 1111 0010 0101
2
1110 1111 0110 0001
Pease porridge in the pot.
3
1010 1100 0100 1100
Nine days old.
4
1100 1110 1010 0111
Some like it hot, some like it cold
5
1110 1111 1110 0011
Some like it in the pot
6
1010 1100 0100 1100
Nine days old
cold
1000 0000 0010 0100
old 1000 1000 0100 0000
Real
matches?
Pease porridge hot, please porridge
cold.
1 and 4
2, 3, 5 and 6
Merits of Signature Files
• Faster than full text scanning
– 1 or 2 orders of magnitude faster
– But… still linear in collection size
• Modest space overhead
– 10-15% vs. 50-300% (inversion)
• Insertions can be handled easily
– append only
– no reorganization or rewriting
Comparison of indexing methods
• Bitmap consumes an order of magnitude
more storage than inverted files and
signature files
• Signature files require un-necessary access to
the main text because of false matches
• Signature file main advantage: no in-memory
lexicon
• Compressed inverted files are the state-of-the
art index structure used by most search
engines
Search Engines
Searches Per Day (2000)
Information from
searchenginewatch.com
Searches Per Day (2001)
Challenges for Web Searching
• Distributed data
• Volatile data/”Freshness”: 40% of the web changes
every month
• Exponential growth
• Unstructured and redundant data: 30% of web pages
are near duplicates
• Unedited data
• Multiple formats
• Commercial biases
• Hidden data
From description of the FAST search engine, by Knut Risvik
http://www.infonortics.com/searchengines/sh00/risvik_files/frame.htm
Standard Web Search Engine Architecture
crawl the
web
Check for duplicates,
store the
documents
DocIds
create an
inverted
index
user
query
Show results
To user
Search
engine
servers
Inverted
index
More detailed
architecture,
from Brin & Page 98.
Only covers the
preprocessing in
detail, not the query
serving.
Indexes for Web Search Engines
• Inverted indexes are still used, even though
the web is so huge
• Some systems partition the indexes across
different machines
– Each machine handles different parts of the data
• Other systems duplicate the data across
many machines
– Queries are distributed among the machines
• Most do a combination of these
In this example, the
data for the pages is
partitioned across
machines. Additionally,
each partition is
allocated multiple
machines to handle the
queries.
Each row can handle
120 queries per
second
Each column can
handle 7M pages
To handle more
queries, add another
row.
From description of the FAST search engine, by Knut Risvik
http://www.infonortics.com/searchengines/sh00/risvik_files/frame.htm
Querying: Cascading Allocation of CPUs
• A variation on this that produces a costsavings:
– Put high-quality/common pages on many
machines
– Put lower quality/less common pages on fewer
machines
– Query goes to high quality machines first
– If no hits found there, go to other machines
Google
• Google maintains the worlds largest Linux
cluster (10,000 servers)
• These are partitioned between index servers
and page servers
– Index servers resolve the queries (massively
parallel processing)
– Page servers deliver the results of the queries
• Over 3 Billion web pages are indexed and
served by Google
Web Page Ranking
• Varies by search engine
– Pretty messy in many cases
– Details usually proprietary and fluctuating
• Combining subsets of:
–
–
–
–
–
–
–
Term frequencies
Term proximities
Term position (title, top of page, etc)
Term characteristics (boldface, capitalized, etc)
Link analysis information
Category information
Popularity information
Ranking: Link Analysis
• Assumptions:
– If the pages pointing to this page are good, then
this is also a good page
– The words on the links pointing to this page are
useful indicators of what this page is about
– References: Page et al. 98, Kleinberg 98
Ranking: Link Analysis
• Why does this work?
– The official Toyota site will be linked to by lots of
other official (or high-quality) sites
– The best Toyota fan-club site probably also has
many links pointing to it
– Less high-quality sites do not have as many highquality sites linking to them
Web Crawling
• How do the web search engines get all of the
items they index?
• Main idea:
–
–
–
–
–
Start with known sites
Record information for these sites
Follow the links from each site
Record information found at new sites
Repeat
Web Crawlers
• How do the web search engines get all of the items
they index?
• More precisely:
– Put a set of known sites on a queue
– Repeat the following until the queue is empty:
• Take the first page off of the queue
• If this page has not yet been processed:
– Record the information found on this page
» Positions of words, links going out, etc
– Add each link on the current page to the queue
– Record that this page has been processed
• Depth-first/breadth-first/…
Sites Are Complex Graphs, Not
Just Trees
Page 1
Page 2
Page 1
Site 1
Site 2
Page 2
Page 3
Page 5
Page 4
Page 3
Page 1
Site 5
Page 6
Page 1
Site 3
Page 2
Page 1
Site 6
Web Crawling Issues
• Keep out signs
– A file called robots.txt tells the crawler which directories are
off limits
• Freshness
– Figure out which pages change often
– Recrawl these often
• Duplicates, virtual hosts, etc
– Convert page contents with a hash function
– Compare new pages to the hash table
• Lots of problems
–
–
–
–
Server unavailable
Incorrect html
Missing links
Infinite loops
• Web crawling is difficult to do robustly!
XML, IR, … databases?
What’s XML?
• W3C Standard since 1998
– Subset of SGML
– (ISO Standard Generalized Markup Language)
• Data-description markup language
– HTML text-rendering markup language
• De facto format for data exchange on
Internet
– Electronic commerce
– Business-to-business (B2B) communication
XML Data Model Highlights
• Tagged elements describe semantics of data
– Easier to parse for a machine and for a human
• Element may have attributes
• Element can contain nested sub-elements
• Sub-elements may themselves be tagged
elements or character data
• Usually considered as tree structure … but
really a graph!
An XML Document
<? xml version=" 1.0"?>
<! DOCTYPE sigmodRecord SYSTEM “sigmodRecord. dtd">
<sigmodRecord>
<issue>
<volume> 1</ volume>
<number> 1</ number>
<articles>
<article>
<title> XML Research Issues</ title>
<initPage> 1</ initPage>
<endPage> 5</ endPage>
<authors>
<author AuthorPosition=" 00"> Tom Hanks</ author>
</ authors>
</ article>
</ articles>
</ issue>
XPATH Example
• List the titles of articles in which the author
has “Tom Hanks”
– //article[//author=“Tom Hanks”]/title
• Find the titles of articles authored by “Tom
Hanks” in volume 1.
– //issue[/volume=“1”]/articles/article/[//author=“T
omHanks”]/title
XQuery: Example
List the titles of the articles authored by “Tom
Hanks”
Query Expression
for $b IN document(“sigmodRecord.xml")//article
where $b//author =“Tom Hanks"
return <title>$b/title.text()</title>
Query Result
<title>XML Research Issues</title>
Data- vs. Document-centric
• Data-centric
– Highly structured
– Usage stems from
exchange of data from
databases
• Example:
<site>
<item ID=“I001”>
<name>Chair</name>
<description>This chair is in good condition ...
</description>
</item>
<item ID=“I002”>
<name>Table</table>
<description>...</description>
</item>
...
</site>
• Document-centric
–
–
–
–
Semi-structured
Embedded tags
Fulltext search important
Usage stems from
exchange of formatted
text
<memo>
<author>John Doe</author>
<title>...</title>
<body>
This memo is meant for all persons responsible for
<list bullets=1>
<item>either <em>customers</em> abroad,
<item>or <em>suppliers</em> abroad.
</list>
...
</memo>
Classes of XML Documents
• Structured
– “Un-normalized” relational data
Ex: product catalogs, inventory data, medical
records, network messages, logs, stock quotes
• Mixed
– Structured data embedded in large text fragments
Ex: On-line manuals, transcripts, tax forms
Queries on Mixed Data
• Full-text search operators
Ex: Find all <bill>s where "striking" & "amended"
are within 6 intervening words
• Queries on structure & text
Ex: Return <text> element containing both
"exemption" & "social security" and preceding &
following <text> elements
• Queries that span (ignore) structure
Ex: Return <bill> that contains “referred to the
Committee on Financial Services”
XML Indexing
A
A
B
C
D
B
E
T1
C
D
E
T2
T3
T1
T2
T3
T4
XML Document Schema
Node index
A
B
C
D
E
ID
S
E
P
A
B
C
0
1
2
0
1
4
13
3
12
1
1
D
…
3
…
5
…
8
…
2
…
T1
T2
T3
T4
T1
2
Word index T2
6
T3
7
Text Region Algebras
contains(A, nodeindex)
A
A
B
C
D
B
E
T1
C
D
E
T2
T3
T1
T2
T3
T4
Text Region Algebras
contained-by(D, nodeindex)
A
A
B
C
D
B
E
T1
C
D
E
T2
T3
T1
T2
T3
T4
Using Region Algebras
roots := contains and not contained-by
A
A
B
C
D
B
E
T1
C
D
E
T1
T2
T3
T4
T2
T3
T4
Using Region Algebras
leafs := contained-by and not contains
A
A
B
C
D
B
E
T1
C
D
E
T1
T2
T3
T4
T2
T3
T4
Using Region Algebras
inner := diff(diff(nodes, leafs),roots)
B
C
D
B
E
T1
C
D
E
T1
T2
T3
T4
T2
T3
T4
XML Region Operators - 1
• Containment
– Contains and contained-by
• Length
– Length (including markup) and text-length (no markup)
• Other possibilities include…
– Range or distance - enabling references
– Overlaps
– (Minimal) bounding region (direct parent)
XML Region Operators – 2
• Transitive closure
– Containment - select on word index with known
lower and upper bound
• Regeneration of XML document
– Containment on node index
– Containment on word index
– Union of node and word subsets
Inverted Lists for XML Documents
1
<section>
3
4
5
6
7
2
<title> Information Retrieval Using RDBMS </title>
<section>98
10
11
12
13
<title> Beyond Simple Translation </title>
14
<section>
15
16
17 18
19
20
<title> Extension of IR Features </title>
21
</section>
</section>22
23
</section>
Element index
Text index
<section>
(1, 1:23, 0) (1, 8:22, 1) (1, 14:21, 2) … …
<title>
(1, 2:7, 1) (1, 9:13, 2) (1, 15:20, 3) … …
“information”
(1, 3, 2) … …
“retrieval”
(1, 4, 2) … …
Containment, direct containment, tight containment, proximity
Inverted Lists to Relational Tables
ELEMENTS
<section>
(1, 1:23, 0)
Inverted
List
(1; 8:22, 1)
(1; 14:21, 2)
term docno begin end level
…...
section
section
section
1
1
1
1
8
14
…...
23
22
21
0
1
2
TEXTS
“information”
Inverted
List
(1, 3, 2)
……
term
information
docno wordno level
…...
1
3
…...
2
Inverted List Operations to SQL
Translation
-- E // ‘T’
select
from
where
and
and
and
and
-- E / ‘T’
select
from
where
and
and
and
and
*
ELEMENTS e, TEXTS t
e.term = ‘E’
t.term = ‘T’
e.docno = t.docno
e.begin < t.wordno
t.wordno < e.end
-- E = ‘T’
select
from
where
and
and
and
-- distance(“T1”, “T2”) <= n
*
select *
ELEMENTS e, TEXTS t
from TEXTS t1, TEXTS t2
e.term = ‘E’ and t.term = ‘T’ where t1.term = ‘T1’ and t2.term = ‘T2’
e.docno = t.docno
and
t1.docno = t2.docno
t.wordno = e.begin + 1
and
t2.wordno > t1.wordno
e.end = t.wordno + 1
and
t2.wordno <= t1.wordno + n
*
ELEMENTS e, TEXTS t
e.term = ‘E’ and t.term = ‘T’
e.docno = t.docno
e.begin < t.wordno
t.wordno < e.end
e.level = t.level - 1
Experimental Setup: Workload
• Data sets
Data Sets
Size of documents
inverted index size
relational table size
Number of distinct elements
Number of distinct text words
Total number of elements
Total number of text words
Shakespeare DBLP
Synthetic
8 MB
53 MB
207 MB
11 MB
78 MB
285 MB
15 MB
121 MB
566 MB
22
598
715
22,825
250,657
100,000
179,726 1,595,010 4,999,500
474,057 3,655,148 19,952,000
• 13 micro-benchmark queries:
– Of form: ‘elem’ contains ‘word’
– Vary ‘elem’ and ‘word’ of different frequencies: check
sensitivity of performance w.r.t. selectivity
Number of Term Occurrences
Queries term1 frequency term2 frequency result rows
QS1
90
277
2
QS2
107,833
277
36
QS3
107,833
3,231
1,543
QS4
107,833
1
1
QD1
654
55
13
QD2
4,188
712
672
QD3
287,513
6,363
6,315
QD4
287,513
3
3
QG1
50
1,000
809
QG2
134,900
55,142
1,470
QG3
701,000
165,424
21,936
QG4
50
82,712
12
QG5
701,000
17
4
DB2/Inv. List Engine Performance
Ratios
• Want to know why RDBMS:
– sometimes performs much better
– usually performs much worse
• Two significant factors: Join algorithm, cache utilization
Why Does RDBMS Sometimes Perform
Better?
-- E // ‘T’
select
from
where
and
and
and
and
*
ELEMENTS e, TEXTS t
e.term = ‘E’
t.term = ‘T’
e.docno = t.docno
e.begin < t.wordno
t.wordno < e.end
Inverted List Engine
RDBMS
Index Nested Loop
index
‘E’
ELEMENTS
(or TEXTS)
More I/O
More CPU
index
‘T’
TEXTS
(or ELEMENTS)
‘E’ list
‘T’ list
MPMGJN vs. Standard Merge Join
Three merging predicates
d1 = d2, b <= w, w <= e
One merging predicate: d1 = d2,
Additional filtering predicates:
b <= w, w <= e
Inverted List Engine
d
5
5
5
5
5
5
b
7
14
21
22
29
32
e
20
19
28
27
31
40
MPMGJN
d w
5 2
5 23
5 24
5 33
5 37
5 42
RDBMS
d
5
5
5
5
5
5
b
7
14
21
22
29
32
e
20
19
28
27
31
40
d w
5 2
5 23
5 24
5 33
5 37
5 42
Standard Merge Join
Number of Row Pairs Compared
Queries d1=d2,b<=w<=e
QS1
5
QS2
7,131
QS3
89,716
QS4
2,366
QD1
503
QD2
4,723
QD3
263,458
QD4
1,766
QG1
1,000
QG2
103,994
QG3
610,816
QG4
12
QG5
56,084
d1=d2
1,653
984,948
10,175,904
3,475
555
1,315,662
14,082,080
4,950
1,000
148,773,116
2,319,244,480
82,712
238,340
MPMGJN vs. Index Nested Loop Join
Inverted List Engine
b
7
14
21
22
29
32
e
20
19
28
27
31
40
MPMGJN
d w
5 2
5 23
5 24
5 33
5 37
5 42
d
5
5
5
5
5
5
b
7
14
21
22
29
32
e
20
19
28
27
31
40
d w
5 2
5 23
5 24
5 33
5 37
5 42
Index
d
5
5
5
5
5
5
RDBMS
Index Nested-loop Join
MPMGJN vs. Index nested-loop join, depends on:
number of index key comparisons vs. records comparisons
cost of index key comparisons vs. records comparisons
RDBMS vs. Special-purpose?
• Recap:
– RDBMS’s have great technology (e.g., B+-tree,
optimization, storage)
– but not well-tuned for containment queries using
inverted lists
• Open question:
– use RDBMS or special purpose engine to process
containment queries using inverted lists?
Thanks
•
•
•
•
•
•
•
•
•
•
Ron Larson
Marti Hearst
Doug Oard
Philip Resnik
Sriram Raghavan
Johan List
Maurice van Keulen
David Carmel
Aya Soffer
…