r i keys and files
Download
Report
Transcript r i keys and files
Information Retrieval Techniques
MS(CS) Lecture 5
AIR UNIVERSITY MULTAN CAMPUS
Quick Review
•
•
•
•
Inverted Index Construction (Exercise)
Query Processing Using Inverted Index
Faster Posting Merges: Skip Pointers
Phrase Queries
– Bi-word Index
– Extended Bi-Word Index
– Positional Index
Sec. 2.4.2
Proximity queries
• LIMIT! /3 STATUTE /3 FEDERAL /2 TORT
– Again, here, /k means “within k words of”.
• Clearly, positional indexes can be used for such
queries; biword indexes cannot.
• Exercise: Adapt the linear merge of postings to
handle proximity queries. Can you make it work
for any value of k?
– This is a little tricky to do correctly and efficiently
– See Figure 2.12 of IIR
– There’s likely to be a problem on it!
3
Proximity search
We just saw how to use a positional index for phrase
searches.
We can also use it for proximity search.
For example: employment /4 place
Find all documents that contain EMPLOYMENT and
PLACE within 4 words of each other.
Employment agencies that place healthcare workers
are seeing growth is a hit.
Employment agencies that have learned to adapt
now place healthcare workers is not a hit.
4
4
Proximity search?
Use the positional index
Simplest algorithm: look at cross-product of positions
of (i) EMPLOYMENT in document and (ii) PLACE in
document
Very inefficient for frequent words, especially stop
words
Note that we want to return the actual matching
positions, not just a list of documents.
This is important for dynamic summaries etc.
5
5
Sec. 2.4.2
Positional index size
• You can compress position values/offsets:
• Nevertheless, a positional index expands
postings storage substantially link
• Nevertheless, a positional index is now
standardly used because of the power and
usefulness of phrase and proximity queries …
whether used explicitly or implicitly in a
ranking retrieval system.
6
Sec. 2.4.2
Positional index size
• Need an entry for each occurrence, not just
once per document
Why?
• Index size depends on average document size
– Average web page has <1000 terms
– SEC filings, books, even some epic poems … easily
100,000 terms
• Consider a term with frequency 0.1%
Document size
Postings
Positional postings
1000
1
1
100,000
1
100
7
Sec. 2.4.2
Rules of thumb
• A positional index is 2–4 as large as a nonpositional index
• Positional index size 35–50% of volume of
original text
• Caveat: all of this holds for “English-like”
languages
8
Sec. 2.4.3
Combination schemes
• These two approaches can be profitably
combined
– For particular phrases (“Michael Jackson”,
“Britney Spears”) it is inefficient to keep on
merging positional postings lists
• Even more so for phrases like “The Who”
• Williams et al. (2004) evaluate a more
sophisticated mixed indexing scheme
– A typical web query mixture was executed in ¼ of
the time of using just a positional index
– It required 26% more space than having a
positional index alone
9
Inverted Index Construction
•
•
•
•
Positional index size
Dictionary size
Hardware issues
Large collection requirements analysis
Inverted index
11
11
Dictionaries
The dictionary is the data structure for storing the
term vocabulary.
Term vocabulary: the data
Dictionary: the data structure for storing the term
vocabulary
12
12
Dictionary as array of fixed-width entries
For each term, we need to store a couple of items:
document frequency
pointer to postings list
. . .
Assume for the time being that we can store this
information in a fixed-length entry.
Assume that we store these entries in an array.
13
13
Dictionary as array of fixed-width entries
space needed: 20 bytes
4 bytes
4 bytes
How do we look up a query term qi in this array at query time?
That is: which data structure do we use to locate the entry (row)
in the array where qi is stored?
14
14
Data structures for looking up term
Two main classes of data structures: hashes and trees
Some IR systems use hashes, some use trees.
Criteria for when to use hashes vs. trees:
Is there a fixed number of terms or will it keep
growing?
What are the relative frequencies with which
various keys will be accessed?
How many terms are we likely to have?
15
15
Hashes
Each vocabulary term is hashed into an integer.
Try to avoid collisions
At query time, do the following: hash query term, resolve
collisions, locate entry in fixed-width array
Pros: Lookup in a hash is faster than lookup in a tree.
Lookup time is constant.
Cons
no way to find minor variants (resume vs. résumé)
no prefix search (all terms starting with automat)
need to rehash everything periodically if vocabulary keeps
growing
16
16
Trees
Trees solve the prefix problem (find all terms starting with
automat).
Simplest tree: binary tree
Search is slightly slower than in hashes: O(logM), where M is
the size of the vocabulary.
O(logM) only holds for balanced trees.
Rebalancing binary trees is expensive.
B-trees mitigate the rebalancing problem.
B-tree definition: every internal node has a number of children
in the interval [a, b] where a, b are appropriate positive
integers, e.g., [2, 4].
17
17
Binary tree
18
18
B-tree
19
19
Ch. 4
Index construction
• How do we construct an index?
• What strategies can we use with limited main
memory?
Sec. 4.1
Hardware basics
• Many design decisions in information retrieval
are based on the characteristics of hardware
• We begin by reviewing hardware basics
Sec. 4.1
Hardware basics
• Access to data in memory is much faster than
access to data on disk.
• Disk seeks: No data is transferred from disk while
the disk head is being positioned.
• Therefore: Transferring one large chunk of data
from disk to memory is faster than transferring
many small chunks.
• Disk I/O is block-based: Reading and writing of
entire blocks (as opposed to smaller chunks).
• Block sizes: 8KB to 256 KB.
Sec. 4.1
Hardware basics
• Servers used in IR systems now typically have
several GB of main memory, sometimes tens
of GB.
• Available disk space is several (2–3) orders of
magnitude larger.
• Fault tolerance is very expensive: It’s much
cheaper to use many regular machines rather
than one fault tolerant machine.
Sec. 4.2
Recall IIR 1 index construction
• Documents are parsed to extract words and these
are saved with the Document ID.
Doc 1
I did enact Julius
Caesar I was killed
i' the Capitol;
Brutus killed me.
Doc 2
So let it be with
Caesar. The noble
Brutus hath told you
Caesar was ambitious
Term
I
did
enact
julius
caesar
I
was
killed
i'
the
capitol
brutus
killed
me
so
let
it
be
with
caesar
the
noble
brutus
hath
told
you
caesar
was
ambitious
Doc #
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
Sec. 4.2
Key step
• After all documents have been
parsed, the inverted file is
sorted by terms.
We focus on this sort step.
We have 100M items to sort.
Term
I
did
enact
julius
caesar
I
was
killed
i'
the
capitol
brutus
killed
me
so
let
it
be
with
caesar
the
noble
brutus
hath
told
you
caesar
was
ambitious
Doc #
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
Term
ambitious
be
brutus
brutus
capitol
caesar
caesar
caesar
did
enact
hath
I
I
i'
it
julius
killed
killed
let
me
noble
so
the
the
told
you
was
was
with
Doc #
2
2
1
2
1
1
2
2
1
1
1
1
1
1
2
1
1
1
2
1
2
2
1
2
2
2
1
2
2
Sec. 4.2
Scaling index construction
• In-memory index construction does not scale
– Can’t stuff entire collection into memory, sort,
then write back
• How can we construct an index for very large
collections?
• Taking into account the hardware constraints
we just learned about . . .
• Memory, disk, speed, etc.
Sec. 4.2
Sort-based index construction
• As we build the index, we parse docs one at a time.
– While building the index, we cannot easily exploit
compression tricks (you can, but much more complex)
• The final postings for any term are incomplete until the end.
• At 12 bytes per non-positional postings entry (term, doc, freq),
demands a lot of space for large collections.
• T = 100,000,000 in the case of RCV1
– So … we can do this in memory in 2009, but typical
collections are much larger. E.g., the New York Times
provides an index of >150 years of newswire
• Thus: We need to store intermediate results on disk.
Sec. 4.2
Sort using disk as “memory”?
• Can we use the same index construction
algorithm for larger collections, but by using
disk instead of memory?
• No: Sorting T = 100,000,000 records on disk is
too slow – too many disk seeks.
• We need an external sorting algorithm.
RCV1 collection
Shakespeare’s collected works are not large
enough for demonstrating many of the points in
this course.
As an example for applying scalable index
construction algorithms, we will use the
Reuters RCV1 collection.
English newswire articles sent over the wire in
1995 and 1996 (one year).
29
29
Same algorithm for disk?
Can we use the same index construction algorithm
for larger collections, but by using disk instead of
memory?
No: Sorting T = 100,000,000 records on disk is too
slow – too many disk seeks.
We need an external sorting algorithm.
30
30
“External” sorting algorithm
(using few disk seeks)
We must sort T = 100,000,000 non-positional postings.
Each posting has size 12 bytes (4+4+4: termID, docID,
document frequency).
Define a block to consist of 10,000,000 such postings
We can easily fit that many postings into memory.
We will have 10 such blocks for RCV1.
Basic idea of algorithm:
For each block: (i) accumulate postings, (ii) sort in memory,
(iii) write to disk
Then merge the blocks into one long sorted order.
31
31
Merging two blocks
32
32
Blocked Sort-Based Indexing
Key decision: What is the size of one block?
33
33
Problem with sort-based algorithm
Our assumption was: we can keep the dictionary in
memory.
We need the dictionary (which grows dynamically) in
order to implement a term to termID mapping.
Actually, we could work with term,docID postings
instead of termID,docID postings . . .
. . . but then intermediate files become very large.
(We would end up with a scalable, but very slow index
construction method.)
34
34
Single-pass in-memory indexing
Abbreviation: SPIMI
Key idea 1: Generate separate dictionaries for each
block – no need to maintain term-termID mapping
across blocks.
Key idea 2: Don’t sort. Accumulate postings in
postings lists as they occur.
With these two ideas we can generate a complete
inverted index for each block.
These separate indexes can then be merged into one
big index.
35
35
Using wildcard in queries
Wildcard queries
mon*: find all docs containing any term beginning with mon
Easy with B-tree dictionary: retrieve all terms t in the range:
mon ≤ t < moo
*mon: find all docs containing any term ending with mon
Maintain an additional tree for terms backwards
Then retrieve all terms t in the range: nom ≤ t < non
Result: A set of terms that are matches for wildcard query
Then retrieve documents that contain any of these terms
37
37
How to handle * in the middle of a term
Example: m*nchen
We could look up m* and *nchen in the B-tree and
intersect the two term sets.
Expensive
Alternative: permuterm index
Basic idea: Rotate every wildcard query, so that the *
occurs at the end.
Store each of these rotations in the dictionary, say, in
a B-tree
38
38
Permuterm index
For term HELLO: add hello$, ello$h, llo$he,
lo$hel, and o$hell to the B-tree where $ is a
special symbol
39
39
Permuterm → term mapping
40
40
Permuterm index
For HELLO, we’ve stored: hello$, ello$h, llo$he, lo$hel, and
o$hell
Queries
For X, look up X$
For X*, look up X*$
For *X, look up X$*
For *X*, look up X*
For X*Y, look up Y$X*
Example: For hel*o, look up o$hel*
Permuterm index would better be called a permuterm tree.
But permuterm index is the more common name.
41
41
Processing a lookup in the permuterm index
Rotate query wildcard to the right
Use B-tree lookup as before
Problem: Permuterm more than quadruples the size
of the dictionary compared to a regular B-tree.
(empirical number)
42
42
k-gram indexes
More space-efficient than permuterm index
Enumerate all character k-grams (sequence of k characters)
occurring in a term
2-grams are called bigrams.
Example: from April is the cruelest month we get the bigrams:
$a ap pr ri il l$ $i is s$ $t th he e$ $c cr ru ue el le es st t$ $m mo
on nt h$
$ is a special word boundary symbol, as before.
Maintain an inverted index from bigrams to the terms that
contain the bigram
43
43
Postings list in a 3-gram inverted index
44
44
k-gram (bigram, trigram, . . . ) indexes
Note that we now have two different types of
inverted indexes
The term-document inverted index for finding
documents based on a query consisting of terms
The k-gram index for finding terms based on a query
consisting of k-grams
45
45
Processing wildcarded terms in a bigram index
Query mon* can now be run as: $m AND mo AND on
Gets us all terms with the prefix mon . . .
. . . but also many “false positives” like MOON.
We must postfilter these terms against query.
Surviving terms are then looked up in the term-document
inverted index.
k-gram index vs. permuterm index
k-gram index is more space efficient.
Permuterm index doesn’t require postfiltering.
46
46
Exercise
Google has very limited support for wildcard queries.
For example, this query doesn’t work very well on Google:
[gen* universit*]
Intention: you are looking for the University of Geneva, but
don’t know which accents to use for the French words for
university and Geneva.
According to Google search basics, 2010-04-29: “Note that the
* operator works only on whole words, not parts of words.”
But this is not entirely true. Try [pythag*] and [m*nchen]
Exercise: Why doesn’t Google fully support wildcard queries?
47
47
Processing wildcard queries in the termdocument index
Problem 1: we must potentially execute a large number of
Boolean queries.
Most straightforward semantics: Conjunction of disjunctions
For [gen* universit*]: geneva university OR geneva université
OR genève university OR genève université OR general
universities OR . . .
Very expensive
Problem 2: Users hate to type.
If abbreviated queries like [pyth* theo*] for [pythagoras’
theorem] are allowed, users will use them a lot.
This would significantly increase the cost of answering queries.
Somewhat alleviated by Google Suggest
48
48
Spelling correction
Two principal uses
Correcting documents being indexed
Correcting user queries
Two different methods for spelling correction
Isolated word spelling correction
Check each word on its own for misspelling
Will not catch typos resulting in correctly spelled words, e.g., an
asteroid that fell form the sky
Context-sensitive spelling correction
Look at surrounding words
Can correct form/from error above
49
49
Correcting documents
We’re not interested in interactive spelling correction
of documents (e.g., MS Word) in this class.
In IR, we use document correction primarily for
OCR’ed documents. (OCR = optical character
recognition)
The general philosophy in IR is: don’t change the
documents.
50
50
Correcting queries
First: isolated word spelling correction
Premise 1: There is a list of “correct words” from which the
correct spellings come.
Premise 2: We have a way of computing the distance between
a misspelled word and a correct word.
Simple spelling correction algorithm: return the “correct” word
that has the smallest distance to the misspelled word.
Example: informaton → information
For the list of correct words, we can use the vocabulary of all
words that occur in our collection.
Why is this problematic?
51
51
Alternatives to using the term vocabulary
A standard dictionary (Webster’s, OED etc.)
An industry-specific dictionary (for specialized IR
systems)
The term vocabulary of the collection, appropriately
weighted
52
52
Hash function for strings:
key[i]
98 108 105
key a l i
0 1 2
i
KeySize = 3;
hash(“ali”) = (105 * 1 + 108*37 + 98*372) % 10,007 = 8172
“ali”
0
1
2
hash
function
……
ali
8172
……
10,006 (TableSize)
CENG 213 Data Structures
53
QUESTIONS?