Text indexing
Download
Report
Transcript Text indexing
7CCSMWAL
Algorithmic Issues in the
WWW
Lecture 7
1
Text searching
To search for a keyword within text, we can
• Scan the text sequentially when
The text is small, e.g., a few MB
The text collection is very volatile (modified very
frequently)
No extra space is available (for building indices)
• Build a data structure over the text (called an
index) to speed up the search when
The text collection is large and static or semi-static
(can be updated at reasonably regular intervals)
2
Inverted files
• Also called inverted indices
• Mainly composed of two elements
Dictionary (or vocabulary, or lexicon)
• Set of all different words (tokens, index terms)
Posting list (or inverted list)
• Each word has a list of positions where the word appears.
Used here to refer to terms within documents
• Postings file is the set of all posting lists
• Postings lists are much larger than the dictionary
• Dictionary is commonly kept in memory, and
postings lists are normally kept on disk
• Structure of postings lists can vary (problem
dependent) e.g. Each page of this PPT presentation
3
Example (see Intro to IR)
• The dictionary sorted alphabetically into terms
• Each posting list is sorted by document ID
• The numbers are the documents in which the term
occurs
(or lines in a page or book or whatever)
4
Construct an inverted file
• Input: a list of normalized tokens for each
document
• Example
Doc 1
• I did enact Julius Caesar: I was killed i’ the Capitol;
Brutus killed me.
Doc 2
• So let it be with Caesar. The noble Brutus hath told
you Caesar was ambitious:
5
Construct an inverted index
• The core indexing step is sorting the lists of
tokens so that the terms are alphabetical
• Multiple occurrences of the same term from the
same document are then merged
• The term frequency is also recorded. (#occs. of
term in doc)
• Instances of the same term are then grouped,
and the result is split into a dictionary (of terms)
and postings (list of documents containing term)
• The dictionary also records some statistics, such
as the number of documents which contain each
term (document frequency), and total number of
occurrences
6
Example
The inverted file
7
Data structures for postings lists
• Fixed length arrays
Each postings list is a fixed length array
The arrays may not be fully utilized
Reading a postings list is time-efficient as it is
stored in contiguous locations
8
Data structures for postings lists
• Singly linked lists
Each postings list is a singly linked list
No empty entries but there is the overhead of
the pointers
9
Methods to index the terms
Various approaches include:
• External sort and merge
• In memory indexing based on hashing
followed by merging
• Distributed indexing
Google (e.g.) processes documents using
Map-Reduce
10
Hardware constraint
• Indexing algorithm is governed by hardware
constraints
• Characteristics of computer hardware
Access to data in memory is much faster than access
to data on disk
It takes a few clock cycles to access a byte in memory,
but much longer to transfer it from disk
Want to keep as much data as possible in memory,
especially the data that we need to access frequently
11
Index construction with disk
• The list of tokens may be too large to be stored and sorted
in memory
• External sorting algorithm
minimize the number of random disk seeks during sorting
Blocked sort-based indexing algorithm (BSBI)
• BSBI
1. segment the collection into parts of equal size (Step 4 of the
pseudo code)
2. construct the intermediate inverted file for each part in memory
(Step 5 of the pseudo code)
This step is the same as when the list of tokens can fit in the
memory where inverted file is constructed in the memory
3. store the intermediate inverted files on disk (Step 6 of the pseudo
code)
4. merge all intermediate inverted files into the final index (Step 7 of
the pseudo code)
12
BSBI pseudo code
BSBIndexConstruction()
1. n 0
2. while (all documents have not been processed)
3. do n n+1
4.
block ParseNextBlock()
5.
BSBI-Invert(block)
6.
WriteBlockToDisk(block, fn)
7. MergeBlocks(f1, ..., fn; fmerge)
13
Merging intermediate inverted files
brutus
caesar
noble
with
d1,d3
d1,d2,d4
d5
d1,d2,d3,d5
brutus
caesar
julius
killed
brutus
caesar
julius
killed
noble
with
d6,d7
d8,d9
d10
d8
d1,d3,d6,d7
d1,d2,d4,d8,d9
d10
d8
d5
d1,d2,d3,d5
disk
Two blocks (“posting lists to be merged”) are loaded from disk into memory,
merged in memory (“merged posting lists”) and written back to disk
14
Single-pass in-memory indexing
(SPIMI)
• No sorting of tokens is required
• Tokens are processed one by one in memory
When a term occurs for the first time, it is added to the
dictionary (implemented as a hash table), and a new
posting list is created
Otherwise, find the corresponding postings list
Then, add the docID to the postings list
• The process continues until the memory is full.
The dictionary is then sorted and written to disk
Note: the dictionary much shorter than the complete
list of all tokens which occur (or that’s the idea)
15
Hash table
• A data structure that supports operation Lookup
and Insert (and possibly Delete) in expected
constant time
• Can be considered as a table of data
Each term is stored in one of the entries of the table
• A hash function determines which data to be
stored in which table entry
Typically, the hash function maps a string (an
index term or a key) to an integer (table entry)
• More details in slide p.37
16
Example
Picture from Wikipedia
17
SPIMI pseudo code
SPIMI-Invert(token_stream)
1.
output_file = NewFile()
2.
dictionary = NewHash()
3.
while (free memory available)
4.
do token next(token_stream)
5.
if term(token) dictionary
6.
then postings_list = AddToDictionary(dictionary, term(token))
7.
else postings_list = GetPostingsList(dictionary, term(token))
8.
if full(postings_list)
9.
then posting_list = DoublePostingList(dictionary, term(token))
10.
AddToPostingsList(postings_list, docID(token))
11. sorted_terms SortTerms(dictionary)
12. WriteBlockToDisk(sorted_terms, dictionary, output_file)
13. return output_file
The pseudo code only shows how an intermediate inverted file is constructed
The final inverted files merging is the same as BSBI
18
Example
List of tokens
(I, 1)
(did, 1)
(enact, 1)
(julius, 1)
(casear, 1)
(so, 2)
(did, 2)
(it, 2)
(the, 3)
(you, 3)
(hold, 3)
...
...
Main memory
Disk
Terms Postings
Terms Postings
casear 1
casear 1
enact
1
did
1,2
I
1
enact
1
so
2
I
1
julius
1
julius
1
did
1,2
so
2
Hash table
The tokens are read one by one and
inserted to the hash table in main
memory until the memory is full
Inverted file
The entries in the hash
table are sorted and written
to disk as inverted file
19
Distributed indexing
• Perform indexing on large computer cluster
A computer cluster is a group of tightly coupled
computers that work closely together
The group may consists of hundreds or thousands of
nodes (computers)
Individual nodes can fail at any time
• The result of the construction process is a
distributed index that is partitioned across several
machines
Either according to term or according to document
We focus on term-partitioned index
20
Distributed indexing
• MapReduce: a general architecture for
distributed computing
• A master node (computer) directs the
process of
dividing the work up into small tasks
assigning the tasks to individual nodes
re-assigning tasks in case of node failure
21
Distributed indexing
• The master-node breaks the input
documents into splits
Each split is a subset of documents
(corresponding to the partitions of the list of
tokens made in BSBI/SPIMI)
• Two set of tasks
Parsers
Inverters
22
Parsers
• Master assigns a split to an idle parser
node
• Parser reads one document at a time and
produces (term, doc) pairs
• Parser writes pairs into j partitions for
passing on to Inverters
• Each partition is for a range of terms’ first
letters
E.g., a-f, g-p, q-z here j=3
23
Inverters
• To complete the index inversion
• Parses pass the term-partitions to the
inverters. Or can send the (term, doc) pairs
one at a time
• An inverter collects all (term, doc) pairs
(= postings) for its term-partition
• Sorts and writes to posting lists
24
Data flow
Splits of
documents
Master
assign
Parser
a-f g-p q-z
Parser
a-f g-p q-z
Postings
Inverter
a-f
Inverter
g-p
Inverter
q-z
...
...
...
Parser
assign
a-f g-p q-z
partitions
25
Dynamic indexing
• Up to now, we have assumed that
collections are static
• They rarely are
New Documents come in over time and need
to be inserted
Documents are deleted and modified
• This means that the dictionary and the
postings have to be modified:
Posting updates for terms already in dictionary
New terms are added to dictionary
26
Simplest approach. Block update
• Maintain “big” main index on disk
• New documents go into “small” auxiliary index in
memory
• Merge the auxiliary index block and the main
index when the auxiliary index is bigger than a
threshold value
• Assume that the threshold value for refreshing
the auxiliary index is a large constant n
27
Example
Auxiliary index
Main index
New main index
after merging with
auxiliary index
0 postings
n postings
2nd merge
n postings
n postings
2n postings
3rd merge
n postings
2n postings
3n postings
k-th merge
n postings
(k-1)n postings
Suppose symbol represents the merge operation
...
...
n postings
...
1st merge
kn postings
28
Time complexity
• To process T=kn items uses k=T/n merges
• To merge two sorted lists of size n, and Jn
takes O(n+Jn)=O(Jn) time
• Process of building a main index with T
postings needs J=1,…,T/n merges so takes
O(1n + 2n +3n + ... +(T/n)n) = O(T2/n)
time
29
Logarithmic merge
Basic idea: Don’t merge auxiliary and main index directly
• Speeds up merging and index construction in
dynamic indexing
• Maintain a series of indexes, each twice as large
as the previous one
• Keep smallest index (Z0) in memory
• Larger indices (I0, I1, ...) on disk (size doubling)
I0 with size n, I1 with size 2n, I2 with size 4n, and so on
• The scheme for merging
If Z0 gets too big (>=n), write to disk as I0
or merge with I0 (if I0 already exists) as Z1
Either write Z1 to disk as I1 (if no I1)
or merge with I1 to form Z2
30
Pseudo code of logarithmic merging
LMergeAddToken(indexes, Z0, token)
1.
Z0 Merge(Z0, {token})
2.
if |Z0| = n
3.
then for i 0 to
4.
do if Ii indexes
5.
then Zi+1 Merge(Ii, Zi)
6.
(Zi+1 is a temporay index on disk)
7.
indexes indexes – {Ii}
8.
else Ii Zi (Zi becomes the permanent index Ii)
9.
indexes indexes {Ii}
10.
Break
11.
Zo
LogarithmicMerge()
1.
Zo (Z0 is the in-memory index)
2.
indexes
3.
while true
4.
do LMergeAddToken(indexes, Z0, GetNextToken())
31
Example
symbol represents the merge operation
Actions taken
Z0 ;
I0
Z1 I0 Z0; I1 Z1;
Remove I0; Z0
I1
1st time when |Z0|=n I0 Z0;
2nd time when
|Z0|=n
Indexes
3rd time when |Z0|=n I0 Z0;
Z0
4th time when |Z0|=n Z1 I0 Z0; I1 Z1 ;
Z2 I1 Z1; I2 Z2
Remove I0, I1; Z0
I0, I1
I2
Z0
I0, I2
6th time when |Z0|=n Z1 I0 Z0; I1 Z1;
Remove I0; Z0
I1, I2
5th time when |Z0|=n I0 Z0;
7th time when |Z0|=n I0 Z0;
k-th time
Z0
indices at binary k-1
I0, I1, I2
32
Time complexity
• Size Doubling: For T postings blocks, the series
of indexes consists of at most log T indexes,
I0, I1, I2, ..., Ilog T
• Why? Need k=log(T/n) levels for (2^k)n=T items
• To build a main index with T postings, the overall
construction time is O(T log T)
• Each posting is processed (i.e., merged) only
once on each of (at most) log T levels
• Why? If merging occurs item moves up a level
• So logarithmic merge is more efficient for index
construction than block update as
T log T < T2
33
Searching the Index
34
Search structures for dictionaries
• Given a keyword of a query, determine if the
keyword exists in the vocabulary (dictionary).
If so, identify the pointer to the corresponding
posting
• If no search structure exists, we have to check
the terms of the dictionary one by one until a
match is found or all terms are exhausted
takes O(n) time, where n is the number of terms of the
dictionary
• Search structures help speed up the vocabulary
look-up operation
35
Search structures for dictionaries
• Two main choices
Hash table (introduced on slide p.16)
Search tree
• Factors affecting choice
How many terms are we likely to have?
Is the number likely to remain static, or change a lot?
Are we likely to only have new terms inserted, or also
to have some terms in the dictionary deleted?
What are the relative frequencies with which various
terms will be accessed?
General speaking, hash table is preferable for more static
data and search tree handles dynamic data more efficiently.
36
Hash table
• Hash table: An array with a hash function and
collision management
• Mainly operated by a hash function, which
determines where to find (or insert) a term
• Ash function maps a term to an integer between
0 and N-1, where N is the number of entries of
the hash table
• Hashing is reproduce-able randomness. It looks
like a term is mapped to a random array index,
but every time we map the term we get the same
index.
37
An example of hash function
Suppose the dictionary consists of terms
that are composed of lower-case letters
or white-space only
A term consists of at most 20 characters
Let f() be a function that maps whitespace to 0, ‘a’ to 1, ‘b’ to 2, ..., ‘z’ to 26.
Let N be a large prime number
The hash function F(word) can be
defined as
[ f(1st character) + f(2nd character)*26
+f(3rd character)*262 + f(4th
character)*263 + ... ] mod N
38
Hash function
• Suppose N=13
• For term ‘caesar’
Entries Terms
0
5*262
F(‘caesar’) = 3 + 1*26 +
+
19*263 + 1*264 + 18*265 mod 13
= 214659097 mod 13
=3
• For term ‘enact’
1*262
F(‘enact’) = 5 + 14*26 +
+
3*263 + 20*264 mod 13
= 9193293 mod 13
=5
Exercise: the, let, it , best
Powers of 26: 1, 26, 676, 17576, 456976
1
2
3
caesar
4
did
5
enact
6
so
7
the
8
9
i
10
julius
11
killed
12
39
Collision
• Collision – two different terms mapped
to the same entry
• For example
For term ‘was’: F(‘was’) = 23 + 1*26 +
= 12893 mod 13 = 10
‘was’ is mapped to the same entry as
‘julius’
19*262
• Collision can be resolved by
auxiliary structures, secondary hash function,
or rehashing
Entries Terms
0
1
2
3
caesar
4
did
5
enact
6
so
7
the
8
9
i
10
julius
11
killed
12
40
Search tree
• Binary search tree
• B-tree
• The terms are in sorted order in the inorder traversal of the tree
• Only practical for in-memory operations
• Read for interest only
41
Binary search tree (BST)
• Binary tree – a tree with every node having at
most two children
• Binary search tree – every node is associated
with a key (term) in which
The term associated with the left child is
lexicographically smaller than that of the parent node
and
The term associated with the parent is
lexicographically smaller than that of the right child
E.g.,
did
caesar
enact
42
Example
Note: Posting, the documents containing the term
i
Vocabulary
did
caesar
killed
enact
julius
the
so
Postings
1
2
1
1
1
1
1
2
1
2
43
Searching in BST
• Start from the root node, the search proceeds to
one of the two subtrees below by comparing the
term you are searching and the term associated
with the root
• The search stops when a match is found or a
leave node is reached
• The search (or lookup) operation takes O(log T)
time where T is the number of terms, provided
that the BST is balanced
Balance criteria, e.g., the numbers of terms under the
two subtrees of any node are either equal or differ by 1
44
B-tree
• Number of subtrees under an internal node varies in a
fixed interval [a,b], where ab are positive integers
• The number of terms associated with an internal node,
except the root, is between a-1 to b-1
• Can be viewed as “collapsing” multiple levels of the binary
tree into one
• Good for the case that dictionary is disk resident, in which
case this collapsing serves the function of prefetching
imminent binary tests
The integers a and b are determined by the sizes of disk blocks
45
Example
a=2 and b=4
capitol
hath
Vocabulary
be
ambitious
2
2
1
i’
did
brutus
1
1
caesar
1
1
I
enact
2
1
1
julius
it
2
let
killed
1
1
me
1
2
Postings
2
2
46
Reminder
Doc 1
• I did enact Julius Caesar: I was killed i’ the Capitol;
Brutus killed me.
Doc 2
• So let it be with Caesar. The noble Brutus hath told
you Caesar was ambitious:
47