I v - The University of Winnipeg

Download Report

Transcript I v - The University of Winnipeg

On the Intersection of Inverted Lists
Yangjun Chen and Weixin Shen
Dept. Applied Computer Science,
University of Winnipeg
515 Portage Ave.
Winnipeg, Manitoba, Canada R3B 2E9
Outline


Motivation
New Index Structure
-

Algorithm
-


Trie over word sequences
Interval sequence assigned to words
Sublists assigned to intervals
basic algorithm based on linear search
algorithm based on binary search
Experiments
Summary
Motivation

Evaluation of conjunctive queries in text
databases and search engines
w1  w2  …  wk,
where each wi is a word.
To find all the documents containing these
words.

Example
cat  dog
New Index Structures
Documents and word sequences:
DocId
1
2
3
4
5
6
7
8
9
10
11
words
a, f, d
a, d
a, e, d
f, b, a
c, d, e
d, f, e, c
f, d, e, a
f, d, e, b
e, c
a, e, f
f, e, c
sorted ws
d, f, a
d, a
e, d, a
f, a, b
e, d, c
e, d, f, c
e, d, f, a
e, d, f, b
e, c
e, f, a
e, f, c
Inverted lists:
e:
d:
f:
a:
c:
b:
{3, 5, 6, 7, 8, 9, 10, 11}
{1, 2, 3, 5, 6, 7, 8}
{1, 4, 6, 7, 8, 10, 11}
{1, 2, 3, 4, 7, 10}
{5, 6, 9, 11}
{4, 8}
e  d = {3, 5, 6, 7, 8, 9, 10, 11}
 {1, 2, 3, 5, 6, 7, 8}
= {3, 5, 6, 7, 8}
I v14
New Index Structures
Sorted word sequences:
DocId
1
2
3
4
5
6
7
8
9
10
11
Trie over sorted word sequences:
sequences
d, f, a
[1, 20]

d, a
v1
[1, 4] d v f
2
e, d, a
v
f, a, b [1, 2] 4
f
v5 a v6 a
e, d, c
v10
[3, 3]
v11 b
a
e, d, f, c
e, d, f, a
[5, 5]
[1, 1]
e, d, f, b
e, c
e, f, a
e, f, c
v0
v3
[5, 7]
[5, 6] [8, 14]
v12
d
[10, 10]
Fig. 1
[8, 19]
c [15, 15] f
c [10, 13] f
[9, 9]
v17 c
(c)
v8
v13
a
[8, 8]
v7
e
v14
a
v15
v9
[16, 18]
c
v16
[16, 16] [17, 17]
b v19
v18 a
[11, 11]
[12, 12]
New Index Structures
Trie. Assume that S = {s1, …, sn}. If |S| = 0, the trie is, of course, empty.
For |S| = 1, trie(S) is a single node. If |S| > 1, S is split into m (possibly empty)
subsets S1, S2, …, Sm so that a string is in Sj if its first word is wj (1 ≤ j ≤ m).
The tries trie(S1), trie(S2), …, trie(Sn) are constructed in the same way except
that at the kth step, the splitting of sets is based on the kth words in the
sequences. They are then connected from their respective roots to a single
node to create trie(S).
Tree encoding. Label each node v in a trie with an interval Iv = [αv, βv], where
βv denotes the rank of v in a post-order traversal of the trie. Here the ranks are
assumed to begin with 1, and all the children of a node are assumed to be ordered
and fixed during the traversal. Furthermore, αv denotes the lowest rank for any
node u in T[v] (the subtree rooted at v, including v). Thus, for any node u in T[v],
we have Iu  Iv since the post-order traversal enters a node before all of its
children, and leaves after having visited all of its children.
New Index Structures

More than one node may be labeled with the
same word

We associate each word w with a interval
1
2
k
sequence of the form: Lw = I w , I w , …, I w, where k
is the number of all those nodes labeled with w
i [ i [1], i [2]] (1  i  k) is an
and each = I w
Iw
Iw
interval associated with a certain node labeled
with w.
New Index Structures
Le: [8,19]
e: {3, 5, 6, 7, 8, 9, 10, 11}
Ld:[1, 4][8, 14]
d: {1, 2, 3, 5, 6, 7, 8}
Lf: [1, 2][5, 7][10, 13][16, 18]
f: {1, 4, 6, 7, 8, 10, 11}
La: [1, 1][3, 3][5, 6][8, 8][11, 11][16, 16] a: {1, 2, 3, 4, 7, 10}
Lc: [9, 9][10, 10][15, 15][17, 17]
c: {5, 6, 9, 11}
Lb:[5, 5][12, 12]
b: {4, 8}
• In general, an interval sequence is shorter than the
corresponding inverted list.
• The longer an inverted list, the shorter the corresponding
interval sequence.
I v14
New Index Structures

Assignment of DocIDs to intervals

{1}
f
v10
{1} a
Ld: [1,
v1
v2
v3
{3, 5, 6, 7, 8, 9, 10, 11}
f {4}
d
e
v6
{3, 5, 6, 7, 8} v7
v9
v8
v5 a
{4}
a
c {9}
d
f {10, 11}
{2}
v12
v11
v13
v14 v15
v17
b {3} a {5} c {6, 7, 8} f
{4}
a {10} c
{11}
v17
v19
v18
c
a
b
{7}
{6}
{8}
4][8, 14]
{1, 2}
v4
v0
I v14 = [10, 13]. The set {6, 7, 8} assigned to v14
can be considered as the set assigned to [10, 13].
{1, 2} {3, 5, 6, 7, 8}
Fig. 2
Query Evaluation
Q = w  w′
?
Lw:
Lw′:
answer:
S1
⊎
S2
⊎
S3
Assume that frequency of w is higher than w.
BASIC EVALUATION ALGORITHM
conj(Lw, Lw) - to find all those intervals in Lw with each being
contained in some interval in Lw, stored in a new sequence L.
1. Let Lw = I1, …, Ik. Let Lw = J1, , …, Jk . L  .
2. Step through Lw and Lw from left to right. Let Ip and Iq be
the interval currently encountered. We will do one of the
following checkings:
i) If Ip  Jq append Jq to the end of L. Move to Jq+1 if q < k
(then, in a next step, we will Ip check against Jq+1 ).
If q = k, stop.
ii) If Ip[1] > Jq[2], move to Jq+1 if q < k. If q = k, stop.
iii) If Ip[2] < Jq[1], move Ip+1 to if p < k (then, in a next step,
we will check Jq against Ip+1). If p = k, stop.
BASIC ALGORITHM
db?
1st step:
p
2nd step:
Ld: [1, 4][8, 14]
p
[1, 4][8, 14]
Lb: [5, 5][12, 12] [5, 5][12, 12]
q
q
3rd step:
p
[1, 4][8, 14]
[5, 5][12, 12]
q
• In Lb, only [12, 12] is contained in an interval [8,
12] in Ld.
• Return the subset associated with [12, 12] as the
result. It is {8}.
BASIC ALGORITHM
Q = d  f  a.
Ld = [1, 4][8, 14]
Lf = [1, 2][5, 7][10, 13][16, 18]
La = [1, 1][3, 3][5, 6][8, 8][11, 11][16, 16]
L = conj(Ld, Lf) = [1, 2][10, 13].
L = conj(L, La) = [1, 1][11, 11].
{1}
{7}
The results: {1} ⊎ {7} = {1, 7}.
Algorithm based on binary search

Set intersection based on binary search
 Let L1 = I1, …, In and Let L2 = J1, , …, Jm be two interval
sequences with m = |L2 | ≤ n = |L1|.
 Let l = log(n/m). Then, 2l is the largest power of
two not exceeding n/m. Let t = n - 2l + 1.
L2:
L1:
Fig. 3
t
Algorithm based on binary search
Compare Jm and It.
• If Jm[1] > It[2] (Jm appears to the right of It), we should look
for the intervals (in L1) covered by Jm somewhere to the right
of It.
 By using the traditional binary search, we try to find an
interval I covered by Jm with l more comparisons.
 Around I, we will continually (by a simple linear search)
find the left-most interval x in L1, which can be covered
by Jm; and then with l more comparisons, we will find the
right-most interval y covered by Jm, in a similar way.
 Obviously, all the intervals between x and y, including x
and y, can be covered by Jm.
Algorithm based on binary search
This information allows us to reduce the problem to
the situation illustrated in Fig. 3. To complete the
whole operation, it is sufficient to apply the above
process to L1 and L2, where
L1 = I1, …, Ix-1 and L2 = J1, , …, Jm-1.
L2
L2:
L1
L1:
t x
y
t x
Fig. 4
y
Algorithm based on binary search
• If, on the other hand, Jm[2] < It[1] (Jm appears to the left
of It), , we should check the intervals to the left of It, and
the problem immediately reduces to the checking of L2 =
L2 against L1 = L1[1 .. t - 1]. We can complete the
operation by applying the above process to L1 and L2.
L2 = L2
L2:
L1
L1:
t
t
Fig. 5
Algorithm based on binary search
• However, L2 may become larger than L1. So in the
recursive call, the roles of L2 and L1 may be reversed, by
which we will check each interval I in L2 against L1 to find
an interval I in L2 such that I  the last interval in L1.
L1
L2:
x
L1:
t
x
y
L2
Fig. 6
y
Algorithm based on binary search
• If Jm  It, we will check linearly It-1, It-2, … until we meet
a left-most interval x which can covered by Jm. Then, check
It+1, It+2, … until a right-most interval y which can be
covered by Jm. All the encountered nodes, except x and y,
must be covered by Jm. This reduces the problem to a
checking of L2 = L2[1 .. m - 1] against L1 = L1[1 .. x].
L2
L2:
L1
L1:
x
t
y
x
Fig. 7
t
y
Algorithm based on binary search
• If Jm  It (we may have this case due to the roll interchange),
we add Jm to the result and the problem reduces to a checking
L2 = L2[1 .. m - 1] against L1 = L1[1 .. t].
L2
L2:
L1
L1:
t
t
Fig. 8
Algorithm based on binary search
Example 2 Consider Ld = [1, 4][8, 14] and
La = [1, 1][3, 3][5, 6][8, 8][11, 11][16, 16]. (m = 2, n = 6)
By our algorithm based on the binary search, the following operations
will be conducted:
Step 1: checking Ld[2] = [8, 14] against La. l = log(6/2) = 1,
t = n - 2l + 1= 6 – 2 + 1= 5,
La[5] = [11, 11]. Since [11, 11]  [8, 14],
we will call linearSearch( ) to find x = 4 and y = 5.
Step 2: checking Ld[1] = [1, 4] against La[1 .. 3].
l = = 1, t = 3 – 21 + 1 = 2, La[2] = [3, 3].
Since [3, 3]  [1, 4], we will will call linearSearch( ) to find x = 1 and y = 2.
IMPROVEMENTS
Search control by using LCAs (least common ancestors)
v1
v10
Tw~
v0
Tw:
v6
v5
v2
v7
v12
:
[1, 20]
[1, 4] [5, 6] [8, 19]
v15
[1, 1] [3, 3] [8, 14] [16, 16]
[8, 8]
v18
[11, 11]
Fig. 9
La = [1, 1][3, 3][5, 6][8, 8][11, 11][16, 16]
v10
v5
v6
v12
v18
v15
IMPROVEMENTS
[1, 20]
4
3
[8, 14]
[1, 4]
2
1
1
Ia
[1, 1]
[8, 19]
2
Ia
[3, 3]
3
Ia
[5, 6]
4
5
Ia
Ia
[8, 8] [11, 11]
6
Ia
[16, 16]
Fig. 10
La = [1, 1][3, 3][5, 6][8, 8][11, 11][16, 16]
Experiments
In the experiments, we have tested seven methods:
Inverted files with melding [6] (IFm for short),
Inverted files with adaptive [16] (IFa for short),
Hashing-based (RanGroupScan in [19]; Hb for short),
Skip-list-based [33] (SkipL for short),
Interval-based (linear-search, discussed in the paper,
Ib for short),
• setIntersect (binary-search, discussed in the paper;
sI for short),
• setIntersect with LCAs (discussed in the paper; sIL for short).
•
•
•
•
•
Experiments
For the experiments with real data, we use part of Wikipedia
data, which contains more than 10 million text documents.
We numbered the documents as they were stored, by assigning
them a sequential number indicating their order in the indexing
process. The characteristics of this collection are shown in
Table 1.
Table 1: Characteristics of Wikipedia Data
Wikipedia data
pages
10,500,000
Size (gigabytes)
16.25
Word occurrences (without markup)
1,567,324,812
Distinct words (after stemming)
3,603,556
Experiments
Two-word queries:
Two inverted lists of the same length:
5 million elements on average for 20 queries
Experiments
Queries with more than two words:
inverted lists of the same length:
5 million elements on average for 20 queries

Summary
An efficient algorithm for intersection of
inverted lists
- Trie over sorted word sequences
- Tree encoding
- Interval sequences associated with words
- Binary search of interval sequences
Computational complexities
time: O(m (log(n/m) + 1)) (m  n), where n and m are
respectively the lengths of the two interval sequences
taking parting the intersection.
Thanks you.