Indexing Mixed Types for Approximate Retrieval

Download Report

Transcript Indexing Mixed Types for Approximate Retrieval

Indexing Mixed Types for Approximate Retrieval
Liang Jin*
Nick Koudas
Chen Li*
Anthony K.H. Tung
UC Irvine
University of Toronto
UC Irvine
National University of Singapore
VLDB’2005
* Liang Jin and Chen Li: supported by NSF CAREER Award IIS-0238586
Queries with Mixed-Type Predicates
Star
Title
The Matrix
1999
Sci-Fi
Samuel Jackson
Star Wars: Episode III - Revenge of the Sith
2005
Sci-Fi
Schwarzenegger
The Terminator
1984
Sci-Fi
Samuel Jackson
Goodfellas
1990
Drama
…
…
…
…
FROM Movies
WHERE star SIMILARTO ’Schwarrzenger’
AND |year – 1980| <= 5;
SIMLARTO:
– a domain-specific function
– returns a similarity value between two strings
•
Genre
Keanu Reeves
SELECT *
•
Year
Example: edit distance
ed(Tom Hanks, Ton Hank) = 2
2
Why fuzzy predicates?
• Errors in queries
– User doesn’t remember a string exactly
– User types a wrong string
• Errors in databases:
– Data is not clean
– Especially true in data integration and cleansing
Relation R
Relation S
Star
Star
Keanu Reeves
Keanu Reeves
Samuel Jackson
Samuel L. Jackson
Schwarzenegger
Schwarzenegger
Samuel Jackson
Samuel L. Jackson
…
…
3
Problem Formulation
Given: A query with fuzzy predicates on strings and
range predicates on numeric attributes
on a single relation
Goal: Answer the query efficiently
SELECT *
FROM Movies
WHERE star SIMILARTO ’Schwarrzenger’
AND |year – 1980| <= 5;
4
Rest of the talk
•
•
•
•
Motivation: supporting queries with mixed-type predicates
Our approach: MAT tree
Construction and maintenance of MAT tree
Experiments
5
Assumptions
• One fuzzy string predicate (edit distance)
• One numeric predicate
Query:
(Qs, δs, Qn, δn)
SELECT *
FROM Movies
(’Schwarrzenger’, 2, 1980, 5)
WHERE star SIMILARTO ’Schwarrzenger’
AND |year – 1980| <= 5;
6
Intuition of MAT (Mixed-attribute-type) Tree
• “2 > 1 + 1”
– One integrated indexing structure is better than
– two independent indexing structures on two attributes
• Indexing numeric attributes: B-tree or R-tree
• Indexing strings as a tree to support fuzzy predicates?
*
*
...
< 1 9 4 6 ,1 9 5 7 >
MAT tree
Root
...
< 1 9 6 4 ,1 9 7 7 >
MBR
*
*
*
...
...
*
...
...
< 1 9 4 6 ,1 9 5 6 >
S p ie lb e rg
1946
G ib so n
1956
< 1 9 5 6 ,1 9 5 7 >
H a n ks
1956
H a n ks
1957
< 1 9 6 4 ,1 9 6 8 >
C ro w e
1964
R o b e rt
1968
7
< 1 9 7 4 ,1 9 7 7 >
D iC a p rio
1974
R o b e rrts
1977
Leaf
nodes
Answering a query (Qs, δs, Qn, δn)
• Top-down traverse the MAT-tree
• At each node, do pruning by checking:
– If [Qn – δn, Qn + δn] overlap with the numeric range.
– If minEditDistance(Qs, Tn) <= δs.
*
*
...
<1946,1957>
Root
...
<1964,1977>
MBR
*
*
*
...
...
*
...
...
<1946,1956>
Spielberg
1946
Gibson
1956
<1956,1957>
Hanks
1956
Hanks
1957
<1964,1968>
Crowe
1964
Robert
1968
8
<1974,1977>
DiCaprio
1974
Roberrts Leaf
nodes
1977
Challenge
• How to represent strings to fit into a limited space
• and support fuzzy-predicate pruning
Limited space (disk based)
*
*
...
< 1 9 4 6 ,1 9 5 7 >
Root
...
< 1 9 6 4 ,1 9 7 7 >
MBR
*
*
*
...
*
...
...
...
< 1 9 4 6 ,1 9 5 6 >
S p ie lb e rg
1946
G ib so n
1956
< 1 9 5 6 ,1 9 5 7 >
H a n ks
1956
H a n ks
1957
< 1 9 6 4 ,1 9 6 8 >
C ro w e
1964
R o b e rt
1968
< 1 9 7 4 ,1 9 7 7 >
D iC a p rio
1974
9
R o b e rrts
1977
Leaf
nodes
Existing Approaches to Indexing Strings as Trees
• M-tree:
– Edit distance: metric space
• Q-tree
– Utilize the q-gram property of strings.
– See our paper for details
10
Representing strings as a trie
S trin g s :
a a d , a b cd e , a b d fg ,
beh, ca, cb
b
a n2
a
b
n5
d n10
c
n11
d
e
n1
e
n6
d
c
n3
n7
a n8
h
n12
n13
f
n14
n15
g
n16
n17
11
n4
b
n9
Compressing a trie
Strings:
aad, abcde, abdfg,
beh, ca, cb
a n2
a n5
b n6
d n10 c
n11
d
n1
b n3
e n7
d h
n12
n13
c
S trin g s:
aad, abcde , abdfg ,
beh, ca, cb
n4
a n8 b n9
<{a},1>
compression
n5
f
n14
n2
n11
e
g
n16
<{c,d,e},3>
n17
• Select k representative nodes (centers).
• Each center is in the format of <alphabet,height>.
• A compressed trie represents more strings
12
<{b},2>
n7
n4
<{e},1>
<{a,b,c},2>
<{b,d},2>
n13
n6
<{a,d},2>
n15
n1
<{h},1>
n15
<{f,g},2>
Minimum edit distance between a string a trie
minEditDistace (Qs, Tn)?
– Convert a trie to an automaton.
– Compute the min distance between a string and an automaton [Myers and Miller, 1989]
– Early termination possible
E d it G ra p h
A u to m a to n
b
[*,a ]
d
a
b
[*,b ]
[a ,d ]
[a ,a ]
[*,*]
[a ,*]
b
[*,b ]
[a ,*]
Q u e ry S trin g
d
a
[a ,b ]
[a ,*]
[*,d ]
[a ,*]
[*,d ]
d
[*,*]
[*,a ]
a
“a c”
[c,*]
[c,*]
[c,b ]
[c,d ]
[c,a ]
[c,*]
b
[*,b ]
[c,*]
[*,a ]
a
13
[*,d ]
d
[*,*]
Compressed trie  Automaton
• Each node is a state.
• Each edge becomes a transition between two states.
• For compressed node <Σ, L>, expand it to L levels.
At each level, all characters in Σ become single
states and are connected to a common tail ε.
a
a
b
b
c
c
Convert a compressed node <{a,b,c},2> into automaton nodes.
14
Outline
•
•
•
•
Motivation: supporting queries with mixed-type predicates
Our approach: MAT tree
Construction and maintenance of MAT tree
Experiments
15
Constructing MAT-tree
• Option 1: insert records one by one.
• Option 2:
– bulk-load records
– construct the MAT-tree bottom-up
16
Compressing a trie
• Important:
– Accurately represent strings in a limited space.
– Minimize “information loss”.
– Maintain the pruning power during a traversal.
• Three methods:
– (1) Reducing # of accepted strings
– (2) Keeping accepted strings “clustered”
– (3) Combining of (1) and (2)
17
Method (1): Reducing # of accepted strings
• Intuition:
– reducing this # makes the compressed trie more accurate
• Goodness function: # of accepted strings
• Algorithm: “Randomized”
–
–
–
–
–
Randomly select k initial centers
Randomly select one of the centers
Randomly select an unselected node
Swap them if it can improve the goodness function
Do certain # of iterations
18
Method (2): Keeping accepted strings clustered
• Intuition:
– keeping the accepted strings similar to the original ones by letting
them share common prefix.
– Place k centers as close to the root as possible.
• Algorithm: “BreadthFirst”
S trin g s :
a a d , a b c d e , a b d fg ,
beh, ca, cb
n1
< {a ,d },2 >
n5
c
b n3
a n2
n4
< {a },1 >
n6
n7
n8
< {b ,c ,d ,e ,f,g },4 > < {e ,h },2 >
19
n9
< {b },1 >
Method (3): Combining (1) and (2)
• Intuition:
– minimize the number of accepted strings, and in the same
time maintain their similarity to the originals.
• Algorithm: “Bottomup”
– Keep shrinking the trie bottom up until we have k nodes
– Compress a node that minimizes # of additional strings
20
Dynamic maintenance
Insertion (s, n)
• Search the index for (s, n). If it’s not in the index,
identify the correct leaf node.
• If no overflow:
– update the “MBR” of the leaf node and its precedents
recursively if necessary.
• If overflow:
– Split the leaf node and
– Construct two compressed tries
– Cascade the split to the precedents if necessary.
Deletion and Update are handled similarly
21
Outline
•
•
•
•
Motivation: supporting queries with mixed-type predicates
Our approach: MAT tree
Construction and maintenance of MAT tree
Experiments
22
Setting
• Data
– IMDB: 100K movie star records (Name and YOB).
– Customers: 50K records (Name and YOB)
• Test bed
– PC: 2.4G P4, 1.2GB Memory, Windows XP
– Visual C++ compiler
• Similar results. Report result for IMDB.
23
Implemented approaches
•
•
•
•
•
•
B-tree
Q-tree
B-tree & Q-tree
BQ-tree
BM-tree
Sequential scan
“BBQ-tree”? 
24
“2 > 1 + 1”
An integrated indexing structure is better than
two separate indexing structures
δs=3, δn=4
25
Scalability
26
Effect of numeric threshold δn
27
Effect of string threshold δs
28
Dynamic Maintenance: time
29
Dynamic maintenance: MAT quality
30
Number of centers
• Increasing cluster # may not reduce the running time: pruning power versus
computational cost
• For BottomUp and BreadthFirst (compared to Randomized)
- Centers close to the root, thus more likely to do early termination
31
Conclusion
• MAT-tree: an efficient indexing structure for
queries with mixed-type predicates
• Can be efficiently constructed and maintained
• Future work: develop a uniform framework to
support different kinds of similarity functions
The Flamingo Project : http://www.ics.uci.edu/~flamingo/
Q&A?
32