Transcript s793-jin
Indexing Mixed Types for Approximate Retrieval
Liang Jin*
Nick Koudas
Chen Li*
Anthony K.H. Tung
UC Irvine
University of Toronto
UC Irvine
National University of Singapore
* Liang Jin and Chen Li: supported by NSF CAREER Award IIS-0238586
Queries with Mixed-Type Predicates
Star
Title
The Matrix
1999
Sci-Fi
Samuel Jackson
Star Wars: Episode III - Revenge of the Sith
2005
Sci-Fi
Schwarzenegger
The Terminator
1984
Sci-Fi
Samuel Jackson
Goodfellas
1990
Drama
…
…
…
…
FROM Movies
WHERE star SIMILARTO ’Schwarrzenger’
AND |year – 1980| <= 5;
SIMLARTO:
– a domain-specific function
– returns a similarity value between two strings
•
Genre
Keanu Reeves
SELECT *
•
Year
Example: edit distance
ed(Tom Hanks, Ton Hank) = 2
2
Why fuzzy predicates?
• Errors in queries
– User doesn’t remember a string exactly
– User types a wrong string
• Errors in databases:
– Data is not clean
– Especially true in data integration and cleansing
Relation R
Relation S
Star
Star
Keanu Reeves
Keanu Reeves
Samuel Jackson
Samuel L. Jackson
Schwarzenegger
Schwarzenegger
Samuel Jackson
Samuel L. Jackson
…
…
3
Problem Formulation
Given: A query with fuzzy predicates on strings and
range predicates on numeric attributes
on a single relation
Goal: Answer the query efficiently
SELECT *
FROM Movies
WHERE star SIMILARTO ’Schwarrzenger’
AND |year – 1980| <= 5;
4
Rest of the talk
•
•
•
•
Motivation: supporting queries with mixed-type predicates
Our approach: MAT tree
Construction and maintenance of MAT tree
Experiments
5
Assumptions
• One fuzzy string predicate (edit distance)
• One numeric predicate
Query:
(Qs, δs, Qn, δn)
SELECT *
FROM Movies
(’Schwarrzenger’, 2, 1980, 5)
WHERE star SIMILARTO ’Schwarrzenger’
AND |year – 1980| <= 5;
6
Intuition of MAT (Mixed-attribute-type) Tree
• “2 > 1 + 1”
– One integrated indexing structure is better than
– two independent indexing structures on two attributes
• Indexing numeric attributes: B-tree or R-tree
• Indexing strings as a tree to support fuzzy predicates?
*
*
...
<1946,1957>
MAT tree
Root
...
<1964,1977>
MBR
*
*
*
...
...
*
...
...
<1946,1956>
Spielberg
1946
Gibson
1956
<1956,1957>
Hanks
1956
Hanks
1957
<1964,1968>
Crowe
1964
Robert
1968
7
<1974,1977>
DiCaprio
1974
Roberrts Leaf
nodes
1977
Answering a query (Qs, δs, Qn, δn)
• Top-down traverse the MAT-tree
• At each node, do pruning by checking:
– If [Qn – δn, Qn + δn] overlap with the numeric range.
– If minEditDistance(Qs, Tn) <= δs.
*
*
...
<1946,1957>
Root
...
<1964,1977>
MBR
*
*
*
...
...
*
...
...
<1946,1956>
Spielberg
1946
Gibson
1956
<1956,1957>
Hanks
1956
Hanks
1957
<1964,1968>
Crowe
1964
Robert
1968
8
<1974,1977>
DiCaprio
1974
Roberrts Leaf
nodes
1977
Challenge
• How to represent strings to fit into a limited space
• and support fuzzy-predicate pruning
Limited space (disk based)
*
*
...
<1946,1957>
Root
...
<1964,1977>
MBR
*
*
*
...
*
...
...
...
<1946,1956>
Spielberg
1946
Gibson
1956
<1956,1957>
Hanks
1956
Hanks
1957
<1964,1968>
Crowe
1964
Robert
1968
<1974,1977>
DiCaprio
1974
9
Roberrts Leaf
nodes
1977
Existing Approaches to Indexing Strings as Trees
• M-tree:
– Edit distance: metric space
• Q-tree
– Utilize the q-gram property of strings.
– See our paper for details
10
Representing strings as a trie
Strings:
aad, abcde, abdfg,
beh, ca, cb
a n2
b n6
a n5
d n10
c
n11
d
e
n1
c
b n3
e n7
a n8 b n9
d h
n12
n13
f
n14
n15
g
n16
n4
n17
11
Compressing a trie
Strings:
aad, abcde, abdfg,
beh, ca, cb
a n2
a n5
b n6
d n10 c
n11
d
e
n1
b n3
e n7
d h
n12
n13
c
Strings:
aad, abcde, abdfg,
beh, ca, cb
n4
a n8 b n9
<{a},1>
compression
n5
<{a,d},2>
f
n14
n15
n11
g
n16
n2
• Select k representative nodes (centers).
• Each center is in the format of <alphabet,height>.
• A compressed trie represents more strings
12
<{b},2>
n7
n4
<{e},1>
<{a,b,c},2>
<{b,d},2>
n13
n6
<{h},1>
n15
<{c,d,e},3> <{f,g},2>
n17
n1
Minimum edit distance between a string a trie
minEditDistace (Qs, Tn)?
– Convert a trie to an automaton.
– Compute the min distance between a string and an automaton [Myers and Miller, 1989]
– Early termination possible
Edit Graph
Automaton
d
[a,*]
[a,d]
[a,a]
[*,d]
[*,a]
a
[c,*] [c,a]
[*,*]
[a,*]
b
[a,*]
“ac”
d
a
[a,b]
[*,b]
Query String
[*,d]
[*,a] [a,*]
b
a
b
[*,b]
d
[*,*]
[c,*]
[c,b]
[c,d]
[c,*]
b
[*,b]
[c,*]
[*,a]
a
13
[*,d]
d
[*,*]
Compressed trie Automaton
• Each node is a state.
• Each edge becomes a transition between two states.
• For compressed node <Σ, L>, expand it to L levels.
At each level, all characters in Σ become single
states and are connected to a common tail ε.
a
a
b
c
b
c
Convert a compressed node <{a,b,c},2> into automaton nodes.
14
Outline
•
•
•
•
Motivation: supporting queries with mixed-type predicates
Our approach: MAT tree
Construction and maintenance of MAT tree
Experiments
15
Constructing MAT-tree
• Option 1: insert records one by one.
• Option 2:
– bulk-load records
– construct the MAT-tree bottom-up
16
Compressing a trie
• Important:
– Accurately represent strings in a limited space.
– Minimize “information loss”.
– Maintain the pruning power during a traversal.
• Three methods:
– (1) Reducing # of accepted strings
– (2) Keeping accepted strings “clustered”
– (3) Combining of (1) and (2)
17
Method (1): Reducing # of accepted strings
• Intuition:
– reducing this # makes the compressed trie more accurate
• Goodness function: # of accepted strings
• Algorithm: “Randomized”
–
–
–
–
–
Randomly select k initial centers
Randomly select one of the centers
Randomly select an unselected node
Swap them if it can improve the goodness function
Do certain # of iterations
18
Method (2): Keeping accepted strings clustered
• Intuition:
– keeping the accepted strings similar to the original ones by letting
them share common prefix.
– Place k centers as close to the root as possible.
• Algorithm: “BreadthFirst”
Strings:
aad, abcde, abdfg,
beh, ca, cb
c n4
b n3
a n2
<{a,d},2>
n5
n1
n6
<{a},1>
n7
n8
<{b,c,d,e,f,g},4> <{e,h},2>
19
n9
<{b},1>
Method (3): Combining (1) and (2)
• Intuition:
– minimize the number of accepted strings, and in the same
time maintain their similarity to the originals.
• Algorithm: “Bottomup”
– Keep shrinking the trie bottom up until we have k nodes
– Compress a node that minimizes # of additional strings
20
Dynamic maintenance
Insertion (s, n)
• Search the index for (s, n). If it’s not in the index,
identify the correct leaf node.
• If no overflow:
– update the “MBR” of the leaf node and its precedents
recursively if necessary.
• If overflow:
– Split the leaf node and
– Construct two compressed tries
– Cascade the split to the precedents if necessary.
Deletion and Update are handled similarly
21
Outline
•
•
•
•
Motivation: supporting queries with mixed-type predicates
Our approach: MAT tree
Construction and maintenance of MAT tree
Experiments
22
Setting
• Data
– IMDB: 100K movie star records (Name and YOB).
– Customers: 50K records (Name and YOB)
• Test bed
– PC: 2.4G P4, 1.2GB Memory, Windows XP
– Visual C++ compiler
• Similar results. Report result for IMDB.
23
Implemented approaches
•
•
•
•
•
•
B-tree
Q-tree
B-tree & Q-tree
BQ-tree
BM-tree
Sequential scan
“BBQ-tree”?
24
“2 > 1 + 1”
An integrated indexing structure is better than
two separate indexing structures
δs=3, δn=4
25
Scalability
26
Effect of numeric threshold δn
27
Effect of string threshold δs
28
Dynamic Maintenance: time
29
Dynamic maintenance: MAT quality
30
Number of centers
• Increasing cluster # may not reduce the running time: pruning power versus
computational cost
• For BottomUp and BreadthFirst (compared to Randomized)
- Centers close to the root, thus more likely to do early termination
31
Conclusion
• MAT-tree: an efficient indexing structure for
queries with mixed-type predicates
• Can be efficiently constructed and maintained
• Future work: develop a uniform framework to
support different kinds of similarity functions
The Flamingo Project : http://www.ics.uci.edu/~flamingo/
Q&A?
32
Backup Slides
33
Constructing MAT-tree
• Option 1: inserting records one by one.
• Option 2: bulk-loading data records and constructing the
MAT-tree in a bottom-up fashion.
– Records are sorted based on one attribute.
– Fill pages with records until full.
– Calculate the numeric range and the compressed trie for
each leaf nodes.
– Merge leaf nodes into internal nodes recursively according
to desired fanout, until a single root is formed.
34
Example – Customer Service Call Center
Customer calls in
Serve the customer
Issue a fuzzy query:
Name LIKE “Tom Hanks” AND
YOB CLOSE to 1958
Return
result
Name
SSN
YOB
Jack Lemmon
430-871-8294
1978
Harrison Ford
292-918-2913
1962
Tom Hanks
234-762-1234
1956
Tim Legler
125-457-8654
1870
…
…
…
35
In this example, the
underline system
should be able to
support fuzzy query
on both the string and
numeric attributes!
Scalability test (IO)
36