Transcript Document

Developing 3-in-1 Index
Structures on Complex Structure
Similarity Search
Presented by Xiaoli Wang
Supervisor: Anthony K. H. Tung
2015/4/13
SOC@NUS
1
Thesis defense by Xiaoli Wang
Outline
Motivation
Complex structures
Limitations on existing systems
Our solutions
Inverted index for graph similarity search
Inverted index for sequence similarity search
A 3-in-1 inverted indexing tool and its applications
Conclusion and future work
2015/4/13
SOC@NUS
2
Thesis defense by Xiaoli Wang
Why complex structures?
Relational database
Data model: table
Real objects
Sequence, tree, and graph
Actor
Title
Year
Genre
Keanu Reeves
The Matrix
1999
Sci-Fi
Samual Jackson
Iron man
2008
Sci-Fi
Samuel Jackson
The man
2006
Crime
Chemical compound
<user>
<person><name>John</name></person>
</user>
<user>
<person>
<name>Mary</name>
<id>u200</id>
</person>
</user>
2015/4/13
Protein structure
Program flow
Coil
Image
Fingerprint
SOC@NUS
Letter
Shape
3
Thesis defense by Xiaoli Wang
How to manage complex structures?
Conventional indexing and searching systems
Restricted to a given data model
Hard to adapt to support various complex structures
Waste resource to re-design various storage systems
Access method
Access method
Access method
Like query
Sequence
database
Tree
database
Graph
database
Storage
…
Biology data
2015/4/13
Documents
Program flow
SOC@NUS
Image
4
Thesis defense by Xiaoli Wang
Real application: social reading systems
Book duplicate detection problem
2015/4/13
SOC@NUS
5
Thesis defense by Xiaoli Wang
Real application: social reading systems
Book duplicate detection problem
Graph solution[1]
1. Adam Schenker. Graph-theretic techniques for web content mining. PhD thesis. University of South Florida. 2003.
2015/4/13
SOC@NUS
6
Thesis defense by Xiaoli Wang
Real application: social reading systems
Book duplicate detection problem
Book annotation migration problem
Sequence solution[1]
1. Xiaoli Wang, etc. Efficient and Effective KNN Sequence Search with Approximate n-grams. In PVLDB. 2013.
2015/4/13
SOC@NUS
7
Thesis defense by Xiaoli Wang
Real application: social reading systems
Book duplicate detection problem
Book annotation migration problem
Annotation search by snapping problem
Sequence solution[1]
1. Xiaoli Wang, etc. Efficient and Effective KNN Sequence Search with Approximate n-grams. In PVLDB. 2013.
2015/4/13
SOC@NUS
8
Thesis defense by Xiaoli Wang
Real application: social reading systems
Book duplicate detection problem
Graph matching problem
Book annotation migration problem
Sequence matching problem
Annotation search by snapping problem
Sequence similarity search problem
Build an information management tool to support the
storage and retrieval of sequences and graphs
2015/4/13
SOC@NUS
9
Thesis defense by Xiaoli Wang
Objectives
Building a 3-in-1 unified indexing system
Design a unified and effective indexing mechanism
Support similarity search on various complex structures
Scope of this thesis
 Focus on the similarity search
problems
 Select edit distance as similarity
measure
 Tree is considered as a specific
graph
2015/4/13
SOC@NUS
10
Thesis defense by Xiaoli Wang
Objectives
Building a 3-in-1 unified indexing system
Design a unified and effective indexing mechanism
Support storage and retrieval of all complex structures
 Step 1: Propose an effective index to
support efficient graph similarity
search
 Step 2: Extend the above index to
further support efficient sequence
similarity search
 Step 3: Design and implement the
unified 3-in-1 index to support
various complex structure search
2015/4/13
SOC@NUS
11
Thesis defense by Xiaoli Wang
Outline
Motivation
Complex structures
Limitations on existing systems
Our solutions
Inverted index for graph similarity search
Inverted index for sequence similarity search
A 3-in-1 inverted indexing tool and its applications
Conclusion and future work
2015/4/13
SOC@NUS
12
Thesis defense by Xiaoli Wang
Graph similarity search problem
Given a graph database, we want to find its most similar
graphs based on graph edit distance
A query graph
a
c
Similar graphs to the query
b
a
c
a
c
a
c
a
b
b
d
b
e
d
f
c
……
a
c
b
A graph database
Graph edit distance:
λ(g1, g2) = minimum # of operations (insertion, deletion, substitution) to
transform graph g1 into graph g2
2015/4/13
SOC@NUS
13
Thesis defense by Xiaoli Wang
Existing works
A naive solution: No pruning and poor scalability
100,000
full scan
candidates
100,000
GED computation
answers
50
C-Star (VLDB'09): Too many GED bound computations
exact
graph edit
GED bound computationUnpractical
GED
computation
candidates
answers
distance
computations
full scan
100,000
100
50
K-AT (TKDE'10): Loose bound – too many false positives
Expensive GED bound
Tighter bound
GED bound computation
GED computation
answers
computation candidates
100,000
70% scan
index
90,000
50
Loose GED bound
2015/4/13
SOC@NUS
14
Thesis defense by Xiaoli Wang
Our solution
Build an index to reduce scan ratio
Use tighter GED bounds to reduce candidate size
GED bound computation
100,000
5% scan
index
candidates
GED computation
100
answers
50
Inverted index based on the star decomposition
The graph is decomposed into smaller stars
Inverted index is designed to store the reference between
graphs and stars
Efficient CA-based search framework is proposed
2015/4/13
SOC@NUS
15
Thesis defense by Xiaoli Wang
Star decomposition
Star structure
A star structure is a labelled, single-level, rooted tree
which can be represented by a 3-tuple s = (r, L, l), where
r is the root vertex, L is the set of leaves and l is a
labelling function
Star representation for a graph
A graph can be broken into a multi-set of star structures
c
b
a
c
b
a
c
g1
2015/4/13
c
c
b
a
c
b
a
c
c
SOC@NUS
a
c
S(g1)
16
Thesis defense by Xiaoli Wang
Star decomposition
Star edit distance
Given two star structures s1 and s2,
λ(s1, s2) = T(r1, r2) + d(L1, L2)
Where T(r1, r2) = 0 if l(r1) = l(r2); otherwise T(r1, r2) = 1
d(L1, L2) = ||L1| − |L2|| + M(L1, L2)
M(L1, L2) = max{| ΨL1|, | ΨL2|} − |ΨL1∩ΨL2|
Example:
given s1 = abcc, and s2 = dcc,
T(r1, r2) = 1, as l(a) ≠ l(d);
d(L1, L2) = |3-2| + 3 – 2 = 2;
λ(s1, s2) = 1 + 2 = 3.
2015/4/13
a
b
SOC@NUS
c
s1
d
c
c
c
s2
17
Thesis defense by Xiaoli Wang
Mapping distance: distance between
two star representations
a
a
c
c
g2
g1
b
ε
ε
b
d
s0
ε
s1
s0 0
s2 3
2
0
0
1
0
0
s4 3
s5 1
2
0
0
2
0
0
s0
s0 0
s2 3
s1
2
s3
2
ε
5
1
2
5
s4 3
s5 1
2
1
5
2
2
5
b
a
a
0
a
c
s0
b
d
s2
1
a
c
d
s4
1
a
b
a
c
s0
b
c
s1
c
b
s3
s5
b
d
c
S(g1)
5
ε
S(g2)
The mapping distance: μ(g1, g2) = 0 + 1 + 1 + 5 = 7
2015/4/13
SOC@NUS
18
Thesis defense by Xiaoli Wang
Observations on star decomposition
The lower bound based on mapping distance is shown to
be tight, and can be computed in cubic time
If we use the star structure as the filtering feature in a
feature-based inverted index, then the derived tighter
bounds can be applied for graph pruning
Graphs have more similar stars can be more similar to
each other
If we can have a method to access database graphs in an
increasing order of edit distance to a query graph, then a
CA-based filtering framework can work well to reduce
the scan ratio
2015/4/13
SOC@NUS
19
Thesis defense by Xiaoli Wang
CA-based filtering strategy
Sorted lists for a query graph
Lists are sorted increasingly in order of star edit distances
Stars of a database graph are accessed by increasing
dissimilarity to stars of the query graph
a query graph q
Graph identity
2015/4/13
q: s1
q: s2
g1: 0
g4: 0
g2: 1
g1: 1
g3: 2
g2: 3
…
g6.: 5
g6.: 5
…
g4: 6
g3: 9
SOC@NUS
λ(s, s2), where s is a star in S(g4)
Sorted lists
20
Thesis defense by Xiaoli Wang
CA-based filtering strategy
Use mapping distance as the aggregation value
When CA halts, all unseen graphs are sure to have
mapping distances larger than the current threshold
value. They can be safely filtered out, as mapping
distance is a lower bound of graph edit distance.
a query graph q
q: s1
Seen
Possibly unseen
2015/4/13
q: s2
g1: 0
g4: 0
g2: 1
g1: 1
g3: 2
g2: 3
…
g6.: 5
g6.: 5
…
g4: 6
g3: 9
SOC@NUS
If λ(g6, q) ≥ ω, g6 can be
safely filtered out
ω = sum(2, 3) = 5 > d (= 4)
Threshold value
21
Thesis defense by Xiaoli Wang
Overview of the search framework
Two-level index
Upper-level inverted index between graphs and stars
Lower-level inverted index between vertex labels and stars
2015/4/13
a
c a
b
d b
c a
c
e
d
b
a
c
a
f
…
b
A graph database
SOC@NUS
c
22
Thesis defense by Xiaoli Wang
Overview of the search framework
a
c
g1
b
a
s0
c
b
a
s0
c
a
b
s2
d
a
b
s1
c
a
c
s4
d
a
c
s3
b
b
d
s5
c
Star decomposition
b
d
a
c
g2
b
ε
Two lists for s0 and s3
s0: abc
gid
frep
g2
1
g1
1
ε
s3: cab
gid feq
g2
1
Two-level index
Upper-level inverted index between graphs and stars
Lower-level inverted index between vertex labels and stars
2015/4/13
a
c a
b
d b
c a
c
e
d
b
a
c
a
f
…
b
A graph database
SOC@NUS
c
23
Thesis defense by Xiaoli Wang
Overview of the search framework
Two lists for vertex labels a and b
b
a
s0
b
a
s0
c
a
b
s2
d
a
b
s1
c
a
c
s4
d
a
c
s3
b
b
d
s5
c
c
ε
a
b
sid frep
sid feq
s1
1
s0
1
s2
1
s3
1
s3
1
s5
1
s4
1
Two-level index
Upper-level inverted index between graphs and stars
Lower-level inverted index between vertex labels and stars
2015/4/13
a
c a
b
d b
c a
c
e
d
b
a
c
a
f
…
b
A graph database
SOC@NUS
c
24
Thesis defense by Xiaoli Wang
Overview of the search framework
a query graph q
Similar graphs to the query
q: s2
q: s1
The graph similarity search
 construct graph score-sorted lists
using the top-k stars returned from
the lower-level
The top-k sub-unit query
 Construct a score-sorted list
 Run CA algorithm for graph
pruning
 Return top-k stars using TA method
Two-level index
Upper-level inverted index between graphs and stars
Lower-level inverted index between vertex labels and stars
2015/4/13
a
c a
b
d b
c a
c
e
d
b
a
c
a
f
…
b
A graph database
SOC@NUS
c
25
Thesis defense by Xiaoli Wang
Experimental results: query performance
Response Time(sec)
Candidate Size(K)
k-AT
C-Tree
SEGOS
30
10
25
20
1
15
10
0.1
5
0
0
0.01
2
4
6
8
10
12
14
16
18
20
Edit Distance Threshold(0 to 20)
10
30
25
1
0.1
20
15
10
0.01
5
0
0.001
5
10
15
20
25
30
35
40
Data Size(5K to 40K)
2015/4/13
SOC@NUS
26
Thesis defense by Xiaoli Wang
Outline
Motivation
Complex structures
Limitations on existing systems
Our solutions
Inverted index for graph similarity search
Inverted index for sequence similarity search
A 3-in-1 inverted indexing tool and its applications
Conclusion and future work
2015/4/13
SOC@NUS
27
Thesis defense by Xiaoli Wang
Sequence similarity search problem
KNN Query in a sequence database :
Given a sequence database, we want to find its k most similar sequences
based on sequence edit distance
A query sequence
Similar sequences
A sequence database
Sequence edit distance:
λ(s1,s2) = minimum # of operations (insertion, deletion, substitution) to
transform sequence s1 into sequence s2
2015/4/13
SOC@NUS
28
Thesis defense by Xiaoli Wang
Existing works
A naive solution: No pruning and poor scalability
100,000
candidates
100,000
full scan
SED computation
answers
50
Trie-based methods: Too complex time and space cost
Poor scalability
on long strings
SED computation
candidates
and large databases answers
70% scan
index
100,000
50
Gram-based methods: Too many false positives
SED bound computation
100,000
2015/4/13
index
SED computation
candidates
Poor scalability
90,000
answers
on large
50
databases
Loose GED bound
SOC@NUS
29
Thesis defense by Xiaoli Wang
Our solution
Extend the proposed indexing mechanism on graph
search to support efficient string search
Use tighter SED bounds to reduce candidate size
SED bound computation
candidates
100,000
index
SED computation
100
answers
50
Inverted index based on the n-gram decomposition
The sequence is decomposed into smaller n-grams
Similar n-grams are used to tightly bound SED
Efficient CA-based filtering strategy is proposed
2015/4/13
SOC@NUS
30
Thesis defense by Xiaoli Wang
The n-gram based decomposition
i ntroduct ion
Sliding Window
5-grams
Intuition: Similar strings must share enough number of common n-grams
Intuition: Similar strings must share enough number of approximate n-grams
Approximate n-gram: n-gram with certain edit distance
2015/4/13
SOC@NUS
31
Thesis defense by Xiaoli Wang
Count filtering with approximate n-grams
Lemma
Consider two strings s1 and s2. If s1 and s2 are within an
edit distance of τ, then s1 and s2 must share at least
ɸ(s1, s2) = max{|s1|, |s2|} - n + 1 - η(τ, t, n) n-grams with
gram edit distance ≤ t.
η(τ, t, n) = max{1, n-2×t}+(n-t) ×(τ-1)
η(τ, t, n) is the maximum # of n-grams affected by τ edit operations to have gram
edit distance >t.
For example, η(τ, 0, n) = τ×n. This means that τ edit operations will affect at most
τ×n n-grams to have gram edit distance >0.
2015/4/13
SOC@NUS
32
Thesis defense by Xiaoli Wang
Mapping distance: distance between
two n-gram representations
s=“Emm Wo”
q=“EmmaW”
ng0 ng1 ng3
ng0 0
2
3
ng2 2
1
2
ng4 3
2
1
ng5 3
3
3
Emm
0
ng0
1
mm_
ng2
m_W
ng4
_Wo
ng5
Emm
ng0
mma
ng1
maW
ng3
ε
3
3
1
3
3
3
G(s)
ε
G(q)
The mapping distance: μ(s, q) = 0 + 1 + 1 + 3 = 5
2015/4/13
SOC@NUS
33
Thesis defense by Xiaoli Wang
The lower bound on mapping distance
Given two strings s1 and s2. The gram mapping distance μ
(s1, s2) between s1 and s2 satisfies
μ(s1, s2) ≤ (3n-2)×λ(s1, s2)
Given an edit distance threshold value of τ. If we use the
summation of gram edit distances as the aggregation
function in the CA method, then the threshold value can be
computed as τ×(3n-2)
2015/4/13
SOC@NUS
34
Thesis defense by Xiaoli Wang
CA-based filtering strategy
Algorithm halts:
The aggregation value:
2+2+2+2+2+1+1+1+1 = 14
The maximum value in the heap is: max { λs| s∈ top-1} = 1
The CA halts:
14 > 1×(3n-2) = 13
For any unseen sequence, its mapping distance is larger than the
aggregation value. This means its edit distance is larger than the
maximum edit distance value in the top-k heap. It can be safely
filtered out.
Top-1 heap
GED=0
L0
20
30
L1
20
30
L2
20
30
L3
20
30
L4
20
30
L5
20
30
GED=1
GED=2
2015/4/13
20
30
40
20
30
20
30
20
30
20
30
10
20
30
40
L6
20
L7
20
10
30
10
20
30
20
30
40
L8
10
20
20
ID
10
20
30
40
SOC@NUS
Intermediate results
Freq.: 0
Freq.: 1
0
-8
8
6
7
0
--
μ(L)
-1
3
--
λ
-1
3
--
35
Thesis defense by Xiaoli Wang
Overview of the search framework
Two-level index
Upper-level inverted index between sequences and n1-grams
Lower-level inverted index between n1-grams and n2-grams
A sequence database
2015/4/13
SOC@NUS
36
Thesis defense by Xiaoli Wang
Overview of the search framework
The upper-level inverted index
ng0
ng1
ng2
ng3
ng4
ng5
ng6
ng7
ng8
ng9
trodu uctio uctiv ntrod oduct roduc ction intro ctive ducti
10
10
20
10
10
10
10
10
20
10
20
20
20
20
20
20
5-grams
ID Sequence
10 Introduction
20 introductive
Two-level index
Upper-level inverted index between sequences and n1-grams
Lower-level inverted index between n1-grams and n2-grams
A sequence database
2015/4/13
SOC@NUS
37
Thesis defense by Xiaoli Wang
Overview of the search framework
The upper-level inverted index
ng0
ng1
ng2
ng3
ng4
ng5
ng6
ng7
ng8
ng9
trodu uctio uctiv ntrod oduct roduc ction intro ctive ducti
10
10
20
10
10
10
10
10
20
10
20
20
20
20
20
20
2-grams
The lower-level inverted index
nt ct du in io iv od on ro uc ti ve tr
ng3 ng1 ng0 ng7 ng1 ng2 ng0 ng6 ng0 ng1 ng1 ng8 ng0
5-grams
ng7 ng2 ng4
ng6 ng8 ng3
ng3 ng2 ng2
ng3
ID Sequence
ng4 ng5
ng4
ng5 ng4 ng6
ng7
10 Introduction
ng6 ng9
ng5
ng7 ng5 ng8
20 introductive
ng8
ng9 ng9
ng9
Two-level index
Upper-level inverted index between sequences and n1-grams
Lower-level inverted index between n1-grams and n2-grams
A sequence database
2015/4/13
SOC@NUS
38
Thesis defense by Xiaoli Wang
Overview of the search framework
a query sequence q
Similar sequences to the query
q: ng2
q: ng1
The sequence KNN search
 Construct score-sorted lists using
the n1-grams from the lower-level
The n1-gram similarity search
 Update value of t
 Return n1-grams with distance ≤ t
 Accumulate the count of similar
n1-grams, and run CA algorithm to
check the halting condition
Two-level index
Upper-level inverted index between sequences and n1-grams
Lower-level inverted index between n1-grams and n2-grams
A sequence database
2015/4/13
SOC@NUS
39
Xiaoli WANG: Technical Meeting
Thesis defense by Xiaoli Wang
Experimental results
Dataset: 50K, average length=350, max length=5000
100 query sequences
Quality of count filtering
2015/4/13
KNN query performance
SESAME@IDMI
40
Thesis defense by Xiaoli Wang
Outline
Motivation
Complex structures
Limitations on existing systems
Our solutions
Inverted index for graph similarity search
Inverted index for sequence similarity search
A 3-in-1 inverted indexing tool and its applications
Conclusion and future work
2015/4/13
SOC@NUS
41
Thesis defense by Xiaoli Wang
The 3-in-1 storage system
Our proposed inverted index can support efficient
graph and sequence similarity search
Challenges on inverted
index
 Various complex structures =>
unified key and value data
structure
 Various query processing
algorithms => general list
processing method
2015/4/13
SOC@NUS
42
Thesis defense by Xiaoli Wang
Challenge 1: unified index structures
Key: smaller substructures
graph (star), sequence (n-gram), tree (binary branch[1])
A key is a string
Posting list: a list of values
value (the complex structure linked to the key)
A value is an integer of document number
Unified inverted index
(Terms with posting list of document ids)
stars
n-grams
binary branches
Graph data
Sequence data
Tree data
1. Yang, R. etc. Similarity Evaluation on Tree-structured Data. In SIGMOD 2005.
2015/4/13
SOC@NUS
43
Thesis defense by Xiaoli Wang
Challenge 1: unified index structures
Key: smaller substructures
graph (star), sequence (n-gram), tree (binary branch[1])
A key is a string
Posting list: a list of values
value (the complex structure linked to the key)
A value is an integer of document number
ID Sequence
10 Introduction
20 introductive
The upper-level inverted index
5-grams trodu uctio uctiv ntrod oduct roduc ction intro ctive ducti
10
20
10
20
10
20
10
20
10
20
10
10
20
20
10
20
1. Yang, R. etc. Similarity Evaluation on Tree-structured Data. In SIGMOD 2005.
2015/4/13
SOC@NUS
44
Thesis defense by Xiaoli Wang
Challenge 2: common list processing
List scanning: CA-based method
Mapping distance as the aggregation value
Count filtering as the pruned technique
Implementation on top of Lucene[1]
Application layer
Graph similarity
search
Sequence
similarity search
Tree similarity
search
List processing
Lucene index
Index layer
1. http://lucene.apache.org/.
2015/4/13
SOC@NUS
45
Thesis defense by Xiaoli Wang
Real application: social reading systems
Book duplicate detection problem
Graph matching problem
Book annotation migration problem
Sequence matching problem
Annotation search by snapping problem
Sequence similarity search problem
Build an information management tool to support the
storage and retrieval of sequences and graphs
2015/4/13
SOC@NUS
46
Thesis defense by Xiaoli Wang
Real application: social reading systems
An information management tool for Readpeer.com[1]
1. http://readpeer.com/.
2015/4/13
SOC@NUS
47
Thesis defense by Xiaoli Wang
Real application: social reading systems
An information management tool for Readpeer.com[1]
Efficient annotation retrieval: about 650ms
Book duplicate detection: about 1 sec
1. http://readpeer.com/.
2015/4/13
SOC@NUS
48
Thesis defense by Xiaoli Wang
Outline
Motivation
Complex structures
Limitations on existing systems
Our solutions
Inverted index for graph similarity search
Inverted index for sequence similarity search
A 3-in-1 inverted indexing tool and its applications
Conclusions and future work
2015/4/13
SOC@NUS
49
Thesis defense by Xiaoli Wang
Conclusions
Build a unified 3-in-1 index for various complex
structures
Propose an inverted index to support efficient graph
similarity search
Further extend the inverted index to support efficient
sequence similarity search
Implement a unified inverted index to support various
similarity search problems on complex structures
The unified inverted index helps to build an effective
information management tool in our real social reading
system
2015/4/13
SOC@NUS
50
Thesis defense by Xiaoli Wang
Future works
Implement the unified index using distributed system
Other Challenges on
inverted index
 Index size is proportional to data
size => disk base storage
 List processing may be slow with
multiple I/Os => parallel
processing to distribute I/Os into
different machines
2015/4/13
SOC@NUS
51
Thesis defense by Xiaoli Wang
Future works
Implement the unified index using distributed system
2015/4/13
SOC@NUS
52
Thesis defense by Xiaoli Wang
Future works
Extend the unified inverted index to support more types
of queries on other data models
Nested structured data
High dimensional data (e.g., image or video query
processing)
2015/4/13
SOC@NUS
53
Thesis defense by Xiaoli Wang
Thank You!
Q&A
2015/4/13
SOC@NUS
54