Transcript Document

Stack-based Algorithms for
Pattern Matching on DAGs
Li Chen, Amarnath Gupta, M. Erdem Kurul
San Diego Supercomputer Center (SDSC),
University of California, San Diego
VLDB’05
Motivation

Graph model is important in databases and
knowledge representation



Bibliographic citations, hypertext, ontology
A lot of scientific data are beyond XML tree model
Many of them are directed acyclic graphs (DAGs)




Taxonomy of proteins, chemical compounds, organisms
Data provenance graphs
Sequence data and multiple sequence alignments
Searching for highly similar substructures


7/22/2015
Gives rise to numerous pattern matching problems
e.g., a novel (metabolic) pathway against a pathway
database
2
Example & Its Abstraction
graph-structured patent citation network
Labeled DAG
 node: patent/article
 label: patent/article
properties


Query patterns: path / twig / dag
Stanford
UPenn
UWisconsin
USA
computer
science
Japan
biomedical economy
England
France
year, contact_author,
affiliation, country, etc.
directed edge: uniformly
“cited-by”
Query Model
 node: matching a certain
property of data nodes
 edges: / for direct (resp. //
for indirect ) “cited-by”
Germany
7/22/2015
3
Problem Definition

Is this a (sub)graph isomorphism problem?
Definition: Two graphs are isomorphic if there is a one-to-one correspondence
between their vertices and there is an edge between two vertices of one graph if
and only if there is an edge between the two corresponding vertices in the other.

NP-hard
!
Is this a subgraph homeomorphism problem?
Definition: The homeomorphic image of a pattern graph H in a data graph G is:
the images of nodes in H are nodes of G, and the images of edges in H are
paths in G.
Neither!! … ours is easier (polynomial)
• acyclic graph model
• corresponding vertices (nodes) have the same label
7/22/2015
4
Problem Definition (cont.)

Pattern matching on DAGs
The input is a (virtual) single rooted DAG G
A total mapping from Q to G, preserving parent-child /
ancestor-descendant relationships
Branches represent “and ” semantics  structural join




Return node bindings in witness structures
m1
c4
c1 p1
c2
computer
science
(c)
e1
e2
m2
DAG-structured data
7/22/2015
c2
c4
a1
biomedical
b1
c1
(b)
economy
(e)
Twig pattern query
b1
e1 b1
e1
b1
e2
query solutions
5
Related Work

Exact v.s. Inexact graph matching [Shasha PODS02]

Exact: a total mapping from query nodes to data nodes
(usually requiring label matching)
Inexact: either a partial mapping, or an approximated
total mapping



Trade-off between space and time
Trade space for time




Path Index: materialize fixed or parameterized length of paths
Transitive closure computing: for queries involving ‘//’
Store adjacency list and compute on-the-fly
Q: Is there a method economic in both time and space?
7/22/2015
6
Outline of The Talk





Motivation
Problem Definition
Related Work
The inspiration of our idea
Our Approach






Linear-space representation for DAG
Stack-based algorithms for path, twig, and dag queries
Complexity analysis
Optimization by prefiltering
Experimental Evaluations
Conclusions and Future Work
7/22/2015
7
Inspiration from XML Pattern Matching:
Interval Encodings
interval encoding of a tree
year
2
5
title
6
book
1
9
y is a descendant of x, i.e., y x 
y.left > x.left and y.right < x.right
٨
pre-order
30
authors
10
Implication of overlapping intervals
chapter
19
20
29
x
2000
3
4
XML
author
author
head
section
7
11 14
15
21 24
25 28
8
18
y
y
x

Bill
Jake
12 13
16 17
History
22
23
head
26
27
x
y

Difficulties in directly applying interval encoding to DAG

Each tree node has at most one parent, while a graph node may have more

Multiple encoding may be a solution, but more comparisons are introduced,
so likely not space nor time economic
7/22/2015
8
Inspiration from XML Pattern Matching:
Stack-based Algorithms for Holistic Joins

Stack-based Algorithm [Bruno et al. SIGMOD02]




7/22/2015
Build a stream and a stack corresponding to each query node
Nodes in streams are pushed into stacks in their document order
Pop a node from its stack if its interval no longer overlaps the
newly pushed node
For a node pushed into a leaf stack, output all root-to-leaf paths
9
Challenges

Whether stack-based algorithms are extendable to
pattern matching on DAGs?

If possible, how? Is it economic in space and time?
7/22/2015
10
Outline of The Talk





Motivation
Problem Definition
Related Work
The inspiration of our idea
Our Approach






Linear-space representation for DAG
Stack-based algorithms for path, twig, and dag queries
Complexity analysis
Optimization by prefiltering
Experimental Evaluations
Conclusions and Future Work
7/22/2015
11
DAG Representation
G = (V, E,
d
٨
What do we do?


7/22/2015
٨

Not pre-compute and store
Neither store adjacency list for d
Instead, store interval encoding of a tree-cover, covering part of
And index on the remaining linkages minimally but losslessly
٨

٨

٨
O(|V|2 )
space
٨
O(|E|)
٨

node partial order d , i.e., e =<a,b> E  b d a
transitive closure , i.e., p =<x,y> P  y x
٨ ٨

d)


٨
Partial order v.s. transitive closure


12
Our DAG Representation
Decompose a DAG G into T and GR

T = (V, ET) is a tree-cover (spanning tree)
GR = (VR, ER) is the remaining graph, ER = E - ET
m1
c4
c1 p1
c2
b1
a1
e1
e2
m2
=
nid
encoding
m1
c1
b1
c2
e1
p1
a1
e2
m2
c4
[1,20]
[2,9]
[3,4]
[5,8]
[6,7]
[10,17]
[11,16]
[12,13]
[14,15]
[18,19]
Two ways of inducing
e2
m2
[a1]
[a1]
surrogate
Given a node w, its
 surplus preds = direct non-tree
preds
 surrogate preds = nearest preds
that have surplus preds
e.g., b1[3,4] is contained by c1[2,9], hence b1 c1
thru SSPI, w/o further node interval checking

7/22/2015
Surrogate
& Surplus surplus
b1
[c2,a1]
predecessor
index (SSPI)
a1
[c4]
thru checking of node intervals


+
PL
٨

e.g., a1  PL(b1), hence b1 p1, similarly, b1 c4, e2
٨

٨
G
nid
c4
٨

٨

13
Properties of Our DAG Representation
Lossless in inducing

Building costs (a tree-cover traversal of G)

Procedure
1. encode each node w along the traversal of T
2. if w has surplus preds ui in addition to its tree parent v,
add ui in PL(w); and if v is also in SPPI,
1.
2.

7/22/2015
٨

add v in PL(w), if v does not have surplus preds itself
inherit PL(v) in PL(w), otherwise
Linear time & space
1. in terms of |V| for interval encoding
2. in terms of |E| for populating SSPI
14
Extending Stack-based Holistic Join
Algorithms
Key ideas





7/22/2015
Keep the data structures of streams and stacks
Add a new structure – partial solution pools
Put a popped node in its pool, rather than discard it
Grow partial solutions for the new-found
Exploit temporal properties to avoid vain attempts
٨

15
Algorithm Extension
SweepPartialSolutions: checking & building solutions in pools

When?


Where?


Between v and the nodes in each of its children pools
What (condition)?


A node v is popped out of stack
Check if each child pool has a node w, s.t. w
Check if uPL(PL…(w)) s.t.
u.L  v.L and u.R  v.R
w
v
w
v
w
What (action)?

7/22/2015
v
How?
v

٨

Expand: grow partial solutions headed by w to be headed by v
16
PathStackD by Example
c4
c1 p1
c2
b1
e1
m
c
a1
e2
(a) Data G
m2
b
b1
c1
Sb
Sc
c2
c1
b1
m1
Sm
m2
m1
c4
m2
m1
c4
m1
(c) Stacks
7/22/2015
Pb
Pc
Pm
m1
c2
c1
(b) Query
m1 c1 b1
b1
b1
c1
c2
c2
b1
c1
c2
m1 c2 b1
m1 c4 b1
(d) Pools
m2
٨
m1
(e) Results
17
Algorithm Analysis
The total containment ( ) checks in pools are
٨


Not |Sm1| x |Sm2| x … x |Smn| times of SPPI look-ups


|Smi|: size of the ith stream,
n: size of the path Q
But much tightly restricted due to temporal properties

Not all stream nodes, but child pool nodes (to the left of v)
w
v
u
v
w
u
v
v
u
Not entire SSPI is searched for checking if w v
٨

v
u
v
Function checkContainment(v,w)
while (u:=next PL(w) and !found)
if (u.L  v.L and u.R  v.R) return true
else if (u.L  v.R) return false
else if (u has no preds) remove u from PL(w)
else {found = checkContainment(v,u);
if (!found) PL(w)=PL(w)+PL(u)-{u} }
7/22/2015
for subsequent v’s,
this u can be ignored
18
PathStackD
Theorem 1 Given a path query q and a DAG G, PathStackD
correctly returns all the query answers for q.
sound and complete
Theorem 2 Given a path query q and a DAG G, PathStackD
has the worst-case I/O and CPU time complexities of O(|q||Smi| 2+
|q||Smi|d + |E|), i.e., max(|E|, |q||Smi|(max(|Smi|, d))).
|Smi|: average stream size
|q|: query size
d:
diameter of G
7/22/2015
Optimal compared
to O(|V|2 |q|)
19
Additional Changes in TwigStackD

Key changes:
getMinSources (original)  getMissings (ours)

1. node with minimal left value
1. the same
2. has all the required descendant types 2. record which required types are missing
sweepPartialSolutions

3. check if missing types are complemented by pool nodes
m1
c1 p1
c2
b1
e1
c4
c
a1
e2 m2
(a) Data G
7/22/2015
m
b e
(b) Query
Sb
c2
c1
e1
Sc
b1
m1
Pb
Sm
Se
Pc
Pm
Pe
(c) Stacks
m1 c1 b1 e1
m1 c2 b1 e1
(d) Pools
(e) Results
20
A Prefiltering Step

Purpose


Improve efficiency by reducing the I/O factor |Smi|
Basic Idea

Impose structural constraints of the query pattern
for filtering nodes to be put in streams
e.g.,
each QBitVec captures required
downwards structural constraints
a
b
0011
QBit
c
0001
7/22/2015
each QBitVec captures required
upwards structural constraints
a
1111
d
0100
QBitVec
b
1010
1000
d
1100
c
1011
21
Two Passes for Prefiltering

Downwards Filtering By Example
Traverse data DAG and aggregate the satisfied descendant types
Match the satisfied with the required


Data nodes are processed in post-order
a1
1111
a
b
0011
1111
d
0100
c
0001
e1
b1
d1
0001
0011
0100
c1
0001
m1
0001
a2
1000
c2
0001
encoded query pattern
encoded data DAG
? satisfied constraints
required constraints 
when exiting each edge directing from n to prev, do
// myBitVec is the bitVector value for n
myBitVec = bitOR(myBitVec,prevBitVec,QBit)
// prev is query relevant if it matches a query label
if (prev is query relevant &&
prev does not satisfies structural constraint)
then myBitVec=bitAND(myBitVec,~prevQBit)
if (n is query relevant &&
bitAND(myBitVec,QBitVec) == QBitVec)
then n satisfies structural constraint
put n into the corresponding stream
post-order : guarantees that a node is encoded before all its ancestors
topological-order : guarantees that a node is encoded before all its descendants
7/22/2015
22
Summary of Our Approach
The key ideas
Our DAG representation losslessly covers all transitivity closure




*_stackD algorithms leverage tradeoffs between space and time




7/22/2015
Interval encoding on tree-cover T for covering
SSPI and tree-cover encoding together cover the complete
Worst-case space is O(|V|+|E|), compared to O(|V|^2) if pre-compute
and store all transitive closure
٨

٨

Adopt a new structure, i.e., partial solution pools, in addition to
streams and stacks
Modify/add procedures to handle stack-popped nodes in pools,
where remaining solutions can be found
Worst-case time is O(max(|E|, |Smi|^2)), compared to O(|V|^2) if no
path index is utilized
Prefiltering further optimizes performance by reducing |Smi|
23
Outline of The Talk





Motivation
Problem Definition
Related Work
The inspiration of our idea
Our Approach






Linear-space representation for DAG
Stack-based algorithms for path, twig, and dag queries
Complexity analysis
Optimization by prefiltering
Experimental Evaluations
Conclusions and Future Work
7/22/2015
24
Experimental Evaluations

System implementation


Java 1.4
Light-weight storage engine -- PSEPro from ObjectStore


Utilize its VMMA for memorydisk data structure mapping
Experimental setups

Tunable synthetic DAG data generator


Real-life data


7/22/2015
Parameters: diameter, fan-out, fan-in, distinct # of labels
Gene ontology data, tree data from XMark benchmark
augmented by random cross links
2.6Ghz Pentium IV PC, 1GB MM, 2GB VM
25
Experiment 1
(ms) 16
(ms) 60
60
Nav
Exec
Filter
12
(ms) 160
Nav
Exec
Filter
40
40
120
80
8
20
20
4
00
0
PQ
TQ
DQ
40
0
PQ
TQ
a
b
a
b
c
c
d
d
DQ
e
f
b
Nav
Exec
Filter
300
e
d
f
900
200
600
100
300
0
0
PQ
Nav
Exec
Filter
1200
TQ
DQ
PQ
n=200K, m=360K
n: |V|
m: |E|
7/22/2015
DQ
(ms) 1500
400
a
TQ
n=100K, m=180K
(ms) 500
TQ
PQ
DQ
n=50K, m=90K
n=25K, m=45K
PQ
Nav
Exec
Filter
TQ
DQ
n=400K, m=720K
Compare processing time (including prefiltering and query
execution) of *StackD against Nav[Kanza PODS03]
26
Experiment 2
(ms) 6
Nav
TwigStackD
4
2
TQ
0
5
n=5K,
10
20
30
40
50
60
b
70 (K)
m=5K, 10K, 20K, 30k, 40K, 50K, 60K, 70K
a
d
c
e
f
Evaluate the performances of both algorithms with the changing characteristics (density) of DAG
7/22/2015
27
Experiment 3
(ms)100
80
(ms)
200
Nav
PathStackD
150
Nav
TwigStackD
60
100
40
50
20
0
0
PQ1
PQ2
PQ3
a
a
a
a
a
a
b
b
b
b
b
b
c
c
c
c
c
d
PQ4
PQ5
TQ1
PQ6
d
d
d
e
e
e
f
f
TQ2
a
a
b
TQ3
c
b
d
a
c
e
b
c
f d
f
e
g
h
i
g
Evaluate the performances of both algorithms with the changing characteristics (size) of query
7/22/2015
28
Experiment 4
(ms) 12
10.4
PSD/Filtering
PSD/NoFiltering
8
4.2
4
1.32
0.17 0.09
0.5
1.4
0.41
0
1K
5K
25K
50K
Evaluate the performance of PathStack-D with or without the aid of the prefiltering step
7/22/2015
29
Experiment 5
BuildSSPI TSDFilter TSDExec #Scan
#Result NavAlgo
25K
1.02
1.26
1.58
0.58M
2.4K
14.5
50K
1.94
1.86
6.27
2.34M
5.6K
51.8
4.38
3.84
28.35
10.9M
14K
148.7
8.31
8.25
94.76
37.7M
24K
481.7
19.06
21.11
375.90
121M
44K
1276
100K
200K
400K
TQ
a
b
d
c
e
f
100MB XML document (~ 1.4M nodes and ~ 1.6M edges)
PQ=//site//person//age
TQ=//site(//item//description, //category//name, //person//age)
*StackD
7/22/2015
#Scan #Result NavAlgo
PQ
22.39
3.59M
12.8K
53.4
TQ
53.51
10.4M
27.8K
72.2
30
Conclusions and Future work


Conclusions

Gracefully generalized the stack-based algorithms for pattern
matching on DAGs

The extended algorithms are sound and complete

The proposed approach is optimal among those that do not rely
on precomputed transitive closure
Future Work

Further improvement by incorporating statistics on a graph
structure and/or advanced indexing schemes

Allow for more general graph operations which gives rise to more
challenging query optimizations
7/22/2015
31
Questions?
7/22/2015
32