Slides - Department of Computer and Information Science and

Download Report

Transcript Slides - Department of Computer and Information Science and

Algorithmics and Applications of
Tree and Graph Searching
Dennis Shasha, Jason T. L. Wang, and Rosalba Giugno
Presenters:
Jerod Watson & Christan Grant


Introduction
Searching in Trees




Searching in Graphs




Approximate Containment Queries
Path-Only Searches
Extension to Trees
Keygraph Searching in Graph DBs
GraphGrep
Subgraph Matching
Conclusion
Introduction

Modern search engines



Keyword-based queries
Impressive speed
Several research efforts have attempted
to generalize keyword search to keytree
and keygraph searching
XQuery
AQUA Query





Query expressed as a tree pattern,
termed “query tree”
DB can be represented as single tree or
as set of trees
Each tree could be ordered or unordered
Queries often concerned with the
parent-child, ancestor-descendant”, or
path relationship among nodes
Queries can be expressed by
containment mapping.



Query tree may contain fixed length don’t
cares (FLDCs) ex. “?”
Query tree may contain variable length
don’t cares (VLDCs) ex. “*”
This class of queries referred to as
approximate containment (AC) queries
Path-Only Searches

Many AC queries are concerned with
paths only.


Ex. “Find the descendants of Mary who is a
child of John”
XISS is an indexing and querying system
designed to support regular path
expressions
Extension to Trees

Pathfix algorithm


Phase 1: Encodes each root-to-leaf path of
every data tree into a suffix array DB
Phase 2: Compares the query tree Q with
each data tree D in the DB allowing a
difference of DIFF

Handling Don’t Cares




Partition query into connected subtrees
having don’t cares
Match each of those don’t care free subtrees
with data trees in the DB
For the matched subtrees that belong to the
same data tree, determine whether they
combine to match the query based on the
matching semantics of the don’t cares.
Filtering

Implementation

ATreeGrep
Graphs
Graphs




Abstract data type of elements (nodes or
vertices) interconnected by edges.
A graph is a specialized tree in which
there is no constraint on the number of
paths is possible from a node
No root
Graph may contain cycles
Keygraph Searching



Searching for a particular graph or order
of elements inside of a large graph (i.e.
internet)
Searching for a particular graph or
structure among several graphs (i.e.
chemical elements)
Use indexing to reduce complexity
Keygraph Searching

Three basic steps
1.
2.
3.
Reduce the search space by filtering
Formulate query into simple structures
Match
Keygraph Searching (survey)




A* algorithm
GraphDB
Daylight
Lore
A*



Seminal work by Nilson (1980)
Route finding algorithm that keeps track of its
visited nodes and the distance it has traveled.
Applications:
Protein databases (discovery and search)
Image databases
Chinese character databases
CAD circuit data and software source code
A*

Pseudocode
function A*(start,goal)
var closed := the empty set
var q := make_queue(path(start))
while q is not empty
var p := remove_first(q)
var x := the last node of p
if x in closed
continue
if x = goal
return p
add x to closed
foreach y in successors(x)
enqueue(q, p, y)
return failure
GraphDB


Specifies a data model and query model.
1.
Queries are in the form of regular expressions
2.
Nodes are classes representing data objects
3.
Edges are classes to store paths in the database
4.
Path classes are and indexing data structures are used to
index database
Provides graph and search operations to:

Shortest path between two nodes

Subgraphs from a starting node and range
GraphDB
Daylight


"Provide the best known computer
algorithms for chemical information
processing to those who need them."
Uses finger printing to index/prune
ChemDB (Contains 6.5 million unique structures or subgraphs)
Lore



Database management system for XML
Modeled using rooted labeled subgraph
Indexed in four ways for fast regular
expression use

Vindex, Tindex, Lindex, Pindex(Data Guide)
Lore
1)
2)
3)
4)
Vindex: For each edge labeled l, all nodes are index with
incomming edges labeled l and some unique atomic value that
satisfy some condition.
Tindex: A text index for all nodes with l-labeled edges a with a
string of specific values containing specific words
Lindex: Link index to index nodes with outgoing l-labled
edges
Pindex (DataGuide): indexes all nodes reachable from root
through labled path.
The DataGuide is used by all queries from root.
Other queries traverse paths using indexs(1-3), pruning what is not
a match.
Tindex (1999)

A Data structure to index semistructured
database nodes that are reachable from
several regular path expressions

T-index may be more efficient than Pindex because it relaxes some constraints
Reportedly in graph of size 1500 T-index
is 13% of database

GraphGrep


Uses variable length paths (cyclic or
acyclic) to index DB. This provides for
efficient filtering.
Nodes have ids (numbers) and labels
(letters).
GraphGrep

Index Construction
1.
2.
3.
Choose an lp max indexing length
Create “path-representation”
Create fingerprint
GraphGrep

Filtering the Database
1.
2.
Query graph is parsed and a fingerprint
built
Fingerprint are compared
1.
2.
If a graph has at least one value in its fingerprint
that is less than the query fingerprint it is
discarded.
Remaining graphs may contain > 1 sub graphs
GraphGrep

Filtering the Database


Takes linear time to the size of the database
But discards 99% of database!!!
GraphGrep

Finding Subgraphs Matching with Queries

Query tree depth first traversal branches are
decomposed into sequences of overlapping
label-paths (patterns)
GraphGrep

Overlaps
1.
2.
3.
Last node in a patters coincides with first
node of next pattern (e.g. ABCB (lp = 3)
ABC CB)
If a node has branches, it is included in the
first pattern of every branch
The first node in a cycle is visited twice
GraphGrep

Matching Example
1.
2.
3.
Select the set of paths
Combine lists with constraints
Remove lists with equal id nodes in non
overlapping positions
GraphGrep

Techniques for Queries with Wildcards



Consider the parts of the query graph that is
between wild cards (like pathfix)
The cartesian product of the components
that match are valid.
An entry in the cartesian product is a valid
path (length = wildcards) between nodes.
GraphGrep





1 GHz pentium III
NCI databases (1,000 – 16,000 nodes)
Average 20 nodes in db (max 270 nodes)
Queries 13-189 nodes
Lp = 4 and 10
GraphGrep


Linear in size of DB
Different lp influence running time
Conclusions / Questions

Searching in Trees


Introduces ATreeGrep
Searching in Graphs
Introduces GraphGrep
Thanks to:




God
Class
Wikipedia
Various other Googled sources