Mining Frequent Query Patterns from XML Queries

Download Report

Transcript Mining Frequent Query Patterns from XML Queries

Mining Frequent Query
Patterns from XML
Queries
L.H. Yang, M.L. Lee, W. Hsu, and S.
Acharya.
Proc. of 8th Int. Conf. on Database
System for Advanced
Applications(DASFAA’03)
2004/12/31
報告人:邱紹禎
1
Motive


As XML prevail, the efficient retrieval of
XML Data become important
many researches focus on
1.
2.
3.
index XML documents
process regular path expression
discover frequent query pattern
2
Query Pattern Tree
Q1 {resultPattern = {/book/title, /book/price},
predicates = {/book/author/data() = ”Buneman”},
documents = {book.xml}}


Wildcards “*” indicate the ANY label in DTD
Relative path “//”indicate zero or more labels
3
Query Pattern Tree
Def. Query Pattern tree




A rooted tree QPT<V, E>
V is the vertex set
E is the edge set
Each vertex v has a label with its value in {“*”, “//”, tagSet}
Def. Rooted Subtree



A rooted subtree RST <V’, E’>
Root(RST)= Root(QPT)
V’ V, E’ E
4
Frequent Query Pattern Trees




D : a database of query pattern trees {QPT1,……,QPTN}
Freq(RST) : the total occurrence of a rooted subtree RST in D
Supp(RST) = freq(RST) / |D|
The problem is to find all the frequent RSTs in D with some
minimum support
Supp(RST) = 2/3
5
Tree Pattern Matching


book/section/figure/title  book // title
book/section/figure/image  book/section/*/image
So node with label x ≦ * ≦ //
Def. Query Pattern Tree Matching
we say that RST is contained in a QPT if the following hold:
1.
The root nodes in RST and QPT have the same label
2.
If a node w  RST is matched with node v  QPT, then it
satisfies
(a)w.label ≦v.label
(b)each subtree of w is contained in some subtree of QPT
6
Discovering Frequent Rooted
Subtrees
find all frequent 1-edge RSTs by
scaning Database once
RST-Gen generate the candidate
set Ck+1 by using the previously
found frequent set Fk and pruning
those unqualified candidates.
Contains determines if
RSTk+1 is contained in
the pattern tree t.
7
Generation of Candidate RSTs



use schema-guided enumeration method to
generate candidate RST without repetition
contruct a G-QPT by merging the query pattern tree
in the database
use Apriori property to prune the candidates RST
8
Generation of Candidate RSTs
9
Containment of RST in a
Pattern Tree
count the RSTs’ support in the database
compare recursive from the root to the leaf node


Algorithm Contains
Case1 : w is a leaf node
v.label is not ”//”
a)
1)
2)
b)
w.label = “//” or “*”, return the result of comparison w.label≦v.label
If w.label apprears in the set of labels of node v’s ancestors, return
TRUE
v.label is “//”, we must find if any of v’s child node n satisfies
w.label≦n.label
10
Containment of RST in a
Pattern Tree
Case2 : w is not a leaf node, and v is a leaf node
w is impossible to be contained in v
Case3 : Both w and v are not leaf nodes
1.
2.
3.
if w.label≦v.label doesn’t hold, return false
compute whether all of the subtrees of w is contained in those
of v
If v is “//”
a)
b)
Check whether w is contained in one of v’s children
Check whether the subtree of w is contained in v
11
Containment of RST in a
Pattern Tree-Example
book
book
section
section
section
w
section
w’
//
v
section
v’
12
Optimizations for XQPMiner

Encoding Query Pattern Trees
Replaced by “1,2,-1,3,-1,8,-1”


Indexing Frequent RSTs
Using Transaction IDs
Divide the enumeration of RSTk into two sets
1. Gleaf : generated by expanding the right most leaf
node
2. Ginternal : generated by expanding the nodes
along the right most branch except the leaf node
13
Optimizations for XQPMinerExample
RSTk+1.TIDList = RSTk.TIDList ∩ RST1k.TIDList
the RSTs in Ginternal need not be matched in D
14
Performance Study



P4 2.4GHz with 1GB RAM, running Windows XP
Each dataset consist of 200000 QPTs
Zipfian distribution
Datasets
G-QPT
DBLP
Shakespears
Play
Num. of nodes
98
67
Max depth
8
6
Num. of //
13
0
Max fanout
12
9
Ave # of nodes
7.4
7.5
8
6
12
9
QPT in
Max depth
DB
Max fanout
15
16
17
Algorithm Contains
18
Algorithm Contains
19