Transcript Document

Online Mining of Frequent
Query Trees over XML Data
Streams
Hua-Fu Li*, Man-Kwan Shan and Suh-Yin Lee
Department of Computer Science
National Chiao-Tung University
Hsinchu, Taiwan 300, R.O.C.
http://www.csie.nctu.edu.tw/~hfli/
*: corresponding author
[email protected]
1
Outline

Introduction


Problem Definition


Online Mining of Frequent Query Trees over
XML Data Streams
The Proposed Algorithm


Mining of Data Streams, Tree Mining
FQT-Stream (Frequent Query Trees of
Streams)
Conclusions and Future Work
[email protected]
2
Mining of Data Streams:
Motivations

Many Applications generate data streams







Application characteristics




Day to day business (credit card, ATM transactions, etc)
Hot Web services (XML data, record and click streams)
Telecommunication (call records)
Financial market (stock exchange)
Surveillance (sensor network, audio/video)
System management (network events)
Massive volumes of data (several terabytes)
Records arrive at a rapid rate
Data distribution changes on the fly
What do we want to get from data streams ?

Real time query answering, Statistics, and Pattern
discovery
[email protected]
3
Mining of Data Streams:
Computation Model

Requirements of Mining Data Streams



Single pass: each record is examined at most once
Bounded storage: Limited Memory for storing synopsis
Real-time: Per record processing time (to maintain
synopsis) must be low
Synopsis
in Memory
Buffer
Stream
Mining
Processor
(Approximate)
Results
Data Streams
[email protected]
4
Problem Definition of Frequent
Query Tree Mining (1/2)

XML Query Tree Stream (XQTS)


A sequence of query trees (QTs)
QT1, QT2, …, QTN


N is tree id the latest incoming query tree
Support of a Query Tree QTi

sup(QTi): the number of QTs in XQTS
containing QTi as a subtree
[email protected]
5
Problem Definition of Frequent
Query Tree Mining (2/2)

A QTi is a Frequent Query Tree (FQT)

if and only if sup(QTi)  sN


s is a user-defined minimum support
threshold in the range of [0, 1]
Our Task

To mine the set of all frequent query
trees (FQTs) by one scan of the XQTS

Using as smaller memory as possible
[email protected]
6
Proposed Algorithm FQT-Stream
(Frequent Query Trees of Streams)

FQT-Stream consists of 5 phases






1. read a QT (Query Tree) from the buffer in the main
memory
2. transform the QT into a new NQTS (Normalized
Query Tree Sequence) representation
3. construct a in-memory summary data structure called
FQT-forest (a forest of Frequent Query Trees) by
projecting the NQTSs
4. prune the infrequent query trees from FQT-forest
5. find the set of all FQTs (Frequent Query Trees) from
current FQT-forest
Since phase 1 is straightforward,

We focus on phases 2-5
[email protected]
7
Phase 2 of FQT-Stream: NQTS
Transformation

NQTS Transformation of QT

Using DFS on the QT

A sequence of triple (node-id, level, order)



level: the level of the QT
order: sequence order of the NQTS
For example (5-NQTS in Figure 1)
[email protected]
8
Phase 3 of FQT-Stream: FQTforest Construction (1/4)

For each NQTS, 2 steps are
performed to construct the FQTforest

Step 1: enumerate each NQTS into a
set of sub-sequences using Order-Break
(OB) technique

OB is a level-wise method
[email protected]
9
Phase 3 of FQT-Stream: Step 1
of FQT-forest Construction (2/4)

For example, a 5-NQTS = <(A, 0, 1), (B, 1,
2), (D, 2, 3), (E, 2, 4), (C, 1, 5)>

First, the 5-NQTS is broken into three 4NQTSs




<(A, 0, 1), (D, 2, 3), (E, 2, 4), (C, 1, 5)>
<(A, 0, 1), (B, 1, 2), (E, 2, 4), (C, 1, 5)>
<(A, 0, 1), (B, 1, 2), (D, 2, 3), (C, 1, 5)>
These sequences are 1-OB (One Order Break)


1-OB sequences have one order break in the sequence
order
The original 5-NQTS is called 0-OB
[email protected]
10
Phase 3 of FQT-Stream: Step 1
of FQT-forest Construction (3/4)

After delete the duplicates


Three 4-NQTSs  Two 3-NQTSs with
One Order Break
Two 3-NQTSs  One 2-NQTS


<(A, 0, 1), (E, 2, 4), (C, 1, 5)>, <(A, 0, 1), (B, 1,
2), (C, 1, 5)><(A, 0, 1), (C, 1, 5)>
Finally, the set of 1-OB contains 8
NQTSs
[email protected]
11
Phase 3 of FQT-Stream: Step 1
of FQT-forest Construction (4/4)

Set of 2-OB is generated from the set of
1-OB

For example



2-OB <(A, 0, 1), (D, 2, 3), (C, 1, 5)> is generated from
1-OB <(A, 0, 1), (D, 2, 3), (E, 2, 4), (C, 1, 5)>
Repeat this process until no candidate kOB
Property 1

The maximum size of order break is k-3, i.e., (k3)-OB, if the query tree has k nodes
[email protected]
12
Phase 3 of FQT-Stream: Step 2
of FQT-forest Construction (1/3)

The OBs (0-OB, 1-OB, 2-OB) are
projected and inserted into a FQTforest using Incremental Projection
(IP) technique

A NQTS, <X1X2…Xi>, with i nodes is
projected into i sub-NQTSs (also called
node-suffix NQTSs)

<Xi>, <XiXi-1>, …, <X2>, <X1>

We use one field node-id to represent the fields
(node-id, level, order) for simplicity
[email protected]
13
Phase 3 of FQT-Stream: Step 2
of FQT-forest Construction (2/3)

Example of IP

1-OB: <(A, 0, 1), (D, 2, 3), (E, 2, 4), (C, 1, 5)> is projected
into 4 node-suffix NQTSs as follows





<(C, 1, 5)>
<(E, 2, 4), (C, 1, 5)>
<(D, 2, 3), (E, 2, 4), (C, 1, 5)>
<(A, 0, 1), (D, 2, 3), (E, 2, 4), (C, 1, 5)>
After projection, a tree structure checking is
preformed

If the level of the first node in a node-suffix NQTS is
not the smallest level

the node-suffix NQTS is deleted
[email protected]
14
Phase 3 of FQT-Stream: Step 2
of FQT-forest Construction (3/3)

After tree structure checking

The node-suffix NQTSs are inserted into FQT-forest


Update the corresponding nodes’ supports
FQT-forest consists of 2 parts

FN-list



A list of Frequent Nodes
Each node Xi in FN-list has a NQTS-tree (Xi.NQTS-tree)
NQTS-trees (trees of Normalized Query Tree
Sequences)


A sequence (NQTS) is represented by a path
And its appearance frequent is maintained in the last of
node of the path
[email protected]
15
Phase 4 of FQT-Stream:
Infrequent Information Pruning

In order to guarantee the limited
space requirement


Pruning Infrequent Information
Pruning steps

Check each node Xi in the FN-list of
FQT-forest


If its sup(Xi) < sN  delete Xi and its
NQTS-tree
Check other NQTS-trees to prune these
infrequent nodes
[email protected]
16
Phase 4 of FQT-Stream:
Frequent Query Tree Mining

Assume that there are k frequent nodes,
<X1, X2, …, Xk>, in the FN-list


FQT-Stream traverses the Xi.NQTS-tree (i, i
= 1, 2, …, k) to find the sequences with prefix
Xi whose estimated support is greater than or
equal to sN in a DFS manner
These frequent query trees are stored into
a temporal list, called FQT-List
[email protected]
17
Conclusions and Future Work

We propose an efficient one-pass
algorithm FQT-Stream (Frequent
Query Trees of Streams)


To find the set of all frequent query
trees over the entire history of online
XML data streams
Future Work

Online Mining of Frequent Query Trees
over Sliding Windows
[email protected]
18