Routing of XML and XPath Queries in Data Dissemination Networks

Download Report

Transcript Routing of XML and XPath Queries in Data Dissemination Networks

Routing of XML and XPath Queries
in Data Dissemination
Networks
Guoli Li, Shuang Hou and H.-A. Jacobsen
University of Toronto, Canada
Published at: The 28th International Conference on
Distributed Computing Systems
(ICDCS 2008)
Outline
•
•
•
•
•
•
Research Problem
Background (Key Concepts)
Related Work
Proposed Solution
Experimental Results
Conclusions and Limitations
Research Problem
• The XML/XPath routing problem in an overlay
network of distributed XML content-based
routers.
• More specifically, the paper focuses on the problem
of efficiently routing an XML document in a
Publisher/Subscriber System (an XML document
emitted from a data producer at one point in the
network to a set of data consumers located
anywhere throughout the network).
Background
• Some key concepts that we need to know,
before we can fully understand this paper:
– XML-based Data Dissemination Network
– Publisher/Subscriber System
– Content-based Routing
Key Concepts
• XML-based Data Dissemination Network
– XML has evolved as the standard for data
representation and exchange. XPath is a popular
query language for XML data.
– In an XML-based dissemination network, the data is
marked up in XML, and is routed based on Filter
Expressions stored at intermediate nodes that
indicate where the XML document is to be routed to.
– Filter expressions, often expressed as XPath
expressions (XPEs), are submitted by data consumers
who express interest in receiving certain kinds of
documents.
Key Concepts
• Publisher/Subscriber System
– A publisher/subscriber system is comprised of
information producers who publish information, and
information consumers who subscribe to information
[12].
– The publisher/subscriber system matches or routes
relevant information to interested consumers.
– Publishers and subscribers are clients to the
publisher/subscriber system, are loosely coupled in
space and time, and have no knowledge of each other.
Key Concepts
• Content-based Routing
– Messages in content-based
publisher/subscriber systems are routed based
on their contents rather than the IP address of
their destinations.
Content-based Routing
Related Work
1) Centralized filtering approaches for XML data [2, 9, 6,
3].
2) Content-based routing approaches for non-XML-based
data (in distributed publisher/subscriber architectures)
[4, 24, 8].
3) Distributed query-based retrieval approaches for XML
data [5, 18, 19].
4) ONYX [10] is a query-based XML retrieval approach for
XML data dissemination in distributed environments. Its
approach is closely related to Li’s approach. But the
objectives of the two approaches are quite different.
Related Work
5) Galanis et al. [15] explore XML data dissemination
based on a DHT. They use data summaries to ensure
that queries are only sent to relevant servers.
6) XTreeNet [16] proposes a dissemination protocol to
avoid repeatedly matching XML data against queries at
intermediate routers.
7) Theoretical properties of XPE containment are
discussed in [11, 22]. They propose their own
algorithms to detect the covering relations and give the
computational complexity analysis for these algorithms.
Challenges in this Paper
• Many optimization techniques, such as
advertisements [4] and covering and merging
techniques [8, 4, 24], have proven to gain
significant benefits for non-XML based
publisher/subscriber systems.
• How to apply these concepts to XML
(considering the structural complexity of XML
data) is the challenge addressed by this paper.
Proposed Solution
• In Li et al.’s paper, their proposed solution
mainly consists of three parts:
– Advertisement-based Routing
– Covering
– Merging
Advertisement-based Routing
• Basic idea:
Advertisement-based Routing
• Step 1: Use the XML Document Type Definition (DTD)
to generate advertisements about the information a
data producer is going to publish
– Each XML document is decomposed into a set of XML paths and each
path is represented as e = /t1/t2/.../tn, where ti is the XML element
name.
– Then, an advertisement is described as: a = /t1/t2/…/tn-1/tn, where ti
can be either an element name or a wildcard, and a has the same
length with the publication it advertises.
– In their approach, advertisements are derived from the DTD, since the
DTD allows deriving all possible paths from the root to the leaves
appearing in related XML documents.
Advertisement-based Routing
• An advertisement may have multiple recursive parts
that appear in sequence or are embedded in each
other. They classify the recursive advertisements to
three categories as below:
– Simple-recursive advertisements:
• described as a = a1(a2)+a3
– Series-recursive advertisements:
• described as a = a1(a2)+a3(a4)+a5
– Embedded-recursive advertisements:
• described as a = a1(a2(a3)+a4)+a5
Advertisement-based Routing
• Step 2: Use the matching algorithms (for both
non-recursive and recursive advertisements)
to determine if an advertisement a matches a
subscription s (which means the publication
sets P(a) and P(s) overlap)
– Algorithms for Non-recursive advertisement
– Algorithms for Recursive advertisement
Advertisement-based Routing
• Algorithms for Non-recursive advertisement
1) Absolute simple XPEs:
• If the given XPE is longer than the advertisement the
algorithm does not have to be applied
• the algorithm compares each pair of elements or
wildcards in advertisement and subscription, according
to the matching rules shown in the following table
Advertisement-based Routing
• Algorithms for Non-recursive advertisement
2) Relative simple XPEs:
• These expressions are similar to absolute simple
XPEs except for the first operator, that is, the XPE is
relative. The matching could start at any position of
the advertisement because the subscription is
relative. A naive matching algorithm in this case is
repeatedly calling AbsExprAndAdv.
Advertisement-based Routing
• Algorithms for Non-recursive advertisement
3) Descendant operators in XPEs:
• Descendant operators indicate that more than one
element should appear in the matching
advertisement.
• The matching algorithm for XPE with descendant
operators and advertisement (DesExprAndAdv) is
based on the above XPE matching algorithms (i.e.,
AbsExprAndAdv and RelExprAndAdv).
Advertisement-based Routing
• Algorithms for Recursive advertisement
Covering
1. In this part, the paper first proposes a novel data
structure called subscription tree to maintain
XPEs by identifying the covering relations among
them.
2. Then they present covering algorithms for
absolute simple XPEs, relative simple XPEs and
XPEs with descendant operator separately, to
reduce the routing table size stored at each
router and speed up routing computation in the
routers.
Covering
• Subscription Tree
– The idea is to store the subscriptions according to
the covering relationship among them. A
subscription at a node in the tree covers all
subscriptions in its subtree. A subscription node
can have only one parent in the tree, but it may be
covered by several subscriptions. Each node is
allowed to have a set of super pointers, which
indicate the covering relations with nodes outside
its subtree.
Covering
• Covering Algorithm
– The covering rules used in the algorithms are
straightforward. We say Sub1 containing an
element ti covers Sub2 containing an element mi
at the corresponding position, if ti is a wildcard no
matter what mi is, or ti = mi, where none of ti and
mi is wildcard.
– The covering algorithm is also discussed in the
following three subscription cases:
• Absolute simple XPEs, Relative simple XPEs and
Descendant operators in XPEs
Merging
• Subscriptions can be merged into a new
subscription to create a more concise routing
table. In Li et al.’s paper, the basic merging
rules for XPEs are as follows:
– The subscriptions can be merged if they have only
one difference.
– Different operators are merged to //-operator.
– Different elements are merged to * wildcard.
Experimental Results
• They performed some experiments to evaluate
the performance of their routing and covering
algorithms.
• They use the XPath generator released by Diao et
al. [9] to generate XPath queries, and set the
maximum length of an XPE to 10.
• Use the IBM XML Generator [1] to create the XML
document workload, and set the maximum
number of levels of the resulting XML documents
to 10 (consistent with the maximum length of
XPEs).
Experimental Results
• The performance metrics include:
– routing table size
– XPE processing time
– publication routing time in a single broker
(content-based router)
– the network traffic
– notification delay in broker topologies
Experimental Results
• The experimental results show that:
– the routing table size is reduced by up to 90%.
– the routing time at each broker is improved by up to
85% in the most favorable cases.
– the overall network traffic has been reduced by about
35%.
• The performance of the system has been greatly
improved by applying advertisement-based
routing, covering, and merging techniques.
Conclusions and Limitations
• The originality of this work is not that good. As
we’ve seen, almost all the techniques used in this
paper have been proposed and used in some
previous work.
• They applied some existing techniques to solve a
new type of problem which nobody else has done
before.
• The performance improvement of their solution
in this paper is really impressive, which indicates
their improved algorithms are quite effective. So I
still think this is a good paper.
Conclusions and Limitations
• The strength of the proposed approach highly
depends on the possibilities of overlaps
among different XPEs (XPath Expressions).
Otherwise, the covering algorithms could not
be very effective. Therefore, I am afraid this
approach won’t perform very well in the cases
that there are not many overlaps among the
XPEs from different users.
Conclusions and Limitations
• The paper mentioned quite a few existing
research work, but didn’t make comparisons with
the previous work in their experiments part.
It would be more convincing, if they could
provide some experiments compared with the
previous approaches, since some of the previous
work use similar techniques as the ones used in
this paper.
Conclusions and Limitations
• In the experiments part, the maximum length of
an XPE is set to 10 (which is the same as the
maximum number of levels of the resulting XML
documents).
• I am wondering whether the performance of
their solution will be downgraded a lot, when the
maximum length of an XPE is increased to 50 or
100. In other words, how is the scalability of the
approach, in terms of the maximum length of an
XPE?
Thank you!