Modeling and Querying Web Data A Survey
Download
Report
Transcript Modeling and Querying Web Data A Survey
Modeling and Querying Web Data
A Survey
By Li Lu
Overview
Introduction
Data Representation for Querying the Web
Modeling and Querying the Web
Summary and Future
Introduction
Background
• The most common techniques used in searching
information from the Web are based on sending
information retrieval requests to index servers.
• Use web query techniques to locate, filter and
present web information.
Challenges
• Difficult to build a common model for Web.
• Hard to extract information from web data.
Data Representation for Querying the Web
Graph Data Models
• Based on a labeled graph in which the nodes
represent web pages, edges represent links between
web pages, and the labels on the edges can be
attribute names
• Capable of express navigational queries over the
graph structure.
Semistructured Data Models
• Based on labeled directed graphs. There is no
restriction on the number of edges that can go out
from a given node, or on the type of attribute value.
• Be able to query the schema or the labels on the
edges of the graph
Data Representation for Querying the Web (cont.)
A Hypertree Containing a Publications Database
(WebOQL) [AM98]
Data Representation for Querying the Web (cont.)
Semantic Web Data Models
• Semantic Web is a Web whose content can be
annotated by metadata and be processed
automatically by machines.
• The formulation of semantic assertions of semantic
Web is based on Resource Description Framework
(RDF) model [LS99], which can be viewed as a
partially labeled directed graph.
• They have the ability to exploit the semantics of the
Web content and can provide better query result than
their counterpart that based on the content and
structure of the Web data.
Data Representation for Querying the Web (cont.)
An Example RDF Graph [WWW1]
Modeling and Querying the Web
Query Languages for Graph Representation of
Website
• The query languages combine both the content-based
queries and structure-based queries. Therefore, they
are able to formulate regular path expression queries
and to express navigational queries over the graph
structure.
• WebSQL [MMM97], W3QL [KS95], WebLog
[LSS96]
• Example: WebSQL [MMM97]
Modeling and Querying the Web (cont.)
WebSQL [MMM97]
• Model of Web as a relational database with two virtual
relations: Document and Anchor.
“Document[url, title, text, type, length, modif]”
“Anchor[base, href, label]”
• To map onto the graph structure of the WWW, each
document in the Document relation is mapped to a node
object in the graph and each hypertext link between two
documents in Anchor relation is represented by a link
object.
Modeling and Querying the Web (cont.)
• Sample query [FLM98]: to find a list of tuples of the form
(d1, d2, label), where d1 is a document stored at local site,
d2 is a document stored somewhere else, and d1 points to
d2 by a link labeled label. Suppose all the local
documents are reachable from www.mysite.start.
“SELECT d.url, e.url, a.label
FROM Document d SUCH THAT
www.mysite.start * d,
Document e SUCH THAT d => e,
Anchor a SUCH THAT a.base = d.url
WHERE a.href = e.url”
Modeling and Querying the Web (cont.)
Query Languages for Semi-Structured
Representation of Website
• To discover the implicit structure within the
semistructured Web data and then recast the Web
data to fit into the discovered structure
• WG-Log [CDPT98], ULIXES and PENELOPE
[AMM97a], WebOQL [AM98]
• Example: WebOQL [AM98]
Modeling and Querying the Web (cont.)
WebOQL [AM98]
• Introduced a hypertree data structure. Hypertree is an ordered arclabeled tree with two kinds of arcs, internal arcs and external arcs.
Internal arcs are used to indicate structured objects and external
arcs are used to indicate hyperlinks among objects. Arcs are
labeled with records.
A Hypertree Containing a Publications Database
(WebOQL) [AM98]
Modeling and Querying the Web (cont.)
• Represent web pages by hypertree and mapping function.
Mapping function is used to map URLs to corresponding
hypertrees. The hypertree and mapping function are also
called schema and browsing function of the Web
respectively.
• Sample query [FLM98]: to extract the title and URL of the
full version of papers authored by “Smith” from the
csPapers database.
“SELECT [y.Title, y’.Url]
FROM x in csPapers, y in x’
WHERE y.Authors ~ “Smith” ”
Modeling and Querying the Web
Query Languages for Semantic Web
• Semantic web is a web whose content can be
annotated by metadata and be processed
automatically by machines.
• Semantic query has the ability to exploit the
semantics of the Web content.
• RQL [KACPS02], SquishQL [MSR02] , TRIPLE
[SBAHKW02].
Summary and Future
Summary
• Web data models are divided into three main categories:
•
graph data model, semistructured data model and semantic
web data model.
Based on these data models, Web query languages are also
classified into three primary groups.
Future
• To develop techniques to manipulate dynamic pages could be
•
beneficial to Web query application and it may be a promising
direction for future research.
To combine the query result from different resource on the
Web, especially the result from both structured and
unstructured data sources also pose some challenges for
future research.
References
[AM98]
G. Arocena, A. Mendelzon, “WebOQL: Restructuring Documents, Databases,
and Webs”, Proc. ICDE'98, Orlando, Florida, Feb. 1998.
[CDPT98] S. Comai, E. Damiani, R. Posenato, L. Tanca, “A Schema-based Approach to
Modeling and Querying WWW Data”, Proc. of FQAS'98, Roskilde, May 1998,
LNAI 1495.
[AMM97a] P. Atzeni, G. Mecca, P. Merialdo, “To Weave the Web”, International
Conference on Very Large Data Bases (VLDB'97), Athens, Greece, August 2629, 1997, pages 206-215.
[FLM98] D. Florescu, A. Levy, A. Mendelzon, “Database Techniques for the World-Wide
Web: A Survey”, SIGMOD Record 27, 3 (1998), 59-74.
[KACPS02] G. Karvounarakis, S. Alexaki, V. Christophides, D. Plexousakis, M. Scholl,
“RQL: A Declarative Query Language for RDF”, WWW2002, May 2002,
Honolulu, Hawaii.
[KS95]
D. Konopnicki and O. Shmueli, “W3QS: A query system for the World Wide
Web”, In Proc. of the Int. Conf. on Very Large Data Bases (VLDB), pages 5465, Zurich, Switzerland, 1995.
[LSS96]
L. V. S. Lakshmanan, F. Sadri, L. N. Subramanian, “A declarative language for
querying and restructuring the Web”, In Proc. of the sixth International
Workshop on Research Issues in Data Engineering, RIDE’96, New Orleans,
February 1996.
[MM97] A. O. Mendelzon, T. Milo, “Formal Models of Web Queries”,
Proceedings of the Sixteenth ACM Symposium on Principles of Database
Systems, 134-143, 1997.
[MMM97] A. Mendelzon, G. Mihaila, T. Milo, “Querying the world wide web”,
International Journal on Digital Libraries, 1(1):54-67, 1997.
[MSR02] L. Miller, A. Seaborne, A. Reggiori, “Three Implementations
of SquishQL, a Simple RDF Query Language”, Proceedings of 1st
International Semantic Web Conference. ISWC2002, Sardinia, Italy,
June 9-12, 2002
[SBAHKW02]A. Sheth, C. Bertram, D. Avant, B. Hammond, K. Kochut, Y. Warke,
“Semantic Content Management for Enterprises and the Web”, IEEE
Internet Computing, July/August 2002, pp.80-87, 2002.
[WWW1] http://www.amk.ca/talks/semweb-intro, “Introduction to the Semantic
Web and RDF”