XML on Semantic Web
Download
Report
Transcript XML on Semantic Web
XML on Semantic Web
Outline
The Semantic Web
Ontology
XML
Probabilistic DTD
References
The Semantic Web (1/4)
The first generation Web
The second generation Web:current Web
The third generation Web:Semantic Web
The conceptual structuring of the Web in an explicit
machine-readable way
Requirements:Universal expressive power、
Support for syntactic Interoperability、Support for
Semantic Interoperability
The Semantic Web (2/4)
Syntactic interoperability talks about parsing the data,
and semantic interoperability means to define
mappings between unknown terms and known terms
in the data
Semantic interoperability:requires standards
syntactic form of document and semantic content
A further representation and inference layer is
needed on top of the currently available layers of the
WWW:Ontology
The Semantic Web (3/4)
The Semantic Web (4/4)
Ontology (1/5)
An explicit machine-readable specification of a
shared conceptualization
Crucial role:representation of a shared
conceptualization of a particular domain
reusable
find pages that contain syntactically different but
semantically similar words
Construct:concepts (which are usually organized
by taxonomies), relations, functions, axioms,
instances
Ontology (2/5)
Ontology (3/5)
Concepts:
–
–
Be anything about which something is said
Also known as classes (XOL, RDF(s), OIL,
DAML+OIL), objects (OML), categories (SHOE)
Taxonomies:
–
used to organize ontological knowledge using
generalization and specialization relationships
through which simple and multiple inheritance
could be applied
Ontology (4/5)
Relations and functions:
–
–
–
An interaction between concepts of the domain
and attributes
Be called relations in SHOE、OML, roles in OIL
Functions are a special kind of relation
Axioms:
–
–
Constraining information, verifying correctness,
deducting new information
Also known as assertions (OML), rule, logic
Ontology (5/5)
Instances:
–
Represent elements in
the domain attached to a
specific concept
Measurement of the
expressiveness:
–
XOL, RDF(s), SHOE,
OML, OIL, DAML+OIL
XML (1/7)
As a serialization syntax for other markup
language, ex:SMIL、XOL、SHOE
As semantic markup of Web-pages
As a uniform data-exchange format
XML (2/7)
Universal expressive power:anything can
be encoded in XML if a grammar can be
defined for it
Syntactic interoperability:XML parser can
parse any XML data and is usually a
reusable component
Semantic interoperability:there is no way of
recognizing a semantic unit from a particular
domain of interest (not yet widely recognized)
XML (3/7)
XML (4/7)
Data exchange:
–
–
Build a model of the domain of interest
From the domain model a DTD or an XMLs is constructed
Advantage:reusability of the parsing software
components
There exists multiple possibilities to encode a given
domain model into a DTD, so the direct connection
from the DTD to the domain model is lost and it
cannot be easily reconstructed
XML (5/7)
XML (6/7)
A direct mapping based on the different DTDs is not
possible
So we have to define the mappings between the
different domain models, then between the different
DTDs:
–
–
–
Reengineering of the original Domain Model from the DTD
or XML Schema
Establishing mappings between the entities in the domain
model
Defining translation procedures for XML Documents
Using a more suitable formalism than pure XML can
save much of the additional effort
XML (7/7)
Probabilistic DTD(1/11)
Describes the most likely orderings of XML
tags and that contains statistical properties
for each tag
Utilize association rule discovery algorithm
and sequence mining techniques
Probabilistic DTD (2/11)
Objectives:tagging all text documents and
deriving an appropriate preliminary flat XML
DTD
–
A knowledge discovery in textual databases (KDT)
process to build clusters of semantically similar
text units and then new documents can be
converted into XML documents
Probabilistic DTD (3/11)
UML schema:are initially conceived by experts
serves as a reference for the DTD, but there is no
guarantee that the final DTD will be contained in or
contain this schema
KDT process:
–
–
–
–
–
–
Tagging initial text documents
Domain knowledge constitutes such as thesaurus、
preliminary UML schema, input to process
Pre-processing
Iterative clustering
Post-processing
Establishing a probabilistic DTD
Probabilistic DTD (4/11)
Probabilistic DTD (5/11)
Pre-processing:
–
–
–
–
–
Setting the level of granularity
NLP processing such as tokenization、
normalization、word stemming
Building text unit descriptors—a reduced feature
space(now are chosen by engineer)
Mapping all text units into Boolean vectors of this
feature space
Extract named entity
Probabilistic DTD (6/11)
Clustering:
–
–
–
–
Performed in multiple iterations, each iteration
outputs a set of clusters
All text unit vectors are clustered
Partition clusters into “acceptable” and
“unacceptable” according to quality criteria
Members of “unacceptable” are input data to the
next iteration
Probabilistic DTD (7/11)
Post-processing:
–
–
–
–
“acceptable” clusters are semi-automatically
assigned a label
Ultimately, cluster labels are determined by the
engineer
All default cluster labels are derived from text unit
descriptors
Automatically derived XML DTD from XML tags
Probabilistic DTD (8/11)
Probabilistic DTD (9/11)
Establishing a probabilistic DTD:
–
–
Deriving the most likely ordering of the tags
Computing the statistically properties of each tag
inside the document type definition
Deriving the ordering of the tags
–
–
Backward Construction of DTD Sequences:
builds “maximal” sequences
Forward sequence construction
Probabilistic DTD (10/11)
Backward Construction of DTD Sequences
–
–
–
–
–
–
Starts with an arbitrary tag ﺡand then identifies the tag most
likely to appear before it
If no such tag exists, then shifts to the next sequence. If
there is one, then the next iteration starts. If there are k tags,
then duplicates k incomplete sequences.
Each tag Xi leading to ﺡwith a confidence Ci
If there is a Ci larger than the others, then Xi is the
predecessor of ﺡin the sequence
If C0 where is the confidence where ﺡhas no predecessor is
largest, then ﺡis the first element
Confidence is the tag’s TagSupport multiplied by the
accuracy
Probabilistic DTD (11/11)
References
The Semantic Web—on the respective Roles of XML
and RDF
–
Intelligent Information Agent with Ontology on the
Semantic Web
–
Weihua Li
Ontology Languages for the Semantic Web
–
Stefan Decker, Frank van Harmelen, Jeen Broekstra, Michael Erdmann,
Dieter Fensel, Ian Horrocks, Michel Klein, Sergey Melnik
Asuncion Gomez-Perez, Oscar Corcho
Extraction of Semantic XML DTDs from Texts Using
Data Mining Techniques
–
Karsten Winkler, Myra Spiliopoulou