Enabling Semantic Searching

Download Report

Transcript Enabling Semantic Searching

Enabling Semantic Searching
by Stefano Mazzocchi
<[email protected]>
[email protected]
What is the “Semantic Web”?
The Semantic Web is an
extension of the current web in
which information is given
well-defined meaning, better
enabling computers and people
to work in cooperation.
[Tim Berners-Lee, James Hendler, Ora Lassila]
[email protected]
Didn’t get it? Let’s try again
• The web is the most successful
publishing media of the history of
mankind.
• And still growing!!
• The ‘semantic web’ dream is to make it
possible to have machines that help us
consuming that much information!
[email protected]
What do we need to build a
semantic web?
• Data identification and retrieval
• Development of vocabularies
• Model constraints
• Assertion and proofs
[Eric Prud’hommeaux]
[email protected]
All that?
• Unfortunately, yes…
• …but each time we reach one of
these steps, the capabilities end
up to be surprising!
[email protected]
One example for all: Google!
• Google infers page importance from the global
web hyperlink topology.
• This is possible because the semantics of
hyperlinks are well determined, thus
understandable by machines.
• The result of such a simple elaboration are
astonishing.
[email protected]
Semantic Searching
The act of looking for data
with the help of information
inferred from some welldefined meaning of the data
itself.
[email protected]
Warning: Problems Ahead!
• The
• The
• The
• The
• The
Babel Problem
Chicken-Egg Problem
ROI Problem
Screen-Scrape Problem
Marginal Costs Problem
[email protected]
The Babel Problem (1)
• XML makes it possible to create
new markup languages to fit each
little need.
• In many situations, existing
markups are complex and their
learning curve is too steep… thus:
• We see an explosion of markup
languages
[email protected]
The Babel Problem (2)
• It is not obvious that this trend
will come to a saturation
(especially with the advent of
SOAP-based web services)
• Automatic translation between
markups is not always
algorithmically possible.
[email protected]
The Chicken-Egg Problem
• People won’t feel the need to
publish information in more
semantically meaningful
languages, until there will be some
use of them.
• And no use will emerge until there
will be enough of such semantic
information to work on.
[email protected]
The ROI Problem
• If writing ‘semantized’ information
is more expensive than writing
‘non-semantized’ information…
• … and the return on this extra
costs don’t pay them off, it simply
won’t happen!
[email protected]
The Screen-Scrape Problem
• The great majority of web
information is published using
HTML, which has intrinsically poor
semantic capabilities.
• If the extraction of semantic
information from HTML is done
using ‘screen-scraping’ the costs
will always exceed the benefits!
[email protected]
The Marginal Cost Problem
• If the marginal cost of adding
semantic information while
authoring some text is linear with
the text size, the whole semantic
web might never economically
scale! (especially together with the
ROI problem)
[email protected]
Enabling semantic searching
• We need a way to solve all the
previous problems, or there will
never be something better than
Google.
[email protected]
Enter the solutions!
• XML-based Web Publishing
• Standardized semantic HTTP
variants
• Semantic-aware content editors
[email protected]
XML-based Web Publishing
• XML-based web publishing systems
make it ‘economically worth’ to create
XML content.
• This partially solves the chicken-egg
and the ROI problems since such
systems allow people to have
immediate benefits (especially for those
with cross-media publishing needs)
[email protected]
HTTP Variants!
• HTTP/1.1 has the notion of ‘resource
variants’. So it is possible to ask for a
specific flavor of a given resource.
• If ‘semantic variants’ were
standardized, this might solve, together
with XML-based publishing systems, the
Screen-Scrape problem.
• Apache Cocoon already implements
such a concept with ‘resource views’.
[email protected]
Semantic-aware Content
Editors
• A simple and cost-effective
solution for semantic-aware
content editing is a conditio sine
qua non for the production of
semantically-meaningful content.
[email protected]
Conclusions (1)
• Searching is the first scenario of
use of semantic web technologies
since it doesn’t require all the
infrastructure to be present.
• Still, many problems must be
faced, especially those socioeconomically-related ones that
academia is currently ignoring.
[email protected]
Conclusions (2)
• Without an incremental and
economically feasible plan of
adoption, the semantic web is
unlikely to happen.
• The proposed plan of adoption that
uses XML publishing on the server
side along with standardized
semantic HTTP variants
[email protected]
Conclusions (3)
• Still, the biggest problem to face is
semantically-aware content editing
and the solution of the Babel
problem without requiring the
creation of huge ontologies that
will very unlikely be manageable
for the entire web.
[email protected]
ToDo (1)
• Agree on a way to publish the different
resource variants!
• Agree on markups/metadata or, at
least, provide mechanical ways to
translate one into another.
• Enforce the use of namespaced XML
(despite the lack of validation support in
DTD and lack of coherence between the
infoset and the syntax)
[email protected]
ToDo (2)
• Think about semantic-aware editing
(which is not only XML-aware, but also
RDF-aware!)
• Research into less expressive (than
RDF) but more practical and costeffective solutions to encode semantic
information into the schemas instead of
their content (semantic-sheets?,
semantic relevance ratings?)
[email protected]
Thanks!
Any questions?
[email protected]