Inside semantic Web research engines: between semantic

Download Report

Transcript Inside semantic Web research engines: between semantic

Inside semantic Web
search engines:
between semantic
annotation and Natural
Language Processing
Dentro i motori di
ricerca semantici: tra
annotazione semantica
ed elaborazione della
lingua naturale
Incontro ISKO Italia - Torino 3 aprile 2009
Intervento di
Mela Bosch
[email protected]
Terminology on Web Search Engines
Text Search Engine: based on Lexical analysis. The main aim of the
lexical analysis is to divide the text into paragraphs, sentences and
words and also entities such as e-mail addresses or URLs. All these
elements are knows as tokens, and the Search Engine makes a parsing
with statistical parameters to develop a range of links as a response to a
query.
Latent semantic indexing (LSI): based on Latent semantic analysis
(LSA); LSI is a technique of Natural Language Processing (NLP) which
uses an indexed database of documents to find similar terms. It can find
a synonym and then return the best matched websites for the query. LSI
does not require exact matching words for ranking result.
Semantic Web search engines: take the sense of a word as a factor in its
ranking lists or offers the user a choice as to the sense of a word or
phrase.
Semantic Web search engines
or Search engines of 3rd generation
Three types:
User oriented Semantic Web search engine: It returns web page links.
It can use internally both Semantic Web technologies and LSI. Ex.:
True Knowledge, Hakia and PowerSet.
Semantic Web Services oriented engine: It returns links to ontologies,
OWL files, RDF instances. It is inadequate for end users. Ex.: SOWL,
WSE, Watson, Falcons, Sindice and Swoogle. The idea is to provide
ways for businesses to inter-operate across domains or services.
Social-semantic Web oriented engine: The socio-semantic web (s2w)
uses classification and ontologies in very practical situations. S2w
search engines’ aim is to complement the formal Semantic Web
vision adding a pragmatic collaborative tagging (folksonomy)
approach. The main interest is to to enable users to share knowledge.
Ex.: http://www.stumpedia.com/
Semantic Web search engines. What
are all these differences for?
“Semantic Web means many
things to different people:
•It is about artificial
intelligence, computer
programs solving complex
optimization problems
•It is about web services, in
terms of end user value
•It is the web of data, where
information is represented in
RDF or microformats and
OWL.”
See:
http://www.readwriteweb.com/archives/s
emantic_web_patterns_a_guide_redux.php
The components of
Semantic Web search
engines
•Natural Language
Processing (NLP)
•Annotation
•Annotation
Free-text annotation:
The annotations can be comments,
notes, explanations, references,
examples, advice, corrections or any
other type of external remark that
can be attached to or embedded in a
Web document or a selected part of
the document.
See: http://www.ncb.ernet.in/groups/dake/annotate/intro.shtml
Semantic annotation in general
Semantic annotation is the association of a data entity with an
element from a classification scheme, ontology or other knowledge repository
Examples of semantic annotation:
• the assignment of MeSH descriptors to citations in MEDLINE
• the assignment of Gene Ontology terms to gene products in UniProt
Semantic Web Annotation
Is the technique for uploading machine understandable data on the Web by
creating metadata through semantic tagging
•A
semantic annotation is a formal
annotation, where the predicate is an
ontological term, and the object
conforms
to
an
ontological
definition.
• The
term “annotation” can denote
both the process of annotating and
the result of that process.
It is crucial to the fulfillment of the Semantic Web to give
useful meaning to data or to unstructured text
Semantic Web Annotation
The Semantic Web Annotation process includes three
components:
• an ontology which describes the domain of interest
• a data instance recognition process that discovers all
instances of interest in target web documents based on
the defined ontology
• an annotation generation process creates a semantic
meaning disclosure file for each annotated document.
Through the semantic meaning disclosure file, any
ontology-aware machine agent can understand the
target document.
See: http://www.deg.byu.edu/ding/research/SemanticAnnotation.html
Annotation: can be manually, automatically or semi-automatically
generated
The process of annotating requires semantic annotation tools:
Types of semantic annotation tools
Inline annotation means that the original document
is augmented with metadata information.
Embedded metadata
<html>
…
<annot>
…
</html>
It focuses on annotating
information on pages
using RDF
so that it is machine
readable
Also called:
Semantic Authoring
or
Bottom-up approach
Types of semantic annotation tools:
Standoff annotation means that the metadata is
stored separately from the original document.
<html>
…
</html>
Attached metadata
annotation
The annotations are then stored in a
database that is made available to
users via websites and sometimes via
web services
It is generally preferable from the point of view of inter-operability
Also called: top-down approach. Its focus is leveraging
information from existing web pages, to derive meaning
automatically
There are several choices for annotation
The components of Semantic Web search engines
•Natural Language Processing (NLP)
Initially NLP
•is conceived as a support for Linguistics studies
•aims at using computers to interpret and
manipulate words as a part of a language
A powerful method for the investigation and
evaluation of human language itself. i.e.
enhanced study over large corpora of texts
Then
•Artificial Intelligence defines NLP as the act of using computers
to process written and spoken languages for some practical
purpose such as translating languages, or carrying conversations
with machines.
The components of Semantic Web search engines
•Natural Language Processing (NLP)
After the Web explosion NLP has been used for the
development of natural language understanding systems that
convert samples of human language into more formal
representations that are easier to manipulate for computer
programs.
Now
•Thanks to the NLP techniques different
algorithms such as chunking, clustering,
parsing, spellchecking, tagging, and word
sense disambiguation are used to handle
text intelligently and to get information
from the Web on text data banks in order
to answer questions
Conclusion
However, both methodologies are now being
combined:
•semantic web search engines need many
pages to be annotated (which requires an
enormous effort),
•so that NLP becomes an important help in
automatic or semi-automatic annotation.
•At the same time the precision of text
analysis may be optimized by means of
techniques of assignment provided by users
and professionals.
In conclusion, the trend is the development of collective
knowledge systems that improve as more people participate, as
they are based on human contributions. All of this will possibly
be integrated by NLP algorithms.
References
Iskold, Alex. (2006) Semantic Web Patterns: A Guide to Semantic Technologies.
http://www.readwriteweb.com/archives/semantic_web_patterns_a_guide_redux.php
Atanas, K. et al. (2005) Semantic Annotation, Indexing, and Retrieval. Ontotext Lab.
http://www.ontotext.com/publications/SemAIR_ISWC169.pdf
Vehvilainen, A. et al. (2006) SemiAutomatic Semantic Annotation and Authoring, Tool for a Library Help
Desk Service. Helsinki University. http://www.seco.tkk.fi/publications/2006/vehvilainen-hyvonen-almsemi-automatic-semantic-annotation-and-authoring-tool.pdf
Diana Maynard (2005) Benchmarking ontology-based annotation tools for the Semantic Web. Department
of Computer Science, University of Sheffield, UK.http://gate.ac.uk/sale/ahm05/ahm.pdf
Good, Benjamin M ; Kawas, Edward ; Wilkinson, Mark. (2007) Bridging the gap between social tagging
and semantic annotation: E.D. the Entity Describer.
http://precedings.nature.com/documents/945/version/2/html
Useful links:
http://www.semanticfocus.com/
http://logic.stanford.edu/oem/projects.html#_Coordinating_Collective_Work
http://semantic-mediawiki.org/wiki/Semantic_MediaWiki