Ontology Learning

Download Report

Transcript Ontology Learning

Ontology Learning
For the Semantic Web
The Paper Itself

Based around two products OntoEdit and Text-
to-Onto.
A rather foundational approach to the problems
surrounding information extraction.
 Occasionally, some really weak sentence
structure.
 Problems with inconsistent example use.
 A frustrating exercise in presenting questions
and not the answers.

The Web
(our semantic battleground)
The Web was created as a free-form
information space.
 Made for human comprehension, not
machine understanding.
 From experience with the web there
seems to be some inherent aversion to
correct speeling or gRamar.

Machine Semantics

“Computers, while originally designed to
understand a series of electrical pulses,
have had that same vocabulary expanded
to be able to also evaluate letters,
booleans, and numbers.”
-Brian Goodrich
Ontologies

Ontologies are metadata schemas.
– Controlled vocabulary of concepts
– Machine understandable semantics
– Define shared domain conceptualizations
 (e.g. website to website, people to machines, etc.)
The Assumption

“If every internet webpage had an
associated perfect ontology that was just
as accessible as the selfsame webpage,
the creation of the semantic web would be
only as far away as the creation of a
browser that can find and interpret those
ontologies and extract information based
upon those models.”
-Brian Goodrich
The Knowledge Bottleneck

“…manual acquisition of ontologies still
remains a tedious cumbersome task
resulting in a knowledge acquisition
bottleneck.”
– Steffen Staab
The Challenge

In Overcoming this knowledge acquisition
bottleneck the authors took a three-fold
approach:
– Time (Can you develop an ontology fast?)
– Difficulty (Is it difficult to build an ontology?)
– Confidence (How do you know that you’ve got
the ontology right?)
OntoEdit

OntoEdit supports the development and
maintenance of ontologies using graphical
means. It supports RDF-Schema, DAMLONT, OIL and F-Logic.
 Has many of the same features as our
Ontology Editor.
 Cardinality Restrictions
 Keyword Associations
 Value Phrase Restrictions
Assaulting the walls of Jericho

Multi-disciplinary approach
– Machine learning (human assisted)

Five Phase approach:

Import and Re-use existing ontologies
Data Extraction uses machine learning to sculpt major
sections of the target ontology.
Target ontology is “pruned”
Refinement(?) (automatically and incrementally
maintained by evaluating “quality” of proposals) [Hahn &
Schnattinger]
Validation using prime target application as a measure
for success of the ontology.




Another Wonderful Graph
Import/ReUse
 Extract
 Prune
 Refine
 Validate
Legacy data:
reference to
archaic
“databasing”
techniques

Text-to-Onto
Components for
Learning Ontologies
by Staab and Maedche
Management
Component
Resource Processing Component
Algorithm Library
GUI for Manual Engineering
Ontology Primitives

a set of strings that describe lexical entries L for concepts and
relations;
 a set of concepts2 —C;
 a taxonomy of concepts with multiple inheritance (heterarchy)
HC;
 a set of non-taxonomic relations —R — described by their
domain and range restrictions;
 a heterarchy of relations, i.e. a set of taxonomic relations HR;
 relations F and G that relate concepts and relations with their
lexical entries, respectively;
 a set of axioms A that describe additional constraints on the
ontology and allow to make implicit facts explicit;
Management Component
OntoEngineer uses to select desired XML/HTML
pages, document type definitions, databases, or
pre-existing ontologies
 Selects methods for the Resource Processing
Component and Algorithms for the Library
Component
 Also includes a crawler that can find legacy data
relevant to creation of the ontology on the web.
(used for training data)

Resource Processing Component




HTML documents may be indexed and reduced to free text.
Semi-structured documents, like dictionaries, may be transformed into a
predefined relational structure.
Semi-structured and structured schema data (like DTD’s, structured
database schemata, and existing ontologies) are handled following different
strategies that may (or may not) be discussed later.
For processing free natural text our system accesses the natural language
processing system SMES (Saarbr¨ucken Message Extraction System), a
shallow text processor for German. SMES comprises a tokenizer based on
regular expressions, a lexical analysis component including various word
lexicons, a morphological analysis module, a named entity recognizer, a
part-of-speech tagger and a chunk parser.
Algorithm Library Component

This is the actual ontology builder and
where we revisit our previous model:
– Import/ReUse
– Extraction
– Pruning
– Refining

And then almost introduce some actual
algorithms these phases use.
Import/Reuse

Recovering Conceptualizations:
– First, schema structures are identified and
imported separately. This may be done
manually or using reverse engineering tools.
– Second, merging and aligning.
 This is a HUGE body of research that is largely
ignored by this document:
“While the general research issue concerning merging and
aligning is still an open problem, recent proposals (e.g., [8])
have shown how to improve the manual process of
merging/aligning.”
Extraction
Lexical Entry & Concept Extraction
 Hierarchical Concept Clustering
 Dictionary Parsing
 Association Rules

Lexical Entry & Concept
Extraction
Uses statistical technique (N-grams) similar to
the product from Cui’s presentation on the
BioMedicine data extractor to group multi-word
nouns together and associate them with their
corresponding verbs
 Every time a new lexical entry is introduced to L
the OntoEngineer must decide whether to
include the entry in an existing concept domain
or to introduce a new one.

Hierarchical Concept Clustering
A useful way of creating a taxonomic
classification of concepts.
 Done automatically Text-to-Onto clusters
concepts by adjacency of terms and syntactical
relationships.
 Done by a cooperative machine learning system,
ASIUM, presented by Faure & Nedellec. Uses
the verb to noun and noun to verb association
method.
 “Thus, they cooperatively extend the lexicon, the
set of concepts, and the concept heterarchy. (L,
C, HC)

Dictionary Parsing

This is really only one step further than
what we are doing with the lexicons in our
own Ontology Editor. The identified
dictionary words are used with the
concept clustered verb and noun
associations to infer relationships between
lexical entries.
Association Rules
These algorithms are usually used for data
mining.
 Works by using the taxonomy heterarchy
to generalize the lexical entries and
thereby draw conclusions about their use.

– “Snacks are purchased together with drinks”
-instead of“Lay’s chips are purchased with Sprite.”
Example Output from Text-to-Onto
Completeness vs. Scarcity
(Pruning)

Pruning the Ontology
–
“It is a widely held belief that targeting
completeness for the domain model on the one hand
appears to be practically unmanageable and
computationally intractable, and targeting the
scarcest model on the other hand is overly limiting
with regard to expressiveness. Hence, what we strive
for is the balance between these two, which is really
working.”
– Staab and Maedche
Import and ReUse, as well as the different
Extraction methods we’ve discussed all tend to
introduce unfocused elements into the ontology,
as more general rules satisfy the conditional
statements much more often.
 Pruning is the art of diminishing the ontology to
more specific rules.

– First, must evaluate how removal of item from C (the set
of concepts) will affect the rest of the ontology. (Petersen
[9], no dangling or broken links)
– Second, based on absolute or relative counts of
frequency determine which ontology items are to be
either kept or pruned. (Kietz [13])
Refine
Hahn and Schnattinger
 Incremental approach to updating an
ontology “centered around linguistic and
conceptual “quality” of various forms of
evidence [i.e. conflicting and analogous
semantic structures] underlying the
generation and refinement of concept
hypothesis.”

Conclusions

Ontology learning a significant leverage to
Semantic Web.
– Propels propagation of ontologies

Multi-disciplinary approach to the problem
Further Challenges in Learning
XML namespace mechanisms will turn the web
into an “amoeba-like” structure, with ontologies
supporting and referring to each other (ReUse
and Import) Not clear yet on what will be the
semantic result of this evolution.
 This examination has been restricted almost
entirely to RDF-Schema. Additional layers of
RDF (future OIL or DAML-ONT) will require new
means for improved Ontology engineering.

Questions?