Content Management

Download Report

Transcript Content Management

Content Management
and the role of taxonomies
Judith Molka-Danielsen
Oct. 13, 2003
Primary Challenges for
Content Management Systems

Heterogenenous Data Sources – create
some normalized representation of data to
provide equal (reading) accessibility for
human and machine alike.



retrieving data from a RDBMS involves
programmatic access (ODBC, SQL)
HTML files consist of tagged text. Stylistic and
structural info, different code is interpreted by
browsers in different ways, confusing for
automated programs, but humans manage it.
Word processing applications – Word, Acrobat,
binary data converted to text with proprietary
interpreter, and associated viewer. Want
interoperability of viewers with other formats.
Primary Challenges for
Content Management Systems

Distribution of Data Sources



Access involves use of protocols (HTTP,
HTTPS, FTP, SCP,…) to go through firewalls.
With business applications we still need security
and to limit views to selected individuals and
groups.
Additional protocols (XML, IIOP, SOAP and Web
Services) are being used to build tools for
integrating systems.

To deliver messages to components through http, a
protocol is needed. The Simple Object Access
Protocol (SOAP), written in XML, is emerging as the
protocol.
Primary Challenges for
Content Management Systems

What is being used to identify distributed data
sources: Distributed Directories and protocols




The Domain Name Service (DNS) is a hierarchically
distributed directory of Names (home.himolde.no) and IP
addresses.
The X.500 directory service is a hierarchically distributed
directory of objects. Object attribute-value pairs may be
stored and looked up.
LDAP is a protocol for accessing a directory service. Most
visions of the Web imagine “federated” servers to help find
objects.
UDDI is one protocol for advertising and discovery
The Web Today
Web
Server
DN
Server
DN
Server
DN
Server
DN
Server
2. Object
Request
Client
1. Location
Lookup
The Web with Object
Directories
Web
Server
1. Registration
DN
Server
Web
Server
DN
Server
LDAP
Server
DN
Server
DN
Server
3. The Rest
Client
2. Attribute/Value Request
and
Object/Location Response
Primary Challenges for
Content Management Systems

Data Size and the Relevance Factor




Large repositories like WWW
Need a system to drill down to subsets of
relevant information. Speed and automation
is critical. (Find not just more results, but
better.)
Find a particular needle in a haystack with a
billion needles.
Find all the needles which are similar to
some other needle which has already been
discovered.
What can help?
Semantic web technology



XML and the Resource Description
Framework (RDF) will allow XML tags to be
labeled in conjunction with a referential
knowledge representation.
Machine based inference engines should
replace today's search engines.
New editors are needed to infuse semantic
information into the content easily, as some
editors allow users that do not know html
syntax to create web pages.
Syntactic Integration
Structural Integration
Semantic Integration
RDF
As an example of
RDF applied in a
logistic context
we model the three
entities ship,container
and item.
RDF provides a simple data model for expressing statements
using (subject, predicate, object) triples, and an associated
serialization syntax in XML. All three elements of the triple can
be defined within the current document or refer to another
resource on the Web.
RDF in use


In RDF we can express relations between entities, such
as a ship transports a container, and a container
contains an item. These relations can but need not to be
hierarchical, i.e. a business can be the owner of the
transported item, and at the same time the user of the
container. It is important to note that these relations can
change over time, ownership moves from one business
to another, and container move from ships to trucks for
further transportation. These transitions may trigger
events, like financial transactions or notifications.
An ontology can be used to define all the concepts and
their meaning used in a certain (set of) schema(s).
Components of
Semantic Technology



Classification
Metadata
Ontologies (taxonomies)
Classification



General keyword searches lead to many irrelevant
results.
An automatic classification system could for example,
divide a 1000 stories into 5 categories, so keyword
searches would be more relevant.
Techniques for classification







Statistical analysis and pattern matching
Rule-based methods
Linguistic analysis
Bayesian theory (probabilistic)
Ontology driven: name-entity and domain-phrase recognition
Committee-based approaches use various techniques
Classification is more precise if documents are tagged
with metadata and conform to a predetermined
schema.
Metadata


Data about the data
Levels of Metadata



Syntatic
Structural
Semantic
Syntatic Metadata




General information
Little for context determination
Document size, location, date of creation..
Used in




Assessment of the document’s relevance
Version tracking
User level access policies
Email, docs in file systems, have this info.
Structural Metadata




Information about the structure of
content
Varies widely with document type
XML allows creators to enclose
content within meaningful tags.
Can make associations between
content from multiple documents.
Semantic Metadata

Semantic Metadata is “data which may
be associated explicitly or implicitly
with a given piece of content (such as
a document) and whose relevance for
that content is determined by its
ontological position (its context) within
one or more domains of knowledge.”
Semantic Metadata




Metadata receives its contextual information from a
reference knowledgebase.
Metadata that is extracted from any document may
be stored as a snapshot of that document’s relevant
information.
The metadata contained within this snapshot simply
references the instances of name-entities, which
are stored in the ontology.
Each name-entity has related information stored:
synonyms, attributes, related entities.
Semantic Metadata

Documents can link to each other in several
ways



Explicit metadata – docs that mention the same
exact metadata
Implicitly related metadata – docs that contain
synonyms or hierarchically related name
entities.
Ontoloical associations – by name-entities
associations, one doc mentions a company
name while another mentions the ticker symbol.
Standards: DCML defines a generic
element set, non-specific to domain of
knowledge. Can be used as a top domain.
Forms of knowledge representation



Dictionary – terms are the keys and definitions are
the values. There are no links between terms.
Thesaurus – includes antonyms and synonyms.
The pieces of knowledge are linked.
Taxonomy – includes etymological information
(derivation) and synonyms are organized
hierarchically (inheritance).



Flower is a subclass of plant. But a rose may be related to
love. Associations may be emotional, cultural, temporal.
Relevant associations Can be discovered by a dataanalysis system utilizing a reference knowledge base.
Ontology – is the labeling of the relationship in the
taxonomy.
Types of Metadata
Ontology Description Languages



Knowledge model building in a given domain is
subjective
Problems combining independently developed
ontologies
Resource Description Framework (RDF) and RDFSchema (RDF-S) data model tries to address this:



Resource – is an item of interest at the atomic level, entitity,
concept or document. Each resource is uniquely identified by
a URI
Properties – descriptive, characteristics and attributes of a
resource. They may be associative, relating one resource to
another.
Statement – is what is known as an RDF triple. It contains a
reference to a resource, a property names, and that
property’s value. These identifiers take the form of link
addresses.
Ontology Description Languages

RDF-S (specification for ontoloy modeling.)


Dublin Core Metadata Initiative


http://www.w3.org/TR/2000/CR-rdf-schema20000327/
http://dc2003.ischool.washington.edu/program.html
DARPA Agent Markup Language + Ontology
Interface Layer (DAML+OIL) expands on the
RDF-S. Classes are defined as elements and
can be related to other classes in disjunction,
union, or equality.

The W3C has a ontology web language (OWL) that
is based on OIL.
Meta-data Interpretation


DAML (DARPA) endeavor to interpret a
simple ontology to infer information about
resources.
Put very simply:





If people have names
If students are people
If resource X is about a student
Resource X should have a name
This kind of inference could be easily
constructed within the context of an objectoriented directory
Schema Interpretation – and integration

consider two sets of resources:





For set A, the attributes are structured in accord with
the kind of meta data described on the previous
slide.
Imagine the same for set B, but using different
attribute names and values
Accept that the attribute-values are called resource
descriptions and a document called a resource
description schema defines the relations for each set.
Imagine the two schema are related through a third
schema
Finally imagine an engine that relates resources in
set A to resources in set B based on schema level
inferences
The Semantic Web Vision
LDAP
Server
Web
Server
LDAP
Server
Web
Server
DN
Server
DN
Server
Schema
Server
LDAP
Server
DN
Server
DN
Server
5. The Rest 2. Description
Client
3. Object Query
Association
Schema
Server
Schema
Server
1. Schema
Registration
Schema
Server
4. Inferencing
Sample Knowledgebases

WordNet is a networked thesaurus, developed at
Princeton, in the form of a lexical matrix. It maps
word forms to word meanings, M2M relationship.
The set of word-meanings for a word is a synset.


Open Directory Project


It is not an ontology because it does not contain real world
information required in labeled relationships, such as, a
“branch” is an administrative division with a chairman
above it.
http://www.dmoz.org/
National Library of Medicine has an ontology
system, Unified Medical Language System (UMLS),
with researchers and intstitutions contributing to it.

http://www.nlm.nih.gov/research/umls/
Toolkits – should provide for..






Establishing of configurable parameters
Extraction agents and classifiers modules
The system should accept training sets of
data, and learn from patterns, so future
items are classified without manual trigger.
Easily navigatible visual environment
Tracking date and time of data entry
ROADS provides tools for creating subject
gateways, http://www.ilrt.bristol.ac.uk/roads/
Extracting Wrapper Technologies






WysiWyg Web Wrapper Factory (W4F), crawl and
retrieve data from web pages, to create wrappers
that represent the content of the pages.
ANDES, uses XPath rules
XWRAP toolkit, has interactive rules formulation
S-CREAM (semiautomatic creation of metadata)
lets the user annotate documents.
Ontoprise (product by Semagix)
http://www.ontoprise.com
BUT, an ontology driven classifier and domain
specific metadata annotator allows searching on
classification by keyword AND on implied entity
association. (SEE example on next slide.)
Semagix Visualizer – is a visualization tool for
viewing an ontology or schema.
Related References

http://bazaar.sis.pitt.edu/ The E-Speak
Initiative at the University of Pittsburgh




E-Speak Overview
(http://bazaar.sis.pitt.edu/es_ppt_over/AIntrotoE
Speak_files/frame.htm )
E-Speak Revised
(http://bazaar.sis.pitt.edu/es_ppt_over/AESpeak
Revisited_files/frame.htm )
Oracle9i Data Mining Concepts
Oracle9i AS Personalization is used to build
data mining models.