Semantic Lifting
Download
Report
Transcript Semantic Lifting
Semantic
Lifting for
Traditional
Content
Resources
Semantic CMS Community
Lecturer
Organization
Date of presentation
Co-funded by the
European Union
1
Copyright IKS Consortium
Page:
Part I: Foundations
(1)
Introduction of Content
Management
Part II: Semantic Content
Management
(3)
Knowledge Interaction
and Presentation
(2)
Foundations of Semantic
Web Technologies
Part III: Methodologies
(7)
Requirements Engineering
for Semantic CMS
Representation
(4) Knowledge
and Reasoning
(8)
Designing
Semantic CMS
(5)
Semantic Lifting
(9)
Semantifying
your CMS
(6)
Storing and Accessing
Semantic Data
(10)
www.iks-project.eu
Designing Interactive
Ubiquitous IS
Copyright IKS Consortium
Page: 3
What is this Lecture about?
We
... how to build ontologies
representing complex
knowledge domains.
... a way to reason about
knowledge.
We
have learned ...
Part II: Semantic Content
Management
(3)
Knowledge Interaction
and Presentation
Representation
(4) Knowledge
and Reasoning
need a way ...
... to extract knowledge from
content in a automatic way
Semantic Lifting
www.iks-project.eu
(5)
Semantic Lifting
(6)
Storing and Accessing
Semantic Data
Copyright IKS Consortium
Page: 4
Overview
What
is semantic lifting?
Core concepts
Scenarios
Requirements
Technologies
Semantic Reengineering
Semantic Enhancements of textual content
www.iks-project.eu
Copyright IKS Consortium
Page: 5
What is “Semantic Lifting”?
Semantic
Lifting refers to the process of associating
content items with suitable semantic objects as
metadata to turn “unstructured” content items into
semantic knowledge resources
Semantic
Lifting makes explicit “hidden” metadata in
content items
www.iks-project.eu
Copyright IKS Consortium
Page: 6
Semantic Lifting Targets
Semantic
Semantic Lifting harmonizes metadata representations
Semantic Lifting reengineers data from an existing resource so
that the data from the resource can be reused within in a
semantic repository
Semantic
Reengineering of structured data
Content Enhancement
Semantic Lifting generates additional metadata and annotations
by semantic analysis of content items
Semantic Lifting classifies content objects by means of semantic
annotations
www.iks-project.eu
Copyright IKS Consortium
Page: 7
Structured Content
Structured
content provides implicit semantics through
the structure definition
Table definitions in relational databases, XML
schemata, field definitions for adressbooks,
calendars, etc.
Application
programs are designed to „know“ how
to interpret the structures and the data within.
Semantic Lifting is used for Reengineering to
support data exchange and seamless interoperability
between different systems
www.iks-project.eu
Copyright IKS Consortium
Page: 8
Unstructured Content
Unstructured
content
Images, texts, videos, music, web pages composed
of various types of media items
Meaningful only to humans not to machines
Content must be described semantically by metadata
to become meaningful to machines, e.g. what the text
or image is about.
Semantic Lifting is used as content enhancement
www.iks-project.eu
Copyright IKS Consortium
Page: 9
Mixed Content
No
dichotomy of structured and unstructured content
Structured databases are used to store unstructured
content types, such as texts, images etc.
Documents can be composed of unstructured content
items such as free text and images as well as more
structured information, e.g. tables and charts
Free text
Structured
content
www.iks-project.eu
Copyright IKS Consortium
Page: 10
Metadata: Variants
Metadata exist in many forms:
Free text descriptions
Descriptive content related keywords or tags from fixed vocabularies or
in free form
Taxonomic and classificatory labels
Media specific metadata, such a mime-types, encoding, language, bit
rate
Media-type specific structured metadata schemes such as EXIF for
photos, IPTC tags for images, ID3-tags for MP3, MPEG-7 for videos,
etc.
Content related structured knowledge markup, e.g. to specify what
objects are shown in an image or mentioned in a text, what the actors
are doing, etc.
www.iks-project.eu
Copyright IKS Consortium
Page: 11
Metadata: Variants
Inline
metadata are part of content
ID3 tags embedded in MP3 files
Offline
metadata are kept separate from content
www.iks-project.eu
Copyright IKS Consortium
Page: 12
Formal semantic metadata
Data
representation in a formalism with a formal
semantic interpretation that defines the concept of
(logical) entailment for reasoning:
Soundness: conclusions are valid entailments
Completeness: every valid entailment can be deduced
Decidability: a procedure exists to determine whether a
conclusion can be deduced
Embodiments:
Logics
Knowledge Representation Systems, Description Logics
Semantic
www.iks-project.eu
Web: RDF, OWL
Copyright IKS Consortium
Page: 13
„Semantics“ in CMS
CMS
systems provide various methods to include
metadata
Organize content in hierarchies
Hierarchical taxonomies
Attachment of properties to content items for metadata
Content type definitions with inheritance
These
methods are used in CMS systems in ad-hoc
fashion without clear semantics. Therefore no welldefined reasoning is possible.
www.iks-project.eu
Copyright IKS Consortium
Page: 14
Semantic Lifting Usage
Content Creation and Acquisition
Authoring content
Uploading external content/documents
automatic extraction and analysis, e.g. for indexing
Importing content from external sources/documents
Support content editors in providing metadata of specified types
Integration of external content into content repository
Content needs to be transformed to match internal CMS structures and
metadata schemes
Crossreferencing/linking among CMS content items and external
content
Detect related or additional content
Add pointers/links to related or additional content
www.iks-project.eu
Copyright IKS Consortium
Page: 15
Semantic Lifting Usage
Access
to external documents and content repositories
Semantic harmonization with CMS semantic structures
Semantic interoperability in data exchange with other content
repositories
The
CMS needs to understand the data structures used
by external services and programs
E.g synchronization of a local calendar from Outlook with an
external calendar based on iCalendar format
E.g. Importing RDF from a Linked Data endpoint such as
dbpedia
The
CMS must present its data in a form understood by
external target services or programs
www.iks-project.eu
Copyright IKS Consortium
Page: 16
Semantic Lifting Usage
Publishing
content with metadata
Metadata need to be transformed into a form compatible
with the publication format
E.g.
converting FreeDB metadata into ID3 tags for inclusion in
an MP3 file
www.iks-project.eu
Copyright IKS Consortium
Page: 17
Publishing Web Content with
semantic metadata
Augmenting web content with structured information becomes
increasingly important
Several methods have emerged in recent years to include
structured metadata in Web pages
Microformats
RDFa
Microdata (HTML5)
Supported by the major search engines to improve search and
result presentation, e.g. Google („Rich Snippets), Bing, Yahoo
www.iks-project.eu
Copyright IKS Consortium
Page: 18
Augmenting Web Content
The HTML code contains a review of a restaurant in plain text
using only line breaks for structuring
Without specialized information extraction analysis tools it cannot
be interpreted, e.g. that it is a review (of what and when?), who the
reviewer was, etc.
<div>
L’Amourita Pizza
Reviewed by Ulysses Grant on Jan 6.
Delicious, tasty pizza on Eastlake!
L'Amourita serves up traditional wood-fired Neapolitan-style pizza,
brought to your table promptly and without fuss. An ideal neighborhood
pizza joint.
Rating: 4.5
</div>
www.iks-project.eu
Copyright IKS Consortium
Page: 19
Microformats
Same text but additional span elements with class attributes to
encode the type of contained information (hReview) and the
properties of that type
<div class="hreview">
<span class="item">
<span class="fn">L’Amourita Pizza</span>
</span>
Reviewed by <span class="reviewer">Ulysses Grant</span> on
<span class="dtreviewed">
Jan 6<span class="value-title" title="2009-01-06"></span>
</span>.
<span class="summary">Delicious, tasty pizza on Eastlake!</span>
<span class="description">L'Amourita serves up traditional wood-fired
Neapolitan-style pizza, brought to your table promptly and without fuss.
An ideal neighborhood pizza joint.</span>
Rating:
<span class="rating">4.5</span>
</div>
www.iks-project.eu
Copyright IKS Consortium
Page: 20
RDFa
Same text but additional attributes and span elements encoding a
RDF structure:
namespace declaration of the used ontology
RDF class encoded by typeof attribute and its properties by a
property attribute
<div xmlns:v="http://rdf.data-vocabulary.org/#" typeof="v:Review">
<span property="v:itemreviewed">L’Amourita Pizza</span>
Reviewed by
<span property="v:reviewer">Ulysses Grant</span> on
<span property="v:dtreviewed" content="2009-01-06">Jan 6</span>.
<span property="v:summary">Delicious, tasty pizza on Eastlake!</span>
<span property="v:description">L'Amourita serves up traditional wood-fired
Neapolitan-style pizza, brought to your table promptly and without fuss.
An ideal neighborhood pizza joint.</span>
Rating:
<span property="v:rating">4.5</span>
</div>
www.iks-project.eu
Copyright IKS Consortium
Page: 21
Microdata (HTML5)
Same text but additional attributes and span elements:
A class declaration as value of an itemtype attribute and its
properties as values of an itemprop attribute
<div>
<div itemscope itemtype="http://data-vocabulary.org/Review">
<span itemprop="itemreviewed">L’Amourita Pizza</span>
Reviewed by <span itemprop="reviewer">Ulysses Grant</span> on
<time itemprop="dtreviewed" datetime="2009-01-06">Jan 6</time>.
<span itemprop="summary">Delicious, tasty pizza in Eastlake!</span>
<span itemprop="description">L'Amourita serves up traditional wood-fired
Neapolitan-style pizza,
brought to your table promptly and without fuss. An ideal neighborhood pizza
joint.</span>
Rating: <span itemprop="rating">4.5</span>
</div>
</div>
www.iks-project.eu
Copyright IKS Consortium
Page: 22
Lifting Requirements:
Overview
Top-level
requirements
Semantic
Associations with Content
Semantic Harmonization
Semantic Linking
Interactive Lifting
Customizability
Semantically Transparent Structured Content
Sources
www.iks-project.eu
Copyright IKS Consortium
Page: 23
Semantic Associations with
Content
Unstructured
content and information must be
supplied with structured semantic annotations and
metadata.
Support for various content/media types
Information extraction from text, topic classification, image
tagging, …
Support for creation of semantic annotations in content
authoring
www.iks-project.eu
Copyright IKS Consortium
Page: 24
Semantic Harmonization
Metadata
and annotations must be harmonized with
requirements for semantic processing in the CMS
Reengineering methods, interpreters and wrappers for all
types and formats of metadata and annotations, e.g. tags,
microformats, XML Metadata ( MPEG-7, …), ID3 tags,
EXIF data, …
Ensure semantic interoperability of data and annotation
schemes within the CMS and across external resources
Ontology mapping and harmonization of annotations
External
metadata
Metadata generated by semantic analysis
www.iks-project.eu
Copyright IKS Consortium
Page: Slide 25
Semantic Linking
Lifting
must enable the interlinking of content
objects by semantic relationships.
Internal linking of content items within the CMS
links to external resources, e.g. Linked Open Data
Establish semantic relatedness of content for different
views as well as different search, navigation and browsing
strategies, …
Direct
semantic links among content items and metadata
Similarity relations over sets of content items
Clustering of content items
www.iks-project.eu
Copyright IKS Consortium
Page: Slide 26
Interactive Lifting
Lifting
must interact with CMS users.
Suggest semantic annotations during content creation
Support
for various publishing formats such as microformats,
RDFa, etc.
Automatic annotations (autotagging) with optional
correction option
Learning capabilities and adaptability of automatic
annotation components from user feedback
www.iks-project.eu
Copyright IKS Consortium
Page: 27
Customizability
Lifting
components must be customizable by CMS
users/customers.
Users must not be restricted to predefined vocabularies,
ontologies, …
Domain ontologies, terminologies, tag sets are defined by
CMS users/customers.
Browsers and editors for component resources are
necessary.
www.iks-project.eu
Copyright IKS Consortium
Page: 28
Transparent Structured
Content Sources
Structured
content sources need to be reengineered to
semantic resources
Support uniform data access to structured content
repositories, e.g. SPARQL end points based on D2RQ
technologies for transparent access to RDF and non-RDF
databases
Extraction of ontologies from database structures,
schemata, XML, resources, …
Alignment and mapping of the descriptions
www.iks-project.eu
Copyright IKS Consortium
Page: 29
Semantic Reengineering of
structured data sources
Focus on tools for reengineering structured data sources to RDF
representations
Many tools and platforms for
D2R Servers: Exhibit relational DBs as RDF
Talis platform: Linked Open Data
Triplify: like D2R but in PHP
Virtuoso middleware
Krextor/OntoCape: generating RDF from XML
Various Transformers for inducing RDF ontologies and instance
data from XSD and XML
More
details in presentation on Knowledge
Representation (KReS)
www.iks-project.eu
Copyright IKS Consortium
Page: 30
Semantic Content
Enhancements: Overview
Focus
here is on textual content
Metadata Extraction from existing content in various
formats to make embedded metadata explicit
Information Extraction from textual content:
Named Entities
Coreference
Relationships
Classification
and Clustering of content items
Statistical methods and tools
Semantic classification based on ontological definitions
www.iks-project.eu
Copyright IKS Consortium
Page: 31
Information Extraction
Rule based approaches for shallow text analysis
Usually based on Finite State technology: fast, robust
Cascaded processing
Based on templates as target structures to be filled
Example platforms:
GATE
SProUT
Can be used for nearly any kind of extraction/annotation task,
including Named-Entity-Recognition (NER)
Easy customization
www.iks-project.eu
Copyright IKS Consortium
Page: 32
Information Extraction
Semi-supervised
learning approaches
Rule induction from corpora
Use example annotations as seeds for bootstrapping
Pattern Rules learned from contextual features with
generalization over contexts
www.iks-project.eu
Copyright IKS Consortium
Page: 33
Named Entities
Statistical Approaches: examples
Lingpipe: Hidden Markov Models
OpenNLP: Maximum Entropy Models
Stanford NER: Conditional Random Fields
Statistical models crated by supervised learning techniques
Large annotated corpora required
Customization diffcult except by re-annotation/re-training
Not suitable for any type of named entity
www.iks-project.eu
Copyright IKS Consortium
Page: 34
NER Document Markup
www.iks-project.eu
Copyright IKS Consortium
Page: 35
NER Markup for a Web Page
www.iks-project.eu
Copyright IKS Consortium
Page: 36
IE Template
A Person Template (as
Typed Featured Structure)
instantiated from text.
The template supports the
extraction of various
properties of a person.
www.iks-project.eu
Copyright IKS Consortium
Page: 37
Classification
Assign
a data item to some predefined class
Statistical classification
Numerous methods, e.g.:
Bayes classifiers
K-Nearest Neighbor (KNN)
Support Vector Machines (SVM)
www.iks-project.eu
Copyright IKS Consortium
Page: 38
Semantic Classification
Semantic
classification in Knowledge Representation
Formalisms
Infer the item‘s class from the item‘s properties by matching
them with the class definitions: Which classes allow for these
properties?
Assume that our ontology contains 2 classes with some properties
SpatialThing:
PopulatedPlace:
latitude, longitude
population
Paderborn is an object with latidude „51°43′0″N“, longitude „8°46′0″E“ and a
population of 146283.
Then we can infer that Paderborn is a SpatialThing as that are the things that
have latitudes and longitudes in our ontology. Also, we can infer that it is a
PopulatedPlace as that are the things that have a population.
www.iks-project.eu
Copyright IKS Consortium
Page: 39
Clustering
Detection
of classes in a data set
Partitioning data into classes in an unsupervised way
with
high intra-class similarity
low inter-class similarity
Main variants:
Hierarchical clustering
Agglomerative
Partitioning clustering
K-Means
www.iks-project.eu
Copyright IKS Consortium
Page: 40
Tools for Classification and
Clustering
Generic:
WEKA: Java library implementing several dozen methods
for data mining. Application to textual data requires special
preprocessing.
Text:
MALLET: Java library with implementations of major
methods for text and document classification and
clustering
www.iks-project.eu
Copyright IKS Consortium
Page: 41
Evaluation Measures
Standard
evaluation measures for IE/IR etc. systems:
tp tn
Accuracy: acc tp fp tn fn
tp
Precision: prec tp fp
tp
recall
Recall:
tp fn
prec recall
F-Measure : F 2 prec
recall
www.iks-project.eu
tp = true positive
tn = true negative
fp = false positive
fn = false negative
Copyright IKS Consortium
Page: 42
Evaluation Measures:
Classification
A confusion
matrix which reports on the classification of
27 wines by grape variety. The reference in this case is
the true variety and the response arises from the blind
evaluation of a human judge.
=9/(9+3+1)
Many-way Confusion Matrix
Response
Cabernet Syrah Pinot Precision Recall F-Measure
Refer- Cabernet
9
3
0
0,69 0,75
0,72
ence Syrah
3
5
1
0,56 0,56
0,56
Pinot
1
1
4
0,80 0,67
0,73
Macro average
0,68 0,66
0,67
Overall accuracy
0,67
=4/(1+1+4)
www.iks-project.eu
Copyright IKS Consortium
Page: 43
Evaluation Measures: NER
Reference
annotations:
[Microsoft Corp.] CEO [Steve Ballmer] announced the release of [Windows 7] today
Recognized
annotations:
[Microsoft Corp.] [CEO] [Steve] Ballmer announced the release of Windows 7 [today]
-> Microsoft Corp. CEO Steve Ballmer announced the release of Windows 7 today
Precision: 1/(1+3) = 0,25
Recall:
1/(1+2) = 0,33
F-Measure:
2*0,25*0,33/(0,25+0,33) = 0,28
www.iks-project.eu
Counts
Entities
1
[Microsoft Corp.]
FP
3
[CEO]
[Steve]
[today]
FN
2
TP
TN
[Windows 7]
[Steve Ballmer]
Copyright IKS Consortium
Page: 44
NER Evaluation
Nobel
Prize Corpus from NYT, BBC, CNN
538 documents (Ø 735 words/document)
28948 person, 16948 organization occurrences
Sprout
Calais
Stanford
NER
OpenNLP
Precision
77,26
94,22
73,21
57,69
Recall
65,85
86,66
73,62
42,86
F1
71,10
90,28
73,41
49,18
www.iks-project.eu
Copyright IKS Consortium
Page: 45
References
Microformats: http://microformats.org/
RDFa: http://www.w3.org/TR/xhtml-rdfa-primer/
Google Rich Snippets:
http://googlewebmastercentral.blogspot.com/2009/05/introducing-rich-snippets.html
Linked Data: http://linkeddata.org/guides-and-tutorials
Linked Data: Heath and Bizer, Linked Data: Evolving the Web into a Global Data
Space. Morgan & Claypool, 2011. (Online: http://linkeddatabook.com/book)
Information Extraction: Moens, Information Extraction: Algorithms and Prospects in
a Retrieval Context. Springer 2006
Text Mining: Feldman and Sanger, The Text Mining Handbook: Advanced
Approaches in Analyzing Unstructured Data, CUP, 2007
www.iks-project.eu
Copyright IKS Consortium