Brian - osm.cs.byu.edu

Download Report

Transcript Brian - osm.cs.byu.edu

Inconsistent Data on the
Semantic Web
A Theoretical Approach
Brian Goodrich
The Problem
An computer application has a set of input
and a set of output based upon the set of
input and its internal logic.
If an application is given data as input which
causes a conflicted state in deciding its
output, it will crash without some kind of
logic by which to decide that conflict.
The Semantic Web is based being able to
parse human intent from structured, semistructured, and unstructured data on the
Web.
Human intent is frequently conflicting.
Conflicting Data Sources





Malicious - (deceptive or rerouting attempts) or
just ignorantly incorrect information
Incomplete Information – having insufficient
context or simply unfinished data
Humor – especially sarcasm, satire and
exaggeration (e.g. political cartoons)
Time – what once was one thing is now another
(e.g. quality of service, price, etc.)
Ontological Deficiency – when extraction ontology
lacks sufficient vividness to separate data
appropriately.
Solution
Fast
Maintain
current speed
of the Web.
Accurate
Correct
decisions of
data reliance.
Dynamic
Keeps pace
with change
on the Web.
Thesis
To propose a method for simplifying the
task of dealing with conflicting data on
the Semantic Web in a fast, accurate
and dynamic way by supplying each
web source with a derived indicator of
its communal usage called a Consensual
Reliability Score. (CRS)
Methods
CRSBot
Site Type (a)
Incoming Index (b)
Usage Mining (c)
Direct Survey (d)
•Formula for deriving CRS from inputs a, b, c, & d.
•With weighted constants z, y, x, & w.
Site Type Mining (a * z)…
Five types of Web Pages





Head Pages
Navigation Pages
Content Pages
Look up Pages
Personal Pages
Incoming Index …(b * y)…
•Distributed web crawler that counts hyperlinks
then traverses the unique hyperlink paths, looking
for additional links.
•Link counts are stored in a hash indexed by the
destination of the hyperlinks.
•Provides a dynamic count of how often the
internet as a whole is pointing to a given web
source. Therefore an indication of how often
people use the given web source.
•Excludes orphan sites (mostly personal sites and
spam pop-ups)
•Based on the success of the Google search engine
Usage Mining …(c * x)…
Most straight forward approach of
testing how often people use a web
source. Query site’s # of hits or how
many people have seen this site?
Problem: Unlike Incoming Index method, does not
exclude orphan sites.
Further experimentation needed to determine x’s
weight.
Direct Survey …(d * w)…
Most reliable method of
determining reliability.
Manually query users directly.
Too slow and costly to be
consider a whole solution but
can assist in CRS derivation.
Hopefully offset frequently
visited sites with no true info
(onion.com, humor, etc.)
More experimentation needed
to determine w’s weight.
Review
Semantic Agents
Semantic Browsers
CRSBot
Site Type (a)
Incoming Index (b)
Usage Mining (c)
Direct Survey (d)
“Classical content data mining is not
applicable in this case (CRS derivation)
because it is the content of the web
sources that is in question.”
-Brian Goodrich
Storage
Global Index –




Fast access
Centralized storage for CRSBot.
Centralized vulnerability.
Vital non-distributed resource in a distributed
system.
Local Storage


Non-centralized vulnerability
Non-unified derivation formula (disrupts trust
algorithm)
Local Derivation

Too slow to be useful (problem size too large)
Related Work
Tim Berners-Lee

There is a choice here, and I am not sure right
now which appeals to me most. One is to say
precicely,
 "whatever any document says of the form xxxx is a
member of W3C so long as it is signed with key
32457934759432".

The other is to say,
 "whatever is of form xxxx and can be inferred from
information signed with key 32457934759432“
Problems with both choices, but both use
static references in a dynamic environment
(the web)
Contributions
CRS provides a fast and accurate
measure of community consensus on
the web.
Allows reliable decision about deciding
between conflicting data on the web,
fine-tuning the results from the
Semantic Web.
Limitations
Totally reliant on usage patterns of the
internet, which may not always reflect
which data is more correct.
Reflects only consensus to a data
source, not the actual data contained in
it.
Cannot express complex or compound
relationships or extract partial truths.
Questions?