Semantic Web in the real world

Download Report

Transcript Semantic Web in the real world

Semantic Web In Industry
R. Guha
Two Levels of the Semantic Web
• Deep Semantic Web:
– Intelligent agents performing inference
– Semantic Web as distributed AI
– Small problem … the AI problem is not yet solved
• Shallow Semantic Web: using SW/Knowledge
Representation techniques for
– Data integration
– Search
– Is starting to see traction in industry
Integration: The new buzzword in
bussiness
• Huge explosion in the number of new databases,
applications, documents, … in the 90s
– Lots of redundancy, duplication … => high inefficiency
• Economic pressures forcing consolidation and
efforts to reduce inefficiency
• Two aspects to integration: Process & Data
– Process integration depends on data integration
Data Integration for Science
• Many experimental fields will generate more data
in the next 2 years than exists today
• Large part of research consists of writing programs
to analyze data, e.g., NASA
• Tools to normalize, share, integrate data stuck in
the 80s (ftp, perl, …)
• Semantic Web could create a “web of data” that
changes all this.
• Example of the Internet Observatory
Varieties of Data Integration: Data
Transformation
• Data Transformation Example
– Contact Information in SAP, Siebel, PeopleSoft, …
– We want to reflect updates in one data source into another
XSLT, etc.
App. Server
Siebel
Clarify
PeopleSoft
Varieties of Data Integration: Data
Aggregation
• Data Aggregation Example
– Clinical trial data at Stanford, UCSF, Mayo …
– We want to give a Meta-analyst a uniform view of data
from these different clinical trials
– Example of how this would have helped recent meta
studies such as the estrogen study
Relational Views
Meta-Analyst
UCSF
DBMS
Stanford
Mayo
Data Integration Layers
• Coping with software from different vendors
– Oracle vs. DB2 vs. SQL Server … this is a solved problem
• Coping with different formats
– Relational vs. XML vs. ISAM… this too is a solved problem
• Coping with different schemas
– Solved for the small case where one person understands all the
schemas
– No products for the case where it is truly distributed
• We know how to do it in theory, but lots of practical problems
• Coping with data from unknown sources
– Wide open … lots of unsolved problems
Typical Data Integration
Methodology
• Use a common namespace of terms for the concepts in the
domain of the data sources being integrated, e.g., Employee,
Customer, Patient, weight, height, bodyTemperature, …
• Mappings relate data items in data sources to terms in
namespace
• Transformation algorithms map queries in terms of common
namespace into corresponding queries in terms of data
source vocabularies
• Background knowledge about terms essential for
transformations … e.g., Employee subClassOf Person, 2
people with the same last name, first name and street
address are likely to be the same, I.e., common namespace
is really an Ontology
• Mappings and common namespace are the workhorse
Role of Semantic Web in Data
Integration
• The XML stack (XML, XSD, XPath, XQuery, …) does
not have the concepts (objects, classes, properties,
…) required for representing ontologies
• RDF/S does …
• Neither of the them have a language for
expressing mappings
– But RDF/S, being closer to logic, has more of the
machinery that is required
Kinds of Mappings
• Simple structural
– DB1.patient.weight corresponds to Patient’s weight
•
Conditional structural
– If DB1.patient.type equals Outpatient then
DB1.patient.foo corresponds to Patient’s visits duration …
• Term mappings
– CA in DB1 corresponds to California in domain namespace
– Object with ssn 7687667 in database 1 corresponds to
object with id “aksdks” in database 2
Challenges and non-challenges in
data integration
• Non-challenge: algorithms for doing the
transformations (ISI, MCC, SU & AT&T)
• Engineering Challenges
– Creating large, useful ontologies that are shared by many
– Creating mappings
• Research Challenges
– Semantic Drift
– Fuzzy terms, probabilistic mappings
– Trust
Engineering Challenges
• Creating large, detailed ontologies is complex and
expensive
– But it is happening … CrossWorlds for business concepts,
MAGE, etc. for medicine
– Danger: some of them might turn out to be proprietary
• Creating mappings is tedious and time consuming
• Object mappings pose special challenges
– Mappings need to be dynamic and constantly updated
Research Challenges with mappings
• Semantic Drift
– The meaning of terms as interpreted by different members of a
community, over time could drift
– Cyc experience shows that Description Logic mechanisms are not
adequate for either detecting or fixing these
• Fuzzy mappings
– E.g., walmart’s concept of chair is similar to but not the same as
MOMA’s concept of chair
• Probabilistic mappings
– There is a 82% likelihood that Michael Jordan in database 1 is
the same as Michael Jordan in database 2
Other data web related challenges
• Trust: How should the program know whether to trust some
new data source?
– Without this, we will only have closed systems
– Options: centralized approaches like UDDI or decentralized
approaches like WOTs
• Inverse trust: how can I trust you not to indiscriminately
distribute my data? A big issue in fresh scientific data
• Systems challenges
– Caching
– Preventing accidental DOS attacks
Forecast for SW and Data
Integration
• We already have a number of data integration
tools on the market
• We are seeing the first generation of ontology
based data integration tools from small companies
• At least some of the big players will probably have
some offerings for doing data integration based on
Semantic Web concepts in the near future
– Whether they use Semantic Web formats and acronyms is
an open question …
• These common vocabularies will exhibit very
strong network effects
Semantic Web for Search: Going
beyond search as Location Bar
• Keywords  a particular page
– Typically a home page or well known hub page
– United airlines  www.united.com
– Unix  gnu.org, linux.org, freebsd.org
• Search as a smarter location bar
• Page rank is ideally suited for this
– This is largely a solved problem
Varieties of Search: Research
searches
• User is searching for info about something
• Could be directed – user is looking for a particular
property
– Price of something, location of some event, …
• Or undirected – user is looking for some general
class of properties
– Reviews/feedback on product, info on person or country
• If there is no hub page on the thing, existing
search engines perform very poorly
• New focus is on this class of searches
Semantic Web for Search
• Keyword based approaches haven’t made
significant advances since PageRank
• Improvements may be gained by adding a
modicum of understanding about the *object*
denoted by the search query
• Improvements not just in search itself but also in
the relevance of search related advertising
Basic Issues
• Need database of potential objects user may be
referring to, along with some properties of the
object … e.g., its type
• Too many objects to manually construct DB
– At least 300 million distinct object references on Web
• If it does know something more about the search
term’s denotation, (e.g., it denotes a musician),
how can the search engine do better?
Building the Web KB
• Many different automated approaches
– Simple natural language processing (Riloff, TAP, …)
– Scrappers
– Machine Learning
• Most commercial efforts lead to proprietary KBs
• Huge opportunity for wider SW community
– Collaborate to actually create the KB
Using the KB
• Word Sense Disambiguation., e.g., MSN Search,
Teoma
• Incorporating data feeds into search results. E.g.,
MSN with popular musicians
• Incorporating object type specific actions. E.g.,
Google with addresses and stock symbols
• Coming soon … KB construction driven by ads
Conclusions
• Please help Eric miller