Managing Semi-Structured Data

Download Report

Transcript Managing Semi-Structured Data

Managing
Semi-Structured Data
Is the web a database?
Rules—What Rules?
“The web changed the digital information rules.”
• Easy to create web information
• Cannot all be stored in relational databases
• Cannot be queried in traditional ways
Semi-structured Data
• Fully structured data
– Databases
– Hidden web
• Fully unstructured data—ordinary text
• Semi-structured data—the grey area in between
– No “good solutions;” no good “software, tools, or
methodologies to manipulate [semi-structured data]”
– “[Researchers] don’t even agree on the shape of the
problem—much less, good approaches to solving it.”
Nature of the Problem
• Information embedded in text
– Keyword search insufficient to answer queries
– Natural language processing also insufficient
• Lack of agreement of vocabularies and schemas
– “Reaching schema agreements among different
communities is one of the most expensive steps in
software design.”
– “We need to be able to process information without
requiring … a priori schema and vocabulary
agreements among participants.”
Example: eBay
• “Impossible for … developers to define an a priori
schema for the information.”
• “Information stored in raw text and searched using
only keywords, significantly limiting its usability.”
• “Some standard entities (e.g., buyer, date, ask, bid …),
but the meat of the information—the item
descriptions—has a rich and evolving structure that
isn’t captured.”
Why Schemas?
• “Schemas assign meaning to the data and … allow
automatic data search, comparison, and processing.”
• Hierarchy of meaning
–
–
–
–
Raw text: strings (values)
Data: attribute-value pairs
Information: data in a conceptual framework
Knowledge: information with a degree of certainty or
community agreement
– Meaning: knowledge that is relevant or activates
• “We have to learn to use and exploit schemas as helpers,
but not rely on their existence or allow them to be
constraining factors.”
Schema-Agnostic Tools
Possible Places to Start
• Information retrieval (sophisticated search engines?)
– Find (maybe?) but not answer
– No DB-like query logic, updates, transactions
• XML
– XML data can exist w/wo schemas; schemas can be defined
before or after
– Mixed text/data content
– Languages for query (XQuery) and transformation (XSLT)
• OWL & RDF
–
–
–
–
RDF: subject-predicate-object triples
OWL: ontological descriptions usually over RDF triples
Classification & inferencing
Semantic annotation and tagging
Are We Stuck?
What’s Next?
• Better information-authoring tools (annotation
assistance)
• Information extraction (automatic annotation)
• Creation and reuse of standard schemas and
vocabularies (ontology generation)
• Mapping schemas to each other (schema mapping)
• Automatic data linking (data linking & merging)
• Automatic processing of semi-structured data
(free-form queries)
– Florescu (Embley)
What’s beyond a database system?
Dataspace System
• Supports data and applications in a wide variety
of formats all within a dataspace.
• Offers an integrated means of searching,
querying, updating, and administering the
dataspace.
• Has varying levels of service (e.g. “best-effort”
or approximate answers)
• Includes tools to create tighter integration of the
data, as necessary.
– Franklin, Halevy, Maier
“We are still at day one.”
“We need to find a compromise to the
tension between the advantages of having
schemas, in terms of better understanding
and automatically processing the data, and
disadvantages imposed by schemas, in
terms of inflexibility and lack of
evolution.”
– Florescu