Introduction to Database Systems

Download Report

Transcript Introduction to Database Systems

Data Warehousing/Mining
Comp 150 DW
Semistructured Data
Instructor: Dan Hebert
Data Warehousing/Mining
1
Semistructured Data

Everything that has no rigid schema
– Schema is contained within the data (self-describing), OR
– No separate schema, OR
– Schema exists but places only loose constraints on data

Emerged as an important topic for a variety of reasons
– Many data sources like WWW which we would like to treat as
databases but cannot for the lack of schema
– Desirable to have an extremely flexible format for data exchange
between disparate databases
– May want to view structured data as semistructured data for the
purpose of browsing
Data Warehousing/Mining
2
Motivation

Some data really is unstructured/semistructured
– World Wide Web,
– Data exchange formats
– Some exotic database management systems, e.g.,
ACeDB, popular with biologists
Data integration
 Browsing

Data Warehousing/Mining
3
Motivation - World Wide Web

Why do we want to treat the Web as a database?
– To maintain integrity
– To query based on structure (as opposed to content)
– To introduce some “organization”.

But the Web has no structure. The best we can say is
that it is an enormous graph.
Data Warehousing/Mining
4
Motivation - Data Formats
Much (probably most) of the world’s data is in
data formats
 These are formats defined for the interchange
and archiving of data
 Data formats vary in generality. ASN.1 and
XDR are quite general
 Scientific data formats tend to be “fixed
schemas”
 The textual representation given by data
formats is sometimes not immediately
translatable into a standard relational/objectoriented representation

Data Warehousing/Mining
5
Motivation - Data Integration

Goal is to integrate all types of information, including
unstructured information
– Irregular, missing information, structure not fully known, dynamic
schema evolution, etc.

Traditional data models and languages not well suited
– Cannot accommodate heterogeneous data sets (different types and
structures), etc.
– Difficult to build software that will easily convert between two
disparate models

OEM (Object Exchange Model)
– Semistructured data model from TSIMMIS project at Stanford
– Internal data structure for exchange of data between DBMSs
– Used by other systems: e.g., Windows 95 registry, Lotus Notes
Data Warehousing/Mining
6
Motivation - Browsing
To query a database one needs to understand the
schema.
 However schemas have opaque terminology and
the user may want to start by querying the data
with little or no knowledge of the schema.

– Where in the database is the string “Casablanca” to be
found?
– Are there integers in the database greater than 216 ?
– What objects in the database have an attribute name
that starts with “act”?

While extensions to relational query languages
have been proposed for such queries, there is no
generic technique for interpreting them.
Data Warehousing/Mining
7
The Model

Represent data as some kind of graph-like or treelike model
– Cycles are allowed but usually refer to them as trees
– Several different approaches with minor differences
(easy to convert)


Data on labels or edges, nodes carry information or not
Straightforward to encode relational and objectoriented databases
– Issue: object identity
Data Warehousing/Mining
8
Querying Semistructured Data

There are (at least) three approaches to this
problem
– Add arbitrary features to SQL or to your favorite query
language
– Find some principled approach to programs that are
based on the type of the data
– Represent the graph (or whatever the structure is) as
appropriate predicates and use some variety of datalog
on that structure
Data Warehousing/Mining
9
The “Extend SQL” Approach
In fact it is an attempt to extend the
philosophy of OQL and comprehension
syntax to these new structures
 It is the approach taken in the design of
UnQL and also of Lorel
 Looks very similar to OQL (path expressions)

Data Warehousing/Mining
10
Example
select
from
where
Data Warehousing/Mining
Entry.Movie.Title
DB
Entry.Movie.Director...
11
Syntax Issues
Need (path) variables to tie paths and edges
together
 Paths of arbitrary length

– “Find all strings in db”
– “Find whether “Allen” acted in “Casablanca”
– Need regular expresions to constrain paths

Rich set of overloadings for operators to deal
with comparisons of objects with values and
of values with sets
Data Warehousing/Mining
12
Underlying Computational Strategy

Model graph as a relational database and use
relational query language.
– Database large relation (node-id, label, node-id)
– Used by Stanford group in LORE/LOREL

Complications
– Labels are from heterogeneous set of types, need
more than one relation
– Additional relations if info to be stored in nodes
– Various navigation issues
Data Warehousing/Mining
13
Semistructured Data - Case Study
Object Exchange Model
Data Warehousing/Mining
14
OEM Features
• Common model for heterogeneous information
exchange, self-describing
• Each object:
OID

Label
Type
Value
OID = unique identifier or NULL
 Label = character string descriptor
 Type = atomic data type or set
 Value = atomic value or set of object references
• “Help pages” for labels
• Query language OEM-QL
Data Warehousing/Mining
15
15
Representing Semistructured Data Using OEM
Label
Memory
Addresses
<collection, {b1, a1, ...}>
Set Value
b1: <book, {t, a}>
t: <title, “Database and ...”>
Atomic Value
a: <author, {n, p}>
n: <name, “Jeff Ullman”>
p: <picture, “/gifs/ullman.gif”>
a1: <article, {v, w, x}>
v: <author, “Gio Wiederhold”>
w: <title, “Mediators in the …”>
x: <journal, “IEEE Computer”>
...
Data Warehousing/Mining
16
16
An OEM Query Language: OEM-QL
• Logic-based language for OEM
– Match object patterns, generate variable bindings,
construct new OEM objects from existing ones
• Get articles published in “IEEE Computer”
P :P:<articles {<journal “IEEE Computer”>}>
• Get titles of books by “Jeff Ullman”
<answer_title T> :<book {<author “Jeff Ullman”> <title T>}>
Data Warehousing/Mining
17
17
Semistructured Data - Case Study
WWW Extraction
Data Warehousing/Mining
18
Problem

Lots of valuable information on the Web
– irregular structure
– highly dynamic
Embedded in HTML
 Limited query facilities

Data Warehousing/Mining
19
Data Extraction Tool
Flexible, easy to use
 Accommodate virtually any HTML source
 Interface with existing system, e.g., data
warehouse, user interface for querying

Query
World
Wide
Web
Extractor
WH
Integrator
Data
Warehouse
Specification
Data Warehousing/Mining
20
Approach

Extract Web data into OEM format
– Query using OEM-QL
Python-based, configurable parser
 Declarative description of HTML source

– location of data on page
– how to package data into OEM
“Regular expression”-like syntax
 Human intelligence rather than A.I.

Data Warehousing/Mining
21
Extractor Specification
Consists of commands of the form:
[ “variable(s)”, “source”, “pattern” ]
Data Warehousing/Mining
22
HTML Source File
<HTML>
<HEAD>
..
.
<TABLE>
<TR>
<TH><I> header 1 </I></TH>
<TH><I> header 2 </I></TH>
<TH><I> header 3 </I></TH>
</TR>
<TR>
<TD> text 1 </TD>
<TD><A HREF=http://www.stuff/> text 2 </A></TD>
<TD> text 3 </TD>
</TR>
..
.
</TABLE>
..
.
</BODY>
</HTML>
Data Warehousing/Mining
23
[
Specification File
[“root”, “get('http://www.example.test/')”, “#” ],
[“__tempvar1”, “root”, “*<table>#</table>*” ],
[“__tempvar2”, “split (__tempvar1,’</tr>’)”, “#” ],
[“rows”, “__tempvar2[1:-1]”, “#” ],
[“header1,header2_url,header2,header3”, “rows”,
“*<td>#</td>*<a*href=#>#</a>*<td>#</td>*”]
]
Data Warehousing/Mining
24
Result OEM Object
...
<root complex {
<rows complex {
<header1 string “text 1”>
<header2_url string “http://www.stuff”>
<header2 string “text 2”
<header3 string “text 3”>
}>
<rows complex {
...
}>
}>
Data Warehousing/Mining
25
Basic Syntax:Variable

variable(l:p:t)
– optional parameters for specification of
corresponding OEM object
l: label name
 t: type
 p: parent object


_variable
– temporary data structure, does not appear as OEM
object
Data Warehousing/Mining
26
Basic Syntax: Source

split(variable,token)
– creates a list with multiple elements using token as the element
separator

get(URL)
– obtain contents of HTML file at address URL
Data Warehousing/Mining
27
Basic Syntax: Patterns

token1 # token2
– match and store current input (between tokens)

token1 * token2
– match, don’t store current input (between tokens)
Data Warehousing/Mining
28
Syntactic Sugar

Functions for extracting commonly used HTML
constructs
– extract_table(variable),pattern
– split_table_row(variable)
– split_table_column(variable)
– extract_list(variable),pattern
– split_list(variables)
Data Warehousing/Mining
29
Advanced Features

Customization of output
– structure, label names, data type, ...
Extraction across multiple HTML pages
 Graceful recovery from parse errors

– resume parsing using next input from source

Multiple patterns in single command
– follow different parse tree depending on structure in
source
Data Warehousing/Mining
30
...
Sample Extraction Scenario
Data Warehousing/Mining
31
Extracted OEM Data
root
OEM-QL}query:
complex {
complex {
temperature
city_temp complex {
country string
city_url url
string
city
weather_today
high_today
low_today
weather_tom
high_tomorrow
low_tomorrow
}
city_temp complex {
country string
city_url url
string
city
…
}
…
}
“Austria”
http://www…
“Vienna”
string
“snow”
string
“-2”
string
“-7”
string
“snow”
string
“-2”
string
“-7”
“Belgium”
http://www…
“Brussles”
<city C {<high H> < low L>}> :<temperature {<city_temp
{<country “Germany”> <city C> <high_today H> <low_today L>}>}>
Data Warehousing/Mining
32
Evaluation

Better than
– writing programs
– YACC, PERL, etc.
– A.I.

Can do better
– GUI tool to simplify the generation of extractor
specification
– Machine learning or data mining techniques to
automatically infer structure...
Data Warehousing/Mining
33