Introduction to Database Systems
Download
Report
Transcript Introduction to Database Systems
Data Warehousing/Mining
Comp 150 DW
Semistructured Data
Instructor: Dan Hebert
Data Warehousing/Mining
1
Semistructured Data
Everything that has no rigid schema
– Schema is contained within the data (self-describing), OR
– No separate schema, OR
– Schema exists but places only loose constraints on data
Emerged as an important topic for a variety of reasons
– Many data sources like WWW which we would like to treat as
databases but cannot for the lack of schema
– Desirable to have an extremely flexible format for data exchange
between disparate databases
– May want to view structured data as semistructured data for the
purpose of browsing
Data Warehousing/Mining
2
Motivation
Some data really is unstructured/semistructured
– World Wide Web,
– Data exchange formats
– Some exotic database management systems, e.g.,
ACeDB, popular with biologists
Data integration
Browsing
Data Warehousing/Mining
3
Motivation - World Wide Web
Why do we want to treat the Web as a database?
– To maintain integrity
– To query based on structure (as opposed to content)
– To introduce some “organization”.
But the Web has no structure. The best we can say is
that it is an enormous graph.
Data Warehousing/Mining
4
Motivation - Data Formats
Much (probably most) of the world’s data is in
data formats
These are formats defined for the interchange
and archiving of data
Data formats vary in generality. ASN.1 and
XDR are quite general
Scientific data formats tend to be “fixed
schemas”
The textual representation given by data
formats is sometimes not immediately
translatable into a standard relational/objectoriented representation
Data Warehousing/Mining
5
Motivation - Data Integration
Goal is to integrate all types of information, including
unstructured information
– Irregular, missing information, structure not fully known, dynamic
schema evolution, etc.
Traditional data models and languages not well suited
– Cannot accommodate heterogeneous data sets (different types and
structures), etc.
– Difficult to build software that will easily convert between two
disparate models
OEM (Object Exchange Model)
– Semistructured data model from TSIMMIS project at Stanford
– Internal data structure for exchange of data between DBMSs
– Used by other systems: e.g., Windows 95 registry, Lotus Notes
Data Warehousing/Mining
6
Motivation - Browsing
To query a database one needs to understand the
schema.
However schemas have opaque terminology and
the user may want to start by querying the data
with little or no knowledge of the schema.
– Where in the database is the string “Casablanca” to be
found?
– Are there integers in the database greater than 216 ?
– What objects in the database have an attribute name
that starts with “act”?
While extensions to relational query languages
have been proposed for such queries, there is no
generic technique for interpreting them.
Data Warehousing/Mining
7
The Model
Represent data as some kind of graph-like or treelike model
– Cycles are allowed but usually refer to them as trees
– Several different approaches with minor differences
(easy to convert)
Data on labels or edges, nodes carry information or not
Straightforward to encode relational and objectoriented databases
– Issue: object identity
Data Warehousing/Mining
8
Querying Semistructured Data
There are (at least) three approaches to this
problem
– Add arbitrary features to SQL or to your favorite query
language
– Find some principled approach to programs that are
based on the type of the data
– Represent the graph (or whatever the structure is) as
appropriate predicates and use some variety of datalog
on that structure
Data Warehousing/Mining
9
The “Extend SQL” Approach
In fact it is an attempt to extend the
philosophy of OQL and comprehension
syntax to these new structures
It is the approach taken in the design of
UnQL and also of Lorel
Looks very similar to OQL (path expressions)
Data Warehousing/Mining
10
Example
select
from
where
Data Warehousing/Mining
Entry.Movie.Title
DB
Entry.Movie.Director...
11
Syntax Issues
Need (path) variables to tie paths and edges
together
Paths of arbitrary length
– “Find all strings in db”
– “Find whether “Allen” acted in “Casablanca”
– Need regular expresions to constrain paths
Rich set of overloadings for operators to deal
with comparisons of objects with values and
of values with sets
Data Warehousing/Mining
12
Underlying Computational Strategy
Model graph as a relational database and use
relational query language.
– Database large relation (node-id, label, node-id)
– Used by Stanford group in LORE/LOREL
Complications
– Labels are from heterogeneous set of types, need
more than one relation
– Additional relations if info to be stored in nodes
– Various navigation issues
Data Warehousing/Mining
13
Semistructured Data - Case Study
Object Exchange Model
Data Warehousing/Mining
14
OEM Features
• Common model for heterogeneous information
exchange, self-describing
• Each object:
OID
Label
Type
Value
OID = unique identifier or NULL
Label = character string descriptor
Type = atomic data type or set
Value = atomic value or set of object references
• “Help pages” for labels
• Query language OEM-QL
Data Warehousing/Mining
15
15
Representing Semistructured Data Using OEM
Label
Memory
Addresses
<collection, {b1, a1, ...}>
Set Value
b1: <book, {t, a}>
t: <title, “Database and ...”>
Atomic Value
a: <author, {n, p}>
n: <name, “Jeff Ullman”>
p: <picture, “/gifs/ullman.gif”>
a1: <article, {v, w, x}>
v: <author, “Gio Wiederhold”>
w: <title, “Mediators in the …”>
x: <journal, “IEEE Computer”>
...
Data Warehousing/Mining
16
16
An OEM Query Language: OEM-QL
• Logic-based language for OEM
– Match object patterns, generate variable bindings,
construct new OEM objects from existing ones
• Get articles published in “IEEE Computer”
P :P:<articles {<journal “IEEE Computer”>}>
• Get titles of books by “Jeff Ullman”
<answer_title T> :<book {<author “Jeff Ullman”> <title T>}>
Data Warehousing/Mining
17
17
Semistructured Data - Case Study
WWW Extraction
Data Warehousing/Mining
18
Problem
Lots of valuable information on the Web
– irregular structure
– highly dynamic
Embedded in HTML
Limited query facilities
Data Warehousing/Mining
19
Data Extraction Tool
Flexible, easy to use
Accommodate virtually any HTML source
Interface with existing system, e.g., data
warehouse, user interface for querying
Query
World
Wide
Web
Extractor
WH
Integrator
Data
Warehouse
Specification
Data Warehousing/Mining
20
Approach
Extract Web data into OEM format
– Query using OEM-QL
Python-based, configurable parser
Declarative description of HTML source
– location of data on page
– how to package data into OEM
“Regular expression”-like syntax
Human intelligence rather than A.I.
Data Warehousing/Mining
21
Extractor Specification
Consists of commands of the form:
[ “variable(s)”, “source”, “pattern” ]
Data Warehousing/Mining
22
HTML Source File
<HTML>
<HEAD>
..
.
<TABLE>
<TR>
<TH><I> header 1 </I></TH>
<TH><I> header 2 </I></TH>
<TH><I> header 3 </I></TH>
</TR>
<TR>
<TD> text 1 </TD>
<TD><A HREF=http://www.stuff/> text 2 </A></TD>
<TD> text 3 </TD>
</TR>
..
.
</TABLE>
..
.
</BODY>
</HTML>
Data Warehousing/Mining
23
[
Specification File
[“root”, “get('http://www.example.test/')”, “#” ],
[“__tempvar1”, “root”, “*<table>#</table>*” ],
[“__tempvar2”, “split (__tempvar1,’</tr>’)”, “#” ],
[“rows”, “__tempvar2[1:-1]”, “#” ],
[“header1,header2_url,header2,header3”, “rows”,
“*<td>#</td>*<a*href=#>#</a>*<td>#</td>*”]
]
Data Warehousing/Mining
24
Result OEM Object
...
<root complex {
<rows complex {
<header1 string “text 1”>
<header2_url string “http://www.stuff”>
<header2 string “text 2”
<header3 string “text 3”>
}>
<rows complex {
...
}>
}>
Data Warehousing/Mining
25
Basic Syntax:Variable
variable(l:p:t)
– optional parameters for specification of
corresponding OEM object
l: label name
t: type
p: parent object
_variable
– temporary data structure, does not appear as OEM
object
Data Warehousing/Mining
26
Basic Syntax: Source
split(variable,token)
– creates a list with multiple elements using token as the element
separator
get(URL)
– obtain contents of HTML file at address URL
Data Warehousing/Mining
27
Basic Syntax: Patterns
token1 # token2
– match and store current input (between tokens)
token1 * token2
– match, don’t store current input (between tokens)
Data Warehousing/Mining
28
Syntactic Sugar
Functions for extracting commonly used HTML
constructs
– extract_table(variable),pattern
– split_table_row(variable)
– split_table_column(variable)
– extract_list(variable),pattern
– split_list(variables)
Data Warehousing/Mining
29
Advanced Features
Customization of output
– structure, label names, data type, ...
Extraction across multiple HTML pages
Graceful recovery from parse errors
– resume parsing using next input from source
Multiple patterns in single command
– follow different parse tree depending on structure in
source
Data Warehousing/Mining
30
...
Sample Extraction Scenario
Data Warehousing/Mining
31
Extracted OEM Data
root
OEM-QL}query:
complex {
complex {
temperature
city_temp complex {
country string
city_url url
string
city
weather_today
high_today
low_today
weather_tom
high_tomorrow
low_tomorrow
}
city_temp complex {
country string
city_url url
string
city
…
}
…
}
“Austria”
http://www…
“Vienna”
string
“snow”
string
“-2”
string
“-7”
string
“snow”
string
“-2”
string
“-7”
“Belgium”
http://www…
“Brussles”
<city C {<high H> < low L>}> :<temperature {<city_temp
{<country “Germany”> <city C> <high_today H> <low_today L>}>}>
Data Warehousing/Mining
32
Evaluation
Better than
– writing programs
– YACC, PERL, etc.
– A.I.
Can do better
– GUI tool to simplify the generation of extractor
specification
– Machine learning or data mining techniques to
automatically infer structure...
Data Warehousing/Mining
33