Processing Semi-Structured Data

Download Report

Transcript Processing Semi-Structured Data

Processing Semi-Structured Data
October 1998
Kazimierz Subieta
Institute of Computer Science PAS
Warsaw, Poland
[email protected]
http://www.ipipan.waw.pl/~subieta
Polish-Japanese Institute
of Computer Techniques,
Warsaw, Poland
K.Subieta. Processing Semi-Structured Data, Slide 1
October 1998
Motivation: Why Semi-Structured Data?
Some data are really unstructured or semistructured:
WWW. HTML pages and WWW databases are untyped, unformatted, with dangling
links. There is little doubt that they will require efficient programming techniques to
deal with their irregularities and inconsistencies (e.g. for implementing active agents).
Interoperability. Heterogeneous and/or distributed databases require resolution of
missing or conflicting data values that occur when semantically identical data items
may have some attribute values different or missing in some data sources.
Schema evolution. As a result of schema evolution some new attributes may appear,
some may disappear, some may change type, some single attributes can be changed
into repeating ones. This naturally leads to irregular data.
Data warehouses. They assume collecting decision-support data from heterogeneous
sources. Analytical processing of heterogeneous information (statistical analysis,
overview reports, data mining, etc.) requires capabilities of integrated
query/programming languages that will make querying and processing irregular data
possible .
K.Subieta. Processing Semi-Structured Data, Slide 2
October 1998
Is the Problem New?
No. But it has many new aspects.
The problem of semi-structured data is closely related to the classical problem
of null values in databases and has about thirty years of history. Many authors
proposed extensions of capabilities of database systems for storing and querying
irregular, uncertain or missing information:
• various extensions of the relational algebra,
• specialized probabilistic DBMSs
• nested relational algebras dealing with null values
• querying disjunctive information or so-called “possible worlds”
• three/four/multi-valued logics
A bibliography on uncertainty management by Dyreson presents about 400
papers devoted to this topic. Null values received the first-class citizenship in
the relational model and became a component of many important notions such
as outer joins and universal relations.
K.Subieta. Processing Semi-Structured Data, Slide 3
October 1998
Null Values: the-state-of-the art
Currently relational databases, SQL and their object-oriented counterparts introduce
null values and some simple means to query/process them as a standard feature.
However, it is rather a common opinion that the problem remains unsolved.
Up to now the theoretical proposals do not fit well with other features of practical
database systems, thus they are not implemented, even in prototypes.
Pirotte and Zimanyi: “...a great variety of approaches that are not easily comparable,
and (...) few practically usable results”.
Dubois, Prade and Testamale: “...the problem is generally not well understood, and
any attempt to incorporate support for null values into an implemented system
should be considered premature at this time'“.
These doubts are actually supported by many professionals, including some of the
former proponents of theoretical ideas related to null values.
K.Subieta. Processing Semi-Structured Data, Slide 4
October 1998
Sources of Null Values
Information is irrelevant, for example, “nickname” for a person who has no nickname.
Information is known, but not filled yet because data input lasts some time. For
example, a new employee is hired, all his/her data are already introduced to the
database, except “salary” which has to be input by another administration division.
Information is valid only for fixed time; after it must be nullified (to prevent errors) and
then input again or recalculated; for example, weather or stock market forecasts.
Information is known and could be filled in, but it is considered inessential for the
particular domain of applications. For example, in some airline ticket reservation system
the nationality of passengers is not filled in for domestic flights.
Information consists of “legacy” records, with limited amount of information, and actual
records having an extended set of attributes.
The result of some processing is an intermediary data structure containing null-valued
elements; e.g. outer joins.
None of these cases can be qualified as uncertain information.
They are related to specific aspects of data storing and processing.
K.Subieta. Processing Semi-Structured Data, Slide 5
October 1998
Null Values in Theories
The fundamental defect of relational theories w.r.t. null values is
the scope mismatch. It concerns the difference between the scope of theories and the scope
in which null values need to be considered in practice.
Theories usually address an idealized query language with very limited capabilities.
The scope for null values is wider and should include:
• all advanced query constructs, such as grouping, aggregate functions, quantifiers, ordering;
• imperative constructs based on queries: creating, updating, inserting, deleting;
• the interface for embedding a query language into a programming language;
• database semantic enhancements: views, database procedures, integrity constraints,
deductive rules, active rules;
• object-oriented extensions: ADT-s, types, classes, interfaces, encapsulation, inheritance;
• static typing systems;
• privacy and security;
• transaction mechanisms;
• interfaces for end users and programming interfaces: graphical query lang., 4GLs, ....
The scope mismatch is the main reason that theoretical results
in this area can be summarised as absolute zero.
Ironically, this does not concerns careers of researchers
that developed these “theories”: the result is great!
K.Subieta. Processing Semi-Structured Data, Slide 6
October 1998
Unions/Variants
In the domain of programming languages there is another well-known feature dealing
with semi-structured data: “variants” in the Pascal family of languages, or “unions” in
the C family. Unions and null values are conceptually similar notions, however, not
equivalent.
For example, if some record type involves n attributes that can be null valued, then
the number of corresponding components of a union is 2n. Unions cover the situation
when names of attributes are the same but types are different.
The programming language community noticed unions and ignored null values.
The database community did vice-versa.
Unions are proposed in IDL of OMG CORBA, with an explicit discrimination
attribute, and in the ODMG model. Unfortunately, ODMG 2.0 contains no construct
to deal with unions in the query language OQL and contains no suggestion how
unions will be treated by the assumed strong typing system. The issue is not trivial,
especially concerning unions with an explicit discrimination attribute.
Actually there is no powerful query language consistently dealing with unions.
K.Subieta. Processing Semi-Structured Data, Slide 7
October 1998
Null Values in SQL
A null value can be stored in a database and returned by an SQL expression if it
cannot be evaluated correctly. This may happen because of null-valued arguments of
an operation, as well as in the case of wrong arguments of operations.
For example, function sum returns NULL for an empty argument table.
SQL does not allow explicit comparisons of nulls and ordinary values but involves a
special predicate is [not] null.
A special function if_null returns an attribute's value if it is not null, or some constant
value otherwise.
Comparison of an expression which returns NULL with any other value returns a third
truth value UNKNOWN; in the WHERE clause, however, it is equivalent to FALSE.
In embedded SQL the admission of null values requires for every such attribute two
variables in an underlying host language program. An indicator variable is used to
store boolean information determining if the particular value of an attribute is null.
K.Subieta. Processing Semi-Structured Data, Slide 8
October 1998
Flaws of Null Values in SQL
The SQL solutions related to null values are semantically inconsistent.
C.J.Date presents a severe criticism of null values in the relational model (“...the null value
concept is far more trouble than it is worth”, “...the SQL null value concept introduces far
more problems than it solves”). Anomalies implied by null values:
Although in SQL officially every null value is distinct (thus A = B returns UNKNOWN if both
A and B return nulls), the group by and unique operators treat them as identical.
Aggregate functions sum, avg, min, max ignore null values.
If relation R contains numerical attributes A and B which could be null valued, than in general
select sum(A+B) from R may return the result different from select sum(A) + sum(B) from R.
Date concludes that these flaws are not only the property of an inadequate SQL design, but
they are the inevitable consequence of the idea of null values.
Unfortunately, this idea of null values is continued and even extended in SQL3.
K.Subieta. Processing Semi-Structured Data, Slide 9
October 1998
Default values
As an alternative to null values Date and other authors propose the concept of
default values. They are ordinary values that are filled in into a tuple in the case of
absent information. Default values can be determined in a relational schema.
For example, the schema
declare SUPPLIER relation
SNO
char(4)
SNAME
char(15)
STATUS
int
CITY
char(20)
primary key SNO;
nodefault,
default( “
“ ),
default ( -1 ),
default (“???” ),
presents defaults for attributes SNAME, STATUS, and CITY as a string of spaces,
integer -1, and a string of three question marks, correspondingly.
An advantage of default values over null values is that they need not any special
options in a query/programming language to process them.
However, default values do not solve all problems, they are not universal and they
are error-prone.
K.Subieta. Processing Semi-Structured Data, Slide 10
October 1998
The New Wave: Semi-Structured Data
A new wave known under the term “semi-structured data” approaches the problem
from a different angle (and actually does not refer to null values), but inherits a lot of
problems and doubts related to null values.
The main limitation of relational approaches is repeated!
Querying semi-structured data is not enough!
Any data processing system must eventually deal with complete computational and
pragmatic power of user/programmer interfaces.
Features related to semi-structured data should be combined with all aspects of database
systems:
• database design,
• data description,
• query languages,
• programming capabilities,
• typing,
• object-orientedness,
• procedures, views, active rules,
• transactions, security, catalogs,....
K.Subieta. Processing Semi-Structured Data, Slide 11
October 1998
Related Works
1. Management of semi-structured data
TSIMMIS (Stanford)
INRIA
Rufus
 UnQL (Uni Penn)
2. Semantic markup technique
HTML generation and semantic markup
Lightweight data mark-up
1. Combine Web information service and Database system
2. Intermediate Representation Model
for semi-structured data (HTML) importing
AVPL (Semantic tag markup)
3. Preserve Data Model and Query Facility
Introduce the concept of schema construction
4. Can be used as a Database Entry
System independently
Applicable to datawarehousing
K.Subieta. Processing Semi-Structured Data, Slide 12
October 1998
Mark-up, Wrappers and Mediators
Wrappers and mediators are new architectural elements of processing
unstructured (semi-structured) information.
Mark-up: special tags attached to e.g. the content of a Web page allowing to determine a
structure of plain text.
Wrapper: a piece of software that converts (virtually) “external” interface to “internal”
interface. A wrapper allows “to see” and process some data in a manner that is presents some
standard.
Mediator: a piece of software on higher abstraction level that allows to convert requests
(queries) from higher abstraction level to lower level. The concept is close to the concept of
database view (updatable).
Wrappers and mediators increase the level of data independence.
They make it possible the access to data in various formats.
They isolate the programmer from details of internal data organization.
They increase the level of abstractions allowing to query and process data by
interfaces that are more user friendly.
The precise meaning of these terms is however not clear.
The rules of designing wrappers and mediators are very informal.
K.Subieta. Processing Semi-Structured Data, Slide 13
October 1998
Current Approaches to Semi-Structured Data
After application of mark-up techniques, wrappers and mediators semi-structured
data should have some form that is more convenient to formal treatment.
In comparison to classical database models (network, hierarchical, relational, objectoriented) the main novelty of the approaches to semi-structured data is that data
names are kept together with data values.
A data structure is viewed as a graph with nodes representing some values (or
internal identifiers) and edges representing data names (i.e. logical access paths).
Every instance of some conceptual entities can be assigned different attributes.
In Tsimmis objects are triples <identifier, label, value>, where
identifier is a unique internal object identification,
label is a data name, and
value is some atomic value (integer, string, etc.) or a set of objects.
The database is a pair <O, N>, where O is a set of objects, and N (a subset of O) is a set of
entry points to the database.
A similar approach (but without identifiers) is assumed in UnQL.
Almost identical assumptions (the most general) are assumed in the stack-based approach.
K.Subieta. Processing Semi-Structured Data, Slide 14
October 1998
TSIMMIS at Stanford
Lightweight Object Model: Object Exchange Model (OEM),
like AVPL, O2’s Complex Value
Lightweight Repository LORE
Lightweight Query Language LOREL
LOREL
select -from-where Clause:
SELECT Frodos.Group.Name
FROM Frodos
WHERE Frodos.Group.Category=“Opera”
SQL sugar + path expressions (limited).
The retrieval power is impossible to determine.
Object-orientation (classes, methods, encapsulation, inheritance): not considered
The integration with fully-fledged programming capabilities: not considered.
Conclusion: limited goals that will be not enough for many applications.
K.Subieta. Processing Semi-Structured Data, Slide 15
October 1998
Example - Movie Database
Entry
Entry
Entry
TV Show
Movie
Title
Title
Cast
Director
Cast
“Casablanca” “Bogart”
“Bacall”
“Play it again, Sam”
Credi
Actors
Title
Actors
Cast
Episode
“Alan”
1
2
3
Special Guests
Data labels (names of objects, names of attributes, etc.)
are kept together with data values.
K.Subieta. Processing Semi-Structured Data, Slide 16
October 1998
Relational Databases are a Particular Case
R1
A
“a”
“b”
B
2
4
R1
C
3
5
Tup
A
R2
C
3
4
5
D
“c”
“d”
“e”
“a”
K.Subieta. Processing Semi-Structured Data, Slide 17
B
2
C
R2
Tup
A
3 “b”
Tup
B
4
C
5
C
3
D
Tup
Tup
C
“c” 5
D
C
“d” 5
D
“e”
October 1998
Querying Paradigms
LISP-like. Data structures are lists which are served by special functions (low
level).
Data structure is a graph. A query determines a sub-graph of it (UnQL).
Comprehensions syntax, structural recursion as an intellectual support.
Navigation in a labeled graph. Path expressions + conditions determine the
target to be retrieved (Lorel).
Extensions to SQL syntax: special constructs testing the presence and absence
of data (Lorel).
Stack based. A query language is based on binding names occuring in a query
according to the state of the environment stack. The stack is managed by
query operators (Loqis).
K.Subieta. Processing Semi-Structured Data, Slide 18
October 1998
My Old Experience: The Great Emigration
In late 70-ties I implemented the system “The Great Emigration” devoted to historical
research. The system recorded data on Polish emigrants (about 15000) after the unsuccessful
November Revolution against Russian domination in 1830-31.
The data were collected from police archives all over the Europe, mostly from France.
Each emigrant was described by about 50 attributes, such as AttendedSchools, Opinions,
StayingPlaces, Duels, Jobs, etc.
Each attribute might be optional (null valued), repeated and complex; each occurence of an
attribute might be (optionally) associated with dates. To the purpose of this system I
implemented a powerful query/programming language, which allowed us to ask non-trivial
queries; for example “Identify the groups of emigrants according to the same time of their
staying in at least three places”.
The generic part of this system we upgraded to a universal DBMS called LINDA. This
system had several other applications, e.g. for processing medical data.
It may be interesting to note that the LINDA data model was almost identical with the
Tsimmis data model (!).
K.Subieta. Processing Semi-Structured Data, Slide 19
October 1998
The Stack-Based Approach:
An Abstract Store Model
I - a set of internal identifiers
N - a set of external names
V - a set of atomic values
Store:
A set of objects +
A set of identifiers (roots)
< i, n, v > atomic object
< i1, n, i2 > pointer object
< i, n, T > complex object
T is a set of (any) objects
+
some obvious constraints
(uniqueness of identifiers,
referential integrities)
No record, tuple, array, set, and bag constructors in the model:
essentially all of them are collections of objects (“environments”).
No uniqueness of external names on any level of data hierarchy:
modeling bulk data.
The model allows us to treat uniformly
relational, nested-relational, functional, object-oriented databases.
Ordering, encapsulation and classes require some extension of the model. later
K.Subieta. Processing Semi-Structured Data, Slide 20
October 1998
Tiny Database
TinyDatabase
i1 EMP
i5 EMP
i9 EMP
i2 NAME Brown
i6 NAME Smith
i10 NAME Jones
i3 SAL 2500
i7 SAL 2000
i11 SAL 1500
i4 WORKS_IN i13
i8 WORKS_IN i17
i12 WORKS_IN i17
i13 DEPT
i17 DEPT
i14 DNAME Toys
i18 DNAME Sales
i15 LOC Paris
i19 LOC Berlin
i16 LOC London
K.Subieta. Processing Semi-Structured Data, Slide 21
October 1998
A Complex Object
i3 NAME Smith
i2 ENO e127
i5 WORKS_IN i17
i4 JOB designer
i8 COMPANY IBM
i1 EMP
i7 PREV_JOB
i6 SAL 7500
i9 WHEN 1975-77
i11 COMPANY ICL
i10 PREV_JOB
i12 WHEN 1977-90
K.Subieta. Processing Semi-Structured Data, Slide 22
October 1998
Universality of the Store Model
Complex hierarchical objects can be defined (no limit in levels)
Programming variables can be defined
We abstract from the persistence status of objects and variables, i.e. we define
in the same way persistent and transient objects
Binary relationships (associations) can be defined via pointer objects.
Ternary and higher order relationships, attributes of relationships: we do not deal
with them - they must be decomposed into binary ones.
Bulk data: we deal with sets/bags. They are modeled by the same name assigned
to many objects on the same hierarchy level
Relational structures: each tuple is understood as an object with subobjects.
Relativity: no difference in the treatment of objects on any hierarchy level.
big advantage for the universality, minimality and simplicity of semantics.
K.Subieta. Processing Semi-Structured Data, Slide 23
October 1998
Semi-Structured Data
Dynamic arrays: the only difference is that names of sub-objects are first-class.
i0 A
i1 1
Monday
i2 2 Tuesday
i7 7 Sunday
Variants/unions: different structures of sub-objects.
i10 EMP
i11 NAME Smith
i12 KIND regular
i20 EMP
i21 NAME Brown
i22 KIND part-time
i13 SAL 5000
i23 EMPLOYER Smith
Null-values: absence of some sub-objects (same approach as for variants).
Consistent! ... in contrast to the SQL approach.
i10 EMP
i11 NAME Smith
i12 KIND regular
i20 EMP
i21 NAME Brown
i22 KIND part-time
i13 SAL 5000
Null
K.Subieta. Processing Semi-Structured Data, Slide 24
October 1998
Type-less, Schema-less?
There is misunderstanding
concerning the terms “type-less” and “schema-less”.
Type-less structures have their advantages (dynamic flexibility) and disadvantages
(error prone, storage overhead).
However, “type-less” does not mean “schema-less”. An important aspects of types
concerns conceptual modeling. Types not only constraint data: they also inform
the user what the database contains; i.e. they represent informal data semantics.
The user must be informed what the database could contain by a sufficiently
precise statement. This statement presents the database schema. It is possible that
the schema is not implemented in the system (thus do not present a typing
constraint), but it must be known for the user for correct formulation of queries.
Such a schema is also necessary for writing wrappers. For example, the designer of
a wrapper must take decisions concerning data names (labels of graph edges). Such
decisions must be supported by a data schema.
K.Subieta. Processing Semi-Structured Data, Slide 25
October 1998
Processing Semi-Structured Data:
Assumptions of the Stack-Based Approach
• A special null value, understood as a generic property of a database system, is
avoided.
• Default values are useful but do not solve the problem of semi-structured data.
Default values are the properties (invariants) of classes.
• The approach unifies null values, unions and repeating data. These cases
irregularities are modeled by absent objects.
• Every kind of objects (atomic, complex, reference objects) can be absent. Such
irregularities are served by standard query facilities.
• The classical environmental stack is used to express the semantics of query
operators in terms of the naming, scoping and binding mechanism.
• The same mechanisms should be used for determining imperative statements,
procedures, methods, views and other procedural abstractions.
• The query/programming interface should have full programming power.
• Some technique of typing semi-structured data should be provided.
K.Subieta. Processing Semi-Structured Data, Slide 26
October 1998
Optional, Alternative and Repeating Data
a) Optional data (null values)
EMP
NAME Brown
EMP
NAME Smith
SAL 2500
WORKS_IN i13
EMP
NAME Jones
SAL 1500
WORKS_IN i17
b) Union
EMP
NAME Clark
EMP
NAME Davis
KIND regular
KIND apprentice
SAL 2700
SKILL 3
c) Repeating (complex) attribute
EMP
NAME Clark
K.Subieta. Processing Semi-Structured Data, Slide 27
EMP
NAME Davis
PREV_JOB
WHERE IBM
WHEN 1985
PREV_JOB
WHERE ICL
October 1998
Classes and Inheritance
K.Subieta. Processing Semi-Structured Data, Slide 28
October 1998
Default Values
They are stored inside class objects as their invariants.
CLASS_SUPPLIER
SNAME “
STATUS -1
SNO 1234
CITY “Rome”
K.Subieta. Processing Semi-Structured Data, Slide 29
.....
CITY “???”
SUPPLIER
SUPPLIER
SNAME “Black“
“
SNAME “Gray“
SNO 1256
STATUS 55
October 1998
Research Problems:
Assuming semi-structured data are modeled by absent (sub) objects,
there are the following problems:
• Consistent integration of this idea with object-orientedness
• Powerful constructs querying data with absent (sub)objects
• Consistent integration of this idea with imperative constructs
• Consistent integration of this idea with programming abstractions
• Consistent integration of this idea with schema declarations and typing
• Methodologies and technologies for writing wrappers for unstructured data
• Methodologies for designing integrators of heterogeneous sources
• Methodologies for analysis and design of IS based on unstructured data.
We believe that these problems should be treated with respect to all
programming issues.
Querying semistructured data is not enough.
Simplistic theories concerning semi-structured data are not progress.
K.Subieta. Processing Semi-Structured Data, Slide 30
October 1998