XML and database technology

Download Report

Transcript XML and database technology

XML, distributed databases, and
OLAP/warehousing
The semantic web and a lot more
What is XML?



A framework for declarative languages
A syntax and two major constructs: elements & attributes
Elements:




Have begin and end tags
Can be embedded
Can be put in lists (homogeneous or heterogeneous)
Attributes:



Are assigned to elements
Are strings
Are put in quotes
What is XML for?

Initially, as a cornerstone of the semantic web



Automatic searching of the web (versus interactive)
Self-describing data
Has been adapted to a wide variety of application domains


As a means for specifying the structure of data
As a catch-all for nontraditional data
XML documents








An instance of XML is a language
An instance of an XML language is a document
Documents are hierarchical & list-oriented
XML documents can be parsed in a single, linear pass
There is do notion of a fixed schema
Does not leverage meta data for set-oriented queries
Order matters in a set of documents
Order matters in a series of elements in a document
Is it a generalized HTML?



Sort of, but perhaps more of a meta alternative to HTML
The real point is to allow HTML pages to be located and
searched automatically
This is done by allowing language developers to create
their own names for documents, elements, & attributes
What else is part of the XML philosophy?

Namespaces



Associated with URLs
Can be referenced in a nested fashion in an XML document
Widely distributed sharing of data, XML languages, and
namespaces
What’s missing, from the database uer’s and
a programmer’s perspective?






No innate notion of a query language
No Objects
Very limited data structuring capabilities
Yet another impedance mismatch problem
No way to store XML documents in a relational database,
at least not natively
No way to make a database out of a set of documents
So, in response to the database
community’s desires…


A hierarchical query language – Xpath
A specification format for schemas – DTDs


But uses a different syntax
Does not accommodate namespaces
So, in response to the database
community’s desires, phase 2…

XML schema





More atomic or “basic” types
Like DTD’s, but with an XML syntax
Supports namespaces
Adds primary keys and foreign keys
Adds more constructs for structuring data

Simple types: primitive types, list and union, & restriction


Attributes can be of simple types
Complex types: compositors



all (unordered) and sequence (ordered), and choice
Extension and restriction
Integrity constraints
Query language 1: XPath


Follows hierarchy of XML documents
Uses syntax borrowed from Unix file system








\ for root
. for current node
@ for value of an attribute
[1], [2], etc., for siblings
// for self or descendent of
.//x for all descendants to find an element of a specific type x
Augmented with URLs to create Xpointer
Relational database systems generally have an XML data
type now
Distributed Databases & Distributed TXS –
homogenous and heterogeneous


See page 689: multiple DBs vs. a distributed DB
Homogeneous distributed DBs




Single unified schema
Designed top down
Distribution by row, column, table, by table selection
Issues of distribution



Redundancy: availability vs. keeping copies up to date
Hidden joins with column distribution
Hidden unions with table selection distribution
Executing distributed transactions

Each node has a master and a client module



3 basic strategies for query fragment execution




Masters are all identical and contain distributed data info
Clients are like single site databases with a prepare to commit
Bring data to procedure
Send procedure to data
Meet in a 3rd place
Estimating costs




Data shipping
Result shipping
Wait times on nodes
Integrity constraint enforcement
Heterogeneous distributed databases

Forms of heterogeneity









Model
Schema
Database product
Namespace
Table structure (implications for object identities)
Keys and Foreign keys
Units
SQL dialect
Semantic issues relating to varying interpretations of data
Integrating heterogeneous databases





After the fact
Stability is never achieved
Mappings are complex
Data may have conflicts, redundancy, and gaps
Closed world vs. open world
Engineering for nonstop change




Mediators around databases
Gateways connecting old apps and new databases
Gateways connecting new apps and old databases
A stability of instability
OLAP

Standard model






N dimension tables
1 fact table (PK is union of keys of dimension tables)
Hypercube visualization
Multidimensional table result visualizations
Star and constellation schemas
Terminology



Drilling down – stepping down nested attributes
Rolling up – moving up nested attributes
Pivot – group by
Specialized operators


Cube operator and 4 equivalent queries
Viewing results


See page 722
Equivalent – see 723
Populating the warehouse



Transformation
Integration
cleaning
Data mining



Effectively an open world application
Association, classification, clustering – page 730
Association – confidence and support – page 731