Chapter 25: Distributed Databases

Download Report

Transcript Chapter 25: Distributed Databases

Chapter 25: Distributed
Databases
Definitions
• Distributed Database – a collection of of
multiple logically interrelated databases
distributed over a computer network
• Distributed Database Management
System – A software system that
manages a distributed database while
making he distribution transparent to the
user.
Motivations for Distributed DBs
•
•
•
•
No centralized point of failure.
Local Autonomy.
There’s a whole lot of data out there to store.
Replication of Data for Disaster Recovery and
High Availability (think RAID on a network)
• High-throughput query processing (either interquery or intra-query parallelism), dynamic loadbalancing,
• Poor people can’t afford supercomputers.
Drawbacks of DDBs:
•
•
•
•
Security
Increased complexity of Database Design
Increased complexity of Software
Data integrity and resolution of concurrent
operations.
• Cost (But if you’re big enough to need one, you probably can afford one?)
Transparency:
Transparency of Data:
– Location Transparency – A command works the same no matter
where in the system it is issued
– Naming Transparency – We can refer to data by the same name,
from anywhere in the system, with no further specification.
– Replication Transparency – Hides multiple copies of data from
user
– Fragmentation Transparency – Hide the fact that data is
fragmented (ie, different sections of correlated data may be in
different locations)
Two Fundamental Patterns for
Fragmenting Data
• Horizontal – Store Whole Tuples on
Different machines.
– Nice because we can use standard relational algebra
statements to define a restriction on a relation that
creates these:

s”new york” (City)
s “chicago” (City)
(Do we need to know all possible values for City in order to fully
specify a fragmentation.)
• Vertical – Store Different Fields of the
same tuples on Different machines.
– Use Projection Op to declare these:
P (Acct #, Branch, Client Name Account)
P (Acct #, Balance Account)
(Notice this requires redundant storage of at
least one primary key per tuple)
Redundant / Non-Redundant
Allocations:
• Full Replication (Completely Redundant)
– Good read time, good recoverability
– Requires more coordination for multiple
writers on same data, hogs disk space
• No Replication (Non-Redundant)
– Easier to coordinate multiple writers, multiple
readers. But no backup in case of disaster.
• Partial Replication
– Trade-off between the above two options.
Global Directory
• Global Centralized
(Why have a DDBMS at all if you’re
going to do this?)
• Dispersed or no Global Directory
• Completely Replicated
• Local-Master Directory
– Each node has its own catalog of data
– Each node has a directory to all of its data
that is replicated elsewhere.
Each database in a distributed database is distinct
from all other databases in the system and has its
own global database name.
Name Resolution in Oracle8
• Every data object in every schema in every database
has a unique identifying name:
– SELECT * FROM
[email protected]_auto.com;
• A remote query is a query that selects
information from one or more remote tables, all
of which reside at the same remote node. For
example:
– SELECT * FROM
[email protected]_auto.com;
Remote and Distributed SQL
Statements in Oracle8
• A remote update is an update that modifies data
in one or more tables, all of which are located at
the same remote node. For example:
– UPDATE
[email protected]_auto.com SET
loc = 'NEW YORK' WHERE deptno = 10;
• A distributed query retrieves information from
two or more nodes. For example:
– SELECT ename, dname FROM scott.emp,
[email protected]_auto.com d
WHERE e.deptno = d.deptno;
• A distributed update modifies data on two or
more nodes. A distributed update is possible
using a PL/SQL subprogram unit, such as a
procedure or trigger, that includes two or more
remote updates that access data on different
nodes. For example:
BEGIN
UPDATE [email protected]_auto.com
SET loc = 'NEW YORK' WHERE deptno = 10;
UPDATE scott.emp
SET deptno = 11 WHERE deptno = 10;
END;
2-Phase Commit Process
• Easy to trigger with the COMMIT directive.
• The Recoverer (RECO) background
process on each server involved in the
transaction coordinates to resolve any indoubt transactions.
• All RECOs either commit or roll-back the
change in a consistent manner.
Chapter 28: Data Warehousing and
OLAP
• Data Warehousing “a subject-oriented,
integrated, nonvolatile, time-variant
collection of data in support of
managements decisions”
• Decision Support Systems or Executive
Information Systems
• Online Analytical Processing (OLAP)
analysis of complex data from a data
warehouse
Data Warehouses
• Optimized for providing general
information about large data sets instead
of explicit information about individual data
records
• Multidimensional Matrices called Data
Cubes (or hypercubes)
• Efficient storage, data marts, distributed
DW, federate DW
Some steps in Data Acquisition
1. Data is extracted (from multiple heterogeneous
sources)
2. Data must be formatted
3. Data must be cleaned (the most involved step)
Data can be backflushed to its source after
cleaning.
4. Data must be converted from its source
(relational, OO, hierarchical) to the DW’s
multidimensional scheme.
5. The data must actually be loaded.
Basic Operations
•
•
•
•
Pivot (rotate)
Roll-Up (grouping)
Drill-Down (subdivision)
Slice and dice: Perform projection
operations on dimensions.
• Sort (data, by some criteria)
• Select Data (by value or range)
Chunk-Offset
Compression
Only stores the addresses
and data for valid cells in
each chunk in a
(offset, cellValue) format
Heum-Geun Kang and Chin-Wan Chung, Exploiting versions for on-line data
warehouse maintenance in MOLAP servers, VLDB, 2002.
Bitmap Indexing
0
0
0
0
0
0
0
0
1
0
0
0
0
1
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
1
0
0
0
0
1
0
0
0
0
1
0
1
0
0
0
1
0
0
0
0
0
0
0
0
0
0
Multidimensional Schema
• Components:
– Dimension Tables – tuples of attributes of the
dimension
– Fact Table – Holds tuples that correspond to recorded
facts.
• Patterns:
– Star Schema – A single table for each dimension.
– Snowflake Schema – obtained by normalizing a star
schema, creating a new hierarchy of multiple
dimensional tables
– Fact Constellation – A set of fact tables that share
some dimension tables
Data Warehousing vs. Materialized
Views
• DWs exist as persistent storage instead of being
materialized on-demand.
• DWs are multidimensional, not relational. Views
of a relational database are relational.
• DWs can be indexed to optimize performance.
Views are dependant on the structure of the
underlying database.
• DWs contain compositions of data collected from
multiple datasources. Views are derived from a
single database.