distributed_db_arch_replication
Download
Report
Transcript distributed_db_arch_replication
Advanced Databases CG096
Lecture 10: Distributed Databases –
Replication and Fragmentation
Nick Rossiter
1
Overview
Last week:
Saw difficulty in handling logical relationships
between distributed information
Potential solutions such as federated DDBMS
This week:
Look at an area where distributed databases are
extensively used
replication
For backup
for improving reliability of service
such as for mirror site
2
Strategies for Data Allocation 1
Centralised
Single database, users distributed across network
High communication costs
Low reliability and low availability
Failure of central site leads to no access to entire database
system
Storage costs
All data access by users over network
No local references
No duplication so minimal
Performance
Likely to be unsatisfactory
3
Strategies for Data Allocation 2
Fragmented
Database distributed by fragments (disjoint views)
Low communication costs
Reliability and availability vary depending on failed site
Failure of one part loses fragments situated there
Other fragments continue to be available
Storage costs
Fragments located near their main users (if good design)
No duplication so minimal
Performance
Likely to be satisfactory – better than centralised as less
network traffic
4
Strategies for Data Allocation 3
Complete Replication
Database completely copied to each site
Communication costs:
High reliability and high availability
Can switch from failed site to another
High Storage costs
High for update, low for read
Need to propagate updates through system
Complete duplication
Performance
High for reads
Potentially poor for updates with propagation of updates
5
Strategies for Data Allocation 4
Selective Replication
Fragments are selectively replicated
Communication costs:
Reliability and availability vary depending on failed site
Failure of one part loses fragments situated there
Other fragments continue to be available
Storage costs
Low (if good design)
Duplication of some fragments mean that it is not minimal but
less than with complete replication
Performance
Likely to be satisfactory – better than centralised as less
network traffic
6
Fragmentation -- Further Details
A fragment is a view on a table.
Two main types
Horizontal (classification by value)
subset of tuples obtained by restrict operation
(algebra) or WHERE clause (SQL)
Vertical (classification by property)
subset of columns obtained by project operation
(algebra) or SELECT clause (SQL)
7
Other Forms of Fragmentation
Mixed (classification by both value and
property)
Derived (association)
both horizontal and vertical fragmentation are
used to obtain a single fragment
an expression such as a join connects the
fragments
None
The whole of a table appears without change
in a view
8
Why fragment?
Most applications use only part of the data
in a table
To minimise network traffic, do not send
more data than is strictly necessary to any
site
Data not required by an application is not
visible to it, enhancing security
9
Factors against fragmentation
Performance
may be affected adversely by the need for
some applications to reconstruct fragments
into larger units
Integrity
more difficult to control with dependencies
possibly scattered across fragments
10
Three rules for fragmentation R1
R1) Completeness
If a table T is decomposed into fragments
every value found in T must be found in at least
one of the fragments
Otherwise get loss of data
So no loss of data as a whole in
fragmentation
11
Three rules for fragmentation R2
R2) Reconstruction
It must be possible to reconstruct T from the
fragments using a relational operation
(typically a natural join)
Otherwise decomposition into fragments is
lossy
Functional dependencies are preserved
12
Three rules for fragmentation R3
R3) Disjointness
A data item may not appear in more than one
fragment unless it is a component of a
primary key
Avoids duplication and potential
inconsistency
although transactions should avoid latter
Primary key duplication allows
reconstructions to be made
13
Strategy for Designing a Partially
Replicated Distributed Database
1
Design global database using standard
methodology
Examine regional distribution of business.
What data should be held by each part of
business?
Some data is only used locally (not exported, as in
Federated DDBMS)
Some data is mostly used locally
14
Strategy for Designing a Partially
Replicated Distributed Database
2
Transactions give many clues as to ideal
placement of fragments
a transaction will perform slowly if it requires data
from different sites, unless the network connecting
them is very fast
a transaction performing much replication of updates
will perform slowly if there is frequent contention for
resources (locking)
frequently used transactions should be optimised;
infrequently used ones can be ignored
15
Strategy for Designing a Partially
Replicated Distributed Database
3
Decide on which relations are not to be
fragmented. These will normally be
replicated everywhere:
as easy to update and to maintain integrity.
Fragment remaining relations to suit:
locality
transactions
16
Transparencies in DDBMS
Transparency hides details at lower levels
(often implementation ones) from user
Four main types:
Distribution
Transaction
Performance
DBMS
17
Distribution Transparency
The DDB is perceived by the user as a
single, logical unit even though the data is:
distributed over several sites
fragmented in various ways
18
Significance of Full Distribution
Transparency
User does not need to know anything about
the distribution techniques
User addresses global schema in queries
User will, however, not understand why
some queries take longer than others
Highest form of distribution transparency is
termed
fragmentation transparency
19
Reduced forms of distribution
transparency
Location transparency
user needs to know about fragmentation but
not about placements at sites
user does not need to know which
replications exist
Local mapping transparency
the most limited transparency
user needs to know about fragmentation and
sites
20
Transaction Transparency
Ensures that all transactions maintain the
DDB’s integrity and consistency
Each transaction is divided into
subtransactions
one subtransaction for each site
usually execute subtransactions in parallel
gains in efficiency
More complicated than in centralised
system
21
Forms of Transaction
Transparency
Concurrency Transparency
all concurrent transactions (centralised and
distributed) execute independently
DDBMS must ensure that:
each subtransaction is executed in the normal
spirit of transactions (ACID)
the subtransactions as a whole, forming one
transaction, are executed ACID-style
the mixture of subtransactions and whole
transactions is executed ACID-style
22
Transactions -- problems with
replication
Failure Transparency
Users are unaware of problems such as that below
encountered during transaction execution
If say 6 copies of a data item (at 6 sites) need to
be updated:
problems if only 5 are currently reachable
need to delay COMMIT until all sites processed
otherwise inconsistent data
unless allow delayed asynchronous update
23
Performance transparency
Requires:
the DDBMS to determine the most costeffective way to handle a request
which fragment to use
(if replicated) which copy of a fragment to use
which site to use
avoidance of any performance degradation
compared with a centralised system
24
DBMS transparency
Hides knowledge of which DBMS is being
used
The most difficult transparency of all
particularly with heterogeneous models
See problems highlighted in lecture 9:
Global Schema Integration
Federated Databases
Multidatabase Languages
25
Replication Servers
Copying and maintenance of data on
multiple servers
Replication -- the process of generating and
reproducing multiple copies of data at one or
more sites
Servers – provides the file resources – the
distributed database
26
Benefits of Replication
Increased reliability
Better data availability
Potential for better performance (with good
design)
Warm stand-by
As in mirror site, shadowing actions of main
site and cutting in if main site crashes
27
Timing of Replication
Synchronous
Immediate according to some common signal such as
time
Ideal as ensures immediate consistency
Assumes availability of all sites
Asynchronous
Independently with delays ranging from a few
seconds to several days
Immediate consistency is not achieved
More flexible as at any one time not all sites need to
be available
28
Types of data replicated
Across heterogeneous data models
Object replication
Mapping required (hard)
More varied than just base data
Also auxiliary structures such as indexes
Stored procedures and functions
Scalability
No volume restrictions
29
Replication administration
Subscription mechanism
Allows a permitted user to subscribe to
replicated data/objects
Initialisation mechanism
Allows for the initialisation of a target
replication
30
Ownership of Replicated Data 1
Master/Slave
Master site
Primary owner of replicated data
Sole right to change data
Publish and subscribe procedure
Asynchronous replication as slave sites receive
copies of the data
Slave site
Receive read-only data from master site
Slaves can be used as mobile clients
31
Ownership of Replicated Data 2
Workflow Ownership
Flexible master designation
Dynamic ownership model
Right to update data moves along the chain of
command (replicating sites)
For example, as order is processed the master
right moves to each department in turn
32
Ownership of Replicated Data 3
Update-anywhere
Peer-to-peer model
Multiple sites can update data
Conflict resolution required
More complex implementation
33
Distribution and Replication in
Oracle 9i
Materialised views
Formerly known as snapshots
Views are updated by
Refresh mechanism
Variable frequency to suit application
Fast – based on identified changes
Complete – replaces existing data
Force – tries Fast – if not possible – does
Complete
34
Oracle 9i transparency
Does not support
Fragmentation transparency
Supports
Site (location) transparency
35
Summary of Distributed DBMS
An area under keen development as improves
However, disadvantages remain:
Availability of data
Overall reliability of system
Performance (with good design)
Implementation can be complex (expensive)
Heterogeneity in models is poorly handled
Use for replicating data is main application today
36