Reference Book Principles of Distributed Database System

Download Report

Transcript Reference Book Principles of Distributed Database System

Reference Book
Principles of Distributed Database
System
Chapters
Chapter 12: Distributed DBMS Reliability
Chapter 14: Distributed Object Database
Management Systems
Chapter 16: Current Issues
Preethi Vishwanath
Week 2 : 12th September 2006 –24th September 2006
Reliability concepts - definitions
System refers to a mechanism that consists of a collection of
components and interacts with its environment with a recognizable
pattern of behavior.
Each component of a system is itself a system, commonly called a
subsystem.
The way components of a system are put together is called the design
of the system.
An external state of a system can be defined as the response that a
system gives to an external stimulus.
The behavior of the system in providing response to all the possible
stimuli from the environment needs to be laid out in an authoritative
specification of its behavior.
Any deviation of a system from the behavior described in the
specification is considered a failure.
Some transactions could cause system failure, such internal states are
called erroneous states.
Any error in the internal states of the components of a system or in the
design of a system is called a fault in the system.
A permanent fault, also called a hard fault, is one that reflects an
irreversible change in the behavior of the system.
Reliability
– Reliability refers to the
probability that the system
under consideration does not
experience any failures in a
given time interval.
– R(t) = Pr{0 failure in time [0,t]
no failures at t = 0}
where R(t)
: reliability of the system
Availability
– Refers to the probability that
the system is operational
according to its specification at
a given point in time t
– A=µ/‫ح‬+µ
where ‫ ح‬is a failure rate
µ is a mean repair time
Mean Time between Failures
– Is the expected time between
subsequent failures in a
system with repair.
– Can be calculated either from
empirical data or from the
reliability function
– Is related to the failure rate
– MTBF = ∫∞0 R(t) dt
Mean Time to repair
– Expected time to repair a
failed system.
– Is related to the repair rate
Steady State availability of a
system with exponential failure
and repair rates can be
specified as
A = MTBF/(MTBF + MTTR)
Reasons for Failure
SE SS Sw i t c h D a t a
S t udy Conduc t e d a t S t a nf or d Li ne a r
a c c e l e r a t or
Un kn own
Envi r onment
Har dwar e
Oper ati ons
Sof t war e
Har dwar e
Oper at ion s
Sof twar e
Ta nde m D a t a
E nv i r onment
har dwar e
s of t war e
mai nt ai nenc e
oper at i ons
Fault tolerance
–
Refers to a system design approach which recognizes that faults will occur
Fault prevention/Fault intolerance
–
–
Aim at ensuring that the implemented system will not contain any faults
Two aspects
Fault avoidance
–
–
Refers to the techniques used to make sure that faults are not introduced into the system
Involve detailed design methodologies such as design walkthroughs, design inspections etc..
Fault removal
–
Refers to the techniques that are employed to detect any faults that might have remained in the system despite the
application of fault avoidance and removed these faults.
Fault detection
–
Issue a warning when a failure occurs but do not provide any means of tolerating the failure.
Latent Failure
–
One that is detected some time after its occurrence
Mean time to detect
–
Average error latency time over a number of identical systems.
Fail-stop modules
–
Constantly monitors itself and when it detects a fault, shuts itself down automatically
Fail-fast
–
Implemented in software by defensive programming, where each software module checks its
own state during state transactions.
Different ways of implementing process pairs
–
–
–
–
–
Lock-step
Automatic check pointing
State check pointing
Data check pointing
Persistent process pairs
Failure in Distributed DBMS
Site(System) Failures
– Always assumed to result in
the loss of main memory
contents.
– Total failures, refers to the
simultaneous failure of all sites
in the distributed system.
– Partial Failure indicates the
failure of only some sites while
the others remain operational.
Media Failures
– Refers to the failures of the
secondary storage devices
that store the database.
– Duplexing of disk storage and
maintaining archival copies of
the database are common
techniques that deal with this
sort of catastrophic problem.
Communication Failure
Transaction Failure
– Incorrect input data
– Detection of present or
potential deadlock
– Usual approach to take in
cases of transaction failure is
to abort the transaction.
– Unique to the distributed case.
– Most common ones are the
errors in the messages,
improperly ordered messages,
lost messages and line
failures
– The term for the failure of the
communication network to
deliver messages and the
confirmations within this
period is performance failure
Interface between the local recovery manager and the
buffer manager
Local Recovery
Manager
Stable
database
Database Buffer
Manager
Database
Buffers
(Volatile
Database)
Recovery Information
In-Place Update Recovery
Information
– Necessary to store info about
database state changes, inorder
to recover back.
– Recorded in the database log
– REDO Action
Database needs to include
sufficient data to permit the undo
by taking the old database state
and recover the new state
– UNDO Action
Database needs to include
sufficient data to permit the undo
by taking the new database state
and recover the old state.
Out-of-place update recovery
information
Network Partitioning
–
Network is divided into only two
components
–
Multiple partitioning
Network is divided into more than two
components
Centralized Protocols
–
Primary Site
Makes sense to permit the operation of the
partition that contains the primary site,
since it manages the lock.
–
Primary copy
More than one partition may be operational
for different queries.
Voting-based Protocols
–
–
– Typical techniques
Shadowing
Every time an update is made, the
old stable storage page, called
shadow page is left intact and a
new page with the updated data
item values is written into the
stable database.
Differential files
Simple partition
–
Transactions are executed if a majority
of the sites vote to execute it.
Quorum-based voting can be used as a
replica control method, as well as a
commit method to ensure transaction
atomicity in the presence of network
partitioning.
In case of non replicated databases,
this involves the integration of the
voting principle with commit protocols.
2 Phase Commit Protocol
The two phase commit protocol is a distributed algorithm which lets all
sites in a distributed system agree to commit a transaction.
The protocol results in either all nodes committing the transaction or
aborting, even in the case of site failures and message losses.
Basic Algorithm
–
1.
2.
3.
4.
Commit-request phase
The coordinator sends a query to commit message to all cohorts.
The cohorts execute the transaction up to the point where they will be
asked to commit. They each write an entry to their undo log and an entry
to their redo log.
Each cohort replies with an agreement message if the transaction
succeeded, or an abort message if the transaction failed.
The coordinator waits until it has a message from each cohort
1.
2.
3.
4.
5.
Commit phase
– Success
If the coordinator received an agreement message from all cohorts
during the commit-request phase:
The coordinator writes a commit record into its log.
The coordinator sends a commit message to all the cohorts.
Each cohort completes the operation, and releases all the locks and
resources held during the transaction.
Each cohort sends an acknowledgement to the coordinator.
The coordinator completes the transaction when acknowledgements
have been received.
– Failure
1. If any cohort sent an abort message during the commit-request
phase:
2. The coordinator sends a rollback message to all the cohorts.
3. Each cohort undoes the transaction using the undo log, and
releases the resources and locks held during the transaction.
4. Each cohort sends an acknowledgement to the coordinator.
5. The coordinator completes the transaction when
acknowledgements have been received.
3 Phase Commit
Non blocking when failures are restricted to site failures
A commit protocol that is synchronous within one state
transition is nonblocking if and only if its state transition
diagram contains neither of the following.
– No state that is “adjacent” to both a commit and an abort state.
– No noncommittal state that is “adjacent” to a commit state.
Replication and Replica Control Protocols
Having replicas of data items improves system
availability.
Advantages
– With careful design, it is possible to ensure that single points of
failure are eliminated
– Overall system availability is maintained even when one or more
sites fail.
Disadvantages
– Whenever updates are introduced, the complexity of keeping
replicas consistent arises and this is the topic of replication
protocols.
Concepts
Object
– Represents a real entity in the
system
– Represented as a pair (object
Identity, state)
– Enables referential object sharing.
State
– Either an atomic value or a
constructed value
Value
– An element of D is a value, called an
atomic value
– [a1:v1,…,an:vn], in which ai is an
element of A and vi is either a value
or an element of I, is called a tuple
value.
– {v1,..,vn}, in which vi is either a value
or an element of I, is called a set
value.
Class
– Grouping of common objects
– Template for all common objects
Inheritance
– Declaring a type to be a subtype of
another.
Abstract Data Types
– Template for all objects of that type.
– Describes type of data by providing a
domain of data with the same
structure, as well as operations
applicable to the objects of that
domain.
– Abstraction capability commonly
referred as encapsulation.
Composition (Aggregation)
– Restriction on composite objects
results in complex objects
– The composite object relationship
between types can be represented by
a composition graph.
Collection
– User defined grouping of objects
– Similar to class in that it groups
objects.
Subtyping
– Based on specialization relationship
among types.
Object Distribution Design
Path partitioning
– A concept describe the clustering
of all the objects forming a
composite object into a partition.
– Can be represented as a
hierarchy of nodes forming a
structural index.
– Index contains the references to
all the component objects of a
composite object, eliminating the
need to traverse the class
composition hierarchy.
Class Partitioning Algorithms
– Main issue is to improve the
performance of user queries and
applications by reducing the
irrelevant data access.
– Affinity based approach
Affinity among instance variables
and methods and affinity among
multiple methods can be used for
horizontal and vertical class
partitioning.
– Cost-Driven Approach
Allocation
– Local behavior-local object
Behavior, the object to which it is
applied, and the arguments are all
co-located.
No special mechanism needed to
handle this case.
– Local behavior-remote object
Behavior, the object to which it is
applied, and the arguments are all
co-located.
Two ways to deal
– Move th remote object to the site
where the behavior is located.
– Ship the behavior implementation
to the site where the object is
located
Client-Server Architecture
Object
Database
Object
Database
Cache Consistency
Problem in any data shipping system that moves data to the clients.
Cache consistency algorithms
– Avoidance-based synchronous algorithms
Clients retain read locks across transactions, but they relinquish write locks at the end
of the transaction.
The client send lock requests to the server and they block until the server responds.
If the client requests a write lock on a page that is cached at other clients.
– Avoidance-based asynchronous algorithms
Do not have the message blocking overhead present in synchronous algorithms.
Clients send lock escalation messages to the server and continue application
processing
– Avoidance-based deferred algorithms
Clients batch their lock escalation requests and send them to the server at commit time.
The server blocks the updating client if other clients are reading the updated objects.
– Detection-based synchronous algorithms
Clients contact the server whenever they access a page in their cache to ensure that
the page is not stale or being written to by other clients.
– Detection-based asynchronous algorithms
Clients send lock escalation requests to the server, but optimistically assume that their
requests will be successful.
After a client transaction commits, the server propagates the updated pages to all the
other clients that have also cached the affected pages.
– Detection-based deferred algorithms
Can outperform callback locking algorithms even while encountering a higher abort rate
if the client transaction state completely fits into the client cache, and all application
processing is strictly performed at the clients.
Object Identifier Management
Object Identifiers are system generated
Used to Uniquely identify every object
Transient object identity can be implemented more
efficiently
Two common solutions
– Physical Identifier approach (POID)
Equates the OID with the physical address of the corresponding
object
Advantage , the object can be obtained directly from the OID.
Drawback, all the parent objects and indexes must be updated
whenever an object is moved to a different page.
– Logical Identifier approach (LOID)
Consists of allocating a system wide unique OID.
Since OIDs are invariant, there is no overhead due to object
movement.
Object Migration
Three alternatives can be
considered for the migration of
classes (types)
– The source code is moved and
recompiled at the destination
– The compiled version of a class is
migrated just like any other object,
or
– The source code of the class
definition is moved, but not its
compiled operations, for which a
lazy migration strategy us used.
Objects can be in one of the four
states
– Ready,
Ready objects are not currently
invoked, or have not received a
message, but are ready to be
invoked to receive a message.
– Active
Active objects are currently
involved in an activity in response
to an invocation or a message
– Waiting
Waiting objects have invoked
another object and are waiting for
a response.
– Suspended
Suspended objects are
temporarily unavailable for
invocation.
Migration involves two steps
– Shipping the object from the
source to the destination, and
– Creating a proxy at the source,
replacing the original object.
Object Clustering
– Difficult for two reasons
Not orthogonal to object identity implementation. Logical
OIDs incur more overhead , but enable vertical partitioning of
classes.
Clustering of complex objects along the composition
relationship is more involved because of object sharing .
– Given a class graph, there are three basic storage
models for object clustering
The decomposition storage model, partitions each object
class in binary relations.
The normalized storage model stores each class as a
separate relation.
The direct storage model enables multi-class clustering of
complex objects based on the composition relationship.
Distributed Garbage Collection
– As programs modify objects and remove references, a persistent
object may become unreachable from the persistent roots of the
system when there is no more reference to it.
– Basic garbage collection algorithms can be categorized
reference counting
– In reference counting, each object has an associated count o reference
– Each time a program creates an additional reference that points to an
object, the object’s count is incremented.
– When reference to an object is destroyed, the corresponding count is
decremented.
tracing-based.
– Mark and sweep algorithms
Two phase algorithms
First phase, mark phase, starts from the root and marks every
reachable object
Once all live objects are marked, the memory is examined and
unmarked objects are reclaimed.
– Copy-based algorithms
Divide memory into two disjoint areas
From-space, Programs manipulate from this space
To-space, left empty
Object Query Processing – Important issues
Object Query Processor
Architectures
–
Open OODB project
Separation between the user
query language parsing
structures and the operator
graph on which the optimizer
operates
–
EPOQ project
Approach to query optimization
extensibility, where the search
space is divided into regions
–
TIGUKAT project
Uses an object approach to
query processing extensibility
Is an extensible uniform
behavioral model characterized
by a purely behavioral semantics
and a uniform approach to
objects.
Query Processing Issues
–
–
Search space and transformation
rules
Search Algorithm
– Cost Function
Can be defined recursively based
on the algebraic processing tree.
– Parameterization
– Path Expression
– Rewriting and Algebraic
Optimization
– Path Indexes
Query Execution
– Path Indexes
Algorithms
1. Create an index on each class
traversed
2. Define indexes on objects across
their type inheritance
3. Access support relations, is a data
structure that stores selected path
expression.
– Set Matching
Algorithms
1. Centralized Algorithms
2. Join execution algorithm
Data Delivery alternatives
Architecture of a Data
Warehouse
Pull-only
– Transfer of data from servers to
clients is initiated by a client
pull.
– Arrival of new data items or
updates to existing data items
are carried out a server without
modification to clients unless
clients explicitly poll the server.
Push-only
– Transfer of data from servers to
clients is initiated by a server
push in the absence of any
specific request from clients.
Hybrid
– Combines the client-pull and
server-push mechanisms.
Query/Analysis
Reporting
Data Mining
Q
U
E
R
I
E
s
Target
Database
Integ
rate
Source
database
Metadata
repository
Semi structured Data
– Free and commercial database on product information etc, interfaces to
such sources, is typically a collection of fill-out forms.
– Typically modeled as a labeled graph
– A labeled graph are self-describing and have no schema.
– Object Exchange Model is used to illustrate such a labeled graph
A label which is the name of the object class
A type which is either atomic (integer, string etc.) or set
A value which is either atomic or a set of objects
An optional object identifier
Web
Server
Global
Data
Dictionary
Wrapper
Data
Source
Wrapper
Data
Source
Wrapper
Data
Source
Problems with Pull-based approach
– users need to know a priori where
and when to look for data.
– Mismatch between the
asymmetric nature of some
applications and the symmetric
communications infrastructure on
applications such as internet.
– Two types of asymmetry
Network asymmetry, network
bandwidth between client- server
different from server-client.
Distributed information systems,
due to imbalance between the
number of clients and the number
of servers.
Data, amount of data being
transferred between client and
server.
Data volatility
Why Push based technologies?
Response to some of the
problems inherent in pull-based
systems.
Algorithm – Push based approach
1.
2.
3.
4.
5.
Order the data items from hottest to
coldest
Partition the data items into ranges of
items, such that the items in each
range have similar application access
profiles. The number of ranges is
denoted by num_ranges.
Choose the relative broadcast
frequency for each range as integers
(rel_freqi, where i is the range).
Divide each range into smaller
elements, called chunks (Cij is the j-th
chunk of range i). Determine the
number of chunks into which range i is
divided as num_chunk, =
max_chunks/rel_freqi, where
max_chunks is the least common
multiple of rel_freqi,¥i.
Create the broadcast schedule by
interleaving the chunks of each range
using the following procedure.
for I from 0 to max_chunks-1 by 1 do
for j from 1 to max ranges by 1 do
Broadcast chunk Cj, (i mod
num_chunksj)
end-for
end-for
Difference between pull-based and push-based systems
–
–
Cache replacement policies
Prefetching mechanism
An idealized algorithm for page replacement is one which determines the
page with the smallest ratio between its probability of access and its
frequency of broadcast.
PIX algorithm, calculates the “cost” of replacing a page and replaces the
least costly one.
The operation of the algorithm is as follows:
When a page Pi is brought into cache and inserted into a chain.
Pri = 0, LTi = CurrentTime
2. When Pi is accessed again, it is moved to the top of its own chain and the following
caculations are made:
Pri = HF / (Current Time –LTi) + (1 – HF) * LTi , LTi = CurrentTime,
3. If a new page needs to be flushed out to open up space, a lix value is calculated
for the pages at the bottom of each chain and the page with the lowest lix value is
flushed out. The lix value is calculated as follows:
lixi = Pri/rel-freqi
where rel-freqi is the relative broadcast frequency of the range (disk) to
which that page Pi belongs.
1.