Investigating Distributed Database Systems

Download Report

Transcript Investigating Distributed Database Systems

Investigating Distributed
Database Systems
Challenges and Technology
Kishore Puppala Rao
Definitions

A database is a logically related collection
of data, stored in one or many files
 A distributed database is a collection of
multiple, logically interrelated databases
distributed over a computer network
Architecture

Client/server architectures
 Multiple clients, single server – this is the
most common and straightforward
implementation
 Multiple clients, multiple servers – more
flexible. DB distributed over multiple
servers. Each client directs requests to a
“home” server.
Architecture (cont’d)

DB is physically distributed by fragmenting
and replicating data (discussed later)
 Regardless of architecture, implementation
details of queries, transactions and DB
operations should be transparent to users.
Architecture (Peer-to-peer)

No distinction between client and server
 Each site has functionality of both client
and server
 E.g. File-sharing apps such as BearShare,
LiveWire
 Sophisticated protocols needed to manage
data distributed across multiple sites
Fragmentation

Partitions the data
 Subdivides each relation either vertically
(by project operation) or horizontally (by
selection operation)
 Facilitates the placement of data close to its
place of use, reducing transmission costs
Replication

Refers to duplication of data for access
and/or security purposes
 Fragments or whole database may be
replicated
 Replication involves keeping physical
separate copies of data at different sites
Distributed vs. Parallel

Distributed DBMS are not parallel DBMS,
although distinction may be unclear
 Distributed DBMS assume loose connection
between processors operating
independently, perhaps under different
operating systems
Parallel DBMS

Multiple processors under same operating
system.
 Architecture: Shared-none, shared-disk, or
shared memory
 Shared-Nothing: Each processor has
exclusive access to its main memory and
disk. Each processing element (PE) is a
local site.
Parallel DBMS (cont’d)

Shared-memory: Each PE has access to any
memory module or disk through some fast
connection (e.g. LAN or cross-bar switch)
 Shared-disk: Each PE has exclusive access
to its own memory, but shared access to any
disk via a fast connection. PE accesses DB
pages on shared disk and copy to local
cache
Transparency

Distributed (and Parallel) DBMS must
provide same functionality and consistency
of centralized DBMS.
 Transparency implies presenting a
consistent view that shields the user from
implementation details such as
fragmentation, replication, and distribution.
 Introduces major challenges
Challenges

Query processing and optimization
 Concurrency control
 Reliability protocols
 Replication protocols
Query Processing and
Optimization

Techniques needed to address difficulties arising
from data distribution and fragmentation.
Localization techniques employed.
 Algebraic queries on global relations are
transformed to operate on fragments
 Opportunities for parallel processing are identified
(fragments are stored at different sites),
unnecessary work is eliminated (not all fragments
may be involved in the query)
Query optimization

Determining the execution sites for
distributed operations
 Identifying the best distributed algorithm
for distributed operations
 Changing the order of operations in a query
Concurrency Control

Challenge in synchronizing user transactions is to
extend serializability and concurrency to the
distributed execution environment
 Serializability: The ability to perform a set of
operations in parallel with the same effect as if
they were performed in a certain sequence,
requires:
 (a) execution of the set of transactions at each site
must be serializable
 (b) the serialization orders of these transactions at
all these sites must be identical
Concurrency (cont’d)

If locking-based algorithms used, lock
management may be centralized or
distributed
 Deadlocks must be avoided
 Deadlock detection and management in a
distributed database can be difficult
Reliability protocols

Several types of failures: System, media,
transaction, communication
 May be difficult to differentiate type of
failure
 Distributed reliability protocols enforce
transaction atomicity (commit all or commit
nothing)
Reliability (cont’d)

E.g. of Atomic commitment protocol: Twophase commit
 All sites involved in the execution of a
distributed transaction must agree to
commit the transaction before it is made
permanent.
Replication protocols

Each logical data item has a number of physical
instances
 Challenge is to maintain (or approximate)
consistency among physical copies as user updates
logical data
 Example criterion: One-copy equivalence – All
physical copies of logical data should be
equivalent after being updated by a transaction
 Read-One/Write All (ROWA) protocol – enforces
one-copy equivalence. Disadvantage: failure of
one site may block entire transaction
Replication (cont’d)

Alternative algorithms relax ROWA by
mapping each write to a subset of the
physical copies
 Quorum-based voting: Copies are assigned
votes; read and (especially) write operations
have to collect votes and reach a quorum to
commit data. (see class notes)
Research and Trends

Workflow models (advanced transaction
models)
 Network scaling problems
 Multi-database systems and interoperability
 Distributed object management
Trends (cont’d)

Primitive objects are not simple-structured data.
Can consist of programs, voice, images, etc.
 Distributed DBMS must handle increasingly larger
data objects. E.g. 1MB storage needed for 1 digital
X-Ray image (1024x1024) @ 8 bits/pixel
 Most commercial DBMS (e.g. MS SQL Server
2000, Oracle 8i) provide some sort of distribution
 Emergence of broadband networks eliminates the
network as a bottleneck
Trends (cont’d)





Mobile computing is escalating in interest and
prevalence
Mobile stations may download data as needed
Alternatively, more powerful mobile stations may
store native data for sharing with others
Mobility raises issues of address migration,
maintenance of directories, and determining the
location of stations
Object-oriented DBMS e.g. CORBA (platform
independent), COM/OLE (MS-specific)
CORBA

Common Object Request Broker
Architecture
 Facilitates the maintenance and DB access
of data from a number of autonomous and
heterogeneous sources (e.g. file systems,
spreadsheets) via a multidatabase approach
 Provides a generic platform for distributed
computing
CORBA (cont’d)

In multidatabase systems, the main problem is the
heterogeneity extant at four levels: platform,
communication, database system, and semantic.
 CORBA facilitates implementation transparency
by providing client access via interfaces defined in
a special Interface Definition Language (IDL),
independent of the databases actual software and
hardware environment.
 Provides location transparency, allowing clients to
access DB objects independent of location and
communication protocols
CORBA (cont’d)

Provides a common interface to mask
heterogeneity among native database
system implementations based on different
data models (e.g. flat-file, relational,
spreadsheet) and query languages
 Common interface overcomes semantic
conflicts such as schema and data conflicts
References

M.T. Ozsu and P. Valduriez, "Distributed and
Parallel Database Systems – Technology and
Current State-of-the-Art", ACM Computing
Surveys, 28(1): 125 - 128, March 1996.
 A. Dogac, C. Dengi and M.T. Ozsu, "Distributed
Object Computing Platforms", Communications of
ACM, 41(9): 95-103, September 1998.
 J. N. Gray, “Notes on Data Base Operating
Systems.” Operating Systems: An Advanced
Course. R. Bayer, R.M. Graham (eds.) New
York: Springer-Verlag, 1979, pp. 393-481.
References (cont’d)

M.T. Ozsu, "The Push/Pull Effect - Can
Distributed Database Technology Meet The
Challenges of New Applications?",
Database Programming & Design, April
1997.
Thank you