Lesson16 Distributed Database

Download Report

Transcript Lesson16 Distributed Database

Client/Server Database and
Distributed Database
Advanced Database
Dr. AlaaEddin Almabhouh
Chapter Objectives

Describe the concepts and advantages of the client/server
database approach.

Describe the concepts and advantages of the distributed
database approach.

Explain how data can be distributed and replicated in a
distributed database.
12-2
Local Area Network (LAN)
12-3

An arrangement of personal
computers connected
together by communications
lines.

The PCs must be located
fairly close to each other.

Allows sharing of resources
such as servers, printers, etc.
Local Area Network (LAN)
12-4

The PCs on a LAN can
certainly operate
independently, but they can
also communicate with each
other.

A gateway computer on the
LAN can link the LAN and
its PCs to other LANs, to
one or more mainframe
computers, or to the
Internet.
Two-Tiered Client/Server Arrangement

Clients = PCs on the LAN

Server = powerful computer on the LAN

A shared database can be stored on a LAN server so that
all of the PCs on the LAN can access it.
12-5
File Server Approach

When a LAN client needs to query, update, or
otherwise use a file on the server, the entire file must
be sent from the server to that client.

All of the querying, updating, or other processing is
then performed at the client.

If changed, the entire file is then shipped back to the
server.
12-6
DBMS Server Approach

The database is located at the server.

The processing is split between the client and the server.

Result: there is much less data traffic on the network.
12-7
Two-Tier Approach
12-8

Some databases can be
stored on a client PC’s own
hard drive while other
databases that the client
might access are stored on
the LAN’s server.

Software has been
developed that makes the
location of the data
transparent to the user at
the client.
Two-Tier Approach
12-9

The user issues a query at
the client.

The software first checks to
see if the required data is on
the PC’s own hard drive. If
yes, the data is retrieved
from it.

If it is not on local drive, the
software looks for it on the
server.
Three-Tier Approach

If the software doesn’t find the data on the client PC’s
hard drive or on the LAN server, it can leave the
LAN through a gateway computer and look for the
data on, for example, a large, mainframe computer
that may be reachable from many LANs.
12-10
Three-Tier Approach

Another use of the term “Three-Tier Approach” with
the following three tiers:



The client PCs
Servers known as application servers
Other servers known as database servers
12-11
The Distributed Database Concept

Instead of having one, centralized database, we are going
to spread the data out among various cities on the
distributed network, each of which has its own computer
and data storage facilities.

All of this distributed data is still considered to be a
single logical database.
12-12
The Distributed Database Concept

Location transparency - The user just issues the query,
and the result is returned.

A person or process anywhere on the distributed network
queries the database.

It is not necessary to know where on the network the data
being sought is located.
12-13
Distributed DBMS

Distributed database management system


Sophisticated software
Manages location transparency
12-14
Distributing the Data
12-15

Headquartered in NY, a
company’s database
consists of 6 large tables:
A, B, C, D, E, F.

With a centralized
database, all 6 tables
would be located in NY.
Distributing the Data

The company has major sites in Los Angeles, Memphis,
New York, Paris, and Tokyo.

The first and simplest idea in distributing the data would
be to disperse the six tables among the five sites, perhaps
based on frequency of use of each table.
12-16
Distributing the Data
12-17

Tables A and B are kept at
New York

Table C is moved to
Memphis

Tables D and E are moved to
Tokyo

Table F is moved to Paris.
Distributing the Data

Paris employees can now access Table F without incurring
telecommunications costs associated with accessing Table
F in NY.

Local autonomy - Paris employees, e.g., can take
responsibility for Table F -- its security, backup and
recovery, and concurrency control.
12-18
Distributing the Data: Problems

When the database was centralized at New York, a
query issued at any of the sites that required a join of
two or more of the tables could be handled in the
standard way by the computer at New York.

The result would then be sent to the site that issued
the query.

In the dispersed approach, a join might require tables
located at different sites!
12-19
Replicated Tables

Duplicated tables at two or more sites on the network.

Advantages

Availability - during a site failure, data can still be accessed at a
replicated location.

Local access - Replicate table at a site requiring frequent
access.
12-20
Replicated Tables

Disadvantages

Security risk

Concurrency control - How do you keep data consistent when
it is replicated in tables on three continents?
12-21
Full Data Replication
12-22

The maximum approach
of replicating every table
at every site.

Great for availability

Great for joins
Full Data Replication
12-23

Worst for concurrency
control - every change to
every table has to be
reflected at every site.

Worst for security

Takes up a lot of disk space
Partial Replication

12-24
Have a copy of the
entire database at
headquarters in New
York and have each table
replicated exactly once
at one of the other
sites.
Partial Replication
12-25

Improves availability each table is now at two
sites.

Security and
concurrency exposures
are limited.

Joins occur at NY.
Partial Replication
12-26

New York could tend to
become a bottleneck.

If a table is heavily used in
both Tokyo and Los
Angeles, it can only be
placed at one of the two
sites (plus the copy of the
entire database in New
York), leaving the other
with speed and telecom
cost problems.
Replication Principles

Place copies of tables at the sites that use them most
heavily in order to minimize telecommunications costs.

Ensure that there are at least two copies of important or
frequently used tables to realize the gains in availability.

Limit the number of copies of any one table to control
the security and concurrency issues.

Avoid any one site becoming a bottleneck.
12-27
Replication Principles
12-28
Concurrency Control in Distributed
Database

The “lost update” problem.

The protections that we discussed earlier that can be put
into place to handle the problem of concurrent update in
a single table are not adequate to handle the new,
expanded problem in distributed database systems.
12-29
Asynchronous Approach

If retrieved data does not necessarily have to be up-tothe-minute accurate, we can use “asynchronous”
approaches to updating replicated data.
12-30
Asynchronous Schemes

The site where the data was updated can send a
message to update to the other sites that contain a
copy of the same table.

One of the sites can be chosen to accumulate all of
the updates to all of the tables, and transmit changes
regularly.

Each table can have one of the sites be declared the
“dominant” site for that table, which periodically
transmits updates to the other sites.
12-31
Synchronous Approach

If retrieved data does have to be up-to-the-minute
accurate.

All data in replicated tables worldwide must always be
consistent, accurate, and up-do-date.

Use two-phase commit.
12-32
Two-Phase Commit:
Prepare Phase

Each computer on the network has a special log file in addition to its
database tables.

The computer at the initiating site sends the updated data to the other
sites that have copies of the table to be updated.

The computers at the other sites record the changes in their logs (but not
in the actual database tables.)

These computers attempt to lock the database tables involved in the update.

If they are successful (the tables are not busy and can be locked) they inform the
initiating site.
12-33
Two-Phase Commit:
Commit Phase

If all of the other sites reported they were successful
in logging the update and locking the tables, the
initiating site issues instructions to transfer the
update from the logs to the actual database tables.
12-34
Two-Phase Commit

Either all of the replicated files have to be updated or
none of them must be updated.

A complex, costly, and time-consuming process.

The more volatile the data in the database, the less
attractive is this procedure for updating replicated
tables in the distributed database.
12-35
Distributed Joins

A query that is run from one of the computers in a
distributed database system that requires a join of
two or more tables that are not all at the same
computer.

The distributed DBMS must have its own built-in
expert system that is capable of figuring out an
efficient way to handle a request for a distributed
join.
12-36
Distributed Joins

The DBMS evaluates various options for performing a
join by considering:

The number and size of the records from each table
involved in the join.

The distances and costs of transmitting the records from
one site to another to execute the join.

The distance and cost of shipping the result of the join back
to the site that issued the query in the first place.
12-37
Partitioning

The purpose is to have records or columns of a table
resident at the sites that use them the most
frequently.

Horizontal Partitioning

Vertical Partitioning
12-38
Horizontal Partitioning
12-39

A relational table can be
split up so that some
records are located at
one site, other records
are located at another
site, and so on.

e.g., partitioning of Table
G.
Vertical Partitioning

The columns of a table are divided up among several
cities on the network.

Each such partition must include the primary key
attribute(s) of the table.

Makes sense when different sites are responsible for
processing different functions involving an entity.
12-40
Distributed DBMS Summary

Centralized Database Advantages

Single site provides high degree of security, concurrency, and
backup and recovery control.

No need for a distributed directory since all of the data is in
one place.

No need for distributed joins since all of the data is in one
place.
12-41
Distributed DBMS Summary

Centralized Database Disadvantages

All data accesses from other than the site with the database
incur communications costs.

The site with the database can become a bottleneck.

Possible availability problem: if the site with the database goes
down, there can be no data access.
12-42
Distributed DBMS Summary

Dispersing Tables on the Network Advantages

Local autonomy.

Reduced communications costs because each table can be
located at the site that most heavily uses it.

Improved availability because portions of the database are
available even if one or some of the sites are down.
12-43
Distributed DBMS Summary

Dispersing Tables on the Network Disadvantages

Several sites have to be concerned with security, concurrency,
and backup and recovery.

Requires a distributed directory and the software to support
location transparency.

Requires distributed joins.
12-44
Distributed DBMS Summary

Targeted Data Replication Advantages



Greatly reduced communications costs for read-only data
access because copies of tables can be located at multiple sites
that most heavily use them.
Greatly improved availability because if a site with a database
table goes down, there may be another site with a copy of that
table.
Targeted Data Replication Disadvantages

Multi-site concurrency control when data in replicated tables is
updated.
12-45
Distributed DBMS Summary

Partitioned Tables Advantages



Greatest local autonomy because data at the record or column
level can be stored at the site(s) that most heavily use it.
Greatly reduced communications costs because data at the
record or column level can be stored at the site(s) that most
heavily use it.
Partitioned Tables Disadvantages

Retrieving all or a large portion of a table may require multisite accesses.
12-46