pmit-6102-15-lec3-Distributed DatabaseDesignx

Download Report

Transcript pmit-6102-15-lec3-Distributed DatabaseDesignx

PMIT-6102
Advanced Database Systems
ByJesmin Akhter
Assistant Professor, IIT, Jahangirnagar University
Lecture 03
Distributed Database Design
Outline

Distributed Database Design
 Distributed Design Problem
 Distributed Design Issues
o Fragmentation
o Data Allocation
Slide 3
Distributed Design Problem

The design of a distributed computer system involves
 making decisions on the placement of data and
programs across the sites of a computer network
 as well as possibly designing the network itself

In the case of distributed DBMSs, the distribution of
applications involves two things:
 distribution of the distributed DBMS software
 distribution of the application programs that run on it.
Are not significant problem
 Assume a copy of the distributed DBMS software exists
at each site where data are stored
 Network has already been designed

We concentrate on distribution of data
Slide 4
Distribution Design Issues

The following set of interrelated questions covers the
entire issue.
 Why fragment at all?
 How should we fragment?
 How much should we fragment?
 How to test correctness?
 How should we allocate?
 What is the necessary information for fragmentation
and allocation?
Slide 5
Reasons for Fragmentation

The important issue is the appropriate unit of
distribution.
 A relation is not a suitable unit, for a number of
reasons.
o First, application views are usually subsets of relations.
o Therefore, the locality of accesses of applications is
defined not on entire relations but on their subsets
o Hence consider subsets of relations as distribution
units.
Slide 6
Reasons for Fragmentation

The relation is not replicated and is stored at only one site,
 results in an unnecessarily high volume of remote data
accesses

The relation is replicated at all or some of the sites where
the applications reside.
 May has unnecessary replication, which causes problems in
executing updates
 may not be desirable if storage is limited.

Finally, the decomposition of a relation into fragments, each
being treated as a unit, permits a number of transactions to
execute concurrently.
 Thus fragmentation typically increases the level of concurrency
and therefore the system throughput.
Slide 7
Fragmentation Alternatives

Relation instances are essentially tables, so the issue
is one of finding alternative ways of dividing a table
into smaller ones.

There are clearly two alternatives for this:
 dividing it horizontally or dividing it vertically.
Slide 8
Fragmentation Alternatives – Horizontal
PROJ
PROJ1 :
projects with budgets less than
$200,000
PROJ2 :
P1
P2
P3
P4
P5
projects with budgets greater
than or equal to $200,000
PROJ1
PNO
PNO
PNAME
Instrumentation
Database Develop.
CAD/CAM
Maintenance
CAD/CAM
BUDGET
LOC
150000
135000
250000
310000
500000
Montreal
New York
New York
Paris
Boston
PROJ2
PNAME
BUDGET
LOC
PNO
PNAME
BUDGET
LOC
P1 Instrumentation
150000
Montreal
P3
CAD/CAM
250000
New York
P2
135000
New York
P4
Maintenance
310000
Paris
P5
CAD/CAM
500000
Boston
Database Develop.
Example of Horizontal Partitioning
Slide 9
Fragmentation Alternatives – Vertical
PROJ
PROJ1: information about project
budgets
PROJ2: information about project
names and locations
PROJ1
PNO
PNAME
BUDGET
P1
P2
P3
P4
P5
Instrumentation
Database Develop.
CAD/CAM
Maintenance
CAD/CAM
LOC
150000 Montreal
135000 New York
250000 New York
310000 Paris
500000 Boston
PROJ2
PNO
BUDGET
PNO
PNAME
LOC
P1
P2
P3
P4
P5
150000
135000
250000
310000
500000
P1
P2
P3
P4
P5
Instrumentation
Database Develop.
CAD/CAM
Maintenance
CAD/CAM
Montreal
New York
New York
Paris
Boston
Example of Vertical Partitioning
Slide 10
Correctness of Fragmentation

Completeness
 Decomposition of relation R into fragments R1, R2, ..., Rn is
complete if and only if each data item in R can also be
found in some Ri
 This property, which is identical to the lossless
decomposition property of normalization
 it ensures that the data in a global relation are mapped
into fragments without any loss
Slide 11
Correctness of Fragmentation

Reconstruction
 If relation R is decomposed into fragments FR={R1, R2,
..., Rn}, then there should exist some relational operator ∇
such that
R = ∇ Ri ,
 The reconstructability of the relation from its fragments
ensures that constraints defined on the data in the form
of dependencies are preserved.
Slide 12
Correctness of Fragmentation

Disjointness
 If relation R is horizontally decomposed into fragments
FR={R1, R2, ..., Rn}, and data item dj is in Rj, then dj should
not be in any other fragment Rk (k ≠ j ).
 This criterion ensures that the horizontal fragments are
disjoint.
 If relation R is vertically decomposed, its primary key
attributes are typically repeated in all its fragments (for
reconstruction).
 Therefore, in case of vertical partitioning, disjointness is
defined only on the non-primary key attributes of a
relation.
Slide 13
Allocation Alternatives

Non-replicated
 partitioned : each fragment resides at only one site

Replicated
 fully replicated : each fragment at each site
 partially replicated : each fragment at some of the sites
Slide 14
Information Requirements

The information needed for distribution design can be
divided into four categories:

Database information

Application information

Communication network information

Computer system information
Slide 15
Fragmentation

Horizontal Fragmentation (HF)

Vertical Fragmentation (VF)

Hybrid Fragmentation (HF)
Slide 16
Horizontal Fragmentation (HF)

Horizontal fragmentation partitions a relation along its
tuples.
 each fragment has a subset of the tuples of the relation.

There are two versions of horizontal partitioning:
 Primary horizontal fragmentation
 Derived horizontal fragmentation

Primary horizontal fragmentation of a relation is
performed
 using predicates that are defined on that relation.

Derived horizontal fragmentation is the partitioning of a
relation
 results from predicates being defined on another relation.
Slide 17
Information Requirements of HF
 Database Information
 how the database relations are connected to one another,
especially with joins.
 In the relational model, these relationships are also depicted
as relations.
SKILL
TITLE, SAL
L1
PROJ
EMP
ENO, ENAME, TITLE
L2
ASG
PNO, PNAME, BUDGET, LOC
L3
ENO, PNO, RESP, DUR
Expression of Relationships Among Relations Using Links
 cardinality of each relation: card(R)
Slide 18
Information Requirements of HF

Application Information
 simple predicates : Given R[A1, A2, …, An], a simple
predicate pj is
pj : Ai θValue
where θ  {=,<,≤,>,≥,≠}, Value  Di and Di is the domain
of Ai.
For relation R we define Pr = {p1, p2, …,pm}
Example :
PNAME = "Maintenance"
BUDGET ≤ 200000
Slide 19
Information Requirements of HF

Application Information
 minterm predicates : Given R and Pr = {p1, p2, …,pm}
define M = {m1,m2,…,mr} as
M = { mi | mi =
pjPr pj* }, 1≤j≤m, 1≤i≤z
where pj* = pj or pj* = ¬(pj).
Example
m1: PNAME="Maintenance"  BUDGET≤200000
m2: NOT(PNAME="Maintenance")  BUDGET≤200000
m3: PNAME= "Maintenance"  NOT(BUDGET≤200000)
m4: NOT(PNAME="Maintenance")  NOT(BUDGET≤200000)
Slide 20
Information Requirements of HF

Application Information

In terms of quantitative information about user applications,
we need to have two sets of data.
 minterm selectivities: sel(mi)
o The number of tuples of the relation that would be accessed
by a user query which is specified according to a given
minterm predicate mi.
 access frequencies: acc(qi)
o The frequency with which a user application qi accesses
data.
o Access frequency for a minterm predicate can also be
defined.
Slide 21
Primary Horizontal Fragmentation

Definition :
A primary horizontal fragmentation is defined by a selection operation
on the owner relations of a database schema. Therefore, given relation
R, its horizontal fragments are given by
Ri = Fi(R), 1 ≤ i ≤ w

where Fi is a selection formula used to obtain fragment Ri. If Fi is in
conjunctive normal form, it is a minterm predicate (mi).
 A horizontal fragment Ri of relation R consists of all the tuples of R which
satisfy a minterm predicate mi.
 Given a set of minterm predicates M, there are as many horizontal fragments
of relation R as there are minterm predicates.
 Set of horizontal fragments also referred to as minterm fragments.
Slide 22
Primary Horizontal Fragmentation

PROJ1 =  LOC=“Montreal” (PROJ)

PROJ2 =  LOC=“New York” (PROJ)

PROJ3 =  LOC=“Paris” (PROJ)
Primary Horizontal Fragmentation of Relation PROJ
Slide 23
PHF – Algorithm
Given:
A relation R, the set of simple predicates Pr
Output: The set of fragments of R = {R1, R2,…,Rw} which
obey the fragmentation rules.
Preliminaries :
 Pr should be complete
 Pr should be minimal
Slide 24
Completeness of Simple Predicates

A set of simple predicates Pr is said to be complete if
and only if there is an equal probability of access by
every application to any tuple belonging to any minterm
fragment that is defined according to Pr.

Example :
 Assume PROJ[PNO,PNAME,BUDGET,LOC] has two
applications defined on it.
 Find the budgets of projects at each location. (1)
 Find projects with budgets less than $200000. (2)
Slide 25
Completeness of Simple Predicates
According to (1),
Pr={LOC=“Montreal”,LOC=“New York”,LOC=“Paris”}
which is not complete with respect to (2).
Modify
Pr ={LOC=“Montreal”,LOC=“New York”,LOC=“Paris”,
BUDGET≤200000,BUDGET>200000}
which is complete.
Slide 26
Minimality of Simple Predicates

If a predicate influences how fragmentation is performed,
(i.e., causes a fragment f to be further fragmented into, say,
fi and fj) then there should be at least one application that
accesses fi and fj differently.

In other words, the simple predicate should be relevant in
determining a fragmentation.

If all the predicates of a set Pr are relevant, then Pr is
minimal.

Pi is relevant if and only if
acc(mi ) acc(m j )

card ( f i ) card ( f j )
 acc(mi ) the access frequency of a minterm mi.
Slide 27
Minimality of Simple Predicates
Example :
Pr ={LOC=“Montreal”,LOC=“New York”, LOC=“Paris”,
BUDGET≤200000,BUDGET>200000}
is minimal (in addition to being complete). However, if we add
PNAME = “Instrumentation”
then Pr is not minimal.
Slide 28
COM_MIN Algorithm
Given:
a relation R and a set of simple predicates Pr
Output: a complete and minimal set of simple predicates
Pr' for Pr
Rule 1: a relation or fragment is partitioned into at least
two parts which are accessed differently by at
least one application.
Slide 29
COM_MIN Algorithm
 Initialization :

find a pi  Pr such that pi partitions R according to Rule 1

set Pr' = pi ; Pr Pr – {pi} ; F  {fi}
 Iteratively add predicates to Pr' until it is complete

find a pj  Pr such that pj partitions some fk defined
according to minterm predicate over Pr' according to Rule 1

set Pr' = Pr'  {pj}; Pr Pr – {pj}; F  F  {fj}

if pk  Pr' which is nonrelevant then
Pr'  Pr' – {pk}
F  F – {fk}
Slide 30
PHORIZONTAL Algorithm
Makes use of COM_MIN to perform fragmentation.
Input:
a relation R and a set of simple predicates Pr
Output: a set of minterm predicates M according to
which relation R is to be fragmented
 Pr'  COM_MIN (R,Pr)
 determine the set M of minterm predicates
 determine the set I of implications among pi  Pr‘
 Iteratively eliminate the contradictory minterms from M
 M  M-mi
Slide 31
PHF – Example

Two candidate relations : PAY and PROJ.

Fragmentation of relation PAY
 Application: Check the salary info and determine raise.
 Employee records kept at two sites  application run at
two sites
 Simple predicates
p1 : SAL ≤ 30000
p2 : SAL > 30000
Pr = {p1,p2} which is complete and minimal Pr'=Pr
 Minterm predicates
m1 : (SAL ≤ 30000)
m2 : NOT(SAL ≤ 30000)  (SAL > 30000)
Slide 32
PHF – Example
PAY1
TITLE
PAY2
SAL
TITLE
SAL
Mech. Eng.
27000
Elect. Eng.
40000
Programmer
24000
Syst. Anal.
34000
Slide 33
PHF – Example

Fragmentation of relation PROJ
 Applications:
o Find the name and budget of projects given their location.
– Issued at three sites
o Access project information according to budget
– one site accesses ≤200000 other accesses >200000
 Simple predicates
 For application (1)
p1 : LOC = “Montreal”
p2 : LOC = “New York”
p3 : LOC = “Paris”
 For application (2)
p4 : BUDGET ≤ 200000
p5 : BUDGET > 200000
 Pr = Pr' = {p1,p2,p3,p4,p5}
Slide 34
PHF – Example

Fragmentation of relation PROJ continued
 Minterm fragments left after elimination
m1 : (LOC = “Montreal”)  (BUDGET ≤ 200000)
m2 : (LOC = “Montreal”)  (BUDGET > 200000)
m3 : (LOC = “New York”)  (BUDGET ≤ 200000)
m4 : (LOC = “New York”)  (BUDGET > 200000)
m5 : (LOC = “Paris”)  (BUDGET ≤ 200000)
m6 : (LOC = “Paris”)  (BUDGET > 200000)
Slide 35
PHF – Example
The result of the primary horizontal fragmentation of PROJ forms six
fragments FPROJ = {PROJ1, PROJ2, PROJ3, PROJ4, PROJ5, PROJ}
according to the minterm predicates M .
Since fragments PROJ2, and PROJ5 are empty, they are not depicted in
Figure
PROJ3
PROJ1
PNO
PNAME
BUDGET
P1
Instrumentation
150000
LOC
PNO
Montreal
P2
PROJ4
PNO
P3
PNAME
BUDGET
LOC
New York
Database Develop. 135000
PROJ6
PNAME
BUDGET
CAD/CAM
250000
LOC
PNO
New York
P4
PNAME
Maintenance
BUDGET
LOC
310000
Paris
Slide 36
PHF – Correctness

Completeness
 Since Pr' is complete and minimal, the selection
predicates are complete

Reconstruction
 If relation R is fragmented into FR = {R1,R2,…,Rr}
R = Ri FR Ri

Disjointness
 Minterm predicates that form the basis of fragmentation
should be mutually exclusive.
Slide 37
Derived Horizontal Fragmentation

Defined on a member relation of a link according to a
selection operation specified on its owner.
 Each link is an equijoin.
 Equijoin can be implemented by means of semijoins.
SKILL
TITLE, SAL
L1
EMP
PROJ
ENO, ENAME, TITLE
L2
PNO, PNAME, BUDGET, LOC
L3
ASG
ENO, PNO, RESP, DUR
Slide 38
DHF – Definition
Given a link L where owner(L)=S and member(L)=R, the
derived horizontal fragments of R are defined as
Ri = R ⋉ Si, 1≤i≤w
where w is the maximum number of fragments that will be
defined on R and
Si = Fi (S)
where Fi is the formula according to which the primary
horizontal fragment Si is defined.
Slide 39
DHF – Example
Given link L1 where owner(L1)=SKILL and
member(L1)=EMP
EMP1 = EMP ⋉ SKILL1
EMP2 = EMP ⋉ SKILL2
where
SKILL1 = SAL≤30000(SKILL)
SKILL2 = SAL>30000(SKILL)
EMP2
EMP1
ENO
ENAME
E3
E4
E7
A. Lee
J. Miller
R. Davis
TITLE
Mech. Eng.
Programmer
Mech. Eng.
ENO
ENAME
TITLE
E1
J. Doe
Elect. Eng.
E2
E5
E6
E8
M. Smith
B. Casey
L. Chu
J. Jones
Syst. Anal.
Syst. Anal.
Elect. Eng.
Syst. Anal.
Slide 40
DHF – Correctness

Completeness
 Referential integrity
o ensures that the tuples of any fragment of the member
relation are also in the owner relation.
o Let R be the member relation of a link whose owner is
relation S which is fragmented as FS = {S1, S2, ..., Sn}.
Furthermore, let A be the join attribute between R and S.
Then, for each tuple t of R, there should be a tuple t' of S
such that
t[A] = t' [A]

Reconstruction

Let a relation R with fragmentation FR = { R1,R2,……Rw }

Disjointness
 Simple join graphs between the owner and the member
fragments.
Slide 41
DHF – Correctness
Slide 42
Vertical Fragmentation

More difficult than horizontal, because more alternatives
exist.

Example:
 In horizontal partitioning, if the total number of simple predicates
in Pr is n, there are 2n possible minterm predicates that can be
defined on it.
 some of these will contradict the existing implications, further
reducing the candidate fragments that need to be considered

In the case of vertical partitioning if a relation has m nonprimary key attributes, the number of possible fragments is
equal to B(m), which is the mth Bell number.

For large values of m;B(m)= approximately (mm)
 for m=10, B(m) =115,000,
 for m=15, B(m) =109,
 for m=30, B(m) = 1023
Slide 43
Hybrid Fragmentation

In most cases a simple horizontal or vertical
fragmentation of a database schema will not be
sufficient to satisfy the requirements of user
applications.

In this case a vertical fragmentation may be followed
by a horizontal one, or vice versa, producing a tree
structured Partitioning.

Since the two types of partitioning strategies are
applied one after the other, this alternative is called
hybrid fragmentation.

It has also been named mixed fragmentation or nested
fragmentation.
Slide 44
Hybrid Fragmentation
It is also called mixed fragmentation or nested fragmentation.
R
HF
HF
R1
R2
VF
VF
VF
R11
R12
R21
VF
R22
VF
R23
Slide 45
Correctness of Hybrid Fragmentation

To reconstruct the original global relation in case of hybrid
fragmentation, one starts at the leaves of the partitioning
tree and moves upward by performing joins and unions.

The fragmentation is complete if the intermediate and leaf
fragments are complete.

Similarly, disjointness is guaranteed if intermediate and leaf
fragments are disjoint.
Slide 46
Allocation

Allocation Problem
Given
F = {F1, F2, …, Fn}
fragments
S ={S1, S2, …, Sm}
network sites
on which a set of applications Q = {q1, q2,…, qq} is running.

The allocation problem involves finding the “optimal” distribution of F to S.

Optimality can be defined with respect to two measures:
 Minimal cost
o The cost function consists of the cost of storing each Fi at a site Sj,
o the cost of querying Fi at site Sj , the cost of updating Fi at all sites where it is
stored,
o the cost of data communication.
 Performance
o minimize the response time.
o maximize the system throughput at each site.
Slide 47
Allocation Model
General Form
min(Total Cost)
subject to
response time constraint
storage constraint
processing constraint
Decision Variable
xij 
1 if fragment Fi is stored at site Sj
0 otherwise
Slide 48
Allocation Model

Total Cost

query processing cost 
all queries

all sites


cost of storing a fragment at a site
all fragments
Storage Cost (of fragment Fj at Sk)
(unit storage cost at Sk)  (size of Fj)  xjk

We choose a different approach in our model of the database
allocation problem (DAP) and specify it as consisting of the
processing cost (PC) and the transmission cost (TC).

Thus the query processing cost (QPC) for application qi is:
processing component + transmission component
Slide 49
Allocation Model

Query Processing Cost
 Processing component PC, consists of three cost factors
 the access cost (AC) + the integrity enforcement cost (IE) + the
concurrency control cost (CC)

 Access cost

(no. of update accesses+ no. of read accesses)  xij  local processing cost at a site
all sites all fragments
o The first two terms calculate the number of accesses of user query qi
to fragment Fj.
o We assume that the local costs of processing them are identical.
o The summation gives the total number of accesses for all the
fragments referenced by qi. Multiplication by LPCk gives the cost of
this access at site Sk.
o We again use xij to select only those cost values for the sites where
fragments are stored.
 Integrity enforcement and concurrency control costs
o Can be similarly calculated
Slide 50
Allocation Model

Query Processing Cost
Transmission component
cost of processing updates + cost of processing retrievals
 In update queries it is necessary to inform all the sites where replicas exist,
while in retrieval queries, it is sufficient to access only one of the copies.
 In addition, at the end of an update request, there is no data transmission
back to the originating site other than a confirmation message, whereas the
retrieval-only queries may result in significant data transmission.
 Cost of updates
  update message cost    acknowledgment cost
all sites all fragments
all sites all fragments
 Retrieval Cost

min all sites (cost of retrieval command  cost of sending back the result)
all fragments
Slide 51
Allocation Model

Constraints
 Response Time
execution time of query ≤ max. allowable response time for
that query
 Storage Constraint (for a site)

storage requirement of a fragment at that site 
storage capacity at that site
all fragments
 Processing constraint (for a site)

processing load of a query at that site 
processing capacity of that site
all queries
Slide 52
Thank You
Slide 53