#### Transcript pmit-6102-15-lec3-Distributed DatabaseDesignx

PMIT-6102 Advanced Database Systems ByJesmin Akhter Assistant Professor, IIT, Jahangirnagar University Lecture 03 Distributed Database Design Outline Distributed Database Design Distributed Design Problem Distributed Design Issues o Fragmentation o Data Allocation Slide 3 Distributed Design Problem The design of a distributed computer system involves making decisions on the placement of data and programs across the sites of a computer network as well as possibly designing the network itself In the case of distributed DBMSs, the distribution of applications involves two things: distribution of the distributed DBMS software distribution of the application programs that run on it. Are not significant problem Assume a copy of the distributed DBMS software exists at each site where data are stored Network has already been designed We concentrate on distribution of data Slide 4 Distribution Design Issues The following set of interrelated questions covers the entire issue. Why fragment at all? How should we fragment? How much should we fragment? How to test correctness? How should we allocate? What is the necessary information for fragmentation and allocation? Slide 5 Reasons for Fragmentation The important issue is the appropriate unit of distribution. A relation is not a suitable unit, for a number of reasons. o First, application views are usually subsets of relations. o Therefore, the locality of accesses of applications is defined not on entire relations but on their subsets o Hence consider subsets of relations as distribution units. Slide 6 Reasons for Fragmentation The relation is not replicated and is stored at only one site, results in an unnecessarily high volume of remote data accesses The relation is replicated at all or some of the sites where the applications reside. May has unnecessary replication, which causes problems in executing updates may not be desirable if storage is limited. Finally, the decomposition of a relation into fragments, each being treated as a unit, permits a number of transactions to execute concurrently. Thus fragmentation typically increases the level of concurrency and therefore the system throughput. Slide 7 Fragmentation Alternatives Relation instances are essentially tables, so the issue is one of finding alternative ways of dividing a table into smaller ones. There are clearly two alternatives for this: dividing it horizontally or dividing it vertically. Slide 8 Fragmentation Alternatives – Horizontal PROJ PROJ1 : projects with budgets less than $200,000 PROJ2 : P1 P2 P3 P4 P5 projects with budgets greater than or equal to $200,000 PROJ1 PNO PNO PNAME Instrumentation Database Develop. CAD/CAM Maintenance CAD/CAM BUDGET LOC 150000 135000 250000 310000 500000 Montreal New York New York Paris Boston PROJ2 PNAME BUDGET LOC PNO PNAME BUDGET LOC P1 Instrumentation 150000 Montreal P3 CAD/CAM 250000 New York P2 135000 New York P4 Maintenance 310000 Paris P5 CAD/CAM 500000 Boston Database Develop. Example of Horizontal Partitioning Slide 9 Fragmentation Alternatives – Vertical PROJ PROJ1: information about project budgets PROJ2: information about project names and locations PROJ1 PNO PNAME BUDGET P1 P2 P3 P4 P5 Instrumentation Database Develop. CAD/CAM Maintenance CAD/CAM LOC 150000 Montreal 135000 New York 250000 New York 310000 Paris 500000 Boston PROJ2 PNO BUDGET PNO PNAME LOC P1 P2 P3 P4 P5 150000 135000 250000 310000 500000 P1 P2 P3 P4 P5 Instrumentation Database Develop. CAD/CAM Maintenance CAD/CAM Montreal New York New York Paris Boston Example of Vertical Partitioning Slide 10 Correctness of Fragmentation Completeness Decomposition of relation R into fragments R1, R2, ..., Rn is complete if and only if each data item in R can also be found in some Ri This property, which is identical to the lossless decomposition property of normalization it ensures that the data in a global relation are mapped into fragments without any loss Slide 11 Correctness of Fragmentation Reconstruction If relation R is decomposed into fragments FR={R1, R2, ..., Rn}, then there should exist some relational operator ∇ such that R = ∇ Ri , The reconstructability of the relation from its fragments ensures that constraints defined on the data in the form of dependencies are preserved. Slide 12 Correctness of Fragmentation Disjointness If relation R is horizontally decomposed into fragments FR={R1, R2, ..., Rn}, and data item dj is in Rj, then dj should not be in any other fragment Rk (k ≠ j ). This criterion ensures that the horizontal fragments are disjoint. If relation R is vertically decomposed, its primary key attributes are typically repeated in all its fragments (for reconstruction). Therefore, in case of vertical partitioning, disjointness is defined only on the non-primary key attributes of a relation. Slide 13 Allocation Alternatives Non-replicated partitioned : each fragment resides at only one site Replicated fully replicated : each fragment at each site partially replicated : each fragment at some of the sites Slide 14 Information Requirements The information needed for distribution design can be divided into four categories: Database information Application information Communication network information Computer system information Slide 15 Fragmentation Horizontal Fragmentation (HF) Vertical Fragmentation (VF) Hybrid Fragmentation (HF) Slide 16 Horizontal Fragmentation (HF) Horizontal fragmentation partitions a relation along its tuples. each fragment has a subset of the tuples of the relation. There are two versions of horizontal partitioning: Primary horizontal fragmentation Derived horizontal fragmentation Primary horizontal fragmentation of a relation is performed using predicates that are defined on that relation. Derived horizontal fragmentation is the partitioning of a relation results from predicates being defined on another relation. Slide 17 Information Requirements of HF Database Information how the database relations are connected to one another, especially with joins. In the relational model, these relationships are also depicted as relations. SKILL TITLE, SAL L1 PROJ EMP ENO, ENAME, TITLE L2 ASG PNO, PNAME, BUDGET, LOC L3 ENO, PNO, RESP, DUR Expression of Relationships Among Relations Using Links cardinality of each relation: card(R) Slide 18 Information Requirements of HF Application Information simple predicates : Given R[A1, A2, …, An], a simple predicate pj is pj : Ai θValue where θ {=,<,≤,>,≥,≠}, Value Di and Di is the domain of Ai. For relation R we define Pr = {p1, p2, …,pm} Example : PNAME = "Maintenance" BUDGET ≤ 200000 Slide 19 Information Requirements of HF Application Information minterm predicates : Given R and Pr = {p1, p2, …,pm} define M = {m1,m2,…,mr} as M = { mi | mi = pjPr pj* }, 1≤j≤m, 1≤i≤z where pj* = pj or pj* = ¬(pj). Example m1: PNAME="Maintenance" BUDGET≤200000 m2: NOT(PNAME="Maintenance") BUDGET≤200000 m3: PNAME= "Maintenance" NOT(BUDGET≤200000) m4: NOT(PNAME="Maintenance") NOT(BUDGET≤200000) Slide 20 Information Requirements of HF Application Information In terms of quantitative information about user applications, we need to have two sets of data. minterm selectivities: sel(mi) o The number of tuples of the relation that would be accessed by a user query which is specified according to a given minterm predicate mi. access frequencies: acc(qi) o The frequency with which a user application qi accesses data. o Access frequency for a minterm predicate can also be defined. Slide 21 Primary Horizontal Fragmentation Definition : A primary horizontal fragmentation is defined by a selection operation on the owner relations of a database schema. Therefore, given relation R, its horizontal fragments are given by Ri = Fi(R), 1 ≤ i ≤ w where Fi is a selection formula used to obtain fragment Ri. If Fi is in conjunctive normal form, it is a minterm predicate (mi). A horizontal fragment Ri of relation R consists of all the tuples of R which satisfy a minterm predicate mi. Given a set of minterm predicates M, there are as many horizontal fragments of relation R as there are minterm predicates. Set of horizontal fragments also referred to as minterm fragments. Slide 22 Primary Horizontal Fragmentation PROJ1 = LOC=“Montreal” (PROJ) PROJ2 = LOC=“New York” (PROJ) PROJ3 = LOC=“Paris” (PROJ) Primary Horizontal Fragmentation of Relation PROJ Slide 23 PHF – Algorithm Given: A relation R, the set of simple predicates Pr Output: The set of fragments of R = {R1, R2,…,Rw} which obey the fragmentation rules. Preliminaries : Pr should be complete Pr should be minimal Slide 24 Completeness of Simple Predicates A set of simple predicates Pr is said to be complete if and only if there is an equal probability of access by every application to any tuple belonging to any minterm fragment that is defined according to Pr. Example : Assume PROJ[PNO,PNAME,BUDGET,LOC] has two applications defined on it. Find the budgets of projects at each location. (1) Find projects with budgets less than $200000. (2) Slide 25 Completeness of Simple Predicates According to (1), Pr={LOC=“Montreal”,LOC=“New York”,LOC=“Paris”} which is not complete with respect to (2). Modify Pr ={LOC=“Montreal”,LOC=“New York”,LOC=“Paris”, BUDGET≤200000,BUDGET>200000} which is complete. Slide 26 Minimality of Simple Predicates If a predicate influences how fragmentation is performed, (i.e., causes a fragment f to be further fragmented into, say, fi and fj) then there should be at least one application that accesses fi and fj differently. In other words, the simple predicate should be relevant in determining a fragmentation. If all the predicates of a set Pr are relevant, then Pr is minimal. Pi is relevant if and only if acc(mi ) acc(m j ) card ( f i ) card ( f j ) acc(mi ) the access frequency of a minterm mi. Slide 27 Minimality of Simple Predicates Example : Pr ={LOC=“Montreal”,LOC=“New York”, LOC=“Paris”, BUDGET≤200000,BUDGET>200000} is minimal (in addition to being complete). However, if we add PNAME = “Instrumentation” then Pr is not minimal. Slide 28 COM_MIN Algorithm Given: a relation R and a set of simple predicates Pr Output: a complete and minimal set of simple predicates Pr' for Pr Rule 1: a relation or fragment is partitioned into at least two parts which are accessed differently by at least one application. Slide 29 COM_MIN Algorithm Initialization : find a pi Pr such that pi partitions R according to Rule 1 set Pr' = pi ; Pr Pr – {pi} ; F {fi} Iteratively add predicates to Pr' until it is complete find a pj Pr such that pj partitions some fk defined according to minterm predicate over Pr' according to Rule 1 set Pr' = Pr' {pj}; Pr Pr – {pj}; F F {fj} if pk Pr' which is nonrelevant then Pr' Pr' – {pk} F F – {fk} Slide 30 PHORIZONTAL Algorithm Makes use of COM_MIN to perform fragmentation. Input: a relation R and a set of simple predicates Pr Output: a set of minterm predicates M according to which relation R is to be fragmented Pr' COM_MIN (R,Pr) determine the set M of minterm predicates determine the set I of implications among pi Pr‘ Iteratively eliminate the contradictory minterms from M M M-mi Slide 31 PHF – Example Two candidate relations : PAY and PROJ. Fragmentation of relation PAY Application: Check the salary info and determine raise. Employee records kept at two sites application run at two sites Simple predicates p1 : SAL ≤ 30000 p2 : SAL > 30000 Pr = {p1,p2} which is complete and minimal Pr'=Pr Minterm predicates m1 : (SAL ≤ 30000) m2 : NOT(SAL ≤ 30000) (SAL > 30000) Slide 32 PHF – Example PAY1 TITLE PAY2 SAL TITLE SAL Mech. Eng. 27000 Elect. Eng. 40000 Programmer 24000 Syst. Anal. 34000 Slide 33 PHF – Example Fragmentation of relation PROJ Applications: o Find the name and budget of projects given their location. – Issued at three sites o Access project information according to budget – one site accesses ≤200000 other accesses >200000 Simple predicates For application (1) p1 : LOC = “Montreal” p2 : LOC = “New York” p3 : LOC = “Paris” For application (2) p4 : BUDGET ≤ 200000 p5 : BUDGET > 200000 Pr = Pr' = {p1,p2,p3,p4,p5} Slide 34 PHF – Example Fragmentation of relation PROJ continued Minterm fragments left after elimination m1 : (LOC = “Montreal”) (BUDGET ≤ 200000) m2 : (LOC = “Montreal”) (BUDGET > 200000) m3 : (LOC = “New York”) (BUDGET ≤ 200000) m4 : (LOC = “New York”) (BUDGET > 200000) m5 : (LOC = “Paris”) (BUDGET ≤ 200000) m6 : (LOC = “Paris”) (BUDGET > 200000) Slide 35 PHF – Example The result of the primary horizontal fragmentation of PROJ forms six fragments FPROJ = {PROJ1, PROJ2, PROJ3, PROJ4, PROJ5, PROJ} according to the minterm predicates M . Since fragments PROJ2, and PROJ5 are empty, they are not depicted in Figure PROJ3 PROJ1 PNO PNAME BUDGET P1 Instrumentation 150000 LOC PNO Montreal P2 PROJ4 PNO P3 PNAME BUDGET LOC New York Database Develop. 135000 PROJ6 PNAME BUDGET CAD/CAM 250000 LOC PNO New York P4 PNAME Maintenance BUDGET LOC 310000 Paris Slide 36 PHF – Correctness Completeness Since Pr' is complete and minimal, the selection predicates are complete Reconstruction If relation R is fragmented into FR = {R1,R2,…,Rr} R = Ri FR Ri Disjointness Minterm predicates that form the basis of fragmentation should be mutually exclusive. Slide 37 Derived Horizontal Fragmentation Defined on a member relation of a link according to a selection operation specified on its owner. Each link is an equijoin. Equijoin can be implemented by means of semijoins. SKILL TITLE, SAL L1 EMP PROJ ENO, ENAME, TITLE L2 PNO, PNAME, BUDGET, LOC L3 ASG ENO, PNO, RESP, DUR Slide 38 DHF – Definition Given a link L where owner(L)=S and member(L)=R, the derived horizontal fragments of R are defined as Ri = R ⋉ Si, 1≤i≤w where w is the maximum number of fragments that will be defined on R and Si = Fi (S) where Fi is the formula according to which the primary horizontal fragment Si is defined. Slide 39 DHF – Example Given link L1 where owner(L1)=SKILL and member(L1)=EMP EMP1 = EMP ⋉ SKILL1 EMP2 = EMP ⋉ SKILL2 where SKILL1 = SAL≤30000(SKILL) SKILL2 = SAL>30000(SKILL) EMP2 EMP1 ENO ENAME E3 E4 E7 A. Lee J. Miller R. Davis TITLE Mech. Eng. Programmer Mech. Eng. ENO ENAME TITLE E1 J. Doe Elect. Eng. E2 E5 E6 E8 M. Smith B. Casey L. Chu J. Jones Syst. Anal. Syst. Anal. Elect. Eng. Syst. Anal. Slide 40 DHF – Correctness Completeness Referential integrity o ensures that the tuples of any fragment of the member relation are also in the owner relation. o Let R be the member relation of a link whose owner is relation S which is fragmented as FS = {S1, S2, ..., Sn}. Furthermore, let A be the join attribute between R and S. Then, for each tuple t of R, there should be a tuple t' of S such that t[A] = t' [A] Reconstruction Let a relation R with fragmentation FR = { R1,R2,……Rw } Disjointness Simple join graphs between the owner and the member fragments. Slide 41 DHF – Correctness Slide 42 Vertical Fragmentation More difficult than horizontal, because more alternatives exist. Example: In horizontal partitioning, if the total number of simple predicates in Pr is n, there are 2n possible minterm predicates that can be defined on it. some of these will contradict the existing implications, further reducing the candidate fragments that need to be considered In the case of vertical partitioning if a relation has m nonprimary key attributes, the number of possible fragments is equal to B(m), which is the mth Bell number. For large values of m;B(m)= approximately (mm) for m=10, B(m) =115,000, for m=15, B(m) =109, for m=30, B(m) = 1023 Slide 43 Hybrid Fragmentation In most cases a simple horizontal or vertical fragmentation of a database schema will not be sufficient to satisfy the requirements of user applications. In this case a vertical fragmentation may be followed by a horizontal one, or vice versa, producing a tree structured Partitioning. Since the two types of partitioning strategies are applied one after the other, this alternative is called hybrid fragmentation. It has also been named mixed fragmentation or nested fragmentation. Slide 44 Hybrid Fragmentation It is also called mixed fragmentation or nested fragmentation. R HF HF R1 R2 VF VF VF R11 R12 R21 VF R22 VF R23 Slide 45 Correctness of Hybrid Fragmentation To reconstruct the original global relation in case of hybrid fragmentation, one starts at the leaves of the partitioning tree and moves upward by performing joins and unions. The fragmentation is complete if the intermediate and leaf fragments are complete. Similarly, disjointness is guaranteed if intermediate and leaf fragments are disjoint. Slide 46 Allocation Allocation Problem Given F = {F1, F2, …, Fn} fragments S ={S1, S2, …, Sm} network sites on which a set of applications Q = {q1, q2,…, qq} is running. The allocation problem involves finding the “optimal” distribution of F to S. Optimality can be defined with respect to two measures: Minimal cost o The cost function consists of the cost of storing each Fi at a site Sj, o the cost of querying Fi at site Sj , the cost of updating Fi at all sites where it is stored, o the cost of data communication. Performance o minimize the response time. o maximize the system throughput at each site. Slide 47 Allocation Model General Form min(Total Cost) subject to response time constraint storage constraint processing constraint Decision Variable xij 1 if fragment Fi is stored at site Sj 0 otherwise Slide 48 Allocation Model Total Cost query processing cost all queries all sites cost of storing a fragment at a site all fragments Storage Cost (of fragment Fj at Sk) (unit storage cost at Sk) (size of Fj) xjk We choose a different approach in our model of the database allocation problem (DAP) and specify it as consisting of the processing cost (PC) and the transmission cost (TC). Thus the query processing cost (QPC) for application qi is: processing component + transmission component Slide 49 Allocation Model Query Processing Cost Processing component PC, consists of three cost factors the access cost (AC) + the integrity enforcement cost (IE) + the concurrency control cost (CC) Access cost (no. of update accesses+ no. of read accesses) xij local processing cost at a site all sites all fragments o The first two terms calculate the number of accesses of user query qi to fragment Fj. o We assume that the local costs of processing them are identical. o The summation gives the total number of accesses for all the fragments referenced by qi. Multiplication by LPCk gives the cost of this access at site Sk. o We again use xij to select only those cost values for the sites where fragments are stored. Integrity enforcement and concurrency control costs o Can be similarly calculated Slide 50 Allocation Model Query Processing Cost Transmission component cost of processing updates + cost of processing retrievals In update queries it is necessary to inform all the sites where replicas exist, while in retrieval queries, it is sufficient to access only one of the copies. In addition, at the end of an update request, there is no data transmission back to the originating site other than a confirmation message, whereas the retrieval-only queries may result in significant data transmission. Cost of updates update message cost acknowledgment cost all sites all fragments all sites all fragments Retrieval Cost min all sites (cost of retrieval command cost of sending back the result) all fragments Slide 51 Allocation Model Constraints Response Time execution time of query ≤ max. allowable response time for that query Storage Constraint (for a site) storage requirement of a fragment at that site storage capacity at that site all fragments Processing constraint (for a site) processing load of a query at that site processing capacity of that site all queries Slide 52 Thank You Slide 53