two phase commit

Download Report

Transcript two phase commit

CSE2132
Database Systems
Week 12 Lecture
Distributed Database
Distributed DB
12. 1
Evolution of Distributed DBMS
 CENTRALISED systems - all system components located on a single
computer
• data
• DBMS
• secondary storage (disks etc)
– Access via serially connected 'dumb terminals' - all processing
occurs at central site
 1980s changes:
– business operations decentralised geographically, need to become
'lean-and-meaner, quick reacting, dispersed operations'
– technological change low cost powerful computing platforms
 led to DISTRIBUTED SYSTEMS
Distributed DB
12. 2
Distributed Data Base Management Systems
 system components distributed over multiple sites, interconnected via
communication system (network)
 managed by Distributed Data Base Management System (DDBMS)
 DDBMS advantages:
– data located near the greatest demand site
– faster data access - desired data subset locally available
– faster data processing - system processing load spread out over
multiple cpu's
– growth facilitation - easy to add new sites to network
– less danger of single point failure
 DDBMS disadvantages:
– complexity of management and control
– security - weaker due to distribution (thus more people involved) and
network traffic
– lack of standards - many communication protocols exist eg. tcp/ip
netbios DECnet etc.
Distributed DB
12. 3
Distributed Processing
 database's logical processing shared among multiple cpu's
 actual database resides on a single computer
Computer A
Site 1:
User McGyver:
Melbourne
DBMS
EMPLOYEE
EMPLOYEE
DATABASE
DATABASE
Computer B
Update
Payroll
Data
Computer C
Communication network
Generate
payroll
report
Database
records
Site 2:
User Agatha:
Sydney
Site 3:
User Perry:
Adelaide
Distributed DB
12. 4
Distributed Database
 actual database stored over two or more independent cpu's - each part
know as a database fragment
Site 1:
User McGyver:
Melbourne
Computer A
DBMS
E1
Computer B
Computer C
DBMS
DBMS
Communication network
E2
Site 2:
User Agatha:
Sydney
E3
Site 3:
User Perry:
Adelaide
Distributed DB
12. 5
Distribution Options
 Two important components of a DDBMS are
– Transaction Processor (TP) - receives and processes the
applications data request
– Data Processor (DP) - stores and retrieves data located at the site
 Data And Processing Distribution Options:
– Single site processing, single site data
• all processing done on single cpu (host computer)
• all data stored on host computers local disk/s
• eg. traditional mainframe / minicomputer DBMS with dumb
terminals or single user microcomputer DBMS
Distributed DB
12. 6
Distribution Options
– Multi site processing, single site data
• multiple processes run on different computers
• all data stored on single computers local disk/s
• e.g. LAN file server
TP acts as redirector - routes all network data requests to
file server
file server appears to end user as a hard disk eg. F:
all data selection, search and update occur at workstation
entire file must be transported across network
inefficient, costly (communication)
Distributed DB
12. 7
Distribution Options (1)
 Client-server is a term used very loosely - difficult to clearly define
Client-server, similar to file server except database processing occurs
at the server site - know as a database server. Client server is about a
split in processing rather than a split in data. A 3 tier architecture is
often employed. This requires communications middleware to resolve
translation issues by the use of an agreed protocol(e.g. TCP/IP,
IPX/SPX or NetBEUI). Client-server approaches overlap and and are
used in conjunction with distributed database.
 Multiple site processing, multiple site data
This describes a fully distributed DBMS. They can then be classified
as homogeneous or heterogeneous.
Distributed DB
12. 8
Distribution Options (2)
• homogeneous DDBMS
The same DBMS running at each site. While there is much
complexity to deal with the task of accessing data over
many sites is simpler than in the heterogeneous case.
• heterogeneous DDBMS
Will support different DBMSs and even different DBMS
models (relational, network, hierarchical) at each site. Such
implementations of a DDBMS operate under certain
restrictions. (e.g. the access to other databasesis read only or
only other relational databases can be accessed.)
Distributed DB
12. 9
Dates Distributed Database Rules
DISTRIBUTED DATABASE SHOULD
LOOK EXACTLY LIKE A NON
DISTRIBUTED 'LOCAL'
DATABASE
LOCAL ACCESS TO LOCAL DATA
Many vendors claim to be delivering
distributed database applications need criteria:
DATES 12 RULES
– 0 - TO USER A DISTRIBUTED
DATABASE SHOULD LOOK
NO DIFFERENT TO A NON
DDB - Distribution transparency,
database treated as single logical
database
– 1 - LOCAL AUTONOMY, local
data owned and managed locally
– 2 - NO RELIANCE ON CENTRAL
SITE
– 3 - CONTINUOUS OPERATION,
failure transparency - site
independence, even in the event of a
node failure the system continues to
operate
– 4 - LOCATION INDEPENDENCE,
location transparency
– 5 - FRAGMENTATION
INDEPENDENCE, fragmentation
transparency
Distributed DB
12. 10
Dates rules continued
– 6 - REPLICATION INDEPENDENCE
– 7 - DISTRIBUTED QUERY PROCESSING, multiple site queries
– 8 - DISTRIBUTED TRANSACTION MANAGEMENT,
transaction transparency - multiple site updates
– 9 - HARDWARE INDEPENDENCE
– 10 - OPERATING SYSTEM INDEPENDENCE
– 11 - COMMUNICATION NETWORK INDEPENDENCE
– 12 - DATABASE INDEPENDENCE
Distributed DB
12. 11
Levels of Distribution Transparency
 Can be used as a method of
classification by determining
level of transparency supported
by DDBMS at highest level:
– FRAGMENTATION transparency
• No need to specify fragment
names or locations
• select *
from employee
where dob < 01/01/40
– LOCATION transparency
• Specify fragment names but
not locations
select *
from e1
where dob < 01/01/40;
UNION
select *
from e2
where dob < 01/01/40;
UNION
select *
from e3
where dob < 01/01/40;
Distributed DB
12. 12
Levels of Distribution Transparency
– local MAPPING transparency (lowest level)
Need to specify both fragment and location (using pseudo-SQL)
select * from e1 node melbourne
where dob < 01/01/40;
UNION
select * from e2 node sydney
where dob < 01/01/40;
UNION
select * from e3 node adelaide
where dob < 01/01/40;
Distributed DB
12. 13
DDBMS Operations
 Join operation (most vendors supply)
– Easier to achieve
– Query optimisation critical
 Update operation
– eg debit / credit of two accounts at different sites
– more difficult to manage
– Need sophisticated transaction management most popular strategy : TWO PHASE COMMIT
 Two Phase Commit requires three operations:
– DO - performs operation & records before and after image in
transaction log
– UNDO - undoes an operation using log entries created in DO
– REDO - redoes an operation using log entries created in DO
Distributed DB
12. 14
Two Phase Commit
 Site originating transaction
(coordinator) sends request to sites
(subordinates), each site processes sub
transaction but does not commit.
 Phase 1 - Preparation:
– 1. coordinator sends prepare to
commit to all subordinates
– 2. subordinates receive message,
write log entries and reply to
coordinator - READY to
COMMIT or NOT READY
– 3. coordinator checks all nodes
ready to commit - if not broadcasts
an ABORT, if all ready:
 Phase 2 - Final COMMIT
– 1. coordinator broadcasts a commit
message to all subordinates and
awaits a reply
– 2. subordinate receives commit and
updates database
– 3. subordinates reply with
COMMITTED or NOT
COMMITTED to the coordinator
– If any subordinates did not commit,
coordinator sends ABORT forcing
an UNDO
Distributed DB
12. 15
DISTRIBUTION STRATEGIES
CUSTOMER
ORDER
ORDER-LINE
PRODUCT
WAREHOUSE
INVENTORY
The overall data model - the company view
 Distribution principles
– Examine geography / frequency of access
– Guiding principle -minimise network traffic and
communication costs
Distributed DB
12. 16
 The sales team
– Situated in Sydney
– Need most access to customer/order/order-line/product
 The supply branch
– Situated in Melbourne
– Need most access to warehouse/inventory/product
 Data model partitioning
CUSTOMER
ORDER
PRODUCT
ORDER-LINE
PRODUCT
WAREHOUSE
INVENTORY
What do we do with RELATIONS at the BOUNDARY ?
FRAGMENTATION vs REPLICATION
Distributed DB
12. 17
 Possible starting points:
– Which site accesses it most?
• Storage at a single site
• Minimises update complexity
– Is there a case for fragmentation?
• Minimises local access time
• Minimises network traffic
– Is there a case for replication?
• Minimises local access time
• Minimises network traffic
 Fragmentation
– Horizontal fragmentation
• based on SELECTION, eg fragment customer table on
customer_city
Distributed DB
12. 18
– Vertical fragmentation
• Based on projection, eg fragment product table on attributes
needed by each group
SALES : P_CODE, DESC, UNIT_PRICE
SUPPLY : P_CODE, PACK_SIZE
– Hybrid Fragmentation
• Assume 3 warehouses
Footscray - p_code < 100
Collingwood - p_code = 100
Dandenong - p_code > 100
• further subdivide vertically supply fragment
SELECT P_CODE < 100 --> FOOTSCRAY
SELECT P_CODE = 100 --> COLLINGWOOD
SELECT P_CODE > 100 --> DANDENONG
Distributed DB
12. 19
Replication
 Efficient retrievals vs
Multi site updating
– Costly, accident prone
Updating techniques
– Conservative
– Don't commit until all sites accept
– Primary node, one site accept updates and broadcast
– Majority voting
– Snapshots, etc
Distributed DB
12. 20
Query Optimization
 Query - list the supplier numbers for cleveland suppliers of red parts ?
– SUPPLIER(SUPPLIER#, CITY)
10,000 DETROIT
– PART(PART#, COLOUR)
100,000 CHICAGO
– SHIPMENT(SUPPLIER#, PART#) 1,000,000 DETROIT
SELECT
S.SUPPLIER#
FROM
SUPPLIER S, PART P, SHIPMENT H
WHERE
S.SUPPLIER# = H.SUPPLIER# AND
H.PART# = P.PART# AND
P.COLOUR = 'RED';
 Time for query varies from 1 second to 2.3 days depending on the Query Plan
selected
Distributed DB
12. 21