String Containers Strings and StringBuffers

Download Report

Transcript String Containers Strings and StringBuffers

Distributed Databases
Going Beyond Simple Client
Server Configuration
Copyright © 2011-2012 Curt Hill
• We have already seen that a DBMS
might use:
– Multiple CPUs
– Multiple disks in RAID configuration or
– Networking
• How do we take it larger and faster?
– Multiple sites separated by significant
Copyright © 2011-2012 Curt Hill
• Distributed Database Management
System (DDBMS)
A software system that manages
multiple logically related databases
distributed over a several sites while
making the distribution transparent
• It appears to users to be a single
• The pieces are geographically
– Miles rather than feet apart
Copyright © 2011-2012 Curt Hill
Parallel and Distributed
• A parallel architecture may loosely
or tightly coupled
• Tightly coupled
– Multiple CPUs sharing a common
• Loosely coupled
– Multiple CPUs sharing disk but not
• Neither of these is considered
Copyright © 2011-2012 Curt Hill
Other Differences
• Networks connect different sites (or
– Not just a server farm with many
machines within a small distance of
• The connected databases must be
related to one another
– A query could involve several sites
• The systems do not have to be
identical in terms of data, hardware
and software
Copyright © 2011-2012 Curt Hill
• Easier to develop applications for a
• Increased reliability and availability
• Improved flexibility
• Increased performance
• Easier to scale to large
Copyright © 2011-2012 Curt Hill
• There is quite a bit of terminology
– Some is unique to DDBMS
• Transparency types
• Autonomy
• Homogeneity
Copyright © 2011-2012 Curt Hill
• The users should not be able to
discern that multiple sites contribute
• Several types of transparency
• Data organization transparency
• Replication transparency
• Fragmentation transparency
• Design transparency
• Execution transparency
Copyright © 2011-2012 Curt Hill
Data organization
• Also known as distribution or
network transparency
• Details of where data is are hidden
from user
• Location transparency
– The command to access is independent
of the location of own site or the data’s
• Naming transparency
– Named objects (such as tables) may be
accessed regardless of location where
they may be
Copyright © 2011-2012 Curt Hill
Replication transparency
• We may have copies of the data at
multiple sites
• This increases performance,
reliability, availability
• The user cannot tell that such
copies exist nor if copies are
accessed rather than the original
• Keeping the copies up to date is also
handled transparently
Copyright © 2011-2012 Curt Hill
Fragmentation transparency
• Tables may be fragmented in either
a horizontal or vertical way across
– Horizontal: not all tuples are at the
same site
– Vertical: not all fields are at the same
• A global query is transparently
transformed into local queries
– User is unaware of any fragmentation
Copyright © 2011-2012 Curt Hill
Design and Execution
• Design: freedom from knowing how
the database is organized across
the different sites
• Execution: ignorance of where the
work of the query is actually
– The source site
– The site where data resides
– Another site
Copyright © 2011-2012 Curt Hill
• How much independent operation is
• If the network is inoperable can the
separated sites still in function to
what degree?
• High degree of autonomy increases
• Design, Communication and
Execution autonomies exist
Copyright © 2011-2012 Curt Hill
Autonomy Again
• Design: Each site may employ its
own design:
– How the tables and views are
– How many and what type of indices
• Communication: how much sharing
of informaiton with other nodes
• Execution: independence of users to
access information
Copyright © 2011-2012 Curt Hill
Reliability and Availability
• Two terms are related but not
• Reliability: likelihood of the system
running at any point in time
• Availability: likelihood that the
system will be continuously running
during some interval
Copyright © 2011-2012 Curt Hill
• How similar are the systems within a
• Homogeneous – all sites are the
• Heterogeneous – sites have different
– Not just different versions of the same
Copyright © 2011-2012 Curt Hill
Local autonomy
• How well can a site function by
– Such as in the case of network failure
• No local autonomy makes the site a
slave of the system
– It does not function by itself
Copyright © 2011-2012 Curt Hill
Bad News: CAP theorem
• A distributed database or web
service cannot guarantee all of the
• Consistency
– That operations occur all at once
• Availability
– Every operation must terminate in the
intended operation
• Partition tolerance
– Operations will complete even if
individual components fail
Copyright © 2011-2012 Curt Hill
Types of DBMS
Traditional central database
Pure distributed database
Federated database
– AKA peer to peer database
Copyright © 2011-2012 Curt Hill
Pure distributed database
• Made to look like traditional
database except that it is distributed
over many sites
• Sites have no local autonomy
• Single conceptual schema for all
• All access is through a single site
that is part of it
Copyright © 2011-2012 Curt Hill
Federated Database
• Each server site has local autonomy
• Users and a DBA are associated
with each site
• A global schema that is used across
the sites
• Heterogeneous software may be
Copyright © 2011-2012 Curt Hill
• Each server site has full local
• Users and a DBA are associated
with each site
• Each site has its own schema
• Each site cooperates with other
sites converting the schema of a
request into the local schema as
• Heterogeneous software may be
Copyright © 2011-2012 Curt Hill
Pure DDB Architecture
• A pure distributed database has a
single conceptual schema
– The Global Conceptual Schema (GCS)
– Users interact with this conceptual
• Each site then has:
– A Local Conceptual Schema (LCS)
– A Local Internal Schema (LIS)
• The local schemas only reflect the
pieces of the overall database
possessed at the local site
Copyright © 2011-2012 Curt Hill
Query Translation
• The query compiler must take a
• It then translates it into the subqueries for each local site
• This is based on the fragmentation
of the whole database
• It may generate several alternative
queries and then evaluate their
costs to find the best one
• Local sites will also have query
Copyright © 2011-2012 Curt Hill
Pure and Federated
• In a pure distributed database there
is typically little or no heterogeneity
or local autonomy
– Each site is very uniform
– Except for the differences demanded
by the fragmentation types
• In a federated system the greater
levels of heterogeneity and
autonomy make for different
handling of schemas
Copyright © 2011-2012 Curt Hill
Federated Architecture
• At the top level there is a federated
schema that supports the external
schema that the users see
• Immediately below this is an export
– This is the schema that all the other
sites are able to access
• Below this is the local schema
• Query translation is similar but more
difficult in this approach
Copyright © 2011-2012 Curt Hill
Three-Tier Client Server
• The three tier approach is generally
required in DDBMS
• The presentation layer
– This is the client - usually a web
browser or specialized client
• The application server
– Split the request among the individual
• Database server
– An individual site
Copyright © 2011-2012 Curt Hill
A Three Tier Approach
DDBMS Application Server
Copyright © 2011-2012 Curt Hill
Application Layer
• This layer largely distinguishes a
• It must have substantial logic so that
it divides each query into the
components that are to be sent to
the individual sites
• It then reconstitutes the results and
relays them to the original client
• The application server may be
multiple machines as well
Copyright © 2011-2012 Curt Hill
Load Balancer
• A load balancer may introduce
another level
• Requests are sent from the client to
the load balancer
• It chooses the application server
that is the least busy
• Once this is chosen it gets out of the
• Unlike the application server
Copyright © 2011-2012 Curt Hill
A Four Tier Approach
App Server
App Server
App Server
Load balancer
Copyright © 2011-2012 Curt Hill
• Without some form of fragmentation
there is no distribution
• It is this fragmentation that
increases performance, reliability
and availability
• It is also this fragmentation that
complicates the global query
compiler and optimizer
Copyright © 2011-2012 Curt Hill
• Horizontal fragmentation stores one
table in multiple sites
• The tuples that are stored at a
particular site are determined by a
– A condition that could be in a query
• Example:
– The employees stored in a fragment at
a site may be those who work at that
– The products that are available at this
site’s warehouse
Copyright © 2011-2012 Curt Hill
• Similar to a decomposition to bring a
schema to a higher normal form
• Example:
– A site may have employee information
suitable for matching to a task, but
another site have payroll information
• Both would need the primary key
– Many sites might have in-stock
information but only one reordering
Copyright © 2011-2012 Curt Hill
More on Fragmentation
• If we have use both we get mixed or
hybrid fragmentation
• A fragmentation schema defines
what pieces are stored
– This includes all attributes of the
fragmented tuple
• An allocation schema maps the
fragments to the sites
– This must also account for replicated
Copyright © 2011-2012 Curt Hill
• Maintaining copies of all or a portion
of database at multiple sites makes
it easier for queries to be handled
without much network traffic
– A fully replicated distributed database
maintains all data at all sites
– No replication is the other extreme
– Partial replication is the middle
• This also increases the overhead of
any type of data change
– The change must be propagated to
other sites
Copyright © 2011-2012 Curt Hill
• The allocation schema records the
assignment of fragments to sites
• A replication schema records where
copies of the data are maintained
• Finding a good replication scheme is
not easy
• Often use frequency of usage data
to assign sites to minimize network
Copyright © 2011-2012 Curt Hill
Query Processing
• Distributed query processing
usually involves stages
• Query mapping
– Not much different than that on a
regular DBMS
• Localization
• Global query optimization
• Local query optimization
• Not much different than that on a
regular DBMS
Copyright © 2011-2012 Curt Hill
• Fragmentation scatters the data
over several sites
• Replication increases the number of
sites where the data may be found
• Localization uses the fragmentation
and replication schemas to
determine where the data is that is
needed for this query
– Often several possibilities exist
Copyright © 2011-2012 Curt Hill
Global Query Optimization
• Localization will provide several
candidate queries
• These will be evaluated for costs:
– CPU costs
– I/O costs
– Communication costs
• Usually the communication cost is
most significant
• The costs are generally expressed in
terms of time
Copyright © 2011-2012 Curt Hill
Data Transfer Costs
• Generally:
– CPUs are faster than memory
– Memory is faster than disk storage
– Disk storage is faster than network
• In a DBMS minimizing disk storage
accesses is very important
• In a DDBMS it is still, but minimizing
network traffic is even more
– Consider a join across two sites
Copyright © 2011-2012 Curt Hill
• Generally executing joins is time
• If the two tables are in different sites
– Transfer one table to another site
– Join the two tables
• Substantial network work,
especially if the table is large
• Many queries have multiple joins
Copyright © 2011-2012 Curt Hill
The Semijoin
• The idea is to reduce as much
volume to be transmitted over the
network as possible
• Reduce the number of tuples prior to
• Reduce the number of columns as
Copyright © 2011-2012 Curt Hill
The Join strategy
• Project out all the unnecessary
columns from the table
– Those columns not needed by result
• Transmit result
– Usually attempt to transmit the smaller
to the larger
• Join the two
• You may even send the result back
to original site and rejoin to get the
full data needed
Copyright © 2011-2012 Curt Hill
Example Tables
• Suppose an employee table:
• Suppose also a department table:
Copyright © 2011-2012 Curt Hill
Example Locations
• Employee table is horizontally
– Each site has only its own employees
• Department table is only at company
– Every department may be represented
at each site
• Want a query that connects
employees with their VPs and
Copyright © 2011-2012 Curt Hill
Example Query
• Query might look something like
Select loc.E_Name,
glob.E_Name, loc.E_salary
From employees loc,
employees glob, department
Where E_site = 4 AND
loc.E_Dept = D_ID AND
glob.E_ID = D_VP
Copyright © 2011-2012 Curt Hill
Example Numbers
• Suppose that an employee record is
88 bytes
• There are 250 records at site 4
• Suppose that there are 700
employees at the central office
• A department record is 35 bytes and
there are 100 departments
Copyright © 2011-2012 Curt Hill
Example Query Execution
• Take the local employees fragment
and project out everything but E_ID,
• Send this to the central site
• Join this with department and then
with employee fragment at central
• Project out everything but loc.E_ID,
and glob.E_Name
• Send back
• Join back to local employee file
• Project andCopyright
© 2011-2012 Curt Hill
Example Transmissions
• Transmitting the site 4 employees is
22,000 bytes
• Transmitting the site 1 employees is
61,600 bytes
• Transmitting the department table is
3,500 bytes
• Transmitting the E_ID, E_Dept
columns from 4 to 1 is 2000 bytes
• Transmitting the semi-joined file
back to 4 is 5000 bytes
Copyright © 2011-2012 Curt Hill
Transaction Management
• The ACID properties may be
• Replicated pieces must be updated
with originals
• Locking multiple sites may slow
matters down
• What needs to happen if
communication is lost during the
Copyright © 2011-2012 Curt Hill
Strict 2PL Rules
• A transaction requests a lock when it
desires to access the object
– Shared lock for reads and exclusive for writes
• It holds all locks until complete
– Either a commit or rollback
• A transaction that cannot get the lock is
suspended until the item is available
• If both reading and writing is desired get
the exclusive lock
• A transaction holding a lock for too long
may be deadlocked
– Often aborted
Copyright © 2011-2012 Curt Hill
Multiple Sites
• A multidatabase or a distributed
database complicates this
• A multidatabase usually needs a two
phase commit (2PC)
• A distributed database often needs
a three phase commit (3PC)
• In both cases a Global Recovery
Manager is required
– AKA Global Recovery Coordinator
Copyright © 2011-2012 Curt Hill
2PC - Phase 1
• Each local signals that the
transaction has concluded
• Global recovery manager sends a
Prepare for Commit message
• Each local will write its log records
for potential recovery
• The locals will respond with ready to
commit if everything is fine and
unable to commit otherwise
Copyright © 2011-2012 Curt Hill
2PC – Phase 2
• If all locals have responded with
ready to commit, then global
recovery manager sends a commit
• Otherwise entire transaction is
rolled back
• Since log information is up to date
this can be recovered even in the
event of a failure
Copyright © 2011-2012 Curt Hill
2PC problems
• This is a blocking protocol
• If the manager fails, then all sites are
left hanging
• Generally serious performance
issues result
• This caused the development of the
three phase commit (3PC)
• Both 2PC and 3PC rely on votes
• Any problems by any local can rollback
the whole transaction
Copyright © 2011-2012 Curt Hill
Three phase commit
• What is added here is the vote result
• An upper bound on the time allowed
to perform the commit exists
• The global manager sends out the
vote tallies
• Now every site knows every other
site involved
• If the global manager fails, another
site may see the transaction to its
Copyright © 2011-2012 Curt Hill
3PC – Phase 1
• Manager sends out the can commit
• Locals respond yes
• If any local votes no or fails to
respond the transaction is aborted
Copyright © 2011-2012 Curt Hill
3PC – Phase 2
• Manager sends out the pre-commit
• Locals acknowledge
• If any local fails to respond the
transaction is aborted
Copyright © 2011-2012 Curt Hill
3PC – Phase 3
• Manager sends out the do commit
• Locals respond have committed
• Timeout causes abort
Copyright © 2011-2012 Curt Hill
• DDBMS are more complicated than
single site DBMS
• Allow for scales unattainable with
regular DBMS
• The various transparencies
complicate implementation, but
simplify usage
Copyright © 2011-2012 Curt Hill