Modern Databases - Stellenbosch University
Download
Report
Transcript Modern Databases - Stellenbosch University
Modern Databases
Willem Visser
RW334
The Web is Changing the Game
• Databases used to be the domain of
corporations with limited amounts of data and
limited amounts of users
– Very valuable information, but not a lot of it
– Important users, but not many of them
• In the modern web-driven world
– Enormous amounts of data are being generated
– Millions of users are interested in that data
What is Wrong here?
What is Wrong here?
How to make the DB scale?
Partition and Distribute the Data
Distributed Database
• A single logical database spread
physically across computers in multiple
locations that are connected by a data
communications link
6
Major Objectives
• Location Transparency
– User does not have to know the location of the data
– Data requests automatically forwarded to appropriate
sites
• Local Autonomy
– Local site can operate with its database when network
connections fail
– Each site controls its own data, security, logging,
recovery
7
Distributed Databases
Advantages
•
•
•
•
•
8
Increased reliability/availability
Local control over data
Modular growth
Lower communication costs
Faster response for certain queries
Distributed Database
Disadvantages
•
•
•
•
9
Software cost and complexity
Processing overhead
Data integrity exposure
Slower response for certain queries
Options for
Distributing a Database
• Data replication
– Copies of data distributed to different sites
• Horizontal partitioning/Sharding
– Different rows of a table distributed to different sites
• Vertical partitioning
– Different columns of a table distributed to different sites
• Combinations of the above
10
Data Replication
• Advantages:
– Reliability
– Fast response
– May avoid complicated distributed transaction integrity
routines (if replicated data is refreshed at scheduled
intervals)
– Decouples nodes (transactions proceed even if some
nodes are down)
– Reduced network traffic at prime time (if updates can
be delayed)
11
Data Replication (cont.)
• Disadvantages:
– Additional requirements for storage space
– Additional time for update operations
– Complexity and cost of updating
– Integrity exposure of getting incorrect data if
replicated data is not updated simultaneously
Therefore, better when used for non-volatile
(read-only) data
12
Factors in Choice of
Distributed Strategy
•
•
•
•
•
•
13
Funding, autonomy, security
Site data referencing patterns
Growth and expansion needs
Technological capabilities
Costs of managing complex technologies
Need for reliable service
Distributed DBMS
• Distributed database requires distributed DBMS
• Functions of a distributed DBMS:
– Locate data with a distributed data dictionary
– Determine location from which to retrieve data and process query
components
– DBMS translation between nodes with different local DBMSs (using
middleware)
– Data management functions: security, concurrency, deadlock control, query
optimization, failure recovery
– Data consistency (via multiphase commit protocols)
– Global primary key control
– Scalability
– Data and stored procedure replication
– Allowing for different DBMSs and application code at different nodes
14
Distributed DBMS
Transparency Objectives
• Location Transparency
– User/application does not need to know where data resides
• Replication Transparency
– User/application does not need to know about duplication
• Failure Transparency
– Either all or none of the actions of a transaction are committed
– Each site has a transaction manager
• Logs transactions and before and after images
• Concurrency control scheme to ensure data integrity
– Requires special commit protocol
15
Query Optimization
•
In a query involving a multi-site join and, possibly, a distributed
database with replicated files, the distributed DBMS must decide where
to access the data and how to proceed with the join. Three step
process:
1.
2.
3.
Query decomposition–rewritten and simplified
Data localization–query fragmented so that fragments reference
data at only one site
Global optimization–
•
•
•
•
16
Order in which to execute query fragments
Data movement between sites
Where parts of the query will be executed
Semi join operation: only the joining attribute of the query is sent from
one site to the other, rather than all selected attributes
Brewer’s CAP Theorem
• Eric Brewer, Keynote at ACM Symposium on
the Principles of Distributed Computing 2000
• You cannot have all three of:
– Consistency
– Availability
– Partition Tolerance
• Nothing short of complete network failure and the
system must keep functioning
• Theorem proven in 2002 by Gilbert and Lynch
• See http://www.julianbrowne.com/article/viewer/brewers-cap-theorem
Dealing with CAP?
• Drop Partitioning Tolerance
– Don’t partition, but then you have serious scalability
issues, which is probably why you want to partition in the
first place
• Drop Availability
– Wait for all the partitions to sync before allowing any
usage
– This is as bad for scalability as having no partitioning
• Drop Consistency
– Eventual Consistency seems to work in most cases
– If you have to drop one, this is the preferred option
– Flies against most DB principles
Relational DB?
• Seems like we are assuming the DB must still
be relational
• Web also forces a new concept
– Not all data look the same anymore!
• Email messages, Images, News documents,
Facebook updates, Tweets,…
• Relations are too rigid
• Semi-structured data
The Information-Integration Problem
•
Related data exists in many places and could, in
principle, work together.
• But different databases differ in:
– Model (relational, object-oriented?).
– Schema (normalized/ not normalized?).
– Terminology: are consultants employees?
Retirees? Subcontractors?
– Conventions (meters versus feet?).
• How do we model information residing in
heterogeneous sources (if we cannot combine it
all in a single new database)?
Example
• Suppose we are integrating information about
bars in some town.
• Every bar has a database.
– One may use a relational DBMS; another keeps
the menu in an MS-Word document.
– One stores the phones of distributors, another
does not.
– One distinguishes ales from other beers,
another doesn’t.
– One counts beer inventory by bottles, another
by cases.
Semi-structured Data
• Purpose: represent data from independent
sources more flexibly than either relational or
object-oriented models.
• Think of objects, but with the type of each object
its own business, not that of its “class.”
• Labels to indicate meaning of substructures.
• Data is self-describing: structural information is
part of the data.
Graphs of Semistructured Data
• Nodes = objects.
• Labels on arcs (attributes, relationships).
• Atomic values at leaf nodes (nodes with no arcs
out).
• Flexibility: no restriction on:
– Labels out of a node.
– Number of successors with a given label.
Example: Data Graph
Notice a
new kind
of data.
root
beer
bar
beer
manf
name
servedAt
Bud
A.B.
manf
prize
name
M’lob
name
addr
Joe’s
Maple
The bar object
for Joe’s Bar
year
1995
The beer object
for Bud
award
Gold
XML
• XML = Extensible Markup Language.
• While HTML uses tags for formatting (e.g., “italic”), XML uses
tags for semantics (e.g., “this is an address”).
• Key idea: create tag sets for a domain (e.g., bars), and
translate all data into properly tagged XML documents.
• Well formed XML - XML which is syntactically correct
– tags and their nesting totally arbitrary.
• Valid XML - XML which has DTD (document type definition)
– imposes some structure on the tags, but much more flexible than
relational database schema.
• DTD and XML Schema
– Meta-data for XML
– Describe what are valid XML structures
XML and Semi-structured Data
• Well-Formed XML with nested tags is exactly the
same idea as trees of semi-structured data.
• XML also enables non-tree structures (with
references to IDs of nodes), as does the semistructured data model.
Example: Well-Formed XML
<?xml version = “1.0” standalone = “yes” ?>
<BARS>
<BAR><NAME>Joe’s Bar</NAME>
<BEER><NAME>Bud</NAME>
<PRICE>2.50</PRICE></BEER>
<BEER><NAME>Miller</NAME>
Root tag
<PRICE>3.00</PRICE></BEER>
</BAR>
<BAR> …
Tags surrounding
</BARS>
a BAR element
A NAME
subelement
A BEER
subelement
27
Example
• The <BARS> XML document is:
BARS
BAR
BAR
BAR
NAME
BEER
BEER
Joe’s Bar
NAME
Bud
PRICE
2.50
NAME
Miller
PRICE
3.00
...
DTD Elements
• The description of an element consists of its
name (tag), and a parenthesized description of
any nested tags.
– Includes order of subtags and their multiplicity.
• Leaves (text elements) have #PCDATA (Parsed
Character DATA ) in place of nested tags.
29
Example: DTD
A BARS object has
zero or more BAR’s
nested within.
<!DOCTYPE BARS [
<!ELEMENT BARS (BAR*)>
<!ELEMENT BAR (NAME, BEER+)>
<!ELEMENT NAME (#PCDATA)>
<!ELEMENT BEER (NAME, PRICE)>
<!ELEMENT PRICE (#PCDATA)>
]>
NAME and PRICE
are text.
A BAR has one
NAME and one
or more BEER
subobjects.
A BEER has a
NAME and a
PRICE.
30
Querying XML
• Why query XML-documents?
– special XML databases
– major DBMSs “speak” XML;
• Does the world need a new query language?
• Most of the world's business data is stored in relational
databases;
• The relational language SQL is mature and wellestablished;
• Can SQL be adapted to query XML data?
– Leverage existing software
– Leverage existing user skills
XML vs Relational Data
• Relational data is "flat”: rows and columns;
• XML data is nested: and its depth may be
irregular and unpredictable;
• Relations can represent hierarchic data by
foreign keys or by structured datatypes;
• In XML it is natural to search for objects at
• unknown levels of the hierarchy:
• "Find all the red things“;
XML vs Relational Data (cont.)
• Relational data is uniform and repetitive;
• All bank accounts are similar in structure;
• Metadata can be factored out to a system catalog;
• XML data is highly variable;
• Every web page is different;
• Each XML object needs to be self-describing;
• Metadata is distributed throughout the document;
• Queries may access metadata as well as data:
"Find elements whose name is the same as their
content“:
• //*[name(.) =string(.)]
XML vs Relational Data (cont.)
• Relational queries return uniform sets of rows;
• The results of an XML query may have mixed
types and complex structures;
• "Red things": a flag, a cherry, a stopsign, ...
• Elements can be mixed with atomic values
• XML queries need to be able to perform
structural transformations
• Example: invert a hierarchy;
XML vs Relational Data (cont.)
• The rows of a relation are unordered
• Any desired output ordering must be derived from
values;
• The elements in an XML document are ordered
• Implications for query:
• Preserve input order in query results
• Specify an output ordering at multiple levels
• "Find the fifth step“
• "Find all the tools used before the
hammer“;
XML vs Relational Data (cont.)
• Relational data is "dense“
• Every row has a value in every column;
• A "null" value is needed for missing or inapplicable
data
• XML data can be "sparse“
• Missing or inapplicable elements can be "empty“ or
"not there“
• This gives XML a degree of freedom not
present in relational databases
XPATH and XQUERY
• XPATH is a language for describing paths in XML documents.
– Really think of the semi-structured data graph and its
paths.
– Why do we need path description language: can’t get at
the data using just Relation.Attribute expressions.
• XQUERY is a full query language for XML documents with
power similar to OQL (Object Query Language, query
language for object-oriented databases).
'Modern' Databases