Transcript Slides

CSC 485E/CSC 571
Advanced Databases
Introduction
What’s a database?
• In essence a database is nothing more than a collection of
information that exists over a long period of time.
• Databases are empowered by a body of knowledge and
technology embodied in specialized software called a
database management system, or DBMS.
• A DBMS is a powerful tool for creating and managing large
amounts of data efficiently and allowing it to persist over long
periods of time, safely.
• Among the most complex types of software available.
The database [management] system
1. Allows users to create new databases and specify their schema (logical
structure of the data), using a data-definition language.
2. Enables users to query and modify the data, using a query language
and data-manipulation language.
3. Supports intelligent storage of very large amounts of data.
–
Protects data from accident or not proper use.
Example: We can require from the DBMS to not allow the insertion of
two different employees with the same SIN.
–
Allows efficient access to the data for queries and modifications.
Example: Indexes over a specified fields
4. Controls access to data from many users at once (concurrency),
without allowing “bad” interactions that can corrupt the data
accidentally.
5. Recovers from software failures and crashes.
Early DBMS’s (1960’s)
•
They encouraged the user to view the data much as it was
stored.
•
The chief models were the Hierarchical and Network.
•
The main characteristic of these models was the possibility
of easy jumping or navigating from one object to another
through pointers.
–
•
However these models didn’t provide a high-level query
language for the data.
–
•
E.g. From one employee to his department.
So, one had still to write programs for querying the data.
Also they didn’t allow on-line schema modifications.
Relational databases
Codd (1970)
• A database system should present the user with a view of
data organized as tables (also called relations).
•
Behind the scene there could be a complex data structure
that allows rapid response to a variety of queries.
–
•
But the user would not be concerned with the storage structure.
Queries could be expressed in a very high-level language,
which greatly increases the efficiency of database
programmers.
–
This high-level query language for relational databases is called:
Structured Query Language (SQL)
Database Studies
• Design of databases.
– What kinds of information go into the database?
– How is the information structured?
– How do data items connect?
• Database programming.
– How does one express queries on the database?
– How does one use other capabilities of a DBMS, such as transactions or
constraints, in an application?
– How is database programming combined with conventional
programming?
• Database system implementation.
We’ll focus on
this part
– How does one build a DBMS, including such matters as query
processing, transaction processing and organizing storage for efficient
access?
Fictitious Megatron 2006 DBMS
• Stores relations as Unix files
• Students(name, sid, dept) is stored in the file
/home/megatron/students as
Smith#123#CS
Jones#533#EE
• Schemas are stored in /home/megatron/schemas
e.g.
Students#name#STR#id#INT#dept#STR
Depts#name#STR#office#str
Megatron sample session
mayne$ megatron
WELCOME TO MEGATRON 2006
megaSQL% SELECT * FROM Students;
Name
id
dept
---------------------------------Smith
123 CS
Johnson 522 EE
megaSQL%
Megatron sample session II
megaSQL% SELECT * FROM Students
WHERE id >= 500;
Johnson#522#EE
megaSQL% quit
THANK YOU FOR USING MEGATRON 2006
mayne$
Megatron Implementation
• To execute SELECT * FROM R WHERE <COND>
• Read file schema to get attributes of R
• Check that the <COND> is semantically valid for R
• Read file R,
– for each line
• check condition
• if OK, display
• If we pipe the result into a file, say T, then add an
entry for T in the file /home/megatron/schemas
Megatron Implementation II
• To execute
SELECT office
FROM Students, Dept
WHERE Students.name = 'Smith' AND
Students.dept = Depts.name;
• Read file schema to get attributes and do semantic check.
• If Ok, then,
for each tuple s in Students
for each tuple d in Depts
if s and d satisfy the WHERE condition,
display the office value from s
What’s wrong with Megatron?
• Tuple layout on disk: no flexibility for DB modifications.
– Change CS to ECON and the entire file has to be rewritten.
• Search Expensive: no indexes; always read entire relation.
• Bruteforce query processing.
– Did we need to look at all pairs of studentdept tuples?
• No buffer manager: everything comes off of disk all the
time.
• No concurrency control: several users can modify a file at
the same time with unpredictable results.
• No reliability: can lose data in a crash or leave operations
half done.
• Little security: file system protection too coarse.
Architecture of a DBMS
•
The “cylindrical” component contains
not only data, but also metadata, i.e.
info about the structure of data.
•
If DBMS is relational, metadata
includes:
– names of relations,
– names of attributes of those
relations, and
– data types for those attributes (e.g.,
integer or character string).
•
A database also maintains indexes for
the data.
– Indexes are part of the stored data.
– Description of which attributes have
indexes is part of the metadata.
What will be covered
1. Secondary Storage Management
a)
b)
c)
d)
Disks
Accelerated access
Handling disk failures
Arranging data on disk
2. Index Structures
1.
2.
B-Trees, Extensible Hash Tables, etc.
Multidimensional Indexes (for GIS and OLAP)
3. Query Execution (we concentrate a lot here)
a) Algorithms for relational operators
b) Join methods.
4. Query Compiler (we concentrate a lot here)
a) Algebraic laws for improving query plans.
b) Cost based plan selection
c) Join orders
What will be covered
5. Concurrency Control
a) Pessimistic schemes (locking)
b) Optimistic schemes (timestamps)
6. Parallel and Distributed Databases
a)
b)
c)
d)
e)
Parallel algorithms on relations
Distributed query processing
Distributed transactions
Google’s Map-Reduce framework
Peer-to-peer distributed search
7. Data Mining
a) Frequent-Itemset Mining
b) Finding similar items
c) Clustering of large-scale data
8. Databases and the Internet
a)
b)
c)
d)
Search engines
PageRank
Data streams
Data mining of streams