Introduction - AIC - Department of Computer Science

Download Report

Transcript Introduction - AIC - Department of Computer Science

CS 319: Theory of Databases:
Generalities
Dr. Alexandra I. Cristea
http://www.dcs.warwick.ac.uk/~acristea/
Lecturers
• Alexandra I. Cristea
• Hugh Darwen: [email protected]
• Other invited talks?: TBA
2
Schedule
• See website
• Exceptions:
– TBA: check website, lectures, forum
3
Slides
• Thanks to:
– Dr. Paul Goldberg:
• http://www.dcs.warwick.ac.uk/people/academic/Paul.Goldberg/cs319/
cs319index.html
– Dr. Meurig Beynon:
• http://www.dcs.warwick.ac.uk/people/staff/Meurig.Beynon/
– Dr. Ad Aerts:
• http://wwwis.win.tue.nl/~aaerts/
– Prof. Dr. Paul De Bra:
• http://wwwis.win.tue.nl/~debra/
– Others: mentioned directly
4
Contact
• Forum:
http://forums.warwick.ac.uk/wf/browse/category.jsp?cat=24
• IF (and ONLY IF) a question is personal, you
might address it to [email protected]
– FORMAT: subject of email should contain ‘CS319’
and topic of the email (otherwise it will be filtered
out)
5
Organisational
• Office hours, see website
• Per email is also possible – but
availability depends on demand
• (general questions better to post on
forum)
6
Course site(s):
• Current:
– http://www.dcs.warwick.ac.uk/~acristea/courses/CS319/
– Will contain current slides, as taught at the course
– Will contain notifications: check BEFORE & AFTER the
course
• Official:
– http://www.dcs.warwick.ac.uk/undergraduate/modules/cs319.html
• Past:
– http://www.dcs.warwick.ac.uk/people/academic/Paul.Goldberg/cs31
9/cs319index.html
7
Content: Topics
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
Generalities DB
Integrity constraints (FD revisited)
Relational Algebra (revisited)
Query optimisation
Tuple calculus
Domain calculus
Query equivalence
Temporal Data
The “Askew Wall”
How to Handle Missing Information without Using
NULL
8
Books
• Korth and Silberschatz, Database
System Concepts, McGraw-Hill,1991.
• Ullman J D, Principles of Database
Systems (Vols 1 and 2), Computer
Science Press,1988.
9
Purpose of this course
• More in-depth information on the theory
of databases
• How (some of) the existing db
languages fit in the theory (and how
they don’t)
10
Overlaps and sequencing
• Form: optional
• Prerequisites
– CS252: Fundamentals of Database Systems
– Optional: CS253: Topics in Database Systems
11
Organization of the course
•
•
•
•
•
15 CATS
CS, CSE, CBS, Mathematics
~ 30 1-hour lectures
Exam at the end: 3 hours
Rules of the game:
– Read also comments on the slides.
– Presence is optional, but beware: slides-only are
NOT ENOUGH to learn from for the exam; you
need to participate, take your own notes, so selfstudy!
12
Scope of CS 319
• Expressive power of db query languages
• Algorithms for the computational problems
• Limitations to classical relational theory
• A central contribution of theory is to say what cannot
be done, not just what can.
• Consider how this observation is relevant to the
above topics.
13
A theme of CS319
• How do theory and computing practice relate to
specific reference to databases?
• Key ideas:
• Theory: based on Codd's relational theory.
– There is an excellent correspondence between
relational theory and practical database application of a
certain kind.
• Relational databases can be seen as a precursor
of 2 principal kinds of computer application:
– environments for end-user programming and
– computer-based models of real-world state.
14
1. Database Generalities
15
DB generalities : What is a database?
• Chris Date:
– Database = computer-based record keeping system
• R.W. Engles: "A Tutorial on DB Organisation"
(1974)
– Db = collection of stored operational data used by
the applications system of a particular enterprise
– enterprise: hospital, university, bank, company etc
– operational data:
•
•
data on products, accounts, patients etc
typically persistent cf conventional program IO data
16
DB generalities : Why use a db?
• How to meet needs using a traditional file-processing
system supported by a conventional OS?
• Files: permanent records of customers, accounts
• Applications programs (APs): enable user to modify
files
–
–
–
–
to credit or debit an account
to add a new account
to find the balance in an account
to generate monthly statements
• APs written by systems programmers as required
• new requirements  new files + new programs
17
Original context for data modelling
1970s style applications
unsophisticated computer users
batch mode interaction
modest response times
no visualisation or GUI
modest expectations for ease-of-use
programming perceived as technical
simple infrastructure and environment
no PC, web etc
no live feeds of data
textual interaction the norm
18
Original context for data modelling
• 1970s style applications
– Business context
• simple business model, limited automation,
access etc
• low volume of data
• not initially distributed
– Computing context
• existing/emerging DB proposals unconvincing
• computers not very powerful
• human and computing resources very expensive
19
Summary of issues for data
management
1.
2.
3.
4.
5.
6.
Data redundancy and inconsistency
Difficulty in accessing data
Data isolation
Concurrent access anomalies
Security problems
Integrity problems
20
Data redundancy & inconsistency
•  programmer uses ≠ format file, developed at
≠ stage in history of enterprise
=> data duplicated:
+ in a DB rationalise and standardise data
[rationalise: shared source for data]
… rationalise doesn't necessarily mean centralise
21
Data redundancy & inconsistency
– compromises are needed
• where users suit themselves => efficient results
 perfect data organisation to suit all users
– duplication: insurance against info loss
22
Difficulty in accessing data
• unforeseen requests, new functionality
• new programs, possibly new data structures
+ in a DB, simplify access & manipulation by
intelligent organisation of data cf. modelling
approach to requirements
e.g. in use of UML in OOSE
23
Data isolation
data to retrieve from many sources in APs
+ in DB, hide source physical data :
higher level of abstraction
– automation: less human interaction with data
risk of corrupted data 
24
Concurrent access anomalies
• would like multiple access (efficiency &
faster response)
e.g. simultaneous withdrawal
+ concurrency can't be managed without
a form of overall control
25
Security problems
• to restrict access to (un)authorised users for
confidential info
+ security needs a form of overall control
– issues: where should the control be?
inside or outside computer system
* e.g., non-trivial problem to determine what can be
inferred from responses to queries that aren't
explicit
26
Integrity problems
• integrity constraints
– may arise dynamically:
– difficult to modify programs to cope;
– hard to guarantee if data stored in ≠ files
+ automated management demands a form of
overall control
– automation reduces scope for human
intervention / interpretation
27
Conclusions:
Issues for data management
• For many commercial applications, a
good solution is offered by a database
management system (DBMS).
• A DBMS is an unconventional OS
operating over a structured file system.
28
Motivating the DBMS concept
• abstract model of corpus of operational data
that simplifies data processing activity:
– simple queries can be handled without writing
new application programs
– if APs need => accessing & manipulating
operational data consistently and efficiently is
simplified
29
The ingredients of a database
Data
integrated
shared
(possibly distributed)
Hardware
storage
Software
database management system: DBMS
protects users from hardware level detail
serves the needs of all users
30
DB Users
• end-user:
– non-specialist accessing data via a query language
– naïve user accessing data via a special-purpose interface
performs data retrieval and update (extend / modify)
• applications programmer:
– writes programs that use the DB by embedding queries to
the DB in a HLL
– develops interfaces for the naïve user
31
DB Users
Database Administrator (DBA):
responsible for overall control
decides what data is to be stored
designs the conceptual scheme
used to represent the operational data
implements authorisation checks
decides strategy for backup and recovery
monitors performance
oversees modification to suit user requirements
32
Data abstraction in a db
• addresses issues of design, use,
management and implementation
• Data model describes (formally) 3 different
levels of abstraction:
physical level
conceptual level
view level
33
3 Levels
physical level:
how is the data actually represented in the hardware?
bits, bytes
conceptual level:
what meaningful relationships are expressed by the
physical data?
entities, and relationships between entities
view level:
what particular relationships are required by users?
more abstract because partial typically very high-level
knowledge constitutes the view
34
Illustrating data abstraction:
DB w. date of birth of a client (bit string).
senior citizens?:  clients aged > 65
Representations at ≠ abstraction levels:
Conceptual: date of birth
Physical: bit string
View: age (not stored in DB!!)
35
Data abstraction in a database
DESIGN
&
MANAGEMENT
USE
views
conceptual
model
external
schemas
subschemes
IMPLEMENTATION

physical
layout
internal
schemas
conceptual
scheme

physical
scheme
36
Role of abstraction (1)
• Internal & external translation schemas:
* protect conceptual model from change
• when physical organisation changes / new views are required
* protect user from a need to change views
* protect altering physical organisation
• if conceptual model is modified
37
Role of data abstraction (2)
• physical data independence:
protecting conceptual model from change when
physical organisation changes
• logical data independence:
protecting user from need to change views when
conceptual model changes
38
Characteristics of electronic
data 1970
• “Abstract model of the entire corpus of operational data”
• Demands of the abstract model in 1970 quite low …
– small volumes of data, modest performance
– limited levels of volatility and automation tolerated
• Today different, BUT
– (subject to viewing human agency as a metaphor for any
agency, )
– key issues of a classical database are still vital
• Any DB modelling paradigm must handle 70s problems
39
db models
• 2 principal kinds of abstract data model:
– object-based models
– record-based models
• earliest
• reflects file system culture they displaced
40
Object-based models
• main models:
– entity-relationship models
– object-oriented data models
• Others: semantic & functional data models.
• E-R model widely used to model data abstractly
• OO model gaining acceptance in practice:
effectively represents data + operations on it.
– RDF, OWL
41
Record-based Logical Models
• Used at the conceptual and view levels.
Specify both
– overall logical structure of the database
– higher-level description of the implementation.
• Record-based: uses records in fixed-format of
several types.
simplifies implementation <> trend towards richness
and variety in structures used to implement OODBs
42
Varieties of record-based
logical model
hierarchical model
records & links organised as a family of trees
network model
records & links organised as a family of graphs
relational model
uses tables to record relationships between data
43
Physical Data Models
• not our primary concern in this module.
• Relevant issues here would be:
–
–
–
–
are data tables stored using hashing?
how are data tables indexed?
how are entries in data tables encoded and ordered?
what algorithms are used to retrieve and update?
44
Classical database features
Instances and Schemes
State of DB changes over time: structure vs. current
state.
overall design of DB =
current content of DB
database schema
=
instance of the DB
Useful analogy with procedural variables:
database schema type definition for variable
instance of databasevalue of the variable
45
Data abstraction and schemas
•
•
•
physical scheme at lowest level
conceptual scheme at intermediate level
several sub-schemes (possibly userdefined) at highest level (views of the DB)
46
Data Definition Language (DDL)
• for database schema
• compiling DDL > Data Dictionary
• the storage & access methods: specified in storage &
definition language
Implementation details usually hidden from users
47
Data Manipulation Language (DML
• data manipulation: accessing DB to
retrieve, insert, delete, or modify data
• data retrieval
– most common
– "querying the DB"
• retrieval component of DML = query language
(abusively: ‘query language’ ~ synonym DML)
48
Varieties of Data
Manipulation Language
• tension between
•
efficiency at physical level
•
intelligibility / ease of use at higher level
• procedural: requires knowledge of data
implementation
• non-procedural: need only specify what data is
needed
49
Data Manipulation Languages
(DML) for typical data models
• Procedural:
– object-based, hierarchical, network models
– user explicit responsibility: optimising queries,
• needs knowledge of data organisation
• Non-procedural:
– relational models
– formulate queries without above,
• implementation has to be optimised
50
Optimisation
• Database Manager (program module!)
– interfaces between low-level data in DB and application
programs & user queries.
• Large volumes of data (available technology)
– gigabytes
– terabytes
thousand megabytes = 1 billion bytes [!]
million megabytes = 1 trillion bytes
• Requires auxiliary storage media, such as disk, CD
etc.
• Optimisation is primarily concerned with
eliminating data transfers between main and
auxiliary memory.
51
Functions of the DB
manager program module
• query processing
– interacting with the file manager modules doing actual
operations on physical data
• integrity enforcement
– checking that data in the DB conforms to specified
constraints
• security enforcement
– ensuring that authorisation is given for access to data
• backup and recovery
– coping with failure, and recovery to consistent DB state
• concurrency control
– ensuring that simultaneous transactions do not interfere.
52
Overall DB system structure
• Processing components
–
–
–
–
–
file manager
database manager
query processor
DML precompiler (to process DML embedded in APs)
DDL compiler
• Data structures
– data files: actual content of db
– data dictionary meta-data
– Indices: auxiliary files to assist fast access
53
Summary
• We have revisited and reconsidered the
generalities of database systems
– Background of appearance
– Main problems they try to solve
– Components & languages
54
… to follow
How to reason with Integrity Enforcements (Constraints):
Functional Dependencies (FDs) applied
55
• Questions?
56