Lecture slides - School of Computer Science

Download Report

Transcript Lecture slides - School of Computer Science

1. Intro to the relational data model
& simple SQL queries
1. What is a database?
2. What is a database management system?
3. Tables
4. The relational data model
“All data can be represented in the form of tables.”
(Edgar Codd, 1970)
2. Querying a database with SQL
5. Simple SELECT queries
6. How does a database system evaluate a
query?
7. Some technical info on simple SELECT queries
8. exercise
9. working environments
Lecture 2
Aggregate and nested queries
10. Sorting the output
11. Random selection
12. Boolean connectives in the WHERE clause
13. Anomalies in tables: repeated rows
14. Anomalies in tables: null values
15. Set operators
16. Aggregation
10. Sorting the output
SELECT * FROM staff ORDER BY title, lastname;
SELECT * FROM staff ORDER BY title DESC, lastname;
11. Random selection
SELECT * FROM staff ORDER BY random();
SELECT * FROM staff ORDER BY random() LIMIT 1;
SELECT * FROM staff ORDER BY salary DESC LIMIT3 OFFSET
3;
12. Boolean connectives in the
WHERE clause
SELECT * FROM staff WHERE salary>50000 AND NOT title =
'Prof';
Precedence:
OR < AND < NOT
(Java:
||
&&
!)
Take care when translating natural language into boolean
connectives in a query!
Take care when translating natural language into boolean
connectives in a query!
Examples of tricky ones:
“Do you want icecream or cake for desert?”
In both Java and SQL you can have both icecream and cake!!!
If you mean exclusive or (between p=icecream and q=cake),
you must make that explicit:
p XOR q →
q)
....WHERE (p AND NOT q) OR (NOT p AND
“Bring me all the blue books if they are in the bottom shelf.”
q
p
Translation word-by-word:
“Bring me all the blue books if they are in the bottom shelf.”
Translation word-by-word:
p => q. equivalently, NOT p OR q
But what was meant in fact in the sentence?
p AND q
Beware that natural language can be ambiguous. Always check
with your customers what they really mean by what they say!
13. Anomalies in tables:
repeated rows
* When entering data into the database:
Imagine a payroll database with some accidentally
repeated rows... This is a source of problems!
When appropriate, the db system can be instructed to
check for identical repeated records by declaring a
(subset of) attribute(s) as UNIQUE.
* For the output of a query:
SELECT DISTINCT title FROM staff;
14. Anomalies in tables: null values
Null vales are better than fantasy values, whenever the value of an
attribute of an entry is unknown. But, we need to be aware of its
effects.
SELECT * FROM staff WHERE office>=100 AND office < 200;
SELECT * FROM staff WHERE NOT(office>=100 AND office < 200);
Neither lists staff whose office number is unknown! To list those too,
SELECT * FROM staff WHERE office>=100 AND office < 200 OR
office IS NULL;
Logic with <unknown> works like this:
NOT <unknown> = <unknown>
<unknown> AND TRUE = <unknown>
<unknown> AND FALSE = FALSE
<unknown> OR TRUE = TRUE
<unknown> OR FALSE = <unknown>
When appropriate, NULL values can be banned for an
attribute at the creation of the table.
15. Set operators
Warning: these are time-consuming operations
SELECT lastname, firstname FROM staff
INTERSECT
SELECT lastname, firstname FROM regular_drug_users;
INTERSECT ALL
UNION
UNION ALL
EXCEPT
EXCEPT ALL
16. Aggregation
= extracting summary info
COUNT,
MAX, MIN, AVG, STDDEV
SELECT MAX(marks) FROM allmarks;
SELECT COUNT (DISTINCT title) FROM staff;
SELECT COUNT (*) FROM staff;
NULL values are ignored
Next:
17. Grouped aggregation
18. Nested queries
19. A conflict between selection and grouping
Exercise...
17. Grouped aggregation
We have a table allmarks which has the following
attributes: student, mark, bc.
We want to know the average mark for each individual
course, as this:
Course Code | Avg mark
---------------------------------
06-02324
| 60.4
.....
| …
SELECT bc AS “Course Code”, AVG(mark) AS “Average mark”
FROM allmarks
GROUP BY bc;
It is important to understand what is going on behind the scene:
i) the table gets sorted by the 'bc' field
ii) it gets split apart into a list of tables (blocks), each having
identical value for the field 'bc'.
iii) each block gets squashed to one line
=> Any attribute that is mentioned in the SELECT part
must be either in GROUP BY or in an aggregation
operator!
Check:
SELECT bc AS “Course Code”, AVG(mark) AS “Average mark”
FROM allmarks
GROUP BY bc;
=> When using GROUP BY, any attribute that is mentioned in the
SELECT part must be either in GROUP BY or in an aggregation
operator! Check:
SELECT bc AS “Course Code”, AVG(mark) AS “Average mark”
FROM allmarks
GROUP BY bc;
Q: How about if we know of another attribute that also keeps
constant within the groups defined in GROUP BY above (e.g.
cid)? - can we not select it too?
A: Nope! [some DB systems permit this but it’s a source of mess!]
Note: We can have several attributes after GROUP BY. That will
create groups that have the concatenated values of those
attributes all the same within a group.
...selection before or after GROUP BY
* to filter the lines that go into grouping: use the
WHERE clause
SELECT bc AS “Course Code”, AVG(mark) AS “Average mark”
FROM allmarks WHERE mark > 0 GROUP BY bc;
...selection before or after GROUP BY
* to filter the lines that come out of grouping:
SELECT bc AS “Course Code”, AVG(mark) AS “Average mark”
FROM allmarks WHERE mark > 0 GROUP BY bc HAVING
COUNT(*) >=5;
- What will this query return?
- What kind of condition can we use in the HAVING construction?
18. Nested queries
- remember that queries return tables
- but SQL lets us use them as values inside the WHERE clause of
another query
- the conversion is done by SQL silently
SELECT lastname, firstname FROM staff
WHERE salary =(SELECT MAX(salary) FROM staff);
Q: what is the result of the inner query? [table with one col & 1
row!]
Q: what is the result of this entire query?
When the inner query returns more than one line:
SELECT lastname, firstname FROM staff
WHERE sid IN (SELECT sid FROM lecturing);
SELECT year, numbers FROM lecturing
WHERE (cid, numbers) IN
(SELECT cid, MAX(numbers) FROM lecturing GROUP BY cid);
Q: what do these queries search for ?
SELECT lastname, firstname FROM staff
WHERE sid IN (SELECT sid FROM lecturing);
[searches for those members of staff which were actually involved
in teaching]
SELECT year, numbers FROM lecturing
WHERE (cid, numbers) IN
(SELECT cid, MAX(numbers) FROM lecturing GROUP BY cid);
[selects for each course the year(s) in which enrolment was
highest]
Other than '=' and 'IN', we can also use 'EXISTS' (or 'NOT
EXISTS'):
SELECT name FROM courses
WHERE NOT EXISTS (SELECT * FROM ALLMARKS
WHERE allmarks.bc = courses.bc);
[searches for those courses for which there are no marks in the
'allmarks' table]
19. Worked example
[A conflict between selection and grouping]
For each course, find the number of students that failed the exam?
SELECT bc AS “Course Code”, COUNT(*) as “Nr Failures”
FROM allmarks
WHERE mark < 40
GROUP BY BC;
??? but how about courses with no failing? - those don't get listed
To fix this, we need to separately construct those courses where
everyone passed, put 0 for “Nr Failures”, and take union.
SELECT bc AS “Course Code”, COUNT(*) as “Nr Failures”
FROM allmarks
WHERE mark < 40
GROUP BY BC
UNION
SELECT DISTINCT bc AS “Course Code”, 0 as “Nr Failures”
FROM allmarks
WHERE bc NOT IN (SELECT DISTINCT bc
FROM allmarks
WHERE mark < 40);
Finishing off review of SQL
21. Beyond single tables
22. How to use SQL to extract information from more
than one table
23. How are multi-table queries evaluated
24. The role of record identifiers (=keys)
25. Using a table more than once
26. Sub-queries in the “FROM” field
27. Design pattern: Finding a maximum
28. Other operators
29. Expressivity of SQL
21. Beyond single tables
Database = several tables
Each table encodes a single entity
Different tables can have different sizes
22. How to use SQL to extract information from more
than one table
List all courses taught by Prof. Yung, without
using his sid.
SELECT lecturing.cid
FROM staff, lecturing
WHERE staff.firstname = ’Achim’ AND
staff.lastname = ’Jung’ AND
staff.sid = lecturing.sid;
Further modify this to display course names
instead of cid
SELECT courses.name
FROM staff, lecturing, courses
WHERE staff.firstname = ’Achim’ AND
staff.lastname = ’Jung’ AND
staff.sid = lecturing.sid AND
lecturing.cid=courses.cid;
23. How are multi-table queries evaluated
All combinations of records from the three tables
are considered.
We require identifier fields (like cid and sid) to
match, which narrows down the number of hits.
Imagine what would be the result if you left out the
WHERE cause?
– don't try it out while other students are working with
the database system...!
24. The role of record identifiers (=keys)
We create a unique identifier for each record.
We use them to make the link between tables. (E.g.
teachers and courses)
Whenever we need to refer to a record from another
table, we use that identifier.
Terminology: unique identifier = 'key'.
We can declare keys when we set up the database. So
the system will check uniqueness.
Terminology: unique identifier = 'key'
Inside that table we call it 'primary key'
We can have several fields that are unique to a record.
We call those 'secondary key'.
* E.g. in courses, cid is primary cay, bc is secondary
key.
Keys allow us to refer to entities (which are more fully
described in their own table) compactly. When we do
so from another table, the key is called 'foreign'.
* E.g. sid is primary key in staff, but occurs as
foreign key in lecturing.
25. Using a table more than once
The system searches line by line. How can we then compare
one record with another one?
Which courses changed lecturer from 2002 to 2003?
SELECT l2.cid
FROM lecturing AS l1, lecturing AS l2
WHERE l1.cid = l2.cid AND
l1.year = 2002 AND l2.year = 2003 AND
l1.sid <> l2.sid;
(Another way would be via a subquery in the WHERE part –
as in last lecture.)
26. Sub-queries in the “FROM” field
SELECT courses.name, AVG(temp.numbers)
FROM courses,
(SELECT DISTINCT cid, numbers
FROM lecturing) AS temp
WHERE courses.cid = temp.cid
GROUP BY courses.cid, courses.name;
[Calculates the average number of students on each course,
avoiding the problem of repeated entries]
27. A design pattern to remember: Finding a maximum
We want to know not just the maximum, but also some details of
the row where the maximum occurs. E.g. who earns the
maximum salary?
Wrong (why?):
SELECT lastname, MAX(salary) FROM staff;
Correct:
SELECT lastname, salary FROM staff
WHERE salary = (SELECT MAX(salary) FROM staff);
28. Other operators
We walked through the main features of SQL.
There are many more.
The main ones should allow you to be able to
consult a book when you need to use more
features.
Examples:
type casting
pretty printing
rounding
29. Expressivity of SQL
Q: Can every query be formulated in SQL?
A: No.
A: Among those that can be, some are quite
cumbersome.
A: In majority of practical applications, the queries
expressing the query is straightforward
Relational Algebra
30. What & Why
31. Relation schemas
32. The basic operators
33. Examples
30. What & Why
Codd:
data = tables
query: table → table
We saw a language to specify such computations:
SQL
- most widely used
- but not the most convenient
Want:
- language of a few basic operators
- combination sufficient to express all queries
LIKE ALGEBRA FOR NUMBERS!
31. Relation schemas
Tables (='relations') = domain of interest in
relational algebra
= the analogue of numbers for high school
algebra
But unlike numbers, tables come in various
shapes...
Solution: schemas
32. The basic relational algebra
operators
Selection
Projection
Renaming
Natural join
Other
33. Examples
Translate the following into relational algebraic
expressions:
1. SELECT lastname
FROM staff
WHERE title=’Prof’;
33. Examples
Translate the following into relational algebraic
expressions:
2. SELECT s.name, l.cid
FROM staff AS s, lecturing AS l
WHERE s.sid = l.sid;
33. Examples
Translate the following into relational algebraic
expressions:
3. SELECT s.lastname, c.name
FROM staff AS s, lecturing AS l, courses AS c
WHERE s.sid = l.sid AND l.cid = c.cid;
33. Examples
Translate the following into relational algebraic
expressions:
4. SELECT l2.cid
FROM lecturing AS l1, lecturing AS l2
WHERE l1.cid = l2.cid AND
l1.year = 2002 AND l2.year = 2003 AND
l1.sid <> l2.sid;