Transcript ppt

Final Exam Review
Lecture 31
Administrivia
• Office hours 1:15 – 2:15 today
– also available via e-mail: [email protected]
– TAs will have extra office hours,
• see class web page, news group
• Final Exam May 16 8-11 a.m.
– Location: 55 Warren
– Closed book, 2 pages of notes, both sides
– IEEE will provide pastries and juice starting at 7:30!
Final Exam Topics
•
Up to midterm 1 (25%)
– Relational Model & Query Languages (Roth)
• Relational Algebra and Calculus
• SQL
– Database Implementation (Haber)
• Disks, buffers, files
• Indexes: B-Trees, Hash Indexes
• Between midterm 1 & midterm 2 (25%)
– Query Execution
•
•
•
•
•
Relational Operators (Haber)
Sorting (Haber)
Joining (Haber)
Query Optimization (Roth)
Since midterm 2 (48%)
– Database Design (Haber)
• The ER Model
• Functional Dependencies & Normalization
•
– Transactions, Concurrency Control, & Recovery (Roth)
Guest Lectures (2%)
Why are databases interesting?
• Theoretical foundation
– Modelling structure of information
• Relations: sets of identically structured tuples
• Design constraints: FDs and correct decompositions
– Formal query languages
• Algebra: operators on relations, return relations
• Calculus: declarative specification of query result
• Practical application of theory
– Using computer structures
• pages, files, memory, buffer pools, indexes
– ACID properties (xacts, concur. control, recovery)
– Reasonable efficiency
Review Outline
• Up to midterm 1
– Relational Model & Query Languages
• Relational Algebra and Calculus
• SQL
– Database Implementation
• Disks, buffers, files
• Indexes: B-Trees, Hash Indexes
• Between midterm 1 & midterm 2
– Query Execution
•
•
•
•
Relational Operators
Sorting
Joining
Query Optimization
• Since midterm 2
– Database Design
• The ER Model
• Functional Dependencies & Normalization
– Transactions, Concurrency Control, & Recovery
DBMS components
•Talks to DBMS to manage data for a specific task
Database application
Query Optimization
and Execution
-> e.g. app to withdraw/deposit money or provide
a history of the account
•Figures out the best way to answer a question
-> There is always more than 1 way to skin a cat…!
•Provides generic ways to combine data
Relational Operators
Access Methods
-> Do you want a list of customers and accounts or
the total account balance of all customers?
•Provides efficient ways to extract data
-> Do you need 1 record or a bunch?
•Makes efficient use of RAM
Buffer Management
-> Think 1,000,000 simultaneous requests!
•Makes efficient use of disk space
Disk Space Management
DB
-> Think 300,000,000 accounts!
The Storage Hierarchy
Smaller, Faster
–Main memory (RAM) for
currently used data.
–Disk for the main
database (secondary
storage).
–Tapes for archiving older
versions of the data
(tertiary storage).
Bigger, Slower
Source: Operating Systems Concepts 5th Edition
Disks are slow. Why?
Transfer time
Seek time
• Time to access
(read/write) a disk
block:
– seek time (moving arms
to position disk head on
track)
– rotational delay (waiting
for block to rotate under
head)
– transfer time (actually
moving data to/from disk
surface)
Arm movement
Rotational delay
Disk Space Manager
• Lowest layer of DBMS software manages space on disk
(using OS file system or not?).
• Higher levels call upon this layer to:
– allocate/de-allocate a page
– read/write a page
• Best if a request for a sequence of pages is satisfied by
pages stored sequentially on disk!
– Responsibility of disk space manager.
– Higher levels don’t know how this is done, or how free space is
managed.
– Though they may make performance assumptions!
• Hence disk space manager should do a decent job.
Buffer Management in a DBMS
Page Requests from Higher Levels
BUFFER POOL
disk page
free frame
MAIN MEMORY
DISK
•
DB
choice of frame dictated
by replacement policy
Buffer pool information table contains:
<frame#, pageid, pin_count, dirty>
Buffer Management
• Keeps a group a disk pages in memory
• Records whether each is pinned
– What happens when all pages pinned?
– Whan happens when a page is unpinned?
• Replacement
– When all frames used, but not pinned, and new page
requested?
– How is the replaced page chosen?
– Least Recently Used (LRU)
– Most Recently Used (MRU)
– Clock
– Advantages? Disadvantages?
What is in Database Pages?
• Database contains files, which are made up
of…
• Pages, which are made up of…
• Records, which are made up of…
• Fields, which hold single values.
How are records/pages organized?
• depends on whether fields variable, or fixed length
• In Minibase, array of type/offsets, followed by data.
F1
F2
F3
F4
Array of Field Offsets
• depends on whether records variable, fixed length.
• Minibase: slot array at beginning of page, records
compacted at end of page.
Rid = (i,N)
Page i
Rid = (i,2)
Rid = (i,1)
20
N
...
16
2
SLOT DIRECTORY
24
N
1 # slots
Pointer
to start
of free
space
How are files organized?
• Unordered Heap File: chained directory pages,
containing records that point to data pages.
Data
Page 1
Header
Page
Data
Page 2
DIRECTORY
Data
Page N
• Other possibilities: sorted files, clustered
indexes, unclustered index + heap file
– Many tradeoffs between them
B: The number of data pages
R: Number of records per page
F: Fanout of B-Tree
S: Time required for equality search
* Don’t Use Index
I/O Cost of
Operations
Heap File
Sorted File
Clustered Tree
Unclustered Tree
Hash Index
Scan all
records
B
B
1.5 B
B*
B*
Get all in
sort
order
4B
B
1.5 B
4B*
4B*
Equality
Search
0.5 B
log2 B
logF (1.5 B)
logF (.15 B)
+1
2
Range
Search
B
S+
S+
#matching #matching
pages
pages
S+
#matching
records
B*
Insert
2
S+B
S+1
S+2
4
Delete
0.5B + 1
S+B
0.5B + 1
S+2
S+2
Indexes
• Can be used to store data records (alt 1), or be an
auxillary data structure that referrs to existing file of
records (alt 2, 3)
• Many types of index (B-Tree, Hash Table, R-Tree, etc.)
• How do you choose the right index?
• Difference between clustered and unclustered indexes?
CLUSTERED
Index entries
direct search for
data entries
Data entries
UNCLUSTERED
Data entries
(Index File)
(Data file)
Data Records
Data Records
Review Outline
• Up to midterm 1
– Relational Model & Query Languages
• Relational Algebra and Calculus
• SQL
– Database Implementation
• Disks, buffers, files
• Indexes: B-Trees, Hash Indexes
• Between midterm 1 & midterm 2
– Query Execution
•
•
•
•
Relational Operators
Sorting
Joining
Query Optimization
• Since midterm 2
– Database Design
• The ER Model
• Functional Dependencies & Normalization
– Transactions, Concurrency Control, & Recovery
Review: Query Processing
• Queries start out as SQL
• Database translates SQL to one or more Relational
Algebra plans
• Plan is a tree of operations, with access path for each
• Access path is how each operator gets tuples
– If working directly on table, can use scan, index
– Some operators, like sort-merge join, or group-by, need tuples
sorted
– Often, operators pipelined, getting tuples that are output from
earlier operators in the tree
• Database estimates cost for various plans, chooses least
expensive
Cost of Operations
• Selections
• Projections
• Sorting, a.k.a. Order By
• Removing duplicates, a.k.a. Select Distinct
• Joins
Selections: “age < 20”, “fname = Bob”, etc
• No index
– Do sequential scan over all tuples
– Cost: N I/Os
• Sorted data
– Do binary search
– Cost: log2(N) I/Os
• Clustered B-Tree
– Cost: 2 or 3 to find first record +
1 I/O for each #qualifying pages
• Unclustered B-Tree
– Cost: 2 or 3 to find first RID +
~1 I/O for each qualifying tuple
• Clustered Hash Index
– Cost: ~1.2 I/Os to find bucket, all tuples inside
• Unclustered Hash Index
– Cost: ~1.2 I/Os to find bucket, +
~1 I/O for each matching tuple
Projection
• Expensive when eliminating duplicates
• Can do this via:
– Sorting: cost no more than external sort
• Cheaper if you project columns in initial pass, since more
projected tuples fit in each page.
– Hashing: build a hash table, duplicates will end up
in the same bucket
Sorting
•
External Merge Sort
– Minimum amount of memory: 3 pages
• Initial runs of 3 pages
• Then 2-way merge of sorted runs
(2 pages for inputs, one for outputs)
• #of passes: 1 + log2(N/3)
– With more memory, fewer passes
• With B pages, #of passes: 1 + log(B-1)(N/B)
– I/O Cost = 2N * (# of passes)
•
Using B+ Trees for Sorting
– Idea:
• Retrieve records in order by traversing leaf pages.
– Is this a good idea? Cases to consider:
• B+ tree is clustered
• B+ tree is not clustered
Good idea!
Could be a very bad idea!
– I/O Cost
• Clustered tree: ~ 1.5N
• Unclustered tree: 1 I/O per tuple, worst case!
Remove duplicates with Hashing
•
•
•
Idea:
– Many ops don’t need the data ordered
– e.g.: removing duplicates in DISTINCT
– e.g.: finding matches in JOIN
Good enough to match all tuples with equal values
Hashing does this!
– And may be cheaper than sorting! (Hmmm…!)
– But how to do it for data sets bigger than memory??
•
If we can hash in two passes -> cost is 4N
•
How big of a table can we hash in two passes?
– B-1 “partitions” result from Phase 0
– Each should be no more than B pages in size
– 2 passes possible if table smaller than B(B-1)
i.e.: can hash a table of size N pages in about √N space
– Note: assumes hash function distributes records evenly!
•
Have a bigger table? Recursive partitioning!
Sorting vs Hashing
• Based on our simple analysis:
– Same memory requirement for 2 passes
– Same IO cost
• Digging deeper …
• Sorting pros:
– Great if input already sorted (or almost sorted)
– Great if need output to be sorted anyway
– Not sensitive to “data skew” or “bad” hash functions
• Hashing pros:
– Highly parallelizable
– Can exploit extra memory to reduce # IOs
Nested Loops Joins
• R, with M pages, joins S, with N Pages
• Nested Loops
– Simple nested loops
• Insanely inefficient M + PR*M*n
– Paged nested loops – only 3 pages of memory
• M + M*N
– Blocked nested loops – B pages of memory
• M + M/(B-2) * N
• If M fits in memory (B-2), cost only M + N
– Index nested loops
• M + PR*M* index cost
• Only good in M very small
Sort-Merge Join
• Simple case:
– sort both tables on join column
– Merge
– Cost: external sort cost + merge cost
• 2M*(1 + log(B-1)(M/B)) + 2N*(1 + log(B-1)(N/B)) + M + N
• Optimized Case:
– If we have enough memory, do final merge and join in same
pass. This avoids final write pass from sort, and read pass from
merge
– Can we merge on 2nd pass? Only in #runs from 1st pass < B
– #runs for R is M/B. #runs for S is N/B.
• Total #runs ~~ (M+N)/B
– Can merge on 2nd pass if M+N/B < B, or M+N < B2
– Cost: 3(M+N)
Cost of Hash Join
• Partitioning phase: read+write both relations
 2(|R|+|S|) I/Os
• Matching phase: read+write both relations
 |R|+|S| I/Os
• Total cost of 2-pass hash join = 3(|R|+|S|)
Q: what is cost of 2-pass merge-sort join?
Q: how much memory needed for 2-pass sort join?
Q: how much memory needed for 2-pass hash join?
Summary: Hashing vs. Sorting
• Sorting pros:
– Good if input already sorted, or need output sorted
– Not sensitive to data skew or bad hash functions
• Hashing pros:
– Often cheaper due to hybrid hashing
– For join: # passes depends on size of smaller relation
– Highly parallelizable
Review Outline
• Up to midterm 1
– Relational Model & Query Languages
• Relational Algebra and Calculus
• SQL
– Database Implementation
• Disks, buffers, files
• Indexes: B-Trees, Hash Indexes
• Between midterm 1 & midterm 2
– Query Execution
•
•
•
•
Relational Operators
Sorting
Joining
Query Optimization
• Since midterm 2
– Database Design
• The ER Model
• Functional Dependencies & Normalization
– Transactions, Concurrency Control, & Recovery
Review: Database Design
• Requirements Analysis
– user needs; what must database do?
• Conceptual Design
– high level descr (often done w/ER model)
• Logical Design
– translate ER into DBMS data model
• Schema Refinement
– consistency, normalization
• Physical Design - indexes, disk layout
• Security Design - who accesses what
Review: the ER Model
name
ssn
age
Employees
cost
Policy
pname
age
Dependents
• Entities and Entity Set (boxes)
• Relationships and Relationship sets (diamonds)
– binary
– n-ary
• Key constraints (1-1,1-M, M-N, arrows on 1 side)
• Participation constraints (bold for Total)
• Weak entities - require strong entity for key
ISA (`is a’) Hierarchies
inherited.
If we declare A ISA B,
every A entity is also
considered to be a B
entity.
name
ssn
attributes
lot
Employees
hourly_wages
hours_worked
ISA
contractid
Hourly_Emps
Contract_Emps
• Overlap constraints: Can Simon be an Hourly_Emps as well as
a Contract_Emps entity? (Allowed/disallowed)
• Covering constraints: Does every Employees entity also have
to be an Hourly_Emps or a Contract_Emps entity? (Yes/no)
• Conversions between Relational schema <-> ER Diagram
Review: Functional Dependencies
–
–
–
–
Properties of the real world
Decide when to decompose relations
Help us find keys
Help us evaluate Design Tradeoffs
• Want to reduce redundancy, avoid anomalies
• Want reasonable efficiency
• Must avoid lossy decompositions
– F+: closure, all dependencies that can be inferred from a set F
– A+: attribute closure, all attributes functionally determined by
the set of attributes A
– G: minimal cover, smallest set of FDs such that G+ == F+
Problems Due to R  W
S
N
L
R W H
123-22-3666 Attishoo
48 8
10 40
231-31-5368 Smiley
22 8
10 30
131-24-3650 Smethurst 35 5
7
30
434-26-3751 Guldu
35 5
7
32
612-67-4134 Madayan
35 8
10 40
Hourly_Emps
• Update anomaly: Can we modify W in only the 1st
tuple of SNLRWH?
• Insertion anomaly: What if we want to insert an
employee and don’t know the hourly wage for his or
her rating? (or we get it wrong?)
• Deletion anomaly: If we delete all employees with
rating 5, we lose the information about the wage for
rating 5!
Review: Normal Forms
• A property of a single relation
• Tells us something about redundancy in reln
• Reln R with FDs F is in BCNF if, for all X  A in F+
A  X (called a trivial FD), or
X is a superkey for R.
• Reln R with FDs F is in 3NF if, for all X  A in F+
A  X (called a trivial FD), or
X is a superkey of R, or
A is part of some candidate key (not superkey!) for R.
(sometimes stated as “A is prime”)
Review: Decomposition
• If reln violates normal form, decompose
– but must have lossless decomposition
• Lossless decomposition:
– decomposition of R into X and Y is lossless if and only if
X  Y is a key for either X or Y
– If W  Z holds over R and (W  Z) is empty, then
decomposition of R into R-Z and WZ is loss-less.
• Algorithm:
– For each FD W  Z in R that violates normal form,
decompose R into R-Z and WZ. Repeat as needed.
– Order not important, but can produce very different results
Review: Dependency Preservation
– decompose too much, and it might be necessary to join
tables to check FDs
– decomposition of R into X and Y is dependency preserving if
(FX  FY ) + = F +
• FX is all FDs involving only attributes in X
• FY is all FDs involving only attributes in Y
– Not always obvious
• ABC, A  B, B  C, C  A, decomposed into AB and BC.
• Is this dependency preserving? Is C  A preserved?
– note: F + contains F  {A  C, B  A, C  B}, so…
• FAB contains A B and B  A; FBC contains B  C and C  B
• So, (FAB  FBC)+ contains C  A
Minimal Cover for a Set of FDs
• G: minimal cover, smallest set of FDs such that G+ == F+
– Closure of F = closure of G.
– Right hand side of each FD in G is a single attribute.
– If we modify G by deleting an FD or by deleting attributes from an FD
in G, the closure changes.
• Every FD in G is needed, and ``as small as possible’’ in order to
get the same closure as F.
• e.g., F+ = {A  B, B  C, C  A, B  A, C  B, A  C}
– several minimal covers: {A  B, B  A, C  B, B  C} (AB + BC)
– or {A  C, C  A, B  C, C  B} (AC + BC)
– or {A  B, B  A, C  A, A  C} (AB + AC)
• e.g., A  B, ABCD  E, EF  GH, ACDF  EG minimal cover:
– A  B, ACD  E, EF  G and EF  H
BCNF and Dependency Preservation
• In general, there may not be a dependency preserving
decomposition into BCNF.
• But, you can always find dependency-preserving decomposition
into 3NF
– Top down:
• Decompose until it is in 3NF
• Compute minimal cover for FDs
• If minimal cover contains a FD X  Y is not preserved, add reln XY
– Bottom up:
• Compute minimal cover
• For each FD X  Y in minimal cover, create reln XY
– Why does this work? Minimal cover doesn’t include redundant
transitive dependencies, which don’t need to be preserved
Questions?
• Up to midterm 1
– Relational Model & Query Languages
• Relational Algebra and Calculus
• SQL
– Database Implementation
• Disks, buffers, files
• Indexes: B-Trees, Hash Indexes
• Between midterm 1 & midterm 2
– Query Execution
•
•
•
•
Relational Operators
Sorting
Joining
Query Optimization
• Since midterm 2
– Database Design
• The ER Model
• Functional Dependencies & Normalization
– Transactions, Concurrency Control, & Recovery
Thank you!
• See you next Wednesday.