Transcript Document

Topics 9-10: Database
optimization for modern machines
Computer architecture is changing over
the years, so we have to change our
programmes, too!
Most database operators, indexes,
buffering techniques and storage
schemes were developed and optimized
over 20 years ago, so we need to retune them for the modern reality.
Dr. N. Mamoulis
Advanced Database Technologies
1
Moore’s Law – Dictionary Term
http://info.astrian.net/jargon/terms/m/moore_s_law.html
Moore's Law
/morz law/ prov. The observation that the
logic density of silicon integrated circuits has closely followed
the curve (bits per square inch) = 2(t - 1962) where t is time in
years; that is, the amount of information storable on a given
amount of silicon has roughly doubled every year since the
technology was invented. This relation, first uttered in 1964
by semiconductor engineer Gordon Moore (who co-founded
Intel four years later) held until the late 1970s, at which point
the doubling period slowed to 18 months. The doubling
period remained at that value through time of writing (late
1999). Moore's Law is apparently self-fulfilling. The
implication is that somebody, somewhere is going to be able
to build a better chip than you if you rest on your laurels, so
you'd better start pushing hard on the problem.
Dr. N. Mamoulis
Advanced Database Technologies
2
Features of Modern Machines and
Future Trends
CPU speed, memory bandwidth, and disk
bandwidth, follow Moore’s Law. On the
other hand, memory latency improves by
only 1% per year.
Memories are becoming larger and slower
relatively to memory bandwidth and CPU
speed.


Most of the time the processor of your PC is idle,
waiting for data to be fetched from memory.
The role of main memory caches is becoming
very important in the overall performance.
Dr. N. Mamoulis
Advanced Database Technologies
3
Features of Modern Machines and
Future Trends (cont’d)
Disk bandwidth also follows Moore’s
Law.
On the other hand, disk seek time
improves by only 10% per year.
Thus a random disk access today may
cost as much as 20-50 sequential
accesses.
Dr. N. Mamoulis
Advanced Database Technologies
4
Features of Modern Machines and
Future Trends (cont’d)
Another characteristic of modern
machines is that their processors
(AMD Athlon, Intel P4), have parallel
processing units able to process at the
same time multiple (e.g., 5-9)
independent instructions.
Dr. N. Mamoulis
Advanced Database Technologies
5
The new bottleneck: Memory
Access

Memory is now hierarchical: two levels of caching
L1 cache
CPU die
CPU
L2 cache
L1 cache-line
L2 cache-line
Memory page
Main memory

Memory-latency: the time needed to transfer 1 byte from the main memory
to the L2 cache.

Cache (L2) miss: if the requested data is not in cache and needs to be
fetched from main memory
Cache-line: The transfer unit from main memory to cache (e.g., L2 cacheline = 128 bytes)
Dr. N. Mamoulis
Advanced Database Technologies
6
Example of Memory Latency
Effects in Databases
Employee(id: 4 bytes, name: 20 bytes,
age: 4 bytes, gender: 1 byte)
id
12000
12305
12503
13454
53959
92123
name
Mike Chan
George Best
Gina Gillian
John Smith
Sarah Croft
James White
age
30
27
45
39
53
57
gender
M
M
F
M
F
M
Disk/Memory representation:
12000 Mike Chan 30 M 12305 George Best 27 M ...
29 bytes
Dr. N. Mamoulis
29 bytes
Advanced Database Technologies
7
Example of Memory Latency
Effects in Databases (cont’d)
Example query: how many girls work
for the company?

SELECT COUNT(*)
FROM EMPLOYEE
WHERE gender = “F”;
What is the cost of this query, assuming
that the relation is in main memory?
We have to access the gender values,
compare them to “F” and add 1 (or 0).
Dr. N. Mamoulis
Advanced Database Technologies
8
Example of Memory Latency
Effects in Databases (cont’d)
Accessing a specific address from
memory loads also the information
around it in the cache-line (in parallel
over a wide bus).
Example: Accessing the gender of the
first tuple:
cache-line
(128 bytes)
Cache:
MMem: 12000 Mike Chan 30 M 12305 George Best 27 M ...
29 bytes
Dr. N. Mamoulis
29 bytes
Advanced Database Technologies
9
Example of Memory Latency
Effects in Databases (cont’d)
Thus a cache-line at a time is accessed
from main memory.
If the requested information is already
in cache (e.g., the gender of the second
tuple), main memory is not accessed.
Otherwise, a cache-miss occurs.
Dr. N. Mamoulis
Advanced Database Technologies
10
Effect of cache misses
The stride.c program demonstrates how
cache misses may dominate the query
processing cost.
http://www.csis.hku.hk/~nikos/courses/
CSIS7101/stride.c
initialize a random memory buffer;
sum = 0;
ilimit = 200000; //tentative
for(i=0; i<ilimit; i+=stride)
sum += buffer[i];
printf("sum=%d\n",sum);
Dr. N. Mamoulis
Advanced Database Technologies
+
stride=5
11
Effect of cache misses (cont’d)
Dr. N. Mamoulis
Advanced Database Technologies
12
Re-designing the DBMS
We need to redesign the DBMS in order to face the
new bottleneck: memory access
New storage schemes are proposed to minimize
memory and disk accesses.
Query processing techniques are redesigned to take
under consideration the memory-latency effects.
Algorithms are changed to take advantage of the
intraparallelism of instruction execution.
New instruction types are used (e.g., SIMD
instructions).
Dr. N. Mamoulis
Advanced Database Technologies
13
Problems in optimizing the DBMS
Programming languages do not have control over the
replacement policy of memory caches. These are
determined by the hardware.
As a result, we cannot apply buffer management
techniques for memory cache management.
Also, simple instructions generated by programming
languages cannot control the parallel instruction
execution capabilities of the machine.
In many cases, we re-write the programs, trying to
“fool” the page replacement policy, and the
instruction execution in order to make them more
efficient.
Dr. N. Mamoulis
Advanced Database Technologies
14
The N-ary storage model
The N-ary storage model (NSM) stores the
information for each tuple as a sequence of bytes.
id
12000
12305
12503
13454
53959
92123
name
Mike Chan
George Best
Gina Gillian
John Smith
Sarah Croft
James White
age
30
27
45
39
53
57
gender
M
M
F
M
F
M
Disk/Memory representation:
12000 Mike Chan 30 M 12305 George Best 27 M ...
29 bytes
Dr. N. Mamoulis
29 bytes
Advanced Database Technologies
15
A Decomposition Storage Model
(DSM)
The table is vertically decomposed, and information
for each attribute is stored sequentially.
a1
id
name
12000 Mike Chan
12305 George Best
12503 Gina Gillian
13454 John Smith
53959 Sarah Croft
92123 James White
id
12000
12305
12503
13454
53959
92123
12000 Mike Chan 12305 George Best ...
Dr. N. Mamoulis
a2
name
Mike Chan
George Best
Gina Gillian
John Smith
Sarah Croft
James W hite
id
12000
12305
12503
13454
53959
92123
age
30
27
45
39
53
57
age
30
27
45
39
53
57
a3
gender
M
M
F
M
F
M
id
12000
12305
12503
13454
53959
92123
12000 30 12305 27 ...
Advanced Database Technologies
gender
M
M
F
M
F
M
12000 M 12305 M ...
16
Properties of DSM
The relation is decomposed into many binary tables,
one for each attribute.
The attributes of a binary table are a surrogate id
and the attribute from the original relation.
The surrogate-id is necessary to bring back together
information for a tuple, by joining the binary tables.
Example: Print the information for tuple with
id=12305
NSM
DSM
SELECT *
FROM EMPLOYEE
WHERE id=12305
Dr. N. Mamoulis
SELECT id,name,age,gender
FROM a1,a2,a3
WHERE a1.id=12305 AND
a2.id = a1.id AND
a3.id = a1.id 17
Advanced Database Technologies
Properties of DSM (cont’d)
Advantages of DSM

If the relation has many attributes, but queries involve only
few attributes, then:
 Much less information is read from disk, than in NSM
 Cache misses are fewer than in NSM, because the stride is
smaller

Projection queries are very fast
Disadvantages of DSM


If a tuple needs to be reconstructed, this will require many
joins.
The size of the decomposed relation is larger than the
original due to the replication of the surrogate key.
Dr. N. Mamoulis
Advanced Database Technologies
18
A Decomposition Model with
Unary tables (Monet)
not materialized
void id
1 12000
2 12305
3 12503
4 13454
5 53959
6 92123
id
12000
12305
12503
13454
53959
92123
void name
1 Mike Chan
2 George Best
3 Gina Gillian
4 John Smith
5 Sarah Croft
6 James White
name
Mike Chan
George Best
Gina Gillian
John Smith
Sarah Croft
James W hite
void
Given the surrogate sid of a tuple,
we can compute its attribute value v by:
v = *(table_address+(sid - first_surrogate)*size)
Dr. N. Mamoulis
1
2
3
4
5
6
age
30
27
45
39
53
57
age
30
27
45
39
53
57
gender
M
M
F
M
F
M
void
1
2
3
4
5
6
gender
M
M
F
M
F
M
1 1 M F ...
value of first
surrogate
Advanced Database Technologies
size of attribute
type in bytes
19
Partition Attributes Across
Attempts to combine advantages of both NSM
and DSM, while avoiding their disadvantages.
The data are split to pages, like in NSM, but
the organization in each page is like DSM.
Thus:


Cache misses are minimized because the
information for a specific attribute is stored
compactly in the page.
Record reconstruction cost is low, since the tuples
are actually stored like in NSM on disk, but
vertically partitioned in each page, where the
reconstruction cost is minimal.
Dr. N. Mamoulis
Advanced Database Technologies
20
How do pages look in each
schema?
NSM
PAGE HEADER 12000
Mike Chan 30 M 12305
George Best 27 M ...
...
DSM
PAGE HEADER 12000
Mike Chan 12305 George
Best
...
...
PAGE HEADER
PAGE HEADER 54647
John Kit 33 M 94765
...
Flora Ho 27 F
PAX
...
PAGE HEADER 12000 30
12305 27 54647 33
94765 27 ... ...
...
PAGE HEADER 12000
12305 ...
Mike Chan George Best
...
30 27 ...
PAGE HEADER 54647
94765 ...
John Kit
PAGE HEADER
...
Flora Ho
...
33 27 ...
...
many tables
Dr. N. Mamoulis
Advanced Database Technologies
21
Comparison between PAX and NSM
NSM
PAGE HEADER 12000
Mike Chan 30 M 12305
George Best 27 M ...
Cache
12000 Mike Chan 30 M
...
SELECT AVG(age)
FROM EMPLOYEE
WHERE id<20000
irrelevant data are
fetched to cache
PAX
PAGE HEADER 12000
12305 ...
Mike Chan George Best
...
30 27 ...
Dr. N. Mamoulis
Advanced Database Technologies
Cache
12000 12305 ...
30 27 ...
relevant data are
fetched to cache
22
Comparison between PAX and DSM
DSM
Cache
PAGE HEADER 12000
Mike Chan 12305 George
Best
...
...
SELECT AVG(age)
FROM EMPLOYEE
WHERE name=“M*”
12000 Mike Chan 12305
George ...
PAGE HEADER 12000 30
12305 27 54647 33
94765 27 ... ...
PAX
PAGE HEADER 12000
12305 ...
Mike Chan George Best
...
30 27 ...
qualifying ids have
to be joined with
other pages
12000
...
join
Cache
Mike Chan George ...
12000 ...
30 ...
join is avoided
Dr. N. Mamoulis
Advanced Database Technologies
23
Is PAX always better than DSM?
NO. If the relation has many attributes (e.g., 10) and the query
involves only few (e.g., 2) then the join may be more beneficial, than
reading whole tuples in memory.
Example: R(a1,a2,a3, ..., a10)
Each attribute is 4 bytes long.
DSM: D1(id,a1), D2(id,a2),...,D10(id,a10).
Query:
SELECT avg(a2)
FROM R
WHERE a1=x;
Query selectivity is very high 1%. The qualifying ids can fit in memory
PAX has to read the whole table=40bytes|R|
DSM has to read D1, apply the query, and create an intermediate table
X with qualifying ids (in memory). Then it has to read D2 to get the
qualifying a2 that join with X. So the total bytes DSM reads from disk
are 16bytes|R|.
Dr. N. Mamoulis
Advanced Database Technologies
24
Summary
In modern computer architectures the bottleneck is
memory access. In many cases the processor is
waiting for data to be fetched from memory.
Memory access is no longer “random”. When we
access a location in disk then data around it are
loaded to a fast memory chip (cache). Access locality
is very important.
Database operators, storage and buffering schemes,
indexes are optimized for the reduction of cache
misses.
Dr. N. Mamoulis
Advanced Database Technologies
25
References
George P. Copeland, Setrag Khoshafian, A
Decomposition Storage Model. SIGMOD, 1985.
Peter A. Boncz, Stefan Manegold, Martin L. Kersten,
Database Architecture Optimized for the New
Bottleneck: Memory Access. VLDB, 1999.
Anastassia Ailamaki, David J. DeWitt, Mark D. Hill,
Marios Skounakis: Weaving Relations for Cache
Performance, VLDB, 2001.
P. A. Boncz. Monet: A Next-Generation DBMS Kernel
For Query-Intensive Applications. PhD dissertation.
Universiteit van Amsterdam, May 2002.
Dr. N. Mamoulis
Advanced Database Technologies
26