Transcript Title
Database Systems on Modern
Processors
Kenneth Ross
Columbia University
(also visiting IBM T. J.
Watson Research Center)
EDGE Workshop, May 2006
Talk Outline
2
Intro – what is a DBMS and what are its
performance goals?
Some history.
How DBMS researchers are dealing with
changes in architecture.
The future.
EDGE Workshop 2006
Database Management Systems
Store and manage large data collections
Features
–
–
–
–
–
–
3
–
Concurrency control
Persistence, Recovery
Integrity (constraints)
Declarative Queries
Transactions (all or nothing)
Distribution
…
EDGE Workshop 2006
Database Management Systems
Focus on Relational DBMS
–
Multi-billion dollar market
DBMS servers are a major use of modern
computers
–
4
IBM DB2, Oracle, Sybase, Microsoft SQL Server,
MySQL, Postgres, …
DBMS architects and HW architects care about
performance of DB workloads
EDGE Workshop 2006
Performance Scalability
Transaction processing
–
–
–
–
Query Processing
–
–
–
5
Keep up with very high transaction rates
Millions of transactions/minute
Benchmark: TPC-C
www.tpc.org
Single row updates with locking
–
Data warehouses with terabytes of data
Complex queries over many tables
Benchmark: TPC-H
Large data volumes: throughput
EDGE Workshop 2006
Query Processing 101
Each query has many potential execution plans
–
Query optimizer explores a space of plans
–
Chooses the plan with lowest cost (i.e., time)
Cost model: I/O, CPU, …
Costs are estimated based on
–
6
Plans are logically equivalent, but differ in speed
–
operators used, input sizes, expected selectivity of
predicates, other DB statistics.
Each operator has a cost function
EDGE Workshop 2006
Talk Outline
7
Intro – what is a DBMS and what are its
performance goals?
Some history.
How DBMS researchers are dealing with
changes in architecture.
The future.
EDGE Workshop 2006
A Bit of History
Database Machines (1980s)
–
–
Major failure
–
–
8
Build specialized processors for databases
Associative memory, SIMD/MIMD, etc.
By the time hardware had been built, commodity
hardware had caught up
Advances in software (e.g., better algorithms)
Lesson: Use commodity hardware!
–
Scope of “commodity” has changed between
1980 and 2006.
EDGE Workshop 2006
Early-Mid 1990s
Move to shared-nothing parallel systems
–
–
Good scale-up and speed-up
–
–
9
Predictable performance with minimal
interference between processors
Database workloads typically partition well
Just add more resources
Embarrassing parallelism
I/O is the bottleneck
EDGE Workshop 2006
Late 1990s
Disk speed/capacity, memory capacity,
processor speed all increase
–
Recognition that these improvements were
changing the performance profile of a DBMS
–
–
–
10
But not memory latency
I/O is no longer the bottleneck
L2 data cache misses and L1 instruction cache
misses are significant [Ailamaki et al 99]
2 to 5 times the CPI of SPECInt workloads
EDGE Workshop 2006
What to do about it?
Build a new DBMS from scratch?
–
–
–
Try to adapt existing systems.
–
–
–
11
Major investments in existing infrastructure.
Performance is only one of many goals.
Research prototypes, e.g., Monet [CWI].
Constrains options, but improvements attainable
without re-implementing everything.
Be configurable to available hardware.
Focus on “inner loops” of the critical paths.
EDGE Workshop 2006
Talk Outline
12
Intro – what is a DBMS and what are its
performance goals?
Some history.
How DBMS researchers are dealing with
changes in architecture.
The future.
EDGE Workshop 2006
Dealing with Modern Architectures
Data & Instruction cache performance
–
–
–
–
Branches
–
–
Take mispredictions into account when estimating query cost
Avoid branches to get better CPI
SIMD
–
13
Better utilization of space in each cache line
Schedule work to efficiently utilize data in each cache line
Chunk work into units processed by L1 cache-sized code
segments
Prefetching
Speedup due to parallelism and absence of branch misprediction
TLB, GPU, SMT, CAM, etc.
EDGE Workshop 2006
Better Cache-line Utilization I
B+-Tree node
CSB+-Tree node [Rao & Ross 2000]
K1 K2 K3 K4 K5 K6 K7 P
P0 K1 P1 K2 P2 K3 P3
8 contiguous children
4 children
4 million leaf entries
1.1
B+-Tree
Full means
allocate space for
all possible
children in
advance
1
CSB+-Tree
0.9
time (s)
Full CSB+-Tree
0.8
0.7
0.6
0.5
0.4
14
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
proportion of searches
EDGE Workshop 2006
Better Cache-line Utilization II
Columnwise storage, e.g., Monet [CWI]
–
PAX [Ailamaki et al 2001]
–
–
–
Best of both worlds with two data copies
CSM [Shao et al 2004]
–
15
Each column stored columnwise within disk page
Good cache behavior for queries
Good I/O behavior for row-access
Fractured Mirrors [Ramamurthy et al 2002]
–
Only process the columns you need
In-memory copy keeps just part of disk page
EDGE Workshop 2006
Consecutive Lookups to an Index
Probe Stream
2 11 1 20 9 19 3 5
A
Key
r1 r2 r3 r4 r5 r6 r7 r8 RID
Cache Capacity
(Up to 3 nodes)
In Cache
B
C
E
D
F
[Zhou & Ross 2003]
16
EDGE Workshop 2006
Consecutive Lookups to an Index
Probe Stream
2 11 1 20 9 19 3 5
A
r1 r2 r3 r4 r5 r6 r7 r8 RID
Cache Capacity
(Up to 3 nodes)
In Cache
B
C
17
Key
E
D
F
EDGE Workshop 2006
Consecutive Lookups to an Index
Probe Stream
2 11 1 20 9 19 3 5
A
r1 r2 r3 r4 r5 r6 r7 r8 RID
Cache Capacity
(Up to 3 nodes)
In L2 Cache
B
C
Key
E
D
F
Cache Thrashing!
18
EDGE Workshop 2006
Buffering Accesses to the Index
Probe Stream
1000
451
998
999
A
In L2 Cache
B
E
Buffered
Items
300
0
1
Results
C
Buffered
Items
200
0
19
D
Buffered
Items
249
0
1
Buffered
Items
100
0
F
Buffered
Items
0
EDGE Workshop 2006
Impact of Buffering
Increases temporal and spatial locality
Overhead: extra copy of data into buffers
Key idea: Buffer units of L2 cache size or less
several tree levels if fan-out is small
Elapsed Time (seconds)
14
12
Pentium 4, 5M keys, 5M probes
10
8
6
4
2
0
20
B+-Tree
CSB+-Tree
Conventional Lookup
R-Tree
Buffered Lookup
EDGE Workshop 2006
Conventional Demand-Driven
Pipelined Execution Model
Support open-next-close iterator interface
An operator returns control immediately after generating one
tuple
[Zhou & Ross 2004]
Results
Instructions
In L1 I-Cache
Parent
Parent
Operator
Cache Thrashing!
GetNext()
Child
Operator
21
Solution: Buffering!
ABABABA… becomes ABBBBBAAAAAB…
Poor instruction locality
Poor branch prediction
accuracy
EDGE Workshop 2006
Validation on Postgres
Agg
sum, avg, count
18
12%
16
Elapsed Time (seconds)
SELECT SUM(l_extendedprice *
(1 – l_discount)
(1 + l_tax)) AS sum_charge,
AVG(l_quantity) AS avg_qty,
COUNT(*) AS count_order
FROM lineitem
WHERE l_shipdate <= date ‘1998-11-01’
(TPC-H)
14
12
10
8
6
4
Agg
2
sum, avg, count
0
Original Query Plan
Buffer
TableScan
Buffered Query Plan
Other Cost
Branch Misprediction Penalty
L2 Cache Miss Penalty
L1 Instruction Cache Miss Penalty
linitem
TableScan
22
Combined footprint:
23K > 16K L1
linitem
The buffered plan reduces
L1 instruction cache misses by 80%
EDGE Workshop 2006
Steps: Cache-resident Transaction
Processing [Harizopoulos et al 2004]
Group similar transactions
–
–
–
23
Reuse L1-cache resident instructions for many
transactions
Fast context switch
40% speedup for TPC-C workloads
EDGE Workshop 2006
Prefetching
Rely on hardware prefetching
–
Explicit prefetching
–
–
Child/sibling cache-lines in tree indexes [Chen et al 2001, 2002]
Staged access to data [Harizopoulos & Ailamaki 2003]
Work-ahead set [Zhou et al 2005] [EDGE poster]
–
–
24
Use predictable (e.g., sequential) access patterns
Load rather than prefetch (which may be dropped)
Use another SMT thread to manage memory access
EDGE Workshop 2006
Branch Mispredictions
Conjunctive Selections [Ross 2002]
for (i=0; i<number_of_records; i++)
{
if (all conditions are true, i.e.,
equal to 1, for record i)
{ answer[j++] = i; }
}
25
EDGE Workshop 2006
Code Alternatives (I)
if (f1(R[i]) && f2(R[i])) {…}
versus
if (f1(R[i]) & f2(R[i])) {…}
26
&& version does less work if f1 is selective.
But… && version has two conditional
branches while & version has one.
EDGE Workshop 2006
Code Alternatives (II)
Version with no conditional branches.
for (i=0; i<number_of_records; i++)
{
answer[j] = i;
j += (f1(R[i]) & f2(R[i]));
}
27
EDGE Workshop 2006
4-way conjunction; simple f1, f2, …
28
EDGE Workshop 2006
Query Optimization
Large Space of plans, e.g.,
for (i=0; i<number_of_records; i++)
{
if ( (f1(R[i]) & f2(R[i])) && f3(R[i]) )
{ answer[j] = i;
j += f4(R[i]); }
}
Optimizer’s cost function must include term
for branch misprediction penalty.
29
EDGE Workshop 2006
Branches and CPI
Branches create control dependencies that
limit instruction-level parallelism on superscalar CPUs
–
Compression/decompression
–
–
30
Avoid dependencies to get better CPI
[Zukowski et al 2006]
Branch-free decompression algorithm
Decompress at 2-6 GB/sec to get good data
bandwidth without hundreds of disks
EDGE Workshop 2006
SIMD on Modern Processors
E.g., SSE on Pentium
Comparisons generate masks
–
31
Convert if-then-else control dependencies to data
dependencies
EDGE Workshop 2006
Benefits of SIMD for DBs [Zhou & Ross 2002]
Select Agg(R.y) From R Where R.x Between a AND b
Table with 1 million records. Predicate selectivity 0.2
Elapsed Time (milliseconds)
25
20
15
10
5
0
SUM
32
SIMD
SUM
COUNT
SIMD
COUNT
Other Cost
MAX
SIMD
MAX
Branch Misprediction Penalty
MIN
SIMD
MIN
EDGE Workshop 2006
Additional work
Efficient SMT parallelism for single operators
–
TLB optimization
–
–
Fast sorting [Govindaraju et al 2006]
Spatial query primitives [Bandi et al 2004]
Use of other commodity hardware
–
33
More passes over data to avoid TLB thrashing (e.g., data
partitioning) [Manegold et al 2004]
Use of co-processors (e.g., GPU)
–
[Zhou et al 2005, Garcia et al 2006]
Content addressible memory (CAM) [Bandi et al 2005]
EDGE Workshop 2006
Talk Outline
34
Intro – what is a DBMS and what are its
performance goals?
Some history.
How DBMS researchers are dealing with
changes in architecture.
The future.
EDGE Workshop 2006
Future Challenges I
Multi-core, more threads per core
DBs already parallel (e.g., Oracle will run on
Niagara) but…
–
–
Cache interference (or underutilization) on each chip
Flipside: No cache coherence problem with shared L2
cache
–
Bandwidth in/out of the chip now a bottleneck
35
Move away from complete shared-nothing?
Need to model on-chip and off-chip communication costs
New trade-offs
Speculative prefetching a bad idea
EDGE Workshop 2006
Future Challenges II
Making use of multiple HW components
–
Best use of whole system
–
–
–
–
36
CPU+GPU
PPE+SPEs on Cell
Various kinds of accelerators (commodity!)
Configurability, tunability
Programmability
Cost models
EDGE Workshop 2006
Conclusions
DBMS software is changing.
Evolution and revolution in response to new
hardware.
More changes ahead for multi-core systems
–
What’s the best HW mix for databases?
–
37
Existing parallel design a head-start
E.g., Is a GPU a server component?
EDGE Workshop 2006