Transcript Title

Database Systems on Modern
Processors
Kenneth Ross
Columbia University
(also visiting IBM T. J.
Watson Research Center)
EDGE Workshop, May 2006
Talk Outline




2
Intro – what is a DBMS and what are its
performance goals?
Some history.
How DBMS researchers are dealing with
changes in architecture.
The future.
EDGE Workshop 2006
Database Management Systems


Store and manage large data collections
Features
–
–
–
–
–
–
3
–
Concurrency control
Persistence, Recovery
Integrity (constraints)
Declarative Queries
Transactions (all or nothing)
Distribution
…
EDGE Workshop 2006
Database Management Systems

Focus on Relational DBMS
–


Multi-billion dollar market
DBMS servers are a major use of modern
computers
–
4
IBM DB2, Oracle, Sybase, Microsoft SQL Server,
MySQL, Postgres, …
DBMS architects and HW architects care about
performance of DB workloads
EDGE Workshop 2006
Performance Scalability

Transaction processing
–
–
–
–

Query Processing
–
–
–
5
Keep up with very high transaction rates
Millions of transactions/minute
Benchmark: TPC-C
www.tpc.org
Single row updates with locking
–
Data warehouses with terabytes of data
Complex queries over many tables
Benchmark: TPC-H
Large data volumes: throughput
EDGE Workshop 2006
Query Processing 101

Each query has many potential execution plans
–

Query optimizer explores a space of plans
–


Chooses the plan with lowest cost (i.e., time)
Cost model: I/O, CPU, …
Costs are estimated based on
–
6
Plans are logically equivalent, but differ in speed
–
operators used, input sizes, expected selectivity of
predicates, other DB statistics.
Each operator has a cost function
EDGE Workshop 2006
Talk Outline




7
Intro – what is a DBMS and what are its
performance goals?
Some history.
How DBMS researchers are dealing with
changes in architecture.
The future.
EDGE Workshop 2006
A Bit of History

Database Machines (1980s)
–
–

Major failure
–
–

8
Build specialized processors for databases
Associative memory, SIMD/MIMD, etc.
By the time hardware had been built, commodity
hardware had caught up
Advances in software (e.g., better algorithms)
Lesson: Use commodity hardware!
–
Scope of “commodity” has changed between
1980 and 2006.
EDGE Workshop 2006
Early-Mid 1990s

Move to shared-nothing parallel systems
–
–

Good scale-up and speed-up
–
–

9
Predictable performance with minimal
interference between processors
Database workloads typically partition well
Just add more resources
Embarrassing parallelism
I/O is the bottleneck
EDGE Workshop 2006
Late 1990s

Disk speed/capacity, memory capacity,
processor speed all increase
–

Recognition that these improvements were
changing the performance profile of a DBMS
–
–
–
10
But not memory latency
I/O is no longer the bottleneck
L2 data cache misses and L1 instruction cache
misses are significant [Ailamaki et al 99]
2 to 5 times the CPI of SPECInt workloads
EDGE Workshop 2006
What to do about it?

Build a new DBMS from scratch?
–
–
–

Try to adapt existing systems.
–
–
–
11
Major investments in existing infrastructure.
Performance is only one of many goals.
Research prototypes, e.g., Monet [CWI].
Constrains options, but improvements attainable
without re-implementing everything.
Be configurable to available hardware.
Focus on “inner loops” of the critical paths.
EDGE Workshop 2006
Talk Outline




12
Intro – what is a DBMS and what are its
performance goals?
Some history.
How DBMS researchers are dealing with
changes in architecture.
The future.
EDGE Workshop 2006
Dealing with Modern Architectures

Data & Instruction cache performance
–
–
–
–

Branches
–
–


Take mispredictions into account when estimating query cost
Avoid branches to get better CPI
SIMD
–
13
Better utilization of space in each cache line
Schedule work to efficiently utilize data in each cache line
Chunk work into units processed by L1 cache-sized code
segments
Prefetching
Speedup due to parallelism and absence of branch misprediction
TLB, GPU, SMT, CAM, etc.
EDGE Workshop 2006
Better Cache-line Utilization I
B+-Tree node
CSB+-Tree node [Rao & Ross 2000]
K1 K2 K3 K4 K5 K6 K7 P
P0 K1 P1 K2 P2 K3 P3
8 contiguous children
4 children
4 million leaf entries
1.1
B+-Tree
Full means
allocate space for
all possible
children in
advance
1
CSB+-Tree
0.9
time (s)
Full CSB+-Tree
0.8
0.7
0.6
0.5
0.4
14
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
proportion of searches
EDGE Workshop 2006
Better Cache-line Utilization II

Columnwise storage, e.g., Monet [CWI]
–

PAX [Ailamaki et al 2001]
–
–
–

Best of both worlds with two data copies
CSM [Shao et al 2004]
–
15
Each column stored columnwise within disk page
Good cache behavior for queries
Good I/O behavior for row-access
Fractured Mirrors [Ramamurthy et al 2002]
–

Only process the columns you need
In-memory copy keeps just part of disk page
EDGE Workshop 2006
Consecutive Lookups to an Index
Probe Stream
2 11 1 20 9 19 3 5
A
Key
r1 r2 r3 r4 r5 r6 r7 r8 RID
Cache Capacity
(Up to 3 nodes)
In Cache
B
C
E
D
F
[Zhou & Ross 2003]
16
EDGE Workshop 2006
Consecutive Lookups to an Index
Probe Stream
2 11 1 20 9 19 3 5
A
r1 r2 r3 r4 r5 r6 r7 r8 RID
Cache Capacity
(Up to 3 nodes)
In Cache
B
C
17
Key
E
D
F
EDGE Workshop 2006
Consecutive Lookups to an Index
Probe Stream
2 11 1 20 9 19 3 5
A
r1 r2 r3 r4 r5 r6 r7 r8 RID
Cache Capacity
(Up to 3 nodes)
In L2 Cache
B
C
Key
E
D
F
Cache Thrashing!
18
EDGE Workshop 2006
Buffering Accesses to the Index
Probe Stream
1000
451
998
999
A
In L2 Cache
B
E
Buffered
Items
300
0
1
Results
C
Buffered
Items
200
0
19
D
Buffered
Items
249
0
1
Buffered
Items
100
0
F
Buffered
Items
0
EDGE Workshop 2006
Impact of Buffering
 Increases temporal and spatial locality
 Overhead: extra copy of data into buffers
 Key idea: Buffer units of L2 cache size or less
 several tree levels if fan-out is small
Elapsed Time (seconds)
14
12
Pentium 4, 5M keys, 5M probes
10
8
6
4
2
0
20
B+-Tree
CSB+-Tree
Conventional Lookup
R-Tree
Buffered Lookup
EDGE Workshop 2006
Conventional Demand-Driven
Pipelined Execution Model


Support open-next-close iterator interface
An operator returns control immediately after generating one
tuple
[Zhou & Ross 2004]
Results
Instructions
In L1 I-Cache
Parent
Parent
Operator
Cache Thrashing!
GetNext()
Child
Operator


21
Solution: Buffering!
ABABABA… becomes ABBBBBAAAAAB…
Poor instruction locality
Poor branch prediction
accuracy
EDGE Workshop 2006
Validation on Postgres
Agg
sum, avg, count
18
12%
16
Elapsed Time (seconds)
SELECT SUM(l_extendedprice *
(1 – l_discount)
(1 + l_tax)) AS sum_charge,
AVG(l_quantity) AS avg_qty,
COUNT(*) AS count_order
FROM lineitem
WHERE l_shipdate <= date ‘1998-11-01’
(TPC-H)
14
12
10
8
6
4
Agg
2
sum, avg, count
0
Original Query Plan
Buffer
TableScan
Buffered Query Plan
Other Cost
Branch Misprediction Penalty
L2 Cache Miss Penalty
L1 Instruction Cache Miss Penalty
linitem
TableScan
22
Combined footprint:
23K > 16K L1
linitem
The buffered plan reduces

L1 instruction cache misses by 80%
EDGE Workshop 2006
Steps: Cache-resident Transaction
Processing [Harizopoulos et al 2004]

Group similar transactions
–
–
–
23
Reuse L1-cache resident instructions for many
transactions
Fast context switch
40% speedup for TPC-C workloads
EDGE Workshop 2006
Prefetching

Rely on hardware prefetching
–

Explicit prefetching
–
–

Child/sibling cache-lines in tree indexes [Chen et al 2001, 2002]
Staged access to data [Harizopoulos & Ailamaki 2003]
Work-ahead set [Zhou et al 2005] [EDGE poster]
–
–
24
Use predictable (e.g., sequential) access patterns
Load rather than prefetch (which may be dropped)
Use another SMT thread to manage memory access
EDGE Workshop 2006
Branch Mispredictions
Conjunctive Selections [Ross 2002]
for (i=0; i<number_of_records; i++)
{
if (all conditions are true, i.e.,
equal to 1, for record i)
{ answer[j++] = i; }
}
25
EDGE Workshop 2006
Code Alternatives (I)
if (f1(R[i]) && f2(R[i])) {…}
versus
if (f1(R[i]) & f2(R[i])) {…}


26
&& version does less work if f1 is selective.
But… && version has two conditional
branches while & version has one.
EDGE Workshop 2006
Code Alternatives (II)

Version with no conditional branches.
for (i=0; i<number_of_records; i++)
{
answer[j] = i;
j += (f1(R[i]) & f2(R[i]));
}
27
EDGE Workshop 2006
4-way conjunction; simple f1, f2, …
28
EDGE Workshop 2006
Query Optimization
Large Space of plans, e.g.,
for (i=0; i<number_of_records; i++)
{
if ( (f1(R[i]) & f2(R[i])) && f3(R[i]) )
{ answer[j] = i;
j += f4(R[i]); }
}
Optimizer’s cost function must include term
for branch misprediction penalty.
29
EDGE Workshop 2006
Branches and CPI

Branches create control dependencies that
limit instruction-level parallelism on superscalar CPUs
–

Compression/decompression
–
–
30
Avoid dependencies to get better CPI
[Zukowski et al 2006]
Branch-free decompression algorithm
Decompress at 2-6 GB/sec to get good data
bandwidth without hundreds of disks
EDGE Workshop 2006
SIMD on Modern Processors
E.g., SSE on Pentium

Comparisons generate masks
–
31
Convert if-then-else control dependencies to data
dependencies
EDGE Workshop 2006
Benefits of SIMD for DBs [Zhou & Ross 2002]
Select Agg(R.y) From R Where R.x Between a AND b
Table with 1 million records. Predicate selectivity 0.2
Elapsed Time (milliseconds)
25
20
15
10
5
0
SUM
32
SIMD
SUM
COUNT
SIMD
COUNT
Other Cost
MAX
SIMD
MAX
Branch Misprediction Penalty
MIN
SIMD
MIN
EDGE Workshop 2006
Additional work

Efficient SMT parallelism for single operators
–

TLB optimization
–

–
Fast sorting [Govindaraju et al 2006]
Spatial query primitives [Bandi et al 2004]
Use of other commodity hardware
–
33
More passes over data to avoid TLB thrashing (e.g., data
partitioning) [Manegold et al 2004]
Use of co-processors (e.g., GPU)
–

[Zhou et al 2005, Garcia et al 2006]
Content addressible memory (CAM) [Bandi et al 2005]
EDGE Workshop 2006
Talk Outline




34
Intro – what is a DBMS and what are its
performance goals?
Some history.
How DBMS researchers are dealing with
changes in architecture.
The future.
EDGE Workshop 2006
Future Challenges I


Multi-core, more threads per core
DBs already parallel (e.g., Oracle will run on
Niagara) but…
–
–
Cache interference (or underutilization) on each chip
Flipside: No cache coherence problem with shared L2
cache


–
Bandwidth in/out of the chip now a bottleneck

35
Move away from complete shared-nothing?
Need to model on-chip and off-chip communication costs

New trade-offs
Speculative prefetching a bad idea
EDGE Workshop 2006
Future Challenges II

Making use of multiple HW components
–
Best use of whole system


–
–
–
–
36
CPU+GPU
PPE+SPEs on Cell
Various kinds of accelerators (commodity!)
Configurability, tunability
Programmability
Cost models
EDGE Workshop 2006
Conclusions



DBMS software is changing.
Evolution and revolution in response to new
hardware.
More changes ahead for multi-core systems
–

What’s the best HW mix for databases?
–
37
Existing parallel design a head-start
E.g., Is a GPU a server component?
EDGE Workshop 2006