Transcript derwin3

DBMSs On A Modern Processor:
Where Does Time Go?
by
A. Ailamaki, D.J. DeWitt, M.D. Hill, and D. Wood
University of Wisconsin-Madison Computer Science Dept.
Madison, WI
Presented by
Derwin Halim
Agenda
Database and DBMS
Motivation for DBMS performance study
Proposed DBMS performance study
Processor model
Query execution time breakdown
Database workload
Experimental setup and results
Conclusion
Database and DBMS
Database is a collection of data, typically
describing the activities of one or more
related organizations: entities and
relationships
DBMS (Database Management System) is a
software designed to assist in maintaining
and utilizing large collections of data
Motivation for
DBMS Performance Study
DBMSs are becoming compute and memory
bound
Modern processors do not improve database
system performance to the same extent as
scientific workloads
Contrasting commercial DBMSs and identifying
common characteristics are difficult
Urgent need to evaluate and understand the
processor and memory behavior of commercial
DBMSs on existing hardware platform
Proposed DBMS Performance Study
Analyze the execution time breakdown of multiple
different commercial DBMSs on the same
hardware platform
Use workload consists of simple queries on a
memory resident database
Isolate basic operations and identify common
trends across the DBMSs
Identify and analyze bottlenecks and provide
solutions
Processor Model:
Basic Pipeline Operation
Processor Model:
Handling Pipeline Stall
Non-blocking cache
Out-of-order execution
Speculative execution with branch
prediction
Query Execution Time Breakdown
TQ = TC + TM + TB + TR – TOVL
Database Workload
Single-table range selections and two table
equijoins over a memory resident database,
running a single command stream
Eliminates dynamic and random parameters
Isolate basic operations: sequential access
and index selection
Allows examination of the processor and
memory behavior without I/O interference
Database Workload
Table:
create table R (a1 integer not null,
a2 integer not null,
a3 integer not null,
<rest of field>)
Sequential range selection:
select avg(a3)
from R
where a2 < Hi and a2 > Lo
Database Workload
Indexed range selection:
construct non-clustered index on R.a2 then
resubmitted the range selection
Sequential join:
select avg(R.a3)
from R, S
where R.a2 = S.a1
40,000 100-byte records in S, each of which
joins with 30 records in R
Experimental Setup:
Hardware and Software Platform
400MHz PII Xeon/MT Workstation
512 MB main memory with 100 MHz system bus
Out-of-order engine and speculative instruction
execution
Non-blocking cache
Separate data and instruction first level caches
Unified second level cache
4 commercial DBMSs on Windows NT 4.0
Service Pack 4
Event measurement counters and emon
Experimental Setup:
PII Xeon Cache Characteristics
Experimental Setup:
Measuring Stall Time Components
Results:
Execution Time Breakdown
Processor spends most of the time stalled
The problem will be exacerbated by the ever increasing
processor-memory gap
Bottleneck shifts
Results:
Memory Stalls Breakdown
L1 D-cache, L2 I-cache, ITLB stall time are insignificant
Focus on L1 I-cache and L2 D-cache stall time component
Results:
L2 D-cache Stall Time
Position of the accessed data in the records
and the record size
L2 D-cache miss is much more expensive
than L1 D-cache miss
Only gets worse as processor-memory
performance gap increases
Larger cache => longer latency
Results:
L1 I-cache Stall Time
L1 I-cache miss is difficult to overlap and causes
serial bottleneck in the pipeline
L1 cache size vs. latency
L1 cache miss increases as data record size
increases
- Inclusion: L2 cache replacement forces L1 cache
replacement
- OS interrupt: periodical context switching
- Page boundary crossing
Results:
Branch Mis-prediction
Serial bottleneck and instruction cache misses
40% BTB misses on average => more static prediction
L1 I-cache miss follows branch mis-prediction behavior
Results:
Resource Stall Time
Dominated by dependency and/or functional unit stalls
Dependency stalls are the most important resource stalls
due to low ILP except for System A
FU stalls are caused by contention in the execution unit
Results:
Simple Query vs TPC Benchmarks
Simple Query vs TPC-D (DSS):
- Similar CPI breakdown
- Still dominated by L1 I-cache and L2 D-cache
miss
Simple Query vs TPC-C (OLTP):
- CPI rate of TPC-C is much higher
- Resource stalls are higher
- Dominated by L2 D- and I-cache miss
Conclusion
Memory stall is a serious performance bottleneck
Focus on L1 I-cache and L2 D-cache misses
Improvements should address all of the stall
components due to possibility of bottleneck shifts
Simple query offers methodological advantage
TPC-D has similar execution time breakdown,
while TPC-C incur more second level cache and
resource stalls