No Slide Title

Download Report

Transcript No Slide Title

On the Interaction Between Commercial Workloads
and Memory Systems in High-Performance Servers
Per Stenström
Fredrik Dahlgren, Magnus Karlsson, and Jim Nilsson
in collaboration with
Sun Microsystems and Ericsson Research
Department of Computer Engineering, Chalmers,
Göteborg, Sweden
http://www.ce.chalmers.se/~pers
Motivation
Dominating multiprocessor server applications 1995
(Source: Dataquest)
Scientific and
engineering
9%
6%
Database
16%
File server
5%
Other
32%
32%
Media and e-mail
Print server
• Database applications dominate (32%)
• Yet, major focus is on scientific/eng. apps (16%)
Project Objective
• Design principles for high-performance memory systems for
emerging applications
• Systems considered:
– high-performance compute nodes
– SMP and DSM systems built out of them
• Applications considered:
– Decision support and on-line transaction processing
– Emerging applications
•
•
•
•
Computer graphics
video/sound coding/decoding
handwriting recognition
...
Outline
• Experimental platform
• Memory system issues studied
– Working set size in DSS workloads
– Prefetch approaches for pointer-intensive
workloads (such as in OLTP)
– Coherence issues in OLTP workloads
• Concluding remarks
Experimental Platform
Application
$
Ethernet
SCSI
TTY
Interrupt
Operating system (Linux)
Devices
CPU
Memory
Sparc V8
M
Single and
multiprocessor
system models
Platform enables
• Analysis of comm. workloads
• Analysis of OS effects
• Tracking architectural events
to OS or appl. level
Outline
• Experimental platform
• Memory system issues studied
– Working set size in DSS workloads
– Prefetch approaches for pointer-intensive
workloads (such as in OLTP)
– Coherence issues in OLTP workloads
• Concluding remarks
Decision-Support Systems (DSS)
Compile a list of matching entries in several database relations
Level
i
Join
Scan
i-1
Join
Scan
2
Join
Scan
1
Scan
Will moderately sized caches suffice for huge databases?
Our Findings
Cache Miss
Rate
MWS
DWS 1
DWS 2
DWS i
Cache size
• MWS: footprint of instructions and private data to access a
single tuple
– typically small (< 1 Mbyte) and not affected by database size
• DWS: footprint of database data (tuples) accessed across
consecutive invocations of same scan node
– typically small impact (~0.1%) on overall miss rate
Methodological Approach
Challenges:
• Not feasible to simulate huge databases
• Need source code: we used PostgreSQL and MySQL
Approach:
• Analytical model using
– parameters that describe the query
– parameters measured on downscaled query executions
– system parameters
Footprints and Reuse Characteristics in DSS
Footprints per tuple access
Level
i
Join
Scan
MWS and DWSi
i-1
Join
Scan
MWS and DWSi-1
2
Join
Scan
MWS and DWS2
1
Scan
MWS and DWS1
• MWS: instructions, private, and metadata
– can be measured on downscaled simulation
• DWS: all tuples accessed at lower levels
– can be computed based on query composition and prob. of match
Analytical model-an overview
• Goal: Predicts miss rate versus cache size for fully-assoc.
caches with a LRU replacement policy for single-proc.
systems
Number of cold misses:
• size of footprint/block size
– |MWS| is measured
– |DWSi| computed based on parameters describing the query (size of relations
probability of matching a search criterion, index versus sequential scan, etc)
Number of capacity misses for tuple access at level i:
• CM0(1- C - C0) if C0 < Cache size < |MWS|
|MWS| - C0
• size of tuple/block size if |MWS| <= Cache size < |MWS| + |DWSi|
Number of accesses per tuple: measured
Total number of misses and accesses: computed
Model Validation
10
9
8
7
6
5
4
3
2
1
0
Database, measured
Database, predicted
Total, measured
96
40
48
20
2
24
10
51
6
25
8
12
64
32
Total, predicted
16
Miss ratio
Goal:
• Prediction accuracy for queries with different compositions
– Q3, Q6, and Q10 from TPC-D
• Prediction accuracy when scaling up the database
– parameters at 5Mbyte used to predict at 200 Mbytes databases
• Robustness across database engines
– Two engines: PostgreSQL and MySQL
Cache size (Kbyte)
Q3 on PostgreSQL
- 3 levels
- 1 seq. scan
- 2 index scan
- 2 nest. loop
joins
Model Predictions: Miss rates for Huge
Databases
Miss ratio (%) versus cache size (Kbytes) for Q3 on
a 10-Terabyte database
2,5
2
1,5
1
0,5
0
Meta data
Database data
Private data
16
32
64
12
8
25
6
51
2
2
10
4
4
20
8
9
40
6
9
81
2
Instructions
Cache size
• Miss rate by instr., priv. and meta data rapidly decay (128 Kbytes)
• Miss rate component for database data small
What’s in the tail?
Outline
• Experimental platform
• Memory system issues studied
– Working set size in DSS workloads
– Prefetch approaches for pointer-intensive
workloads (such as in OLTP)
– Coherence issues in OLTP workloads
• Concluding remarks
Cache Issues for Linked Data Structures
•Traversal of lists may exhibit poor temporal locality
•Results in chains of data dependent loads,
called pointer-chasing
• Pointer-chasing show up in many interesting applications:
– 35% of the misses in OLTP (TPC-B)
– 32% of the misses in an expert system
– 21% of the misses in Raytrace
SW Prefetch Techniques to Attack Pointer-Chasing
• Greedy Prefetching (G).
computation per node <
latency
-
• Jump Pointer Prefetching (J)
- short list or traversal not
known a priori
• Prefetch Arrays (P.(S/H))
Generalization of G and J that
addresses above shortcomings.
- Trade memory space and
bandwidth for more latency
tolerance
Results: Hash Tables and Lists in Olden
Normalized execution time
120
HEALTH
MST
100
80
60
Memory Stall Time
86 82 84
40
20
64 61
14 16 15 18 17
89
75 44
Busy Time
39 39
11 13 28 20 20
0
B
G
J P.S P.H
B
G J
P.S P.H
Prefetch Arrays do better because:
• MST has short lists and little computation per node
• They prefetch data for the first nodes in HEALTH unlike
Jump prefetching
Results: Tree Traversals in OLTP and Olden
Normalized execution time
200
DB.tree
Tree.add
150
83
Memory Stall Time
100
58 35
50
35
26
79 62
50
50
42
70 33
37
26
Busy Time
22
30 40 35 50 38
0
B G J
P.S P.H
B G J P.S P.H
Hardware-based prefetch Arrays do better because:
• Traversal path not known in DB.tree (depth first search)
• Data for the first nodes prefetched in Tree.add
Other Results in Brief
• Impact of longer memory latencies:
– Robust for lists
– For trees, prefetch arrays may cause severe cache pollution
• Impact of memory bandwidth
– Performance improvements sustained for bandwidths of typical
high-end servers (2.4 Gbytes/s)
– Prefetch arrays may suffer for trees. Severe contention on lowbandwidth systems (640 Mbytes/s) were observed
• Node insertion and deletion for jump pointers and prefetch
arrays
– Results in instruction overhead (-). However,
– insertion/deletion is sped up by prefetching (+)
Outline
• Experimental platform
• Memory system issues studied
– Working set size in DSS workloads
– Prefetch approaches for pointer-intensive
workloads (such as in OLTP)
– Coherence issues in OLTP workloads
• Concluding remarks
Coherence Issues in OLTP
DSM node
SMP node
M
M
M
$
$
$
P
P
P
MM
M
MM
M
MM
M
$
$
$
P
P
P
$
$
$
P
P
P
$
$
$
P
P
P
• Favorite protocol: write-invalidate
• Ownership overhead: invalidations cause write stall and
inval. traffic
Ownership Overhead in OLTP
Simulation setup:
• CC-NUMA with 4 nodes
• MySQL, TPC-B, 600 MB database
Kernel DBMS Library Total
44%
LoadStore
Stored- 32%
Between
Loaded- 24%
Between
31%
27%
40%
61%
66%
41%
8%
7%
19%
• 40% of all ownership transactions stem from load/store
sequences
Techniques to Attack Ownership Overhead
• Dynamic detection of migratory sharing
– detects two load/store sequences by different
processors
– only a sub-set of all load/store sequences (~40%
in OLTP)
• Static detection of load/store sequences
– compiler algorithms that tags a load followed
by a store and brings exclusive block in cache
– poses problems in TPC-B
New Protocol Extension
Criterion:
load miss from processor i followed by global store from i,
tag block as Load/Store
Normalized execution time
120
100
80
Write stall
60
Read stall
40
Busy
20
0
Baseline
Mig
LS
Concluding Remarks
• Focus on DSS and OLTP has revealed
challenges not exposed by traditional appl.
– Pointer-chasing
– Load/store optimizations
• Application scaling not fully understood
– Our work on combining simulation with
analytical modeling shows some promise