jobTalkSimple - OPERA research group

Download Report

Transcript jobTalkSimple - OPERA research group

Improving the Performance of
Storage Servers
Yuanyuan Zhou
Princeton University
Traditional Storage
• Delivers limited performance
–
–
–
–
–
–
Locally-attached
Little processing power
Small or no internal cache
Limited scale
Limited bandwidth
Simple storage interface
Database Server
(File server)
…
Disk array
Modern Storage Servers
• “Disks become super-computers”
–
–
–
–
–
–
--Jim Gray
Network attachable
File
Increasing processing power Database …
Server
Server
Gigabytes memory cache
Gigabytes bandwidth
Storage Area Network
Clustering of storage
Offloading application
Processor
Processor
operations
Memory
Memory
…
…
…
Impact of Storage Performance
• Storage I/O remains a bottleneck in
many high-end or mid-size On-line
Transaction Processing (OLTP)
databases (Microsoft report &
SOSP’95).
• Current technology trends
– Processor speed increases 60% per year
– Disk access time improves 7% per year
• Our goal: reduce I/O time
Computation
I/O
1
0.8
0.6
0.4
0.2
0
MS SQL
Approaches in Improving I/O Performance
• Improving response time and throughput
• Minimizing I/O and communication
overhead
My Solutions
• Effective hierarchy-aware
storage caching
– Improving response time
and throughput
• Using user-level
communication as
database-storage network
– Minimizing I/O &
communication overhead
Database
Server
…
File
Server
Storage Area Network
Processor
Memory
…
Processor
…
Memory
…
Outline
• Effective hierarchy-aware storage caching
–
–
–
–
–
Problem
Access pattern & properties
MQ algorithm
Evaluation
Summary
• User-level communication for database storage
–
–
–
–
Background
Architecture & Implementations
Results
Summary
1st Level Buffer Cache
Multi-level Server Cache Hierarchy
Database Clients
File Clients
…
Database Servers
File Servers
Networ
k
Networ
k
Client
Cache
(64MB – 128MB)
Storage Servers
Database
Server Cache
<<
(4GB – 32GB)
Storage
Server Cache
~
(1GB – 64GB)
No need for inclusion property
Multi-level Server Caching
accesses
in cache?
hits
Least
Recently
Used
(LRU)
Database or
File server
Cache
Database Servers
File Servers
(Higher level)
misses
LRU?
Storage server
Cache
Storage Systems
(Lower level)
Analogy: Storage Box (Basement)
• Assumption for analogy: item = box
• Question: do you keep the box?
• If you have a basement, you can keep all the boxes
Living room
(higher-level)
Basement
(lower-level)
DELL
pizza
Traditional Client-Server Cache Hierarchy
Analogy: Storage Box (Closet)
• If you just have a closet, you may keep only the box
for your holiday decorations!
Living room
(higher-level)
cold access
hot access
Closet
(lower-level)
hot miss
Database-Storage Server Cache Hierarchy
But If You Use LRU for Your Closet…
• Your closet will be full of garbage!
Living room
(higher-level)
Basement
(lower-level)
pizza
“Your cache ain’t nothin’but trash”
• Storage server cache access patterns are not
well understood
– Most storage server caches still use LRU
• Muntz & Honeyman (USENIX92)
– Cache hit ratios at lower level file server caches
are very low
• Willick et al (ICDCS92)
– FBR outperforms LRU for disk caches
Questions
• What is the access pattern at storage server
caches?
• What are the properties of a good storage
server cache replacement algorithm?
• What algorithms are good for storage server
caches?
Storage Cache Access Traces
Database or
File Server
Miss Trace
Description
Database or
File cache size
(MB)
# Reads
(millions)
#Writes
(millions)
#Database or
file Server
Oracle-1
Oracle-2
HP Disk
Auspex
Server
TPC-C
100 GB
database
TPC-C
100 GB
database
Cello,
1992
File
server,1993
128
16
30
8
7.3
3.8
0.2
1.8
4.3
2.0
0.3
0.8
Single
Single
Multiple
Multiple
Temporal Distances
• Temporal distance: Inter-reference gap from the
previous reference to the same block
• Example:
blocks: A, B, C, D
access sequence A B C A D B C
temporal distances
3 - 4 4
Temporal Distance Distribution
Accesses to storage server have poor temporal locality
100000
0
0
temporal distances
16
m
200000
2m
200000
32
k
25
6k
400000
4k
300000
51
2
600000
64
400000
8
800000
1
#accesses
Auspex Access Trace (High level) Auspex Miss Trace (lower level)
1
minDist
8
64
2
51
4k 32k 56k
2
temporal distances
Notation: 1k = 1,000 references; 1m = 1,000,000 references
2m 16m
Why Poor Temporal Locality?
Database or File Cache
access to B
B
Storage Cache
access to B
B
…
16K cache blocks
LRU queue
Assume: 20% miss ratio
>16K distinct accesses
>3.2K accesses
…
B
LRU queue
B
Minimal Lifetime Property
A block should stay in cache for at least minDist time to
be hit at the next reference
Oracle-1 (128MB Client Cache)
Oracle-2 (16MB Client Cache)
500000
1200000
450000
1000000
400000
350000
800000
300000
600000
250000
200000
400000
150000
200000
100000
m
16
2m
4k
2
32
k
25
6k
HP Disk Trace
51
64
8
1
16
m
2m
32
k
25
6k
4k
51
2
64
50000
0
8
1
0
Auspex Server Trace
60000
350000
50000
300000
40000
250000
30000
200000
20000
150000
10000
100000
0
m
0
6m
2m
k
56
k
32
4k
2
51
64
8
1
16
2m
6k
25
k
32
4k
51
2
64
8
1
50000
What Blocks to Keep?
A large percentage of accesses are made to a small
percentage of data, but to a less extent
Oracle-1
percentage of accesses
percentage of blocks
100
90
percentage
80
70
60
50
40
30
20
10
0
1
2
4
8
16
frequency
32
64
128
Frequency-based Priority Property
Blocks should be prioritized based on their access frequencies
Oracle -1
Oracle-2
100
100
90
90
80
80
70
70
60
50
60
50
40
40
30
30
20
20
10
10
0
0
1
2
4
8
16
32
64
128
1
2
4
HP Disk Trace
8
16
32
64
128
256
Auspex Server Trace
100
90
100
90
80
80
70
60
70
60
50
40
30
20
50
40
30
20
10
10
0
1
2
4
8
16
32
64
128
256
512
0
1
2
4
8
16
32
64
128
256
512
Replacement Algorithm Properties
• Minimal lifetime
– A block should stay in cache for at least
minDist time
• Frequency-based priority
– Blocks should be prioritized based on their
access frequencies
• Temporal (aged) frequency
– Reference counts accumulated long time ago
should carry less weight.
Performance of Existing Algorithms
Big gap from the off-line Optimal algorithm
Cache Hit Ratio (%)
100
80
60
OPT
FBR
LRU
40
20
0
64
128
256
512
Storage Cache Size(MB)
Cache Hit Ratios (Oracle-1)
1024
Do They Satisfy the Properties?
No on-line algorithms satisfy all three properties
OPT
LRU
FBR
Minimal lifetime Frequency based
priority
Best
Best
Poor with small
Poor
cache sizes
Poor
Well
Temporal
frequency
Best
Well
Well
Our Replacement Algorithm: Multi-Queue(MQ)
• Designed based on the three properties
– Minimal lifeTime: multiple LRU queues with different priorities
– Frequency-based Priority: promoting based on reference counts
– Aged Frequency: demoting when lifetime expires
access block B
lifetime = f (minDist)
B.expireTime = CurrentTime + lifetime
evict
promote
demote
D:6
C:1
B:8
B:7
History
Buffer
Q0
C:1
block ID
reference count
Q1
Q2
D:6
D.expireTime < CurrentTime
Q3
Simulation Evaluation
• Trace driven cache simulator
– Write-through
– Block size: 8 Kbytes
• MQ
– m = 8;
– Adaptive lifeTime setting based on-line statistic
information
Simulation Results
100
Cache Hit Ratio (%)
MQ performs better than others
80
60
OPT
MQ
FBR
LRU
40
20
0
64
128
256
512
Storage Cache Size(MB)
Cache Hit Ratios (Oracle-1)
1024
Why MQ Performs Better?
• MQ can selectively keep some blocks for a longer time
1500000
1000000
29%
71%
500000
0
1
256
64k
16m
Algorithms Temp. Distance Temp. Distance
< 64K
>= 64K
#hits #misses #hits #misses
MQ
1553K
293K 1919K 2645K
FBR
1611K
235k 1146K 3418K
LRU
1846K
0 407K 4157K
Oracle-1, Storage cache size: 512 MB (64K entries)
Implementation Results (Oracle)
• MQ has a similar effect to doubling the cache size
with LRU
Storage
MQ
LRU
Cache Size
128 MB 19.85% 8.85%
256 MB
31.42% 17.66%
512 MB
44.34% 31.69%
Storage cache hit ratios
(Database cache size: 128MB Database size: 100GB)
OLTP End Performance (Oracle)
• MQ can reduce I/O time by 16~25% and improve overall
performance by 9~11% comparing to LRU.
Normalized Execution Time
Computation
1
0.8
0.6
0.4
0.2
0
0.5
LRU
I/O
0.42
0.42
MQ
LRU
128 MB
0.32
0.32
MQ
LRU
256 MB
Storage Cache Size
0.24
MQ
512 MB
Related Work
• Practice
– LRU, MRU, LFU, SEQ, LFRU, 2Q,
FBR,LRU-k, …
• Theory
– LUP & RLUP analytical model
– Competitive analysis
• Temporal locality metrics
– LRU Stack distance (1970)
– Distance string (1976)
– IRG model (1995)
• Multi-queue process scheduling
Summary
• Access pattern?
– Long temporal distance & frequency distributed
unevenly
• Properties?
– Minimal lifetime, frequency-based priority, aged
frequency
• What algorithms are good for storage caches?
– MQ performs betters than seven tested alternatives and
has similar effect to doubling the cache size with LRU.
• Details can be found in
– Y.Zhou, J.F.Philbin and K.Li. The Multi-Queue Server Buffer
Cache Replacement Algorithm. USENIX’01
Outline
• Effective hierarchy-aware storage caching
–
–
–
–
–
Problem
Access pattern & properties
MQ algorithm
Evaluation
Summary
• User-level communication for database storage
–
–
–
–
Background
Architecture & Implementations
Results
Summary
I/O Related Host Overhead
• High-end or mid-size OLTP configurations
– Tolerate disk access latency via async. I/Os
– Problem: high I/O related processor overhead
• Reasons
– OS overhead
– Communication protocol overhead
• Our solution: Using user-level
communication for database storage
Analogy: Overhead Walkway
• Overhead walkway can avoid stairs & traffic, so
you can win money faster!
Database
User-space
Kernel Space
Storage Server
Overhead
walkway
User-level
Communication
SCSI or“Strip”
Fibre Channel
Las Vegas Casinos
User-space
Kernel Space
User-level Communication
• High bandwidth, low latency, low overhead
• Main features
– Bypass OS
– Zero-copying
– Remote DMA (RDMA)
• University research:
– VMMC, UNet, FM, AM, Memory Channel, Myrinet …
• Industrial standard
– Virtual Interface (VI) Architecture
Using User-level Communication
• Intra-Cluster communication
– Scientific parallel applications
– Application Servers (Intel CSP, Compaq
TruCluster)
– Web servers
• Client-server communication
– Databases (Oracle, DB/2, MS SQL)
– Direct Accessed File Systems (DAFS)
Our goals
• User-level communication as a Databasestorage interconnect:
– Is user level communication effective to reduce
the host overhead for database storage?
– How to use user-level communication to
connect database with storage?
VI-Attached Storage Server
• Database and storage communicate using VI
• Storage cache is managed using MQ
Storage Server
Databases
Storage Cache
Client Stub
VI
…
…
VI Network
…
…
VI
Local Disks
Storage Server
Databases
Client Stub
VI
Storage Cache
VI
Local Disks
…
Client-stub Implementations
• Challenges
– Application transparency
– Take advantage of VI
– Storage API
• Implementations
– Kernel Driver
– DLL Interceptor
– Direct Attached Storage (DSA)
transparency
Client Stub: Kernel Driver
• Fully transparent + standard API
• Plus
– Support all applications
– Take advantage of VI’s zerocopying and RDMA features
• Minus
– High kernel overhead
– Require kernel-space VI
Databases
Standard DLL
Kernel-space
Device Driver
Kernel-space VI
VI
Client Stub: DLL Interceptor
• User-level Transparent + standard API
• Plus
Databases
– No modification to databases
– Take advantage of user-level
communication
DLL Interceptor
Standard DLL User-space VI
VI
• Minus
– High overhead to satisfy standard
I/O API semantics
Example: trigger events for I/O
completions
Kernel-space
Client Stub: Direct Storage Access (DSA)
• Least Transparent + new API
• DSA interface:
– Minimize kernel involvement
– Designed based on VI features, e.g.
polling for I/O completions
• Plus
– Fully take advantage of VI
• Minus
– Require small modifications to
database
Modified Databases
DSA Library
User-space VI
VI
Kernel-space
Limitations of User-level Communication
• Substantial enhancements to address the
following issues
– Lack of flow-control and reconnection
mechanisms
– High memory registration overhead
– High locking overhead
– High interrupt overhead
Evaluation
• Real systems
–
–
–
–
OS: Windows XP
Databases: MS SQL, TPC-C benchmark
VI network: Giganet
Tested by customers for 6 months
• Large-size configuration
– Database: 32-way SMP, 32 GB memory
– Storage: 8 PCs each with 3GB memory, 12 Terabytes data
• Mid-size configuration
– Database: 4-way SMP, 4 GB memory
– Storage: 4 PCs each with 2GB memory, total 1 Terabytes
OLTP Performance (Large-size Config.)
Normalized Execution Time
Normalized Execution Time
Computation
I/O
– I/O time: 40%
– Overall: 18%
1.4
1.2
1
0.8
0.6
0.4
0.2
0
FibreChannel
Driver
DLL
• DSA improvement
DSA
100
90
80
70
60
50
40
30
20
10
0
SA
D
LL
D
r
ri
ve
D
ne
l
other
VI
Client Stub
lock
OSkernel
ha
n
Fi
br
eC
CPU utilization
TPC-C I/O Overhead Breakdown
Summary
• User-level communication can effectively connect
database with storage, but may require substantial
enhancements.
• A storage API that minimizes kernel involvement
in the I/O path is necessary to fully exploit the
benefits of user-level communication
• Details can be found in
Yuanyuan Zhou, Angelos Bilas, Suresh Jagannathan, Cezary
Dubnicki, James F. Philbin and Kai Li. Experience with VI-based
Communication for Database Storage. To appear in ISCA’02
Conclusions
• Effective hierarchy-aware storage caching
– MQ has a doubling-cache-size effect comparing to
LRU, and can reduce the I/O time in OLTP by 15~30%
– Provide insights for other similar multi-level cache
hierarchies (e.g. Web Proxy caches)
• Using User-level communication for database
storage
– DSA can reduce the I/O time in OLTP by 40%
– Provide guidelines for the design and implementations
of new I/O interconnects (e.g. Infiniband) and other
applications (e.g. DAFS)
My Other Related Research
• Improving availability
– Fast cluster fail-over using memory-mapped
communication (ICS’99)
• Memory management for Networked Servers
–
–
–
–
Consistency protocols for DSMs (OSDI’96)
Coherence granularity vs. protocols (PPoPP’97)
Performance Limitations of software DSMs (HPCA’99)
Thread scheduling for locality (IOPADS’99)
• http://www.cs.princeton.edu/~yzhou/