The Third Extended File System with Copy-on
Download
Report
Transcript The Third Extended File System with Copy-on
A Workload-Driven Unit of Cache
Replacement for Mid-Tier Database Caching
Xiaodan Wang, Tanu Malik,
Randal Burns
Johns Hopkins University
Stratos Papadomanolakis,
Anastassia Ailamaki
Carnegie Mellon University
Hopkins Storage Systems Lab, Department of Computer Science
Overview
Motivation
– Data intensive scientific database federations
– Mid-tier caching improves scalability
Choosing the unit of cache replacement
– Minimize aggregate network traffic
– Improve query execution performance
Query prototypes
– Cache groups of columns
– Adapts to changes in the workload
Hopkins Storage Systems Lab, Department of Computer Science
OpenSkyQuery
Federation of sky surveys (a virtual telescope)
–
Expected to grow from 30 sites to over 100
Available over the Internet (community of
astronomers, educational users)
Sites are autonomous, heterogeneous, and
geographically distributed
Data intensive workload (large data sets, networkbound)
Scaling through mid-tier caching
–
–
Minimize network traffic
Offload query processing
Hopkins Storage Systems Lab, Department of Computer Science
Caching Schema
Difficult to achieve good query performance
–
–
Both network and query performance are sensitive to
granularity of cache replacement
Fine granularity (column)
–
–
Caches employ commodity hardware
An index-free environment
Poor network performance at small cache sizes
High I/O overhead
Coarse granularity (table)
–
–
Groups unrelated columns
Inefficient query and network performance
Hopkins Storage Systems Lab, Department of Computer Science
Contributions
Cache workload-defined groups of columns (query
prototypes)
Adaptive – candidate query prototypes are
discovered incrementally from the request stream
Self-organizing – each prototype describes a physical
schema optimized for a specific class of queries
Improve in-cache query execution performance
without sacrificing network savings
Hopkins Storage Systems Lab, Department of Computer Science
Caching for Network Savings
Identify and cache database objects that provide
network savings
–
–
Requests that access these objects are serviced from the
cache
Reduces contention for network bandwidth
Bypass Yield Caching (Malik et al., ICDE’05)
–
–
Caching framework that uses economic principles to
maximize network savings
Database objects are ranked by yield (expected network
savings per unit of cache space utilized)
Hopkins Storage Systems Lab, Department of Computer Science
Choosing the Unit of Cache
Replacement
Semantic caching is unsuitable for Astronomy
–
–
Lack locality (objects are rarely reused)
Evaluating query containment is difficult (nested
queries, complex joins, and user-defined functions
are common)
Employ schema-based caching
–
–
–
Queries reuse the same set of columns
Derive popular columns from the workload
Analogous materialized views
Hopkins Storage Systems Lab, Department of Computer Science
File-Bundling (Otoo et al., SC’04)
Loading only columns with high yield at small cache sizes
Q1
Q2
Q3
A B C D E F G H I
Q4
J
Caching columns B, C, H, and I results in no cache hits
Solution: cache groups of columns
Hopkins Storage Systems Lab, Department of Computer Science
Cache
B
C
H
I
Caching Groups of Columns
Existing schema-based caching models are static
(e.g. CacheTables, MTCache, TimesTen)
– Do not account for dynamic workload access patterns
– Physical schema of backend database or defined a priori
– May group columns that are rarely used together
Query prototypes caching
–
–
Identifies the best groupings from the workload
Minimizes query execution cost against prototypes without
sacrificing network savings
Hopkins Storage Systems Lab, Department of Computer Science
Query Prototype
Given a query qi, define the Query Access Set, QAS(qi), as the
set of attributes accessed by qi
qi and qj share the same query prototype if they access the
same attributes (QAS(qi) = QAS(qj))
Example:
SELECT objID
FROM Galaxy, SpecObj
WHERE objID = bestobjID and specclass = 2 and z between
0.121 and 0.127
QAS = {Galaxy:objID, SpecObj:bestobjID,
SpecObj:specclass, SpecObj:z}
Hopkins Storage Systems Lab, Department of Computer Science
Query Prototype
Q1
QAS(Q1) = {R1:A2, R1:A3, R2:B1}
QAS(Q2) = {R2:B1, R2:B2, R2:B3}
Q2
Cache
A1
A2
A3
B1
B2
B3
Prototype
A2
R1
A3
B1
Prototype
B1
B2
B3
R2
Base Tables
B1 is replicated in the cache
Hopkins Storage Systems Lab, Department of Computer Science
Workload Properties
Read-only queries
One-month trace against the Sloan Digital Sky
Survey (SDSS) Data Release 4 – 2TB
1.4 million queries generating 360GB of network
traffic
1176 query prototypes describe the entire workload
11 prototypes capture 91% of the queries
6 prototypes generate 89% of the network traffic
Hopkins Storage Systems Lab, Department of Computer Science
Experiments
Evaluate caching of tables, columns, vertical
partitions, and query prototypes
AutoPart (Papadomanolakis et al., SSDBM’04)
–
–
–
An automated partitioning algorithm for large
scientific databases
Groups columns in order to improve query
execution performance
Produces the best workload-driven, static grouping
Hopkins Storage Systems Lab, Department of Computer Science
Network Savings
Hopkins Storage Systems Lab, Department of Computer Science
Cache Pollution
Hopkins Storage Systems Lab, Department of Computer Science
Query Performance
Hopkins Storage Systems Lab, Department of Computer Science
Discussion
Improving network and query execution performance
are complementary goals
Columns should be grouped together at small cache
sizes (cache hits suffer due to file-bundling)
Column groupings should be adaptive because
–
–
Workload access pattern is dynamic
Indexes are not available
Hopkins Storage Systems Lab, Department of Computer Science
Questions
???
Hopkins Storage Systems Lab, Department of Computer Science
Schema Reuse
• Localized to a small subset of tables
Hopkins Storage Systems Lab, Department of Computer Science
Schema Reuse
• Similar reuse among columns
Hopkins Storage Systems Lab, Department of Computer Science
Object Reuse
• Few objects are reused
Hopkins Storage Systems Lab, Department of Computer Science
SkyQuery
Federation
middleware built at
Hopkins
Wrapper/Mediator
architecture using
web services
Hopkins Storage Systems Lab, Department of Computer Science
Load Cost
Object Load Cost by Unit of Cache Replacement
200
180
160
# Writes/MB
140
120
100
80
60
40
20
0
Column
Qry Prototype
Unit of Cache Replacement
Hopkins Storage Systems Lab, Department of Computer Science
Scan Cost
Scanning large tables,
the useful region is a
small fraction
Incur IO overhead for
accessing data from
extraneous columns
Spatial locality among
related columns
Hopkins Storage Systems Lab, Department of Computer Science
Q
Join Cost
Joining results for queries
that access multiple
fragments
Access should be localized
to few fragments to minimize
join overhead
Hopkins Storage Systems Lab, Department of Computer Science
Q