Slides - University of California, Santa Cruz
Download
Report
Transcript Slides - University of California, Santa Cruz
Automatic Index Selection
for Shifting Workloads
Karl Schnaitter and Neoklis Polyzotis (UC Santa Cruz)
Serge Abiteboul (INRIA and University of Paris 11)
Tova Milo (University of Tel Aviv)
On-line Index Selection
A heavily used database needs indexes
But an evolving workload makes it tough
Popularity shifts over time
Applications come and go
The problem: on-line index selection
Maintain a dynamic set of indexes that
improve performance of current queries
Attempted On-line Solution
Can off-line techniques be adapted?
Off-line algorithms need a fixed query load
This is not given in the on-line problem
We could try:
Gather query workload W
no obvious
implementation
Loop
can be very slow
Optimize indexes for W
creation cost
not considered
Our Solution
COLT: Continuous On-Line Tuning
Features:
A system for on-line index selection
Continuously tracks query load
Evaluates indexes with what-if optimizer
Adaptively controls overhead
Selects indexes based on recent trends
Prototype implementation in PostgreSQL
Problem Assumptions
We make the following assumptions
Data is in a relational database
Workload of SELECT statements
Restricted to single-column indexes
Given a budget for index storage
This problem still has major challenges
Lifting these assumptions in future work
System Architecture
query
Parser
Optimizer
epoch
What-if
Interface
COLT
query & plan
Executor
Database
CREATE INDEX ...
DROP INDEX ...
(between epochs)
queries
COLT Internal Organization
COLT
recent
queries
What-if
Interface
query & plan
candidate
indexes
Organizing the Queries
Similar queries are grouped in clusters
Cluster 1
Cluster 2
Query 2
Query 1
Query 3
Query 5
Query 4
Each query is placed in a cluster on arrival
Aggregate benefit statistics for each cluster
Confidence intervals of past measurements
Organizing the Indexes
new candidate
discarded
Cold Set
Hot Set
Materialized
Set
Relevant, but not promising candidates
Benefit measured with crude metric
Promising candidates for materialization
Profiled accurately with what-if calls
Indexes materialized by COLT
Disk storage must fit in budget
Profiled with reverse what-if calls
Key Challenges
Challenge: Selective Profiling
Which hot and materialized indexes
are profiled w.r.t. the current query?
Challenge: Index Selection
How are index candidates selected for
the cold, hot, and materialized sets?
Selective Profiling (1/2)
query
map to cluster
C1
C2
C3
C4
relevant for profiling
Cold
Hot
random sample
indexes to profile
Mat
focus on indexes
with uncertain
benefit in cluster
Selective Profiling (2/2)
Profiling budget:
max what-if calls per epoch
Budget is adjusted after each epoch
Set proportional to potential of hot indexes
Potential based on optimistic assumptions
Result:
stable workload
suspend profiling
shifting workload intensify profiling
Index Selection (1/3)
How are new candidates chosen?
Use a crude benefit metric
new candidate
Cold
Cost of index access vs. sequential scan
Approximate and cheap to compute
When each query arrives:
Compute crude benefit of relevant indexes
Indexes with benefit > 0 become candidates
Index Selection (2/3)
Cold
How are hot indexes chosen?
At the end of each epoch:
Hot
Get crude benefit of hot and cold indexes
Find cut-off point to be in hot set:
COLD
HOT
Crude
Benefit
Cut-off point derived
from two-cluster model
Index Selection (3/3)
Hot
How to choose materialized set?
At the end of each epoch:
Materialized
Predict benefit of hot and materialized indexes
Observed
Benefit
Predicted
Benefit
time
Cost of materialization is discounted
Select indexes to maximize total benefit
Performance of COLT
Phase 1
Phase 2
Phase 3
Phase 4
Minimum Time
COLT Extra Time
Off-line Extra Time
Experimented with a
4-phase workload
COLT adapts to
each phase
Off-line chooses
best static index set
Overhead of Profiling
Phase 1
Phase 2
Phase 3
Phase 4
Overhead peaks at
start of each phase
Decreases when
system is well tuned
Average < 1 what-if
call per query
Closing Remarks
On-line index selection harder than off-line
Efficiency is a higher priority
Essentially need to guess the future
Index creation is an issue
COLT is our solution
Solves a constrained problem
Potentially extensible to other domains
Only some components would change
Thank You
Setting System Parameters
Epoch length = 10
What-if limit = 20
Or less if very worried about overhead
Averaging window
Used when predicting future benefit
Indicates the number of epochs that give
a good picture of the query distribution
Not easy to set
Other On-line Systems
Bruno and Chaudhuri, ICDE 2007
Avoids extra optimizer calls completely
Heuristics to capture index interaction
Very different mechanisms for index selection
Sattler, Schallehn, and Geist, IDEAS 2004
More similarities to COLT
Different prediction of future benefit
No control on profiling overhead
Performance with Noise
100%
The worst-case
scenario for COLT
80%
60%
Concentrated bursts
of noise queries
40%
20%
Performance loss
in some cases