Slides - University of California, Santa Cruz

Transcript Slides - University of California, Santa Cruz

Automatic Index Selection
for Shifting Workloads
Karl Schnaitter and Neoklis Polyzotis (UC Santa Cruz)
Serge Abiteboul (INRIA and University of Paris 11)
Tova Milo (University of Tel Aviv)
On-line Index Selection


A heavily used database needs indexes
But an evolving workload makes it tough



Popularity shifts over time
Applications come and go
The problem: on-line index selection
Maintain a dynamic set of indexes that
improve performance of current queries
Attempted On-line Solution

Can off-line techniques be adapted?



Off-line algorithms need a fixed query load
This is not given in the on-line problem
We could try:
Gather query workload W
no obvious
implementation
Loop
can be very slow
Optimize indexes for W
creation cost
not considered
Our Solution

COLT: Continuous On-Line Tuning


Features:





A system for on-line index selection
Continuously tracks query load
Evaluates indexes with what-if optimizer
Adaptively controls overhead
Selects indexes based on recent trends
Prototype implementation in PostgreSQL
Problem Assumptions

We make the following assumptions




Data is in a relational database
Workload of SELECT statements
Restricted to single-column indexes
Given a budget for index storage

This problem still has major challenges

Lifting these assumptions in future work
System Architecture
query
Parser
Optimizer
epoch
What-if
Interface
COLT
query & plan
Executor
Database
CREATE INDEX ...
DROP INDEX ...
(between epochs)
queries
COLT Internal Organization
COLT
recent
queries
What-if
Interface
query & plan
candidate
indexes
Organizing the Queries

Similar queries are grouped in clusters
Cluster 1
Cluster 2
Query 2
Query 1
Query 3


Query 5
Query 4
Each query is placed in a cluster on arrival
Aggregate benefit statistics for each cluster

Confidence intervals of past measurements
Organizing the Indexes
new candidate
discarded
Cold Set
Hot Set
Materialized
Set

Relevant, but not promising candidates

Benefit measured with crude metric

Promising candidates for materialization

Profiled accurately with what-if calls

Indexes materialized by COLT

Disk storage must fit in budget

Profiled with reverse what-if calls
Key Challenges
Challenge: Selective Profiling
Which hot and materialized indexes
are profiled w.r.t. the current query?
Challenge: Index Selection
How are index candidates selected for
the cold, hot, and materialized sets?
Selective Profiling (1/2)
query
map to cluster
C1
C2
C3
C4
relevant for profiling
Cold
Hot
random sample
indexes to profile
Mat
focus on indexes
with uncertain
benefit in cluster
Selective Profiling (2/2)

Profiling budget:
max what-if calls per epoch

Budget is adjusted after each epoch



Set proportional to potential of hot indexes
Potential based on optimistic assumptions
Result:
stable workload
 suspend profiling
shifting workload  intensify profiling
Index Selection (1/3)


How are new candidates chosen?
Use a crude benefit metric



new candidate
Cold
Cost of index access vs. sequential scan
Approximate and cheap to compute
When each query arrives:


Compute crude benefit of relevant indexes
Indexes with benefit > 0 become candidates
Index Selection (2/3)


Cold
How are hot indexes chosen?
At the end of each epoch:


Hot
Get crude benefit of hot and cold indexes
Find cut-off point to be in hot set:
COLD
HOT
Crude
Benefit
Cut-off point derived
from two-cluster model
Index Selection (3/3)


Hot
How to choose materialized set?
At the end of each epoch:

Materialized
Predict benefit of hot and materialized indexes
Observed
Benefit
Predicted
Benefit
time


Cost of materialization is discounted
Select indexes to maximize total benefit
Performance of COLT
Phase 1
Phase 2
Phase 3
Phase 4
Minimum Time
COLT Extra Time
Off-line Extra Time
 Experimented with a
4-phase workload
 COLT adapts to
each phase
 Off-line chooses
best static index set
Overhead of Profiling
Phase 1
Phase 2
Phase 3
Phase 4
 Overhead peaks at
start of each phase
 Decreases when
system is well tuned
 Average < 1 what-if
call per query
Closing Remarks

On-line index selection harder than off-line




Efficiency is a higher priority
Essentially need to guess the future
Index creation is an issue
COLT is our solution


Solves a constrained problem
Potentially extensible to other domains

Only some components would change
Thank You
Setting System Parameters


Epoch length = 10
What-if limit = 20


Or less if very worried about overhead
Averaging window



Used when predicting future benefit
Indicates the number of epochs that give
a good picture of the query distribution
Not easy to set
Other On-line Systems

Bruno and Chaudhuri, ICDE 2007




Avoids extra optimizer calls completely
Heuristics to capture index interaction
Very different mechanisms for index selection
Sattler, Schallehn, and Geist, IDEAS 2004



More similarities to COLT
Different prediction of future benefit
No control on profiling overhead
Performance with Noise
100%
 The worst-case
scenario for COLT
80%
60%
 Concentrated bursts
of noise queries
40%
20%
 Performance loss
in some cases

Slides - University of California, Santa Cruz

Transcript Slides - University of California, Santa Cruz

Directory