Presentation

Download Report

Transcript Presentation

Developing a Characterization of
Business Intelligence Workloads for
Sizing New Database Systems
Ted J. Wasserman (IBM Corp. / Queen’s University)
Pat Martin (Queen’s University)
David B. Skillicorn (Queen’s University)
Haider Rizvi (IBM Canada)
Outline




Background and Motivation
Workload Characterization Analysis
Results
Future work
2
What is Sizing?

Estimating the amount of physical computing
resources needed to support a new workload


Processor (CPU), disk, memory
Use of simplifying assumptions, extrapolations,
estimations, projections, rules-of-thumb, prior
experience, etc.

Due to lack of available information about new application
Sizing is more of an art, than a science
3
Our Sizing Approach
1)
2)
3)
4)
5)
6)
Collect the required high-level input data from the customer
Cross-check and verify input data, making assumptions and
estimates if needed
Determine the required system resource demands for each
workload class and type
Aggregate the different workload types and classes’ resource
demands to determine the overall requirements
Determine which hardware configurations will meet the required
resource demands
Produce a ranked list of hardware configurations
* For more details, see:
Wasserman, T.J., Martin, P., Rizvi, H. Sizing DB2 Servers for
Business Intelligence Workloads. In Proc. of CASCON2004,
October 2004, Toronto, Canada.
4
Motivation

Problem: How does a customer describe the
workload of their new application?




May not know the exact queries yet
No production-level measurements available
Only vague, high-level information
Solution: Study the characteristics of a proxy
workload (TPC-H) and have the customer describe
their workload in terms of the approximate
performance goals and mix of the different classes
of queries inherent in the proxy workload
5
Workload Characterization Analysis

Partition queries into a few general classes
based on their resource usage


Need to keep simple so that customer can
understand and relate to partitions
Each class will comprise the queries that are
similar to each other based on resource
usage and other characteristics
6
Data Collection

Data from 5 recent TPC-H benchmark power
runs used (*pre-audited runs)

Hardware configurations and database
scales varied across benchmarks


balanced system configurations were used
Data collected using standard OS monitoring
tools at 5 second intervals for each query
7
Parameter Selection

Query parameters monitored and used in
analysis:






Query response time (seconds)
Average (user) CPU utilization
Average MB/second rate
Average IO/second rate
Size of largest n-way table join
The above set was sufficient for our analysis
8
Data Normalization

Within each benchmark run, data normalized


Each benchmark data set transformed to one with
a 0-mean and std. dev. of 1
Normalized query data from each benchmark
combined and used for analysis
9
Partitioning Techniques


Goal: Partition workload into classes or
clusters so that objects within a cluster are
similar to each other, but are dissimilar to
objects in other clusters
Singular Value Decomposition (SVD) &
SemiDiscrete Decomposition (SDD)



Matrix decomposition techniques
Unsupervised data mining
Good at revealing underlying or ‘hidden’ factors in
data typical of real-world processes
10
Singular Value Decomposition (SVD)

A matrix, A, can be decomposed as:
A = U S VT

where U is n x r, V is r x r, S is an r x r diagonal
matrix whose entries are decreasing (the singular
values), U and V are orthogonal
The singular values indicate how important each
new dimension is in representing the structure of A
11
Singular Value Decomposition (2)

Can be regarded as transforming the original
space to new axes such that as much
variation as possible is expressed along the
first axis, as much as possible of what
remains along the second, and so on
12
SemiDiscrete Decomposition (SDD)

The SDD of A is given by
A =X D Y
where X is n x k, D is a k x k diagonal matrix,
and Y is k x m (for arbitrary k)

The matrices X and Y have entries that are
only -1, 0, or +1
13
SemiDiscrete Decomposition (2)


For any decomposition, the product of the ith
column of X, the ith diagonal element of D
and the ith row of Y is a matrix of the same
shape as A
In SDD, each of these layer matrices
describes a `bump’ in the data, a region (not
necessarily contiguous) of large magnitude
14
Analysis Results
C1
C2
C4
C3
15
Results

Four clusters of queries

Cluster 1: Q1, Q3, Q4, Q5, Q6, Q11, Q14, Q14, Q19


Cluster 2: Q2, Q20


“Simple Complexity”
Cluster 3: Q7, Q8, Q9, Q18, Q21


“Moderate Complexity”
“High Complexity”
Cluster 4: Q10, Q13, Q15, Q16, Q22

“Trivial Complexity”
16
Results (2)


Queries appear to scale well across different
system architectures and database sizes
Attempt to understand meaning of the new
“dimensions” of the SVD analysis



U1 – CPU vs. IO-bound queries
U2 – Query Response Times
U3 – Sequential-IO intensive vs.
Random-IO Intensive
17
Future Work



Perform analysis on larger set of data
Use more robust/representative workload
Extend to other workload types (e.g. OLTP)
18
Fin
Thank you.
19