Transcript Title Slide

Best Practices to Improve Query Performance in a
Data Warehouse - 1
Calisto Zuzarte, STSM, IBM, [email protected]
0
December 01-03, 2009 •Minneapolis, Chicago, Milwaukee
Data Warehouse Life Cycle
Database design / Application design
– The Warehouse Application architects and Database
Administrators work together to design the queries and schema
before they put the application in production
Database performance layer implementation
– In order to meet SLAs, DBAs usual go through some iterations
augmenting the database with performance layer objects and
set up the initial configuration to get good performance
Database tuning operations
– During production, with changing requirements and change in
data, there is on-going tuning required to keep operations
smooth.
1
December 01-03, 2009 •Minneapolis, Chicago, Milwaukee
Motivation
Data warehouse environments characteristics:
– Large volumes of data
• Millions/Billions of rows involved in some tables
• Large Joins
• Large Sorts,
• Large Aggregations
• Many tables involved
• Large amount of data rolled-in and rolled-out
– Complex queries
• Report Queries
• Ad Hoc Queries
It is important to pay attention to query performance
2
December 01-03, 2009 •Minneapolis, Chicago, Milwaukee
Objective
Provide recommendations from a DB2
optimizer perspective to improve query
performance through the Data
Warehouse life cycle
3
December 01-03, 2009 •Minneapolis, Chicago, Milwaukee
Agenda
SESSION 1
Best Practices – Database Design
Best Practices – Application Design
Best Practices – Configuration and Operations
SESSION 2
Best Practices – Performance Layer
4
December 01-03, 2009 •Minneapolis, Chicago, Milwaukee
Best Practices – Database Design
Parallelism
– Inter-partition Shared nothing parallelism (DPF)
– Intra-Query Parallelism (SMP)
Partitioning
– Database Partitioning
– Table Partitioning
• Table (Range) Partitioning
• UNION ALL Views
– Multi-Dimension Clustering
Schema
5
December 01-03, 2009 •Minneapolis, Chicago, Milwaukee
Best Practices - Parallelism
DPF or SMP or both ?
Database partition feature (DPF) is generally recommended
to achieve parallelism in a data warehouse
– Achieves scalability and query performance
SMP (Intra-Query Parallelism) is NOT recommended in
concurrent multi-user environments with heavy CPU usage
SMP is only recommended
– When CPUs are highly under utilized and when DPF is not an
option
6
December 01-03, 2009 •Minneapolis, Chicago, Milwaukee
Partitioning (Complimentary Strategies in DB2)
Database Partitioning (DPF)
“Database Partitioning”
“Distribution Key”
– CREATE TABLE … DISTRIBUTE BY HASH
– Key Benefit : Better scalability and performance through
parallelism
Table Partitioning
“Table Partitioning”
“Table Partitioning Key”
– Table (Range) Partitioning
– CREATE TABLE …PARTITION BY RANGE
– Key Benefit : Better data management (roll-in and roll-out of data)
“UNION ALL branch
– UNION ALL Views
Partitioning”
– CREATE VIEW V AS (SELECT … FROM F1 UNION ALL … )
– Key Benefit : Independent branch optimization
“Cells”, “Blocks”,
“Dimensions”
Multidimensional Clustering (MDC)
– CREATE TABLE … ORGANIZE BY DIMENSION
– Key Benefit : Better query performance through data clustering
7
December 01-03, 2009 •Minneapolis, Chicago, Milwaukee
Distribute By … Partition By … Organize By ..
CREATE TABLE …
DISTRIBUTE BY HASH
PARTITION BY RANGE
ORGANIZE BY DIMENSION
Database
Partition 1
8
Database
Partition 2
Database
Partition 3
TS1
TS2
TS1
TS2
TS1
TS2
Jan
Feb
Jan
Feb
Jan
Feb
North South
North South
North South
North South
North South
North South
East West
East West
East West
East West
East West
East West
December 01-03, 2009 •Minneapolis, Chicago, Milwaukee
Best Practices – DPF Partitioning
Collocate the fact and largest frequently joined dimension
Choose to avoid significant skew on some partitions
Avoid DATE dimension where active transactions for current
date all fall on one database partition (TIMESTAMP is good)
Possibilities for workload isolation for data marts
– Different partition groups but common dimension tables
– Recommend that dimension tables be replicated (discussed
later)
9
December 01-03, 2009 •Minneapolis, Chicago, Milwaukee
Best Practices – Table Partitioning
Recommend partitioning the fact tables
Recommend using the DATE dimension
Works better with application key predicates applied directly
Table (Range) Partitioning
– Consider partitioned indexes with V9.7
– Choose partitioning based on roll-in / roll-out granularity
UNION ALL Views
– Define view predicates or CHECK Constraints to get branch
elimination with query predicates (with constants only)
– Use UNION ALL views only with well designed applications
• Dangers of materialization with ad hoc queries
• Large number of branches needs time and memory to optimize
10
December 01-03, 2009 •Minneapolis, Chicago, Milwaukee
Best Practices – Multidimensional Clustering (MDC)
Recommend defining MDC on the fact table
– Guaranteed clustering (Avoids the need to REORG for
clustering)
– I/O optimization
– Compact indexes (compact, coexists with regular indexes)
Choose dimensions based on query predicates
– Recommend the use of 1 to 4 dimensions
– Need to ensure dimensions are chosen such that they do not
waste storage
Could choose a finer granularity of Table partitioning range
– For example: Table partition range by month, MDC by date
11
December 01-03, 2009 •Minneapolis, Chicago, Milwaukee
Star Schema
STORE
PRODUCT
Store_id
Product_id
SALES
Region_id
…
TIME
Date_id
Month_id
Quarter_id
Year_id
12
Product_id
Store_id
Channel_id
Date_id
Amount
Quantity
…
Class_id
Group_id
Family_id
Line_id
Division_id
…
CHANNEL
Channel_id
…
December 01-03, 2009 •Minneapolis, Chicago, Milwaukee
Dimension Hierarchy
Product Dimension
Division
Level 5
Time Dimension
Line
Level 4
Year
Family
Level 3
Group
Level 2
Class
Level 1
Product
Level 0
Store Dimension
Channel Dimension
Retailer
Channel
Store
Quarter
Month
Date
Sales Fact
13
December 01-03, 2009 •Minneapolis, Chicago, Milwaukee
Best Practices - Schema
Surrogate Keys
– As far as possible use application keys themselves
• allows predicates to be applied/transferred directly on the fact table
• DATE is a good candidate (easier to roll-in/roll-out and for MDC )
Star Schema / Snowflakes
– Separate tables for each dimension hierarchy (snowflake) may
result in a large number of joins
– Flattened dimensions may contain a lot of redundancy (space)
Define Columns NOT NULL when appropriate
– Many optimizations that are done based on NOT NULL
Define Uniqueness when appropriate
– Primary Keys / Unique Constraints / Unique Indexes
14
December 01-03, 2009 •Minneapolis, Chicago, Milwaukee
Agenda
SESSION 1
Best Practices – Database Design
Best Practices – Application Design
Best Practices – Configuration and Operations
SESSION 2
Best Practices – Performance Layer
15
December 01-03, 2009 •Minneapolis, Chicago, Milwaukee
Application Considerations - Expressions
Use constants instead of expressions in the query
– Example
• SELECT … WHERE DateCol <= CURRENT DATE – 5
• Use VALUES(CURRENT DATE – 5) to get the constant first and
use it in the query
Avoid expressions on indexed columns
– Example
• SELECT … WHERE DATECOL – 2 DAYS > ‘2009-10-22’
• SELECT … WHERE DATECOL > ‘2009-10-22’ + 2 DAYS
Similar recommendation with cast functions
– Example
• SELECT … WHERE INT(CHARCOL) = 2009
• SELECT … WHERE CHARCOL = ‘2009’
• Note you may lose Errors/Warnings
16
December 01-03, 2009 •Minneapolis, Chicago, Milwaukee
Application Considerations – Table Partitioning / MDC
As far as possible put local predicates directly on Table
Partition or MDC dimension columns of the fact table
SELECT ... FROM CUSTDIM C, TIMEDIM T, FACT F
WHERE C.country=USA and C.KEYCOL=F.CUSTKEYCOL and
T.Date = ‘2009-01-15’ and T.KEYCOL= F.TIMEKEYCOL
Simplify if the TIMEKEYCOL is correlated to the TIME values
(For example TIMEKEYCOL= 20090115 for the date ‘2009-01-15’)
SELECT ... FROM CUSTDIM C, FACT F
WHERE C.country=USA and C.KEYCOL=F.CUSTKEYCOL and
F.TIMEKEYCOL = 20090115
17
December 01-03, 2009 •Minneapolis, Chicago, Milwaukee
Application Considerations – Table Partitioning / MDC
Another example … consider
SELECT ... FROM CUSTDIM C, TIMEDIM T, FACT F
WHERE C.country=USA and C.KEYCOL=F.CUSTKEYCOL and
T.YEAR = 2009 and T.KEYCOL= F.TIMEKEYCOL
First get the values for MINKEY and MAXKEY
SELECT MIN(KEYCOL) FROM TIMEDIM WHERE YEAR=2009
SELECT MAX(KEYCOL) FROM TIMEDIM WHERE YEAR=2009
Then write the SQL as follows
SELECT ... FROM CUSTDIM C, TIMEDIM T, FACT F
WHERE C.country=USA and C.KEYCOL=F.CUSTKEYCOL and
T.YEAR = 2009 and T.KEYCOL= F.TIMEKEYCOL AND
F.TIMEKEYCOL >= MINKEY AND
F.TIMEKEYCOL <= MAXKEY
18
December 01-03, 2009 •Minneapolis, Chicago, Milwaukee
Application Considerations – General Recommendations
Avoid repetitions of complex expressions
Use Global Temporary Tables to split a query if it contains
more than about 15 tables and compile time is an issue
19
December 01-03, 2009 •Minneapolis, Chicago, Milwaukee
Agenda
SESSION 1
Best Practices – Database Design
Best Practices – Application Design
Best Practices – Configuration and Operations
SESSION 2
Best Practices – Performance Layer
20
December 01-03, 2009 •Minneapolis, Chicago, Milwaukee
Best Practices – Configuration and Operations
Configuration
– Database Configuration
– DBMS Configuration
– Registry Settings
Operations
– Collecting Statistics
21
December 01-03, 2009 •Minneapolis, Chicago, Milwaukee
Configuration
Optimization Level 5
Avoid multiple bufferpools of the same page size
Configuration thumb rules
– BUFFPOOL ~= SHEAPTHRES
– SORTHEAP ~= SHEAPTHRES/(# of concurrent SORT, HSJN)
22
December 01-03, 2009 •Minneapolis, Chicago, Milwaukee
Registry Variables
DB2_ANTIJOIN=EXTEND
• If slow queries have NOT EXISTS, NOT IN predicates
23
December 01-03, 2009 •Minneapolis, Chicago, Milwaukee
Registry Variables
DB2_REDUCED_OPTIMIZATION=YES
– Set if compile time is an issue
IBM Service may recommend a more complex setting for
example:
– DB2_REDUCED_OPTIMIZATION=10,15,20,00011000….
• First Part : DB2_REDUCED_OPTIMIZATION=A,B,C
– IF more than C joins, then "quick greedy"
– ELSE IF more than B joins, then use “greedy”
– ELSE IF more than A joins, use reduced “dynamic” strategy.
• Second Part not documented (Mainly intended for setting by
service)
24
December 01-03, 2009 •Minneapolis, Chicago, Milwaukee
Best Practices
Optimization Level 5
BUFFERPOOL~=SHEAPTHRES
DB2_ANTIJOIN=EXTEND
DB2_REDUCED_OPTIMIZATION=YES
25
December 01-03, 2009 •Minneapolis, Chicago, Milwaukee
Collecting Statistics
The DB2 Query Optimizer relies on reasonably accurate
statistics to get a good query plans
User runs RUNSTATS when data changes (part of ETL)
Statistics Fabrication (unreliable)
– DB2 keeps UPDATE / DELETE / INSERT counters
– Fabrication limited to a few statistics – Not enough
Automatic Statistics
– Automatically collects statistics on tables in need
– Runs in the background as a low priority job
Real Time Statistics
– Collects statistics on-the-fly
26
December 01-03, 2009 •Minneapolis, Chicago, Milwaukee
AUTO RUNSTATS
Set Under Automatic Table Maintenance hierarchy
– AUTO_RUNSTATS cannot be ON unless
AUTO_TBL_MAINT is ON
Automatic maintenance
Automatic database backup
Automatic table maintenance
Automatic runstats
Automatic statement statistics
Automatic statistics profiling
Automatic profile updates
Automatic reorganization
27
(AUTO_MAINT)
(AUTO_DB_BACKUP)
(AUTO_TBL_MAINT)
(AUTO_RUNSTATS)
(AUTO_STMT_STATS)
(AUTO_STATS_PROF)
(AUTO_PROF_UPD)
(AUTO_REORG)
=
=
=
=
=
=
=
=
ON
OFF
ON
ON
OFF
OFF
OFF
OFF
December 01-03, 2009 •Minneapolis, Chicago, Milwaukee
REAL TIME STATISTICS
Set Under Automatic Table Maintenance hierarchy
– Real Time Statistics cannot be ON unless AUTO
RUNSTATS is ON
– AUTO_RUNSTATS cannot be ON unless
AUTO_TBL_MAINT is ON
Automatic maintenance
Automatic database backup
Automatic table maintenance
Automatic runstats
Automatic statement statistics
Automatic statistics profiling
Automatic profile updates
Automatic reorganization
28
(AUTO_MAINT)
(AUTO_DB_BACKUP)
(AUTO_TBL_MAINT)
(AUTO_RUNSTATS)
(AUTO_STMT_STATS)
(AUTO_STATS_PROF)
(AUTO_PROF_UPD)
(AUTO_REORG)
=
=
=
=
=
=
=
=
ON
OFF
ON
ON
ON
OFF
OFF
OFF
December 01-03, 2009 •Minneapolis, Chicago, Milwaukee
Best Practices – RUNSTATS
Distribution Statistics
– Collect large Quantile Statistics for Date columns
– Collect distribution statistics on columns used in predicates
Index Statistics
– Do not collect DETAILED INDEX statistics . Use SAMPLED
DETAILED INDEX statistics instead
Avoid statistics on columns you know will never be used in
predicates or GROUP BY columns
Use TABLESAMPLE option for very large tables and
statistical views
Use RUNSTATS Profiles to store customized invocations
RUNSTATS with ATTACH ?
COMMIT immediately after RUNSTATS of each table
29
December 01-03, 2009 •Minneapolis, Chicago, Milwaukee
Collecting Statistics
Automatic RUNSTATS
Real Time Statistics
SAMPLED DETAILED INDEX
TABLESAMPLE
Selective column statistic specification
Use RUNSTATS PROFILES
30
December 01-03, 2009 •Minneapolis, Chicago, Milwaukee
Summary
Tips and best practices to improve data warehouse query
performance have been discussed.
– Database Design
– Application Design
– Configuration and Operations
These include key considerations related to :
– Parallelism
– Partitioning
– Schema
– Application queries
– Configuration
Session 2 will cover the Performance Layer
31
December 01-03, 2009 •Minneapolis, Chicago, Milwaukee
Disclaimer
© Copyright IBM Corporation [current year]. All rights reserved.
U.S. Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule
Contract with IBM Corp.
THE INFORMATION CONTAINED IN THIS PRESENTATION IS PROVIDED FOR INFORMATIONAL PURPOSES
ONLY. WHILE EFFORTS WERE MADE TO VERIFY THE COMPLETENESS AND ACCURACY OF THE
INFORMATION CONTAINED IN THIS PRESENTATION, IT IS PROVIDED “AS IS” WITHOUT WARRANTY OF
ANY KIND, EXPRESS OR IMPLIED. IN ADDITION, THIS INFORMATION IS BASED ON IBM’S CURRENT
PRODUCT PLANS AND STRATEGY, WHICH ARE SUBJECT TO CHANGE BY IBM WITHOUT NOTICE. IBM
SHALL NOT BE RESPONSIBLE FOR ANY DAMAGES ARISING OUT OF THE USE OF, OR OTHERWISE
RELATED TO, THIS PRESENTATION OR ANY OTHER DOCUMENTATION. NOTHING CONTAINED IN THIS
PRESENTATION IS INTENDED TO, NOR SHALL HAVE THE EFFECT OF, CREATING ANY WARRANTIES OR
REPRESENTATIONS FROM IBM (OR ITS SUPPLIERS OR LICENSORS), OR ALTERING THE TERMS AND
CONDITIONS OF ANY AGREEMENT OR LICENSE GOVERNING THE USE OF IBM PRODUCTS AND/OR
SOFTWARE.
Please update paragraph below for the particular product or family brand trademarks you mention such as
WebSphere, DB2, Maximo, Clearcase, Lotus, etc
IBM, the IBM logo, ibm.com, [IBM Brand, if trademarked], and [IBM Product, if trademarked] are trademarks or registered
trademarks of International Business Machines Corporation in the United States, other countries, or both. If these and other
IBM trademarked terms are marked on their first occurrence in this information with a trademark symbol (® or ™), these
symbols indicate U.S. registered or common law trademarks owned by IBM at the time this information was published. Such
trademarks may also be registered or common law trademarks in other countries. A current list of IBM trademarks is
available on the Web at “Copyright and trademark information” at www.ibm.com/legal/copytrade.shtml
If you have mentioned trademarks that are not from IBM, please update and add the following lines:
[Insert any special 3rd party trademark names/attributions here]
Other company, product, or service names may be trademarks or service marks of others.
32