A Paradigm Shift in Database Optimization: From Indices to

Transcript A Paradigm Shift in Database Optimization: From Indices to

A Paradigm Shift in Database
Optimization:
From Indices to Aggregates
Presented to:
The Data Warehousing & Data Mining mini-track – AMCIS 2002 as
Research-in-Progress
Ryan LaBrie • Robert St. Louis • Lin Ye
Arizona State University
[email protected] • [email protected] • [email protected]
Agenda





A need for a shift in optimization
strategy
What our research is focusing on
How we performed this research
Update on our results
Next steps
Why a Shift, Why Now?

HISTORICALLY

Relational database technology is really
good at what it does…




Transaction-oriented, operational systems
Optimized for data INPUT
FOCUS: Storage of DATA
TODAY’S ENVIRONMENT

Large Data Warehouses



Used for decision support
Need to be optimized for information OUTPUT
FOCUS: Retrieval of INFORMATION
The Decision Support Problem

Relational DBMS limitations

Too much data


Too complex queries


Tera- and petabytes, quickly approaching exabytes
Structured Query Language
Too long for results

Indexing limitations


Usage of (i.e. Table Scans)
B+ Trees
A Possible Decision Support Solution

Multidimensional Databases
New effective storage techniques
 Simpler modeling techniques
 Potential for easier query interfaces
and


Intelligent Aggregation


Appropriate use of redundancy
More effective indexing algorithms

Bitmapped indices
The Focus of Our Research

CURRENT RESEARCH
Cost comparisons of Relational vs.
Multidimensional Decision Support Systems
Working towards a multidimensional
benchmarking system
1.
2.



TPC-H is positioned as a Decision Support benchmark,
however it is based on relational technologies
GOAL: Vendor neutral benchmark for comparing
multidimensional database products
FUTURE RESEARCH
In the long term, show that decisions can be
made more easily with multidimensional
technology


Simpler design, simple interfaces, faster responses
Why Develop a Multidimensional
Benchmark?

Benchmarking is an established method for creating
vendor neutral tests


Benchmarking has been examine in other IS fields
including



Server Platforms: Johnson & Gray, 1993
eCommerce: Menasce, 2002
It has been called for specifically in the data
warehousing academic community


Transaction Processing Performance Council (TPC)
Nemati et al., 2000
and
Has yet to be done
How Are We Building Our Benchmark





Based on the TPC-H relational decision
support benchmark
Create a relational dimensional model
that forms the basis for the data mart
Build a multidimensional cube off the
dimensional model
Convert the SQL statement to the
equivalent MDX
Run both the SQL query and the MDX
query, report results
What We Have Done To Date

Initially have mapped all 22 TPC-H
relational queries to potential data marts





3-4 data marts necessary
Built 2 TPC-H data sets (1GB and 10GB)
Converted TPC-H Query #4 to MDX
Ran comparisons on both data sets
In the process of converting a second
query (TPC-H Query #7) for additional
analysis/confirmation of gains
TPC-H: Query #4 – Relational SQL
Typical Decision Support Request: Answers the questions, “How many orders were
delivered late in Quarter 3 of 1993, sorted by priority?”
SELECT o_orderpriority,
COUNT(*) AS order_count
FROM orders
WHERE o_orderdate >= '1993-07-01'
AND o_orderdate < '1993-10-01'
AND EXISTS
(SELECT *
FROM lineitem
WHERE l_orderkey = o_orderkey
AND l_commitdate < l_receiptdate)
GROUP BY o_orderpriority
ORDER BY o_orderpriority
lineitem
L_ORDERKEY
L_PARTKEY
L_SUPPKEY
L_LINENUMBER
L_QUANTITY
L_EXTENDEDPRICE
L_DISCOUNT
L_TAX
L_RETURNFLAG
L_LINESTATUS
L_SHIPDATE
L_COMMITDATE
L_RECEIPTDATE
L_SHIPINSTRUCT
L_SHIPMODE
L_COMMENT
part
P_PARTKEY
P_NAME
P_MFGR
P_BRAND
P_TYPE
P_SIZE
P_CONTAINER
P_RETAILPRICE
P_COMMENT
partsupp
PS_PARTKEY
PS_SUPPKEY
PS_AVAILQTY
PS_SUPPLYCOST
PS_COMMENT
customer
C_CUSTKEY
C_NAME
C_ADDRESS
C_NATIONKEY
C_PHONE
C_ACCTBAL
C_MKTSEGMENT
C_COMMENT
supplier
S_SUPPKEY
S_NAME
S_ADDRESS
S_NATIONKEY
S_PHONE
S_ACCTBAL
S_COMMENT
NATION
N_NATIONKEY
N_NAME
N_REGIONKEY
N_COMMENT
REGION
R_REGIONKEY
R_NAME
R_COMMENT
orders
O_ORDERKEY
O_CUSTKEY
O_ORDERSTATUS
O_TOTALPRICE
O_ORDERDATE
O_ORDERPRIORITY
O_CLERK
O_SHIPPRIORITY
O_COMMENT
TPC-H: Query #4 – Multidimensional
Expression (MDX) Equivalent
SELECT
{[Measures].[O Latecount]} ON COLUMNS,
{[PriorityDim].children} ON ROWS
FROM Q4Cube
WHERE ([TimeDim].[All TimeDim].[1993].[Quarter 3])
The Database Costs Dilemma
Disk
Space?
Build
Time?
Query
Speed?
Results To Date (Query Speed)
TPC-H
Query 4
1 GB Dataset
10 GB Dataset
Multidimensional
0.33 seconds
0.33 seconds
Relational
46.6 seconds
(140x slower)
925 seconds (~15.5 min)
(~2800x slower)
Relational (optimized
w/Indices)
38 seconds
(114x slower)
Relational (optimized
w/Indices & Striping)
26 seconds
(78x slower)
Test not run
247 seconds (~4.0 min)
(~750x slower)
Results To Date (Other Measures)
TPC-H
Query 4
1 GB Dataset
10 GB Dataset
Relational DB
1.2 GB
12.5 GB
Relational DB
1.8 GB
(w/Indices)
Multidimensional .16 MB
Cube Size
Multidimensional 46 seconds
Cube Build Time
28.9 GB
.16 MB
356 seconds
(~6 minutes)
Preliminary Conclusions



For a very modest investment
organizations will be able to process
very large data warehouses
The multidimensional data mart is the
only practical (speed, processing
time) way to support the end-user
decision maker.
Aggregation truly is a substitute for
expensive hardware
Next Steps







Acquire a larger server
Build 100GB and 300GB TPC-H data sets
Benchmark both relational and dimensional
queries
Publish results
Consider ROLAP, HOLAP, MOLAP issues
Possible extensions to some data mining
research
Possible extensions to decision making
through technology research
Thank You for Your Time
Questions?
[email protected]
www.public.asu.edu/~rlabrie
(for this presentation and paper)
Appendix A: Current System
SOFTWARE

Microsoft Windows
2000 Advanced
Server

Microsoft SQL
Server 2000
Enterprise Edition

Microsoft SQL
Server 2000 Analysis
Services Enterprise
Edition
HARDWARE

(1) 1.8GHz Intel
Pentium 4 processor

768MB RAM

240GB HD space (3
IDE 80GB 7200RPM
Drives)

Total cost: $1100
(Hardware only)

A Paradigm Shift in Database Optimization: From Indices to

Transcript A Paradigm Shift in Database Optimization: From Indices to

Directory