Escape from the Data Warehouse Death March
Download
Report
Transcript Escape from the Data Warehouse Death March
A Narrow Escape:
How We Avoided the Data
Warehouse “Death March”
Northern California Oracle User’s Group
November 15, 2001
Chris Lawson
home.mindspring.com/~karo3
What is a “Death March” ?
Death March projects “use a
“forced march imposed upon
relatively innocent victims, the
outcome of which is usually a high
casualty rate.”
Edward Yourdon, 1997, Death March:
The Complete Software Developer’s Guide to
Surviving “Mission Impossible Projects”
Some “Encouraging” Words
Yourdon: “A large percentage of the
projects associated with start-up
companies are death march
projects. A large percentage of
these projects will fail.”
• DW projects have a terrible reputation,
often producing mad customers and a
frustrated project team.
Death March,
Death March!
• Many seek to “jump ship.”
Further “Words of Encouragement”
By the time the data sources have
been identified and transformed the
in-house customers have often lost
interest.
• Approx 50% of large DW projects “fail to meet
the desired levels of success,” often failing on first
or even second attempts.”
CIO magazine, “The Middle Ground,” 1/15/99.
Traditional Approach: Will it Work?
• For large information systems,
development has traditionally been an
extremely structured process.
• Many days are spent on up-front analysis,
requirements analysis, design reviews, etc.
• The strategy for these types of projects is
to invest much time early, when mistakes
are cheaper to fix.
Traditional Software Project
• For large government-funded
projects, this is reasonable
because requirements are fixed.
• Customer may even mandate standards, such as
“SEI” (Software Engineering Institute).
• These standards require strict requirement
analysis and rigorous design reviews.
• Emphasis is on repeatable processes and
continuous improvement.
Change in “Ground Rules”
• The strategy described above often does not
work well with large, private-sector projects,
such as Data Warehouses.
• The “ground rules” defining a successful
project are different.
• For example, in contrast to government
projects, Data Warehouse requirements are
negotiable, elusive, and will probably change.
Change in Ground Rules (cont’d)
“Just when you are about to deliver,
expect the rules to change – then
change again.”
Steve Tracy, Intelligent Enterprise,
“Close the Loop,” March 27, 2001.
In other words, the entire nature of the project
development cycle is fluid.
Our Escape Route: Prototyping
Data warehouse designed for
DigitalThink used a Prototype
strategy with very restricted scope.
• Early phases were very limited in scope.
• We deliberately followed a simpler
development process.
• The Prototype approach allows early
demonstration of a working model with some
features.
Key to Prototyping Benefits
• With scope limited to only six key
reports and one data source, our
early successes provided
confidence for later stages.
•Prototyping provides the
opportunity to quickly
uncover “show stoppers.
• Roadblocks will be found
quickly.
Another Benefit: Credibility
• Prototyping allows team to gain support
among the potential users.
• Working prototype enabled the team to
perform several “demos” of key reports.
• These demos provided "buy-in" from
interested parties
• Demos also proved that that the project was
“on track”--we gained credibility.
Design Overview
• Like most Data Warehouses, Digital Think
used a process called ETL—Extraction,
Transformation, and Loading.
• Nightly runs perform data extraction,
transformation, and loading from the
production Sybase system into Oracle 8i.
• Only the changed production data is inserted.
• Data is transformed into appropriate ‘star’
schema format using E.piphany ETL tool.
ServerSun E4500 cluster
Production
Sybase
Load Program
Web Server
Sun Ultra 2 (2)
Apache
1.3.12
STAGING
Oracle
Actuate
ReportCas
t
ETL Program
Reporting Server
Sun E-3500 cluster
REPORTING
Oracle
Actuate
e.Reportin
g Server
Reporting System
• The application server layer was actually a
Report Server in our case.
• Reports are produced by Actuate reporting
system.
• Actuate reads the pre-defined report
definition to define a database query.
• It then executes the query on the Oracle 8i
database, adds formatting and graphics.
• It then forwards report to Apache web server
for delivery to the client’s web browser.
Example of Report
Database Surprises
• Surprisingly(?) some of the newest Oracle
features were a mixed blessing. I wasn’t surprised
• Star Transformation did improve
performance of some reports, but
degraded others.
Feature turned off when bug caused different
results from same query!
More on Star Schemas later!
Report “Factories.”
• Some queries require extensive processing.
• This presents risk that numerous reports might be
requested simultaneously, degrading the system.
• This risk was mitigated via “factories,” which
restrict the number of simultaneous queries.
• Reports are assigned to particular factories based
on expected run time.
• The factory for the typical report was configured
to allow execution of two reports per CPU.
A Big Challenge: Lumpy Data
• Certain customers accounted for a large
proportion of the data. This presented a
performance-tuning dilemma.
• Time for reports ranged from a few seconds
for most to a few minutes for some customers.
• Histograms no good because ‘big’ customer
looks like a ‘regular’ account (at first).
How can performance be optimized
given this huge variance?
Oracle Trick: “Query Rewrite”
• Recent database feature uses “materialized
views” (Snapshots) with “Query Rewrite.”
• Each night, aggregate data is stored in a
materialized view for the big accounts.
• Then, whenever these customers submit a
report, the optimizer rewrites the database
query to use the special aggregate data.
• Another help was “Pre-running reports” for big
customers. The Report Server then uses the
pre-built report instead of new query.
Star Schema Queries
Let’s look at how Oracle uses Star Schemas
FACTORY
10 rows
STORE
100 rows
SALES Fact table
100 million rows
TIME_PERIOD
365 rows
PRODUCT
500 rows
Dimension
tables
Star Schema Performance
• Star Schemas are very useful, but good
performance can be elusive when joining.
• There is always the “normal” way of joining-Hash Join, Nested Loop, Sort/Merge.
• But--the Oracle optimizer has some “tricks” to
use when joining tables in Star Schemas.
• These two tricks are called “Star Join” and
“Star Transformation.”
• These two methods are NOT the same, and
use drastically different ideas.
Intro to Sql Joins with Stars
• The Issue: What if the selection criteria is
spread among the Dimension tables.
• That is, the query as a whole only returns a
few rows, but the ‘Where Clause” on each
Dimension table is not very restrictive.
• This presents a special type of problem, and
the join methods can have drastically
different performance, as we will see.
Failure to understand this issue will
cause bad things to happen.
Example: Typical Query
Select Sales.Sale# From Sales, Time_Period, Factory,
Product, Store where
Time_Period.Quarter
Factory.Manuf_Name
Product.Description
Store.City
= ‘Q1-2000’ and
= ‘Coleman’ and
= ‘Camping Gear’ and
= ‘Elk Grove’ ;
The query as a whole only
returns a few rows but any one
criteria would return many rows.
The Dilemma
• We only want the final result set,
not all the rows we have to “wade
through” to get there.
• How can we “jump” to the final result set?
• It seems like no matter how we join the
tables, we must process many rows. The
earliest joins must “pay the price.”
• A solution seems impossible! Or is it?
Oracle’s First Answer: STAR Join
• Optimizer recognizes the Star setup, &
bypasses the usual join methods.
• Prior to joining to the Fact table, the optimizer
jointly considers the dimension constraints.
• The optimizer manually builds a list of all
possible combinations (Cartesian product) of the
Dimension rows that meet the selection criteria.
• This small set of rows is used to access the
Fact table, via a (B*tree) composite index.
Star Join Example
Step 1: Find Cartesian product of Dimension
rows that meet the search criteria.
Quarter = ‘Q1-2000’
Manuf = ‘Coleman
Product = ‘Sleeping Bag’
Store = ‘Boston’
90 rows
x 1 row
x 1 row
x 1 row
total 90 rows
Step 2: Using this set of 90 rows, access
Sales table via 4-col index.
STAR Join: Complications
• Requires huge composite index on Fact table.
• Multiple huge indexes will be necessary so
that “leading columns” rule will always be met.
• Cartesian product can be huge. Example:
Quarter = ‘Q1-2000’
90 rows
Manuf like ‘C%’
x 10 rows
Product like ‘Sleeping%’ x 10 rows
Store_Area = ‘East’
x 20 rows = 180,000
Thus, Star Join would require 180,000 index lookups
Oracle’s Second Answer:
STAR Transformation
The “Trick”-- Use inherent speed of
combining bitmaps
Instead of figuring all possible combinations,
the combining is done at the bitmap level.
Need bitmap index for each foreign key in
Fact pointing to a Dimension table.
Star_Transformation_Enabled = True
Example of STAR Transformation
Consider again criteria: Quarter = ‘Q1-2000’ AND
Manuf like ‘C%’ AND Product like ‘Sleeping’ AND
Store_Area = ‘East’
Step 1: Scan each Dimension table to find rows
that match the search criteria. Get PK’s:
QUARTER
MANUF
PRODUCT
STORE:
90 rows {‘1/1/2000’, ‘1/2/2000’, etc.}
10 rows {‘Coleman’, ‘Clark’, etc.}
10 rows {‘Sleep Bag’, ‘Sleepware’, etc}
20 rows {‘Boston’, ‘New York’, etc.}
STAR Transformation
Step 2: Access matching bitmaps on Fact
table. Combine bitmaps (‘OR’) for each
Dimension:
QUARTER
MANUF
PRODUCT
STORE:
Merge 90 bitmaps
Merge 10 bitmaps
Merge 10 bitmaps
Merge 20 bitmaps
=> 1 bitmap
=> 1 bitmap
=> 1 bitmap
=> 1 bitmap
STAR Transformation
Step 3: Combine (‘AND’) bitmaps yielding the
“final” bitmap that encompasses all selection
criteria.
Step 4: Use final bitmap to read the Fact table.
Step 5: [If necessary] Return to Dimension
tables to get columns not yet read.
High Availability
• The Reporting system was
required to be available 24x7,
because of world-wide
customers.
• This required a method of keeping the database
available while new data is loaded.
• The E.piphany tool provided a handy solution
to this via a duplicate set of database tables.
High Availability
• Each table set contains all the data needed to
run a report. Thus, while the users are
querying set “A,” new data is copied into “B.”
• At the conclusion of the nightly data update,
new reports are directed (via a view) to the
most recent set of data--either “A” or “B.”
• Of course, this feature requires more storage,
but the high availability benefit was deemed a
good tradeoff.
High Availability (HA)
• The database and reporting servers were
configured for HA (High Availability) using
Veritas Cluster Server (VCS).
• With VCS, each server is actually part of a
two-node cluster.
• Upon failure of one node, the entire
application and database automatically switch
to the alternate node.
• See web site for further documentation on
Veritas VCS.
Teamwork
• The critical position of project
architect was filled very early
in the project.
Our architect
•The architect was one of the few team
members with extensive data warehouse
design experience.
•He provided the “road map”, as well as
designed the database “Star” schema.
Teamwork (continued)
• External customers were invited to preview
prototype reports, and provide feedback.
• The DBA worked with the designers
to optimize the response time, and
identify database bottlenecks.
• This frequently meant identifying
poorly performing sql code, and
suggesting corrections, or adding
new indexes.
Our DBA loves
sql tuning
Summary: Our Secret Formula
The project success was not primarily
due to “gurus” or clever development
techniques.
• Rather, prototyping strategy provided a
positive impetus and a laid a good
groundwork for further development.
• Trust and cooperation among team members
was also a key contributor to success.
• Recipe for success: Competent personnel plus
small early successes plus credibility.
For Further Information:
Chris_Lawson @ Yahoo.com
home.mindspring.com/~karo3