Data Warehouses and OLAP What are data warehousing systems?

Download Report

Transcript Data Warehouses and OLAP What are data warehousing systems?

Data Warehouses and OLAP
 What are data warehousing systems?
 Data Warehouse Architecture & Design
 Multidimensional Data Model
 ROLAP and MOLAP Systems
 View Design in Data Warehouses
 Object-Oriented Data Warehousing
 Summary and New Aspects
What Is Data Warehousing?
Data warehousing is a collection of decision
support technologies, aimed at enabling the
knowledge worker (e.g., chief executive,
manager, analyst) to make better and faster
decisions.
- Chaudhuri and Dayal, SIGMOD Record, March 1997
Characteristics of Data Warehousing
Systems
 Historical, summarized and consolidated data ;
very large databases
 Query intensive processing, query throughput
and response time driven
 Multidimensional model, new operations
 ROLAP, MOLAP, Data Marts.
Operational Database Systems
Mainly on-line transaction processing systems.
Have been bread-butter systems for most
database system vendors.
Lot of work has gone in building these systems.
But these systems have limited decision support
functionality (data analysis) required for
competitive business environments
Why Need Data Analysis?
to know your customers and yourself better,
for effective business strategies,
to provide future directions to business organizations.
This kind of data analysis has been going on for long
time. But there is an urgency in getting such data
analysis done faster. Main problem in doing this has been
the disparate and heterogeneous data sources.
Data warehousing systems aim to solve this problem!
Data Warehousing Architecture
Monitoring & Administration
OLAP
Servers
Metadata
Repository
Analysis
Query/Reporting
External Sources
Operational Dbs
Data Warehouse
Extract
Transform
Load
Refresh
Serve
Data Mining
Tools
Data Sources
Data Marts
Taken from Chaudhri&Dayal, SIGMOD RECORD March 1997
Data Warehouse Design
 Define the architecture, do capacity planning, and
select storage servers, database and OLAP
servers, and tools
 Integrate the servers, storage and client tools
 Design the warehouse schema and views
 Define the physical warehouse organization, data
placement, partitioning, and access methods
Data Warehouse Design (Cont..)
 Connect the servers using gateways, ODBC
drivers, or other wrappers
 Design and implement scripts for data extraction,
cleaning, transformation, load and refresh
 Populate the repository with the schema and view
definitions, scripts, and other metadata
 Design and implement end-user applications
 Roll out the warehouse and applications
Back-end Tools and Utilities
 Data Cleansing consists of data migration, data
scrubbing and data auditing
 Data Loading - consists of checking integrity
constraints; sorting; summarization; aggregation and
other computation to build derived tables stored in
warehouse; building indices and other access paths;
and partitioning to multiple target storage areas.
 Refresh - data shipping (triggers) vs transaction
shipping (based on logs)
Multidimensional Data Model
Multidimensional view of data in the warehouse
Each dimension is described by a set of attributes;
the attributes of a dimension may be related via
hierarchy of relationships.
Dimensions: Product, City, Date
Hierarchical summarization paths
Industry
Country
Year
Category
State
Quarter
Product
City
Month
Week
Date
On-Line Analytical Processing (OLAP)
OLAP tools provide an environment for decision making
and business modeling activities by supporting ad hoc
queries
 provide a multidimensional conceptual view of the data
usually star schema in which a single fact table relates to each
dimensional table, or
snowflake schema where dimensional tables are normalized for
simplifying the data operations related to the dimension
 provide easy-to-use end user interfaces
OLAP (Front-end) Tools
Multidimesional data model grew out of the view of
business data popularized by PC spread sheet programs.
Operations supported by multidimensional data model
 Aggregation: total sales by store and by year
 Selection (slicing); sales where toys = “soft” and store =
“LA” and year=1996
 Roll up (multiple group by): sales by city to sales by state
 Drill-down: sales by state to sales by city
 Calculation by positioning: top 5 stores by total sales
Star Schema
Order
OrderNo
OrderDate
Customer
CustomerNo
CustomerName
CustomerAddress
City
Salesperson
SalespersonID
SalespersonName
City
Quota
Product
Fact Table
OrderNo
SalespersonID
CustomerNo
ProdNo
DateKey
CityName
Quantity
TotalPrice
ProdNo
ProdName
ProdDescr
Category
CategoryDescr
UnitPrice
QOH
Date
DateKey
Date
Month
Year
City
CityName
State
Country
Relational OLAP (ROLAP)
 Stores the data in specialized relational
tables (star schema);
 ROLAP offers flexibility; cost is the many
joins needed for each query
 ROLAP extends SQL for decision support
data requests
 Bitmapped indexes more useful than B-trees
in handling large amount of data
Multidimensional OLAP (MOLAP)
 Stores data in a N-dimensional cube (hyper
cube) using array-based storage structure
 each cell is formed by the intersection of all
the dimensions; not all cells have a value (eg,
not every product is sold in every store)
 Cubes are created before can be used and
are static
 Suited for small and medium data sets
View Design & Data Warehousing
 The virtual view approach may be better if the
information sources are changing frequently;
 The materialized view approach would be
superior if the information sources are
changing infrequently and very fast query
response time is needed.
A Motivating Example
Suppose the member databases contain following tables
Item(I_id, I_name, I_price)
Part(P_id, P_name, I_id)
Supplier(S_id, S_name, P_id, city, cost, preference)
Sales(I_id, month, year, amount)
Example continued
Assume we have the following frequently asked queries:
Q1: Select
I_id, sum(amount*I_price)
From
Item, Sales
Where I_name like {MAZADA, NISSEN, TOYOTA}
And
year=1996
And
Item.I_id=Sales.I_id
Group by I_id
Q2: Select P_id, month, sum(amount)
From Item, Sales, Part
Where I_name like {MAZADA, NISSEN, TOYOTA}
And
year=1996
And
Item.I_id=sales.I_id
And
Part.I_id=Item.I_id
Group by P_id, month
Example Continued
Q3: Select
From
Where
And
Group by
Q4: Select
From
Where
And
And
P_id, min(cost), max(cost)
Part, Supplier
Part.P_id=Supplier.P_id
P_name like {spark_plug, gas_kit}
P_id
I_id, sum(amount*min_cost)
Item, Sales, Part
I_name like {MAZADA, NISSEN, TOYOTA}
year=1996 And Item.I_id=Sales.I_id
Item.I_Id=Part.I_id and Part.P_id =
(Select P_id, min(cost) as min_cost From supplier
Group by P_id)
Group by I_id
An MVPP for the Example
1
Q2
Q4
120k
2
5
30k
Q4
1.5k
10
Q1
30k
360k result4
result3
result2
 I_id, sum 360k 
P_id,
P_id, month
(mincost*
min(cost)
sum(amount*no)
amount*no)
max(cost)
36k
result1
I_id, sum
(amount*I_price)
tmp
tmp8
3.6b
tmp7
15m
360m
tmp3
tmp1
I_name like
1k
36m
tmp2
year=“1996”
12k tmp5
{Mazda,
Nisson,
Toyota}
p_name like
10k
{spark_plug,
gas_kit}
Item
1.5k
Sales
Part
tmp6
50k
P_id,
min(cost)
max(cost)
Supplier
Different Materialization Strategies
Materalized Views
Cost of Query Processing Cost of Maintenance
Total Cost
Item, Sales, Part, Supplier
8b980m860k
0
8b980m860k
tmp3, tmp4, tmp8
7b201m547k
1b350m125k
8b551m672k
416m747k
16b32m204k
16b448m951k
tmp3, tmp4, tmp7
7b276m497k
1b220m55k
8b496m552k
tmp3, tmp7
8b281m547k
126m122k
8b407m669k
1m447k
17b384m934k
17b386m381k
tmp3, tmp5
result1, result2, result3, result4
Issues & Problems
 Finding all the common subexpressions and
combining individual query access plans into
one MVPP, such that all the common
subexpressions are merged;
 Finding a set of intermediate nodes in the
MVPP, such that if the members of this set
are materialized, the total cost of global
query access and view maintenance is
minimal.
Algorithms for Materialized View
Selection
 Algorithms for multiple MVPP design;
a feasible solution - working with individual
optimal plans;
generating optimal plan(s) - applying 0-1 integer
programming technique.
 Given an MVPP, using heuristic rules to find a
set of nodes to be materialized so that the
total cost is minimal.
Dynamic Materialized View Selection
 Monitor the queries being executed over
time
 Maintain MVPP by incorporating most
frequently executed queries (common
subexpressions)
 Modify MVPP incrementally by executing
MVPP generation algorithm (in background)
 Decide on the views to be materialized
 Reorganize the existing views
Materialized View Selection Costs
The dynamic materialized view selection
problem has to take into consideration:
Benefit to the query processing cost in future
Cost of maintaining the materialized views
Cost of reorganization
Materialized View Reorganization
 Given a set of views V1, V2, …, Vn currently
materialized
 Let V’1, V’2, …, V’m be the new views that
need to be materialized
 Need to design algorithms for efficient view
reorganization
 on-line (concurrency, failure recovery) & offline (efficiency) algorithms
Relational Schema
Order
OrderNo
OrderDate
Customer
CustomerNo
CustomerName
CustomerAddress
City
Salesperson
SalespersonID
SalespersonName
City
Quota
Fact Table
OrderNo
SalespersonID
CustomerNo
ProdNo
DateKey
CityName
Quantity
TotalPrice
Product
ProdNo
ProdName
ProdDescr
Category
CategoryDescr
UnitPrice
QOH
Date
DateKey
Date
Month
Year
City
CityName
State
Country
An Object Model
Country
OrderPYCView
City
State
City
GetRegion()
OrderView
OrderSet
Summarize()
SalesPerson
SalesPersonID
Quota
Person
Name
DateOfBirth
Address
GetAge()
OrderPYView
Product
Year
Category
CategoryName
CategoryDescr
GetCategName()
Product
ProdName
UnitPrice
GetProdName()
Order
OrderNo
Quantity
TotalPrice
Date
OrderDate
GetDate()
Customer
CustomerNo
Month
Year
ISA relationship
IS PART-OF relationship
Why Object Oriented Data Warehouse?
 Object Identity reduces data redundancy can it help materialized view maintenance?
 Is-a hierarchy facilitates common data
objects and methods reuse (overloading)
 Class composition hierarchy helps fast
traversal using OIDs
 Methods facilitate implementation of
complex aggregate functions (over complex
objects, such as volume of a CAD object)
Efficiency considerations
Structural join index hierarchies and class
partitioning can facilitate in
Evaluating of multiple path operations
Efficiently processing methods
In calculating multidimensional aggregate
operations, such as data cube, and pivoting
Architecture considerations
Following issues need to be addressed
 Is the preferred architecture OO front-end
with relational back-end?
 What about OO back end and front-end?
 How does one integrate data mining and OO
data warehousing components
 How does one build distributed object
oriented data warehousing systems?
Summary
 Data warehousing systems are about 5 years old
 Most of the work has concentrated on
materialized view maintenance, preliminaries
 New aspects of data warehousing have to be
considered to build next generation systems
dynamic materialized view design
object-orientation, etc.
Some References
Dynamic Materialized View Design/Selection
 Timos Sellis group, Stanford Group,
CSIRO/HKUST/CityU
Object Oriented Data Warehousing
 Rundensteiner group, Tore Risch Group,
CityU/HKUST, Univ. of South Australia