Data Warehouse - dbmanagement.info

Download Report

Transcript Data Warehouse - dbmanagement.info

Data Warehousing
and
OLAP
Motivation
Aims of information technology:
 To help workers in their everyday business activity and
improve their productivity – clerical data processing
tasks
 To help knowledge workers (executives, managers,
analysts) make faster and better decisions – decision
support systems
• Two types of applications:
– Operational applications
– Analytical applications
Motivation
On the other hand:
 In most organizations, data about specific parts of
business is there - lots and lots of data, somewhere, in
some form.
 Data is available but not information -- and not the
right information at the right time.
Data warehouse is to:
 bring together information from multiple sources as to
provide a consistent database source for decision
support queries.
 off-load decision support applications from the on-line
transaction system.
Warehousing
• Growing industry: $ 8 billion in 1998
• Range from desktop to huge warehouses
– Walmart: 900-CPU, 2,700 disks, 23TB
– Teradata system
• Lots of new terms
– ROLAP, MOLAP, HOLAP
– rollup. drill-down, slice& dice
The Architecture of Data
What’s has been
learned from data
summaries by
who, what,
when,
where,...
Business
rules
Logical model
physical layout of
data
Metadata
Database schema
Summary data
Operational data
who,
what,
when,
where,
Decision Support and OLAP
 DSS: Information technology to help
knowledge workers (executives, managers,
analysts) make faster and better decisions:
 what were the sales volumes by region and by product
category in the last year?
 how did the share price of computer manufacturers
correlate with quarterly profits over the past 10 years?
 will a 10% discount increase sales volume sufficiently?
 OLAP is an element of decision support system
 Data mining is a powerful, high-performance
data analysis tool for decision support.
Data Processing Models
There are two basic data processing models:
 OLTP – the main aim of OLTP is reliable and
efficient processing of a large number of
transactions and ensuring data consistency.
 OLAP – the main aim of OLAP is efficient
multidimensional processing of large data
volumes.
Traditional OLTP
Traditionally, DBMS have been used for online transaction processing (OLTP)
 order entry: pull up order xx-yy-zz and update status
field
 banking: transfer $100 from account X to account Y






clerical data processing tasks
detailed up-to-date data
structured, repetitive tasks
short transactions are the unit of work
read and/or update a few records
isolation, recovery, and integrity are critical
OLTP vs. OLAP
• OLTP: On Line Transaction Processing
– Describes processing at operational sites
• OLAP: On Line Analytical Processing
– Describes processing at warehouse
OLTP vs. OLAP
OLTP
users
function
DB design
data
Clerk, IT professional
day to day operations
application-oriented
current, up-to-date
detailed, flat relational
isolated
usage
repetitive
access
read/write,
index/hash on prim. key
unit of work short, simple transaction
# records accessed
tens
#users thousands
DB size 100MB-GB
metric transaction throughput
OLAP
Knowledge worker
decision support
subject-oriented
historical, summarized
multidimensional
integrated, consolidated
ad-hoc
lots of scans
complex query
millions
hundreds
100GB-TB
query throughput, response
What is a Data Warehouse
 “A data warehouse is a subject-oriented,
integrated, time-variant, and nonvolatile
collection of data in support of
management’s decision-making process.” -- W. H. Inmon
 Collection of data that is used primarily in
organizational decision making
 A decision support database that is
maintained
separately
from
the
organization’s operational database
Data Warehouse - Subject Oriented
 Subject oriented: oriented to the major
subject areas of the corporation that have
been defined in the data model.
 E.g. for an insurance company: customer,
product, transaction or activity, policy, claim,
account, and etc.
 Operational DB and applications may be
organized differently
 E.g. based on type of insurance's: auto, life,
medical, fire, ...
Data Warehouse – Integrated
 There is no consistency in encoding,
naming conventions, …, among different
data sources
 Heterogeneous data sources
 When data is moved to the warehouse, it is
converted.
Data Warehouse - Non-Volatile
 Operational data is regularly accessed and
manipulated a record at a time, and update
is done to data in the operational
environment.
 Warehouse Data is loaded and accessed.
Update of data does not occur in the data
warehouse environment.
Data Warehouse - Time Variance
 The time horizon for the data warehouse is
significantly longer than that of operational
systems.
 Operational database: current value data.
 Data warehouse data : nothing more than a
sophisticated series of snapshots, taken of at some
moment in time.
 The key structure of operational data may or may
not contain some element of time. The key
structure of the data warehouse always contains
some element of time.
Why Separate Data Warehouse?
 Performance
 special data organization, access methods, and
implementation methods are needed to support
multidimensional views and operations typical
of OLAP
 Complex OLAP queries would degrade
performance for operational transactions
 Concurrency control and recovery modes of
OLTP are not compatible with OLAP analysis
Why Separate Data Warehouse?
 Function
 missing data: Decision support requires
historical data which operational DBs do not
typically maintain
 data consolidation: DS requires consolidation
(aggregation, summarization) of data from
heterogeneous sources: operational DBs,
external sources
 data quality: different sources typically use
inconsistent data representations, codes and
formats which have to be reconciled.
Advantages of Warehousing
•
•
•
•
•
•
High query performance
Queries not visible outside warehouse
Local processing at sources unaffected
Can operate when sources unavailable
Can query data not stored in a DBMS
Extra information at warehouse
– Modify, summarize (store aggregates)
– Add historical information
Advantages of Mediator Systems
• No need to copy data
– less storage
– no need to purchase data
•
•
•
•
More up-to-date data
Query needs can be unknown
Only query interface needed at sources
May be less draining on sources
Operational
databases
External data
sources
The Architecture
of Data Warehousing
Extract
Transform
Load
Refresh
Metadata
repository
Data Warehouse
Data
marts
Serves
OLAP
server
Reports
OLAP
Data mining
Data Sources
 Data sources are often the operational systems,
providing the lowest level of data.
 Data sources are designed for operational use, not
for decision support, and the data reflect this fact.
 Multiple data sources are often from different
systems, run on a wide range of hardware and
much of the software is built in-house or highly
customized.
 Multiple data sources introduce a large number of
issues -- semantic conflicts.
Creating and Maintaining a
Warehouse
Data warehouse needs several tools that
automate or support tasks such as:
Data extraction from different external data
sources, operational databases, files of standard
applications (e.g. Excel, COBOL applications),
and other documents (Word, WWW).
Data cleaning (finding and resolving
inconsistency in the source data)
Integration and transformation of data (between
different data formats, languages, etc.)
Creating and Maintaining a
Warehouse
Data loading (loading the data into the data
warehouse)
Data replication (replicating source database
into the data warehouse)
Data refreshment
Data archiving
Checking for data quality
Analyzing metadata
Physical Structure of Data
Warehouse
There are three basic architectures for
constructing a data warehouse:




Centralized
Distributed
Federated
Tiered
The data warehouse is distributed for: load
balancing, scalability and higher availability
Physical Structure of Data
Warehouse
Client
Client
Client
Central
Data
Warehouse
Source
Source
Centralized architecture
Physical Structure of Data
Warehouse
End
Users
Marketing
Financial
Distribution
Local
Data
Marts
Logical
Data
Warehouse
Source
Source
Federated architecture
Physical Structure of Data
Warehouse
Workstations
(higly summarized
data)
Local
Data
Marts
Physical
Data
Warehouse
Tiered architecture
Source
Source
Physical Structure of Data
Warehouse
• Federated architecture
– The logical data warehouse is only virtual
• Tiered architecture
 The central data warehouse is physical
 There exist local data marts on different tiers
which store copies or summarization of the
previous tier.
Conceptual Modeling of
Data Warehouses
Three basic conceptual schemas:
• Star schema
• Snowflake schema
• Fact constellations
Star schema
Star schema: A single object (fact table) in
the middle connected to a number of
dimension tables
Star schema
product
prodId
name
price
sale
orderId
date
custId
prodId
storeId
qty
amt
store
storeId
city
customer
custId
name
address
city
Star schema
product
prodId
p1
p2
name price
bolt
10
nut
5
sale oderId date
o100 1/7/97
o102 2/7/97
o105 3/8/97
customer
custId
53
81
111
store storeId
c1
c2
c3
custId
53
53
111
name
joe
fred
sally
prodId
p1
p2
p1
storeId
c1
c1
c3
address
10 main
12 main
80 willow
qty
1
2
5
amt
12
11
50
city
sfo
sfo
la
city
nyc
sfo
la
Terms
 Basic notion: a measure (e.g. sales, qty,
etc)
 Given a collection of numeric measures
 Each measure depends on a set of
dimensions (e.g. sales volume as a function of
product, time, and location)
Terms
• Relation, which relates the dimensions to
the measure of interest, is called the fact
table (e.g. sale)
• Information about dimensions can be
represented as a collection of relations –
called the dimension tables (product,
customer, store)
• Each dimension can have a set of
associated attributes
Example of Star Schema
Product
Date
Date
Month
Year
Sales Fact Table
Date
Product
ProductNo
ProdName
ProdDesc
Category
QOH
Store
Store
StoreID
City
State
Country
Region
Customer
unit_sales
dollar_sales
schilling_sales
Measurements
Customer
CustId
CustName
CustCity
CustCountry
Dimension Hierarchies
• For each dimension, the set of associated
attributes can be structured as a hierarchy
sType
store
customer
city
region
city
state
country
Dimension Hierarchies
store storeId
s5
s7
s9
cityId
sfo
sfo
la
tId
t1
t2
t1
mgr
joe
fred
nancy
sType tId
t1
t2
size
small
large
city cityId pop
sfo
1M
la
5M
location
downtown
suburbs
regId
north
south
region regId
name
north cold region
south warm region
Snowflake Schema
Snowflake schema: A refinement of star
schema where the dimensional hierarchy
is represented explicitly by normalizing
the dimension tables
Product
Example of Snowflake Schema
Month
Year
Month
Year
Year
Date
Date
Month
ProductNo
ProdName
ProdDesc
Category
QOH
Sales Fact Table
Date
Product
Store
City
State
Country
Country
Region
City
State
Store
Customer
StoreID
City
unit_sales
dollar_sales
schilling_sales
State
Country
Measurements
Cust
CustId
CustName
CustCity
CustCountry
Fact constellations
Fact constellations: Multiple fact tables
share dimension tables
Database design methodology for data
warehouses (1)
• Nine-step methodology – proposed by Kimball
Step
1
2
3
4
5
6
7
8
9
Activity
Choosing the process
Choosing the grain
Identifying and conforming the dimensions
Choosing the facts
Storing the precalculations in the fact table
Rounding out the dimension tables
Choosing the duration of the database
Tracking slowly changing dimensions
Deciding the query priorities and the query modes
Database design methodology for data
warehouses (2)
• There are many approaches that offer alternative routes to
the creation of a data warehouse
• Typical approach – decompose the design of the data
warehouse into manageable parts – data marts, At a later
stage, the integration of the smaller data marts leads to the
creation of the enterprise-wide data warehouse.
• The methodology specifies the steps required for the
design of a data mart, however, the methodology also ties
together separate data marts so that over time they merge
together into a coherent overall data warehouse.
Step 1: Choosing the process
• The process (function) refers to the subject matter
of a particular data marts. The first data mart to be
built should be the one that is most likely to be
delivered on time, within budget, and to answer
the most commercially important business
questions.
• The best choice for the first data mart tends to be
the one that is related to ‘sales’
Step 2: Choosing the grain
• Choosing the grain means deciding exactly what a fact
table record represents. For example, the entity ‘Sales’ may
represent the facts about each property sale. Therefore, the
grain of the ‘Property_Sales’ fact table is individual
property sale.
• Only when the grain for the fact table is chosen we can
identify the dimensions of the fact table.
• The grain decision for the fact table also determines the
grain of each of the dimension tables. For example, if the
grain for the ‘Property_Sales’ is an individual property
sale, then the grain of the ‘Client’ dimension is the detail of
the client who bought a particular property.
Step 3: Identifying and conforming the
dimensions
• Dimensions set the context for formulating queries about
the facts in the fact table.
• We identify dimensions in sufficient detail to describee
things such as clients and properties at the correct grain.
• If any dimension occurs in two data marts, they must be
exactly the same dimension, or or one must be a subset of
the other (this is the only way that two DM share one or
more dimensions in teh same application).
• When a dimension is used in more than one DM, the
dimension is referred to as being conformed.
Step 4: Choosing the facts
• The grain of the fact table determines which facts can be
used in the data mart – all facts must be expressed at the
level implied by the grain.
• In other words, if the grain of the fact table is an individual
property sale, then all the numerical facts must refer to this
particular sale (the facts should be numeric and additive).
Step 5: Storing pre-calculations in the
fact table
• Once the facts have been selected each should be reexamined to determine whether there are opportunities to
use pre-calculations.
• Common example: a profit or loss statement
• These types of facts are useful since they are additive
quantities, from which we can derive valuable information.
• This is particularly true for a value that is fundamental to
an enterprise, or if there is any chance of a user calculating
the value incorrectly.
Step 6: Rounding out the dimension
tables
• In this step we return to the dimension tables and add as
many text descriptions to the dimensions as possible.
• The text descriptions should be as intuitive and
understandable to the users as possible
Step 7: Choosing the duration of the
data warehouse
• The duration measures how far back in time the fact table
goes.
• For some companies (e.g. insurance companies) there may
be a legal requirement to retain data extending back five or
more years.
• Very large fact tables raise at least two very significant
data warehouse design issues:
– The older data, the more likely there will be problems in reading
and interpreting the old files
– It is mandatory that the old versions of the important dimensions
be used, not the most current versions (we will discuss this issue
later on)
Step 8: Tracking slowly changing
dimensions
• The changing dimension problem means that the proper
description of the old client and the old branch must be
used with the old data warehouse schema
• Usually, the data warehouse must assign a generalized key
to these important dimensions in order to distinguish
multiple snapshots of clients and branches over a period of
time
• There are different types of changes in dimensions:
– A dimension attribute is overwritten
– A dimension attribute caauses a new dimension record to be
created
– etc.
Step 9: Deciding the query priorities
and the query modes
• In this step we consider physical design issues.
–
–
–
–
–
–
The presence of pre-stored summaries and aggregates
Indices
Materialized views
Security issue
Backup issue
Archive issue
Database design methodology for data
warehouses - summary
• At the end of this methodology, we have a design for a data
mart that supports the requirements of a particular
bussiness process and allows the easy integration with
other related data martsto ultimately form the enterprisewide data warehouse.
• A dimensional model, which contains more than one fact
table sharing one or more conformed dimension tables, is
referred to as a fact constellation.
Multidimensional Data Model
Sales of products may be represented in
one dimension (as a fact relation) or
in two dimensions, e.g. : clients and
products
Multidimensional Data Model
Multidimensional Data Model
Two-dimensional cube
Fact relation
sale
Product Client
p1
c1
p2
c1
p1
c3
p2
c2
Amt
12
11
50
8
p1
p2
c1
12
11
c2
8
c3
50
Multidimensional Data Model
Fact relation
sale
Product Client
p1
c1
p2
c1
p1
c3
p2
c2
p1
c1
p1
c2
Date
1
1
1
1
2
2
3-dimensional cube
Amt
12
11
50
8
44
4
day 2
day 1
p1
p2 c1
p1
12
p2
11
c1
44
c2
4
c2
8
c3
c3
50
Multidimensional Data Model and
Aggregates
• Add up amounts for day 1
• In SQL: SELECT sum(Amt) FROM SALE
WHERE Date = 1
sale
Product Client
p1
c1
p2
c1
p1
c3
p2
c2
p1
c1
p1
c2
Date
1
1
1
1
2
2
Amt
12
11
50
8
44
4
result
81
Multidimensional Data Model and
Aggregates
• Add up amounts by day
• In SQL: SELECT Date, sum(Amt)
FROM SALE GROUP BY Date
sale
Product Client
p1
c1
p2
c1
p1
c3
p2
c2
p1
c1
p1
c2
Date
1
1
1
1
2
2
Amt
12
11
50
8
44
4
result
Date
1
2
sum
81
48
Multidimensional Data Model and
Aggregates
• Add up amounts by client, product
• In SQL: SELECT client, product, sum(amt)
FROM SALE
GROUP BY client, product
Multidimensional Data Model and
Aggregates
sale
Product
p1
p2
p1
p2
p1
p1
Client
c1
c1
c3
c2
c1
c2
Date
1
1
1
1
2
2
Amt
12
11
50
8
44
4
sale Product Client
p1
c1
p1
c2
p1
c3
p2
c1
p2
c2
Sum
56
4
50
11
8
Multidimensional Data Model and
Aggregates
• In multidimensional data model together
with measure values usually we store
summarizing information (aggregates)
p1
p2
Sum
c1
56
11
67
c2
4
8
12
c3
50
50
Sum
110
19
129
Aggregates
• Operators: sum, count, max, min,
median, ave
• “Having” clause
• Using dimension hierarchy
– average by region (within store)
– maximum by month (within date)
Cube Aggregation
day 2
day 1
p1
p2 c1
p1
12
p2
11
p1
p2
c1
56
11
c1
44
c2
4
c2
c3
Example: computing sums
...
c3
50
8
c2
4
8
c3
50
sum
c1
67
c2
12
c3
50
129
p1
p2
sum
110
19
Cube Operators
day 2
day 1
p1
p2 c1
p1
12
p2
11
p1
p2
c1
56
11
c1
44
c2
4
c2
c3
...
c3
50
sale(c1,*,*)
8
c2
4
8
c3
50
sale(c2,p2,*)
sum
c1
67
c2
12
c3
50
129
p1
p2
sum
110
19
sale(*,*,*)
Cube
c2
4
8
c312
p1
p2
c1
*
12
p1
p2
c1*
44
c1
56
11
c267
4
c2
44
c3
4
50
11
23
8
8
50
*
62
19
81
*
day 2
day 1
p1
p2
*
c3
50
* 50
48
48
*
110
19
129
sale(*,p2,*)
Aggregation Using Hierarchies
day 2
day 1
p1
p2 c1
p1
12
p2
11
c1
44
c2
4
c2
c3
c3
50
customer
region
8
country
p1
p2
region A region B
12
50
11
8
(customer c1 in Region A;
customers c2, c3 in Region B)
Aggregation Using Hierarchies
client
city
New
Orleans
c1
c2
c3
Poznań
c4
10 3
12
5
11 7
12 11
21
9
7
15
region
Date of
sale
CD
video
Camera
aggregation with
respect to city
NO
PN
Video
22
23
Camera
8
18
CD
30
22
A Sample Data Cube
Date
camera
video
CD
sum
1Q
2Q
3Q
4Q
sum
USA
Canada
Mexico
sum
C
o
u
n
t
r
y
Exercise (1)
• Suppose the AAA Automobile Co. builds a data
warehouse to analyze sales of its cars.
• The measure - price of a car
• We would like to answer the following typical
queries:
–
–
–
–
find total sales by day, week, month and year
find total sales by week, month, ... for each dealer
find total sales by week, month, ... for each car model
find total sales by month for all dealers in a given city,
region and state.
Exercise (2)
• Dimensions:
– time (day, week, month, quarter, year)
– dealer (name, city, state, region, phone)
– cars (serialno, model, color, category , …)
• Design the conceptual data warehouse schema
OLAP Servers
 Relational OLAP (ROLAP):
 Extended relational DBMS that maps
operations on multidimensional data to
standard relations operations
 Store all information, including fact tables,
as relations
 Multidimensional OLAP (MOLAP):
 Special purpose server that directly
implements multidimensional data and
operations
 store multidimensional datasets as arrays
OLAP Servers
 Hybrid OLAP (HOLAP):
 Give users/system administrators freedom to
select different partitions.
OLAP Queries
 Roll up: summarize
dimension hierarchy
data
along
a
 if we are given total sales volume per city we
can aggregate on the Location to obtain sales
per states
OLAP Queries
client
city
c1
c2
New
Orleans
c3
Poznań
c4
10 3
12
5
11 7
12 11
21
9
7
15
region
Date of
sale
CD
video
Camera
roll up
NO
PN
Video
22
23
Camera
8
18
CD
30
22
OLAP Queries
 Roll down, drill down: go from higher
level summary to lower level summary or
detailed data
 For a particular product category, find the
detailed sales data for each salesperson by
date
 Given total sales by state, we can ask for sales
per city, or just sales by city for a selected state
OLAP Queries
day 2
day 1
p1
p2 c1
p1
12
p2
11
p1
p2
c1
44
c1
56
11
c2
4
c2
c3
c3
50
8
c2
4
8
rollup
drill-down
c3
50
sum
c1
67
c2
12
c3
50
129
p1
p2
sum
110
19
OLAP Queries
• Slice and dice: select and project
 Sales of video in USA over the last 6 months
 Slicing and dicing reduce the number of
dimensions
 Pivot: reorient cube
 The result of pivoting is called a crosstabulation
 If we pivot the Sales cube on the Client and
Product dimensions, we obtain a table for
each client for each product value
OLAP Queries
 Pivoting can be combined with aggregation
sale
prodId clientid
p1
c1
p2
c1
p1
c3
p2
c2
p1
c1
p1
c2
1
2
Sum
c1
23
44
67
c2
8
4
12
date
1
1
1
1
2
2
c3
50
50
amt
12
11
50
8
44
4
Sum
81
48
129
day 2
day 1
p1
p2
Sum
p1
p2 c1
p1
12
p2
11
c1
56
11
67
c1
44
c2
4
8
12
c2
4
c2
c3
c3
50
8
c3
50
50
Sum
110
19
129
OLAP Queries
 Ranking: selection of first n elements (e.g. select 5
best purchased products in July)
 Others: stored procedures, selection, etc.
• Time functions
– e.g., time average
Implementing a Warehouse
Implementing a Warehouse
• Designing and rolling out a data warehouse
is a complex process, consisting of the
following activities:
Define the architecture, do capacity palnning,
and select the storage servers, database and
OLAP servers (ROLAP vs MOLAP), and tools,
Integrate the servers, storage, and client tools,
Design the warehouse schema and views,
Implementing a Warehouse
Define the physical warehouse organization,
data placement, partitioning, and access
method,
Connect the sources using gateways, ODBC
drivers, or other wrappers,
Design and implement scripts for data
extraction, cleaning, transformation, load, and
refresh,
Implementing a Warehouse
Populate the repository with the schema and
view definitions, scripts, and other metadata,
Design and implement end-user applications,
Roll out the warehouse and applications.
Implementing a Warehouse
•
•
•
•
Monitoring: Sending data from sources
Integrating: Loading, cleansing,...
Processing: Query processing, indexing, ...
Managing: Metadata, Design, ...
Monitoring
• Data Extraction
– Data extraction from external sources is usually
implemented via gateways and standard
interfaces (such as Information Builders
EDA/SQL, ODBC, JDBC, Oracle Open
Connect, Sybase Enterprise Connect, Informix
Enterprise Gateway, etc.)
Monitoring Techniques
 Detect changes to an information source
that are of interest to the warehouse:
 define triggers in a full-functionality DBMS
 examine the updates in the log file
 write programs for legacy systems
 polling (queries to source)
 screen scraping
 Propagate the change in a generic form to
the integrator
Integration
• Integrator
 Receive changes from the monitors
 make the data conform to the conceptual schema
used by the warehouse
 Integrate the changes into the warehouse
 merge the data with existing data already present
 resolve possible update anomalies
• Data Cleaning
• Data Loading
Data Cleaning
• Data cleaning is important to warehouse –
there is high probability of errors and
anomalies in the data:
– inconsistent
field
lengths,
inconsistent
descriptions, inconsistent value assignments,
missing entries and violation of integrity
constraints.
– optional fields in data entry are significant
sources of inconsistent data.
Data Cleaning Techniques
 Data
migration: allows
simple data
transformation rules to be specified, e.g.
„replace the string gender by sex” (Warehouse
Manager from Prism is an example of this tool)
 Data
scrubbing:
uses
domain-specific
knowledge to scrub data (e.g. postal addresses)
(Integrity and Trillum fall in this category)
 Data auditing: discovers rules and relationships
by scanning data (detect outliers). Such tools
may be considered as variants of data mining
tools
Data Loading
• After extracting, cleaning and transforming, data
must be loaded into the warehouse.
• Loading the warehouse includes some other
processing tasks: checking integrity constraints,
sorting, summarizing, etc.
• Typically, batch load utilities are used for loading.
A load utility must allow the administrator to
monitor status, to cancel, suspend, and resume a
load, and to restart after failure with no loss of
data integrity
Data Loading Issues
• The load utilities for data warehouses have to deal
with very large data volumes
• Sequential loads can take a very long time.
• Full load can be treated as a single long batch
transaction that builds up a new database. Using
checkpoints ensures that if a failure occurs during
the load, the process can restart from the last
checkpoint
Data Refresh
• Refreshing a warehouse means propagating
updates on source data to the data stored in the
warehouse
 when to refresh:
 periodically (daily or weekly)
 immediately (defered refresh and immediate
refresh)
– determined by usage, types of data source, etc.
Data Refresh
 how to refresh
– data shipping
– transaction shipping
• Most commercial DBMS provide replication
servers that support incremental techniques for
propagating updates from a primary database to
one or more replicas. Such replication servers can
be used to incrementally refresh a warehouse
when sources change
Data Shipping
• data shipping: (e.g. Oracle Replication Server), a
table in the warehouse is treated as a remote
snapshot of a table in the source database.
After_row trigger is used to update snapshot log
table and propagate the updated data to the
warehouse
Transaction Shipping
• transaction shipping: (e.g. Sybase Replication
Server, Microsoft SQL Server), the regular
transaction log is used. The transaction log is
checked to detect updates on replicated tables, and
those log records are transferred to a replication
server, which packages up the corresponding
transactions to update the replicas
Derived Data
• Derived Warehouse Data
– indexes
– aggregates
– materialized views
• When to update derived data?
• The most difficult problem is how to refresh the
derived data? The problem of constructing
algorithms incrementally updating derived data
has been the subject of much research!
Materialized Views
• Define new warehouse relations using SQL
expressions
sale
prodId clientid
p1
c1
p2
c1
p1
c3
p2
c2
p1
c1
p1
c2
joinTb
date
1
1
1
1
2
2
prodId
p1
p2
p1
p2
p1
p1
product
amt
12
11
50
8
44
4
name
bolt
nut
bolt
nut
bolt
bolt
id
p1
p2
name price
bolt
10
nut
5
join of sale and product
price
10
5
10
5
10
10
clientid
c1
c1
c3
c2
c1
c2
date
1
1
1
1
2
2
amt
12
11
50
8
44
4
Processing
• Index Structures
• What to Materialize?
• Algorithms
Index Structures
• Indexing principle:
 mapping key values to records for
associative direct access
 Most popular indexing techniques in
relational database: B+-trees
 For multi-dimensional data, a large
number of indexing techniques have been
developed: R-trees
Index Structures
• Index structures applied in warehouses
–
–
–
–
inverted lists
bit map indexes
join indexes
text indexes
Inverted Lists
18
19
20
21
22
23
25
26
age
index
r5
r19
r37
r40
rId
r4
r18
r19
r34
r35
r36
r5
r41
name age
joe
20
fred
20
sally
21
nancy 20
tom
20
pat
25
dave
21
jeff
26
...
20
23
r4
r18
r34
r35
inverted
lists
data
records
Inverted Lists
• Query:
– Get people with age = 20 and name = “fred”
• List for age = 20: r4, r18, r34, r35
• List for name = “fred”: r18, r52
• Answer is intersection: r18
Bitmap Indexes
• Bitmap index: An indexing technique that has
attracted attention in multi-dimensional
database implementation
table
Customer
c1
c2
c3
c4
c5
c6
City
Detroit
Chicago
Detroit
Poznan
Paris
Paris
Car
Ford
Honda
Honda
Ford
BMW
Nissan
Bitmap Indexes
• The index consists of bitmaps:
Index on City:
ec1
1
2
3
4
5
6
Chicago Detroit
0
1
1
0
0
1
0
0
0
0
0
0
Paris
0
0
0
0
1
1
bitmaps
Poznan
0
0
0
1
0
0
Bitmap Indexes
Index on Car:
ec1
1
2
3
4
5
6
BMW
0
1
0
0
1
0
Ford
1
0
0
1
0
0
Honda
0
1
1
0
0
0
bitmaps
Nissan
0
0
0
0
0
1
Bitmap Indexes
 Index on a particular column
 Index consists of a number of bit vectors bitmaps
 Each value in the indexed column has a bit
vector (bitmaps)
 The length of the bit vector is the number of
records in the base table
 The i-th bit is set if the i-th row of the base
table has the value for the indexed column
Bitmap Index
20
23
20
21
22
1
1
0
1
1
0
0
0
0
23
25
26
age
index
bit
maps
0
0
1
0
0
0
1
0
1
1
id
1
2
3
4
5
6
7
8
name age
joe
20
fred
20
sally
21
nancy 20
tom
20
pat
25
dave
21
jeff
26
...
18
19
data
records
Using Bitmap indexes
• Query:
– Get people with age = 20 and name = “fred”
• List for age = 20: 1101100000
• List for name = “fred”: 0100000001
• Answer is intersection: 010000000000
 Good if domain cardinality small
 Bit vectors can be compressed
Using Bitmap indexes
 They allow the use of efficient bit operations to
answer some queries
 “how many customers from Detroit have car
‘Ford’”
– perform a bit-wise AND of two bitmaps:
answer – c1
 “how many customers have a car ‘Honda’”
– count 1’s in the bitmap - answer - 2
 Compression - bit vectors are usually sparse
for large databases – the need for
decompression
Bitmap Index – Summary
 With efficient hardware support for bitmap
operations (AND, OR, XOR, NOT), bitmap index
offers better access methods for certain queries
 e.g., selection on two attributes
 Some commercial products have implemented
bitmap index
 Works poorly for high cardinality domains since
the number of bitmaps increases
 Difficult to maintain - need reorganization when
relation sizes change (new bitmaps)
Join
• “Combine” SALE, PRODUCT relations
• In SQL: SELECT * FROM SALE, PRODUCT
sale
prodId storeId
p1
c1
p2
c1
p1
c3
p2
c2
p1
c1
p1
c2
joinTb
date
1
1
1
1
2
2
prodId
p1
p2
p1
p2
p1
p1
amt
12
11
50
8
44
4
name
bolt
nut
bolt
nut
bolt
bolt
product
price
10
5
10
5
10
10
storeId
c1
c1
c3
c2
c1
c2
date
1
1
1
1
2
2
id
p1
p2
amt
12
11
50
8
44
4
name price
bolt
10
nut
5
Join Indexes
join index
product
sale
id
p1
p2
rId
r1
r2
r3
r4
r5
r6
name price
bolt
10
nut
5
jIndex
r1,r3,r5,r6
r2,r4
prodId storeId
p1
c1
p2
c1
p1
c3
p2
c2
p1
c1
p1
c2
date
1
1
1
1
2
2
amt
12
11
50
8
44
4
Join Indexes
 Traditional indexes map the value to a list of
record ids. Join indexes map the tuples in the join
result of two relations to the source tables.
 In data warehouse cases, join indexes relate the
values of the dimensions of a star schema to rows
in the fact table.
 For a warehouse with a Sales fact table and dimension
city, a join index on city maintains for each distinct city
a list of RIDs of the tuples recording the sales in the
city
 Join indexes can span multiple dimensions
What to Materialize?
• Store in warehouse results useful for
common queries
total sale
Example:
day 2
day 1
p1
p2 c1
p1
12
p2
11
p1
p2
materialize
c1
56
11
c1
44
c2
4
c2
c3
...
c3
50
8
c2
4
8
c3
50
p1
c1
67
c2
12
c3
50
129
p1
p2
c1
110
19
Cube Operation
• SELECT date, product, customer, SUM
(amount)
FROM SALES
CUBE BY date, product, customer
• Need compute the following Group-Bys
– (date, product, customer),
– (date,product),(date, customer), (product,
customer),
– (date), (product), (customer)
Cuboid Lattice
 Data cube can be viewed as a lattice of
cuboids
 The bottom-most cuboid is the base cube.
 The top most cuboid contains only one cell.
(A,B,C,D)
(A,B,C) (A,B,D) (A,C,D) (B,C,D)
(A,B) (A,C) (A,D) (B,C) (B,D) (C,D)
(A)
(B)
(C)
( all )
(D)
Cuboid Lattice
129
all
p1
c1
67
c2
12
c3
50
city
city, product
p1
p2
c1
56
11
c2
4
8
product
city, date
date
product, date
c3
50
day 2
day 1
c1
c2
c3
p1
44
4
p2 c1
c2
c3
p1
12
50
p2
11
8
city, product, date
use greedy
algorithm to
decide what
to materialize
Efficient Data Cube Computation
 Materialization of data cube
 Materialize every (cuboid), none, or some.
 Algorithms for selection of which cuboids to
materialize:
• size, sharing, and access frequency:
–
–
–
–
Type/frequency of queries
Query response time
Storage cost
Update cost
Dimension Hierarchies
• Client hierarchy
region
state
city
cities
city
c1
c2
c3
state
CA
NY
SF
region
East
East
West
Dimension Hierarchies
Computation
all
city
city, product
product
city, date
city, product, date
date
product, date
state
state, date
state, product
roll-up along client
hierarchy
state, product, date
Cube Computation - Array Based
Algorithm
 An MOLAP approach:
 the
base
cuboid
is
stored
as
multidimensional array.
 read in a number of cells to compute partial
cuboids
Cube computations
B
A
C
ALL
{ABC}{AB}{AC}{BC}
{A}{B}{C}{ }
View and Materialized Views
 View
 derived relation defined in terms of base
(stored) relations
 Materialized views
 a view can be materialized by storing the tuples
of the view in the database
 index structures can be built on the materialized
view
View and Materialized Views
 Maintenance is an issue for materialized
views
 recomputation
 incremental updating
Maintenance of materialized views
• “Deficit” departments
• To find all “deficit” departments:
– group by deptid
– join (deptid)
– select all dept.names with budget < sum(salary)
DeptId Name
1
CS
2
Math
3
Comm.
Budget
7500
5500
4500
EmpId
100
200
300
400
500
Lname
Kim
Jabbar
Smith
Brown
Lu
salary
2500
2000
3000
3500
3000
DeptId
1
1
1
2
2
Maintenance of materialized views
• select DeptId, sum(salary) Real_Budget
from Employee
group by DeptId;
Temp (relation)
• select Name
from Dept, Temp
where Dept.DeptId=Temp.DeptId
and Budget < Real_Budget;
Maintenance of materialized views
• assume the following update:
update Employee
set salary=salary+1000
where Lname=‘Jabbar’;
• recompute the whole view?
• use intermediate materialized results
(Temp), and update the view incrementally?
Managing
Metadata Repository
 Administrative metadata
 source database and their contents
 gateway descriptions
 warehouse schema, view and derived data
definitions
 dimensions and hierarchies
 pre-defined queries and reports
 data mart locations and contents
Metadata Repository
 Administrative metadata
 data partitions
 data extraction, cleansing, transformation rules,
defaults
 data refresh and purge rules
 user profiles, user groups
 security: user authorization, access control
Metadata Repository
• Business
– business terms & definition
– data ownership, charging
• Operational
– data layout
– data currency (e.g., active, archived, purged)
– use statistics, error reports, audit trails
Design
•
•
•
•
•
•
•
What data is needed?
Where does it come from?
How to clean data?
How to represent in warehouse (schema)?
What to summarize?
What to materialize?
What to index?
Summary
 Data warehouse is not a software product
or application - it is an important
information processing system
architecture for decision making!
 Data warehouse combines a number of
products, each has operational uses
besides data warehouse
Summary
 OLAP provides powerful and fast tools
for reporting on data:
 ROLAP vs. MOLAP
 Issues associated with data warehouses:
 new techniques: multidimensional database,
data cube computation, query optimization,
indexing, …
 data warehousing and application design:
vendors and application developers.
Current State of Industry
• Extraction and integration done off-line
– Usually in large, time-consuming, batches
• Everything copied at warehouse
– Not selective about what is stored
– Query benefit vs storage & update cost
• Query optimization aimed at OLTP
– High throughput instead of fast response
– Process whole query before displaying anything
Research
•
•
•
•
•
•
Incremental Maintenance
Data Consistency
Data Expiration
Recovery
Data Quality
Dynamic Data Warehouses (how to
maintain data warehouse over changing
external data sources?)
Research
•
•
•
•
Rapid Monitor Construction
Materialization & Index Selection
Data Fusion
Integration of Text & Relational Data &
Semistructured Data & …..
• Data Mining