Transcript ch14

Chapter 14
The Data Warehouse
Fundamentals of Database Management Systems
by
Mark L. Gillenson, Ph.D.
University of Memphis
Presentation by: Amita Goyal Chin, Ph.D.
Virginia Commonwealth University
John Wiley & Sons, Inc.
Chapter Objectives

Compare the data needs of transaction
processing systems with those of decision
support systems.

Describe the data warehouse concept and
list its main features.

Compare the enterprise data warehouse
with the data mart.
14-2
Chapter Objectives

Design a data warehouse.

Build a data warehouse, including the
steps of data extraction, data cleaning,
data transformation, and data loading.

Describe how to use a data warehouse
with online analytic processing and data
mining.
14-3
Chapter Objectives

List the types of expertise needed to
administer a data warehouse.

List the challenges in data warehousing.
14-4
Application Systems

Transaction Processing Systems (TPS)
 Everyday
application systems that support
banking and insurance operations, manage
the parts inventory on manufacturing
assembly lines, keep track of airline and hotel
reservations, support Web-based sales, etc.

Decision Support Systems (DSS)
 specifically
designed to aid managers in
decision-making tasks.
14-5
The Data Warehouse Concept

A data warehouse is a broad-based,
shared database for management decision
making that contains data that has been
accumulated over time.

Formally, a database warehouse is, “a
subject oriented, integrated, non-volatile,
and time variant collection of data in
support of management’s decisions.”
14-6
Characteristics of
Data Warehouse Data








The data is subject oriented
The data is integrated
The data is non-volatile
The data is time variant
The data must be high quality
The data may be aggregated
The data is often denormalized
The data is not necessarily absolutely current
14-7
The Data is Subject Oriented

Data warehouses are organized around
subjects, really the major entities of
concern in the business environment.
 Sales,
customers, orders, claims, accounts,
employees, other entities that are central to
the company’s business.
14-8
The Data is Integrated

Data about each of the subjects in the data warehouse is typically
collected from several of the company’s transactional databases,
each of which supports one or more applications that have
something to do with the particular subject.

All of the data about a subject must be organized or integrated in
such a way that it provides a unified, overall picture of all the
important details about the subject over time.

Data from disparate application databases must be transformed into
common measurements, codes, data types.
14-9
The Data is Non-Volatile

Once data is added to the data
warehouse, it doesn’t change.

It will never change. Changing it would be
like going back and rewriting history.
14-10
The Data is Time Variant

Data warehouse data, with its historic nature,
always includes some kind of a timestamp.

If we are storing sales data on a weekly or
monthly basis and we have accumulated ten
years of such historic data, each weekly or
monthly sales figure must be accompanied by a
timestamp indicating the week or month (and
year!) that it represents.
14-11
The Data Must Be
High Quality

Consider a section of a data warehouse in which the
subject is customer.

If there is a customer address misspelling in one
transactional file, when the data from that file is
integrated with the data from the other transactional files,
there will be some difficulty in reconciling whether the
two different addresses both represent one customer, or
whether they actually represent two different customers.

This must be reconciled as the data is integrated and
entered into the data warehouse.
14-12
The Data May Be Aggregated

The type of data that management requires for decision
making is generally summarized data.

The sheer volume of all the historic detail data would
make the data warehouse unacceptably huge in many
cases.

If the detail data was stored in the data warehouse, the
amount of time that it would take to summarize the data
for management every time a query was posed would
often be unacceptable.
14-13
The Data is Often
Denormalized

If a company is willing to tolerate the substantial
additional space taken up by the redundant
denormalized data, it can gain the advantage of the
improved query performance that redundancy provides
without paying the penalties of increased update time
and potential data integrity problems.

This works because the data integrity problems that can
be caused by redundant data only arise when the data is
updated. The historic data in the data warehouse will not
be updated.
14-14
The Data is Not Necessarily
Absolutely Current

Data warehouse data is updated at some
time interval -- weekly, monthly, etc.

Any changes since the last data
warehouse update are not recorded in it
until the next scheduled update.

Inconsequential when looking at long-term
trends.
14-15
Types of Data Warehouses

Enterprise Data
Warehouse (EDW)

Data Mart (DM)
14-16
Enterprise Data Warehouse

Large-scale; incorporates the data of an entire
company or of a major division, site, or activity of
a company.

A full scale EDW is built around several different
subjects.

Support a wide variety of DSS applications and
serve as a data resource with which company
managers can explore new ways of using the
company’s data to its advantage.
14-17
The Data Mart

Small-scale; designed to support a small
part of an organization.

A company will often have several DMs.

Are based on a limited number of subjects
(possibly one) and are constructed from a
limited number of transactional databases.
14-18
Which to Choose:
The EDW, the DM, or Both?

It depends from company to company.

Top-down development implies that the EDW was
created first and then later data was extracted from an
EDW to create one or more DMs.

A company that has deliberately or as a matter of
circumstance developed a series of independent DMs
may decide, in a bottom-up development fashion to build
an EDW out of the existing DMs.
14-19
Designing a Data Warehouse

Two characteristics of data warehouses are
central to any design:



The subject orientation.
The historic nature of the data.
Data warehouses are often referred to as
multidimensional databases because each
occurrence of the subject is referenced by an
occurrence of each of several dimensions or
characteristics of the subject, one of which is
time.
14-20
Multidimensional Databases

Two dimensions can easily be visualized on a
flat piece of paper.
14-21
Multidimensional Databases

Three dimensions can easily be visualized on a flat piece of paper as a
cube.

Four or more dimensions are more difficult to visualize.
14-22
Storing Multidimensional Data

There is much interest in storing
multidimensional data in relational databases.

The star schema.

Visual design in which the subject is in the middle and
the dimensions radiate outwards.

Have a “fact table” which represents the data
warehouse “subject” and several “dimension tables.”
14-23
General Hardware Company
Data Warehouse

Here is the General Hardware transactional database.
Salesperson Salesperson
Number
Name
SALESPERSON
Customer
Customer
Number
Name
CUSTOMER
Product
Product
Number
Name
PRODUCT
Commi ssion
Percentage
Year of H ire
Salesperson
Number
Unit Price
Office
Number
HQ City
Salesperson
Number
SALES
Customer Employee Employee
Number
Number
Name
CUSTOMER EMPLOYEE
Product
Number
Quantit y
Office
Number
OFFICE
Title
Telephone
Size
14-24
General Hardware Company
Data Warehouse

SALE is the fact table.


Like any relational
table, must have a
primary key.
Dimension tables:



SALESPERSON
PRODUCT
TIME PERIOD
14-25
General Hardware Company
Data Warehouse
14-26
Good Reading Bookstores
Data Warehouse

Do they need a data warehouse, since they
already store a date attribute?

Yes, for two reasons:

While the transactional database performs acceptably
with perhaps the last couple of months of data in it, its
performance would degrade to an unacceptable level
if we tried to keep ten years of data in it.

The kinds of management decision making that
require long-term historic sales data require
aggregate not daily data.
14-27
Good Reading Bookstores
Data Warehouse

SALE is the fact table.


Dimension tables:





Like any relational table,
must have a primary key.
BOOK
PUBLISHER
CUSTOMER
TIME PERIOD
Snowflake design

One dimension table (BOOK)
leads to another dimension
table (PUBLISHER).
14-28
Lucky Rent-A-Car Data
Warehouse

RENTAL is the fact table.


Dimension tables:





Does not contain aggregated
data.
CAR
MANUFACTURER
CUSTOMER
TIME PERIOD
Snowflake design

One dimension table (CAR)
leads to another dimension
table (MANUFACTURER). 14-29
What About a World Music
Association Data Warehouse?

There is already a Year attribute in the RECORDING
table.

The essence of the WMA data is historic.

By its nature, the amount of data in a WMA type
transactional database is much lower than the amount of
data in a Good Reading or Lucky-type transactional
database.

Since the nature of the WMA transactional database
blurs with what a WMA data warehouse would look like,
no WMA data warehouse is needed.
14-30
Building a Data Warehouse

Data Extraction

Data Cleaning

Data Transformation

Data Loading
14-31
Building a Data Warehouse:
Data Extraction

Process of copying the data from the transactional
databases in preparation for loading it into the data
warehouse.

This is not a one-time event.

The data is likely to come from several transactional
databases.

Some of the data entering into this process may come
from outside of the company (data enrichment).
14-32
Lucky Rent-A-Car with
Enrichment Data

In the CUSTOMER
table, Customer Age,
Customer Income, and
Customer Education is
the enrichment data.
14-33
Data Cleaning

Transactional data can have all kinds of
errors in it.

Data warehouses are very sensitive to
data errors
 Data
errors must be “cleaned” or “cleansed”
or “scrubbed” as the data is loaded into the
data warehouse.
14-34
Data Cleaning

There are two steps to cleaning transactional
data in preparation for loading it into a data
warehouse.

Identify the problem data.
• Due to the massive volume of data, this is typically done
using a program.

Fix it.
• Can be handled by using sophisticated artificial intelligence
programs or by creating exception reports for employees to
scrutinize.
14-35
Good Reading Bookstores
Before Data Cleaning
1
2
3
4
5
6
7
8
9
10
11
Customer Number Customer
Name
02847
Mervis
03185
Gomez
03480
Taylor
06837
Steven s
08362
Adams
12739
Gomez
13848
Lucas
15367
Tail or
15933
Chang
18575
Smith
21359
Sanche z
(a) CUSTOMER Tab le .
Street
123 Oak St.
345 Main Ave.
50 Elm R d.
876 Leslie Ln.
1200 Wall aby St.
345 Main Ave.
742 Ave. Lou is e
50 Elm R d.
48 Maple Ave .
390 Martin D r.
666 Ave. Bo li var
Book Number Customer
Number
1
426478
03480
2
077656
18575
3
365905
06837
4
645688
21359
5
474640
15367
6
426478
08362
7
276432
03480
8
365905
12738
9
276432
06837
10
327467
18575
11
426478
06837
(b) SALE Table.
Cit y
State
TN
Columbus OH
San Diego CA
Raleigh
NC
Brisbane
Columbus GA
Brussels
San Diego CA
Toronto
ON
Columbus RP
Santiago
Date
Pric e Quantit y
May 19, 2003
32.99
1
May 19, 2003
19.95
21
May 19, 2003
24.99
3
May 20, 2003
49.50
1
May 34, 2003 3200.99
1
Jun e 03, 2003
32.99
2
Jun e 04, 2003
30.00
1
Jun e 04, 2003
24.99
1
Jun e 05, 2003
30.00
5
Jun e 12, 2003
-32.99
2
Jun e 15, 2003
32.99
1
Country
USA
USA
USA
USA
Australi a
USA
Belgium
USA
Canada
USA
Chil e

Errors in
Customer:

Missing data in row 1, city is
blank.

Questionable
data - the state
for rows 2 & 6
should be the
same.
14-36
Good Reading Bookstores
Before Data Cleaning
1
2
3
4
5
6
7
8
9
10
11
Customer Number Customer
Name
02847
Mervis
03185
Gomez
03480
Taylor
06837
Steven s
08362
Adams
12739
Gomez
13848
Lucas
15367
Tail or
15933
Chang
18575
Smith
21359
Sanche z
(a) CUSTOMER Tab le .
Street
123 Oak St.
345 Main Ave.
50 Elm R d.
876 Leslie Ln.
1200 Wall aby St.
345 Main Ave.
742 Ave. Lou is e
50 Elm R d.
48 Maple Ave .
390 Martin D r.
666 Ave. Bo li var
Book Number Customer
Number
1
426478
03480
2
077656
18575
3
365905
06837
4
645688
21359
5
474640
15367
6
426478
08362
7
276432
03480
8
365905
12738
9
276432
06837
10
327467
18575
11
426478
06837
(b) SALE Table.
Cit y
State
TN
Columbus OH
San Diego CA
Raleigh
NC
Brisbane
Columbus GA
Brussels
San Diego CA
Toronto
ON
Columbus RP
Santiago
Country
USA
USA
USA
USA
Australi a
USA
Belgium
USA
Canada
USA
Chil e
Date
Pric e Quantit y
May 19, 2003
32.99
1
May 19, 2003
19.95
21
May 19, 2003
24.99
3
May 20, 2003
49.50
1
May 34, 2003 3200.99
1
Jun e 03, 2003
32.99
2
Jun e 04, 2003
30.00
1
Jun e 04, 2003
24.99
1
Jun e 05, 2003
30.00
5
Jun e 12, 2003
-32.99
2
Jun e 15, 2003
32.99
1

Errors in
Customer:

Possible
Misspelling - do
rows 3 & 8
refer to the
same person?

Impossible
Data - row 10s
state “RP” is
wrong.
14-37
Good Reading Bookstores
Before Data Cleaning
1
2
3
4
5
6
7
8
9
10
11
Customer Number Customer
Name
02847
Mervis
03185
Gomez
03480
Taylor
06837
Steven s
08362
Adams
12739
Gomez
13848
Lucas
15367
Tail or
15933
Chang
18575
Smith
21359
Sanche z
(a) CUSTOMER Tab le .
Book Number Customer
Number
1
426478
03480
2
077656
18575
3
365905
06837
4
645688
21359
5
474640
15367
6
426478
08362
7
276432
03480
8
365905
12738
9
276432
06837
10
327467
18575
11
426478
06837
(b) SALE Table.
Street
123 Oak St.
345 Main Ave.
50 Elm R d.
876 Leslie Ln.
1200 Wall aby St.
345 Main Ave.
742 Ave. Lou is e
50 Elm R d.
48 Maple Ave .
390 Martin D r.
666 Ave. Bo li var
Cit y
State
TN
Columbus OH
San Diego CA
Raleigh
NC
Brisbane
Columbus GA
Brussels
San Diego CA
Toronto
ON
Columbus RP
Santiago
Date
Pric e Quantit y
May 19, 2003
32.99
1
May 19, 2003
19.95
21
May 19, 2003
24.99
3
May 20, 2003
49.50
1
May 34, 2003 3200.99
1
Jun e 03, 2003
32.99
2
Jun e 04, 2003
30.00
1
Jun e 04, 2003
24.99
1
Jun e 05, 2003
30.00
5
Jun e 12, 2003
-32.99
2
Jun e 15, 2003
32.99
1
Country
USA
USA
USA
USA
Australi a
USA
Belgium
USA
Canada
USA
Chil e

Errors in SALE:

Questionable data
- is the book
quantity of 21 in
row 2 correct?

Impossible/Out-ofRange Data - row
5 indicates that a
single book costs
$3,200.99.
14-38
Good Reading Bookstores
Before Data Cleaning
1
2
3
4
5
6
7
8
9
10
11
Customer Number Customer
Name
02847
Mervis
03185
Gomez
03480
Taylor
06837
Steven s
08362
Adams
12739
Gomez
13848
Lucas
15367
Tail or
15933
Chang
18575
Smith
21359
Sanche z
(a) CUSTOMER Tab le .
Street
123 Oak St.
345 Main Ave.
50 Elm R d.
876 Leslie Ln.
1200 Wall aby St.
345 Main Ave.
742 Ave. Lou is e
50 Elm R d.
48 Maple Ave .
390 Martin D r.
666 Ave. Bo li var
Book Number Customer
Number
1
426478
03480
2
077656
18575
3
365905
06837
4
645688
21359
5
474640
15367
6
426478
08362
7
276432
03480
8
365905
12738
9
276432
06837
10
327467
18575
11
426478
06837
(b) SALE Table.
Cit y
State
TN
Columbus OH
San Diego CA
Raleigh
NC
Brisbane
Columbus GA
Brussels
San Diego CA
Toronto
ON
Columbus RP
Santiago
Date
Pric e Quantit y
May 19, 2003
32.99
1
May 19, 2003
19.95
21
May 19, 2003
24.99
3
May 20, 2003
49.50
1
May 34, 2003 3200.99
1
Jun e 03, 2003
32.99
2
Jun e 04, 2003
30.00
1
Jun e 04, 2003
24.99
1
Jun e 05, 2003
30.00
5
Jun e 12, 2003
-32.99
2
Jun e 15, 2003
32.99
1
Country
USA
USA
USA
USA
Australi a
USA
Belgium
USA
Canada
USA
Chil e

Errors in SALE:

Apparently
Incorrect Data there is no
customer number
12738, as stated
in row 8.

Impossible Data row 10 shows a
negative price for
a book, which is
impossible.
14-39
Data Transformation

As the data is extracted from the transactional
databases, it must go through several kinds of
data transformations on its way to the data
warehouse:

Data from different transactional databases being
merged to form the data warehouse tables.

Data will often be aggregated as it is being extracted
from the transactional databases and prepared for the
data warehouse.
14-40
Data Transformation

Units of measure used for attributes in different transactional
databases must be reconciled as they are being merged into
common data warehouse tables.

Coding schemes used for attributes in different transactional
databases must be reconciled as they are being merged into
common data warehouse tables.

Sometimes values from different attributes in transactional
databases are combined into a single attribute in the data
warehouse (e.g., employee name).
14-41
Data Loading

After all of the extracting, cleaning, and
transforming, the data is ready to be
loaded into the data warehouse.

A schedule for regularly updating the data
warehouse must be put in place.
14-42
Using a Data Warehouse

Online analytic processing (OLAP)

Data Mining
14-43
Online Analytic Processing

A decision support methodology based on
viewing data in multiple dimensions.

There are many OLAP systems on the
market today.

The OLAP environment’s multidimensional
data is very well suited for querying and
for multi-time period trend analyses.
14-44
Online Analytic Processing

Drill-Down


Slice


Going back to the database and retrieving finer levels
of data detail than you have already retrieved.
A subset of the data that focuses on a single value of
one of the dimensions.
Pivot or Rotation

Merely a matter of interchanging the data dimensions.
14-45
Online Analytic Processing

A slice of the
patient data
cube.
14-46
Data Mining

The searching out of hidden knowledge in the
company’s data that can give the company a
competitive advantage in its marketplace.

Due to the massive volume of data warehouse
data, data mining must be done by software.

Case-based learning
 Decision trees
 Neural networks
 Genetic algorithms
14-47
Data Mining Application:
Market Based Analysis

Consider the data collected by a supermarket as it checks out its
customers by scanning the bar codes on the products they’re
purchasing.

The company might have software study the collected market
baskets, each of which is literally the goods that a particular
customer bought in one trip to the store.

The software might try to discover whether certain items “fall into”
the same market basket more frequently than would otherwise be
expected.

Then the items often bought in the same shopping trip can be
placed next to each other in the store to remind someone buying
one that they might also need the other.
14-48
Data Mining: Lucky Rent-A Car
Manufa cture r
Customer
Class
Name
Cost Number Age
1 Compact
Ford
320 884730
54
2 Luxury
Lincoln
850 528262
45
3 Full -Size
Gene ral Motors 489 109565
48
4 Sub- Compact Toyot a
159 532277
25
5 Luxury
Lincoln
675 155434
42
6 Compact
Chrysle r
360 965578
64
7 Mid-Size
Nissan
429 688632
31
8 Luxury
Lincoln
925 342786
47
9 Full -Size
Gene ral Motors 480 385633
51
10 Compact
Toyot a
230 464367
64
11 Luxury
Jaguar
1170 528262
45
12 Sub- Compact Nissan
89 759930
29
13 Full -Size
Ford
335 478432
57
14 Full -Size
Chrysle r
328 207867
29
CAR/RENTAL/CUSTOMER

Inco me
58,000
158,000
62,000
34,000
125,000
47,500
43,000
95,000
72,000
200,000
158,000
28,000
53,500
162,000
Educa tion
B.A.
M.B.A.
B.S.
High School
Ph.D .
High School
M.B.A.
M.A.
B.S.
M.A.
M.B.A.
B.A.
B.S.
Ph.D .
A data mining application may look for patterns in
the data.

Rows 2, 5, 8, and 11 all involve rentals of luxury class
cars with high-cost (revenue to the company) figures. 14-49
Data Mining: Lucky Rent-A Car
Manufa cture r
Customer
Class
Name
Cost Number Age
1 Compact
Ford
320 884730
54
2 Luxury
Lincoln
850 528262
45
3 Full -Size
Gene ral Motors 489 109565
48
4 Sub- Compact Toyot a
159 532277
25
5 Luxury
Lincoln
675 155434
42
6 Compact
Chrysle r
360 965578
64
7 Mid-Size
Nissan
429 688632
31
8 Luxury
Lincoln
925 342786
47
9 Full -Size
Gene ral Motors 480 385633
51
10 Compact
Toyot a
230 464367
64
11 Luxury
Jaguar
1170 528262
45
12 Sub- Compact Nissan
89 759930
29
13 Full -Size
Ford
335 478432
57
14 Full -Size
Chrysle r
328 207867
29
CAR/RENTAL/CUSTOMER

Inco me
58,000
158,000
62,000
34,000
125,000
47,500
43,000
95,000
72,000
200,000
158,000
28,000
53,500
162,000
Educa tion
B.A.
M.B.A.
B.S.
High School
Ph.D .
High School
M.B.A.
M.A.
B.S.
M.A.
M.B.A.
B.A.
B.S.
Ph.D .
If, as is the case here, these similar rentals were made by
people with similar demographics, a “cluster”, then future
marketing can concentrate on selling this product to people
with these demographics.
14-50
Administering a Data
Warehouse

The data warehouse requires a serious level of
management.

Data warehouse administrator - personnel
specialization in the management of the data
warehouse.

Three kinds of employee expertise is required:



Business expertise
Data expertise
Technical expertise
14-51
Administering a Data
Warehouse: Business
Expertise

An understanding of the company’s business
processes that underlies an understanding of
the company’s transactional data and
databases.

An understanding of the company’s business
goals to help in determining what data should be
stored in the data warehouse for eventual OLAP
and data mining purposes.
14-52
Administering a Data
Warehouse: Data Expertise

An understanding of the company’s transactional
data and databases for selection and integration
into the data warehouse.

An understanding of the company’s transactional
data and databases to design and manage data
cleaning and data transformation, as necessary.

Familiarity with outside data sources for the
acquisition of enrichment data.
14-53
Administering a Data
Warehouse: Technical
Expertise

An understanding of data warehouse
design principles for the initial design.

An understanding of OLAP and data
mining techniques so that the data
warehouse design will properly support
these processes.
14-54
Administering a Data
Warehouse: Technical
Expertise

An understanding of the company’s transactional
databases in order to manage or coordinate the
regularly scheduled appending of new data to
the data warehouse.

An understanding of how to handle very large
databases with their unique requirements for
security, backup and recovery, being split across
multiple disk devices, etc.
14-55
Challenges in Data
Warehousing

Data cleaning and finding more “dirty” data than
expected.

Problems associated with coordinating the
regular appending of new data from the
transactional databases to the data warehouse.

Difficulties in managing very large databases.

The challenge of building and maintaining the
data dictionary.
14-56
“Copyright 2004 John Wiley & Sons, Inc. All rights reserved. Reproduction or
translation of this work beyond that permitted in Section 117 of the 1976
United States Copyright Act without express permission of the copyright owner
is unlawful. Request for further information should be addressed to the
Permissions Department, John Wiley & Sons, Inc. The purchaser may make
back-up copies for his/her own use only and not for distribution or resale. The
Publisher assumes no responsibility for errors, omissions, or damages caused
by the use of these programs or from the use of the information contained
herein.”
14-57