Chapter 1 The Data Warehouse

Download Report

Transcript Chapter 1 The Data Warehouse

Data Warehousing
Minsoo Lee
Dept. Computer Science and Engineering
Ewha Womans University
Heterogeneous Database
Integration
Integration System
World
Wide
Web
Digital Libraries
Scientific Databases
Personal
Databases
• Collects and combines information
• Provides integrated view, uniform user interface
• Supports sharing
Data Warehouse Basics


Data warehouse is not an end to itself but
part of the BI infrastructure
Business Intelligence (BI)
–
–
–
Data Warehouse or Data Mart
On-Line Analytical Processing (OLAP)
Data Mining
Business Intelligence

Why?
–

Organizations with purely operational systems
–

Lack of BI is an enormous competitive disadvantage
Unable to make meaningful information out of volumes of
data
The business of chess
–
–
–
What shall I do next?=Strategic thinking=chess
Business environment is full of unknowns
Business strategist : predict behavior of business nouns
Business Intelligence

BI
–
–
–
Helps develop strategy
Must be able to anticipate future conditions
Need to understand the past
Business Intelligence loop



Operational environment
Data Warehouse/Data Mart
Decision Support Systems (DSS)
Business Intelligence Loop
Business Strategist
OLAP
Data Mining
Reports
Decision
Support
Data Storage
Data
Warehouse
Extraction, Transformation,
& Cleansing
CRM
Accounting
Finance
HR
Data Warehouse Architecture
Monitoring &
Administration
OLAP Servers
Metadata
Repository
Reconciled data
External
Sources
Extract
Transform
Load
Refresh
Analysis
Serve
Query/Reporting
Operational
Dbs
Data Mining
DATA SOURCES
TOOLS
DATA MARTS
Features of the Data Warehouse

A Data Warehouse is a subject oriented,
integrated, nonvolatile, time variant
collection of data in support of
management’s decision
–
W.H. Inmon
Subject Orientation

Transaction-oriented systems structure data
in a way that optimizes processing of
transactions (normalization)
–

DW is concerned with the business nouns
(customers, products, sales, etc.)
Operational data is distributed across
multiple applications
–
DW gathers all data in one place
Subject Orientation
Operational Systems
(Transaction Oriented)
Data Warehouse
(Subject Oriented)
File 1
Accounts Payable
Order Processing
File 2
File 3
File 4
Customer
Data
Accounts Receivable
Product
Data
Sales
Data
File 5
File 6
Integration



Forms a single cohesive environment
Data cleansing and Data transformation
Data cleansing
–
–
–
–
Removing errors from the input stream
A good cleansing process can improve quality of
operational environment
Debate on appropriate action when detecting
errors: correct in operational environment as well?
Cannot detect all errors
Integration

Data Transformation
–
–
Receives input streams and transform into one
consistent format
Issue of defining inconsistencies




Description
Encoding
Units of Measure
Format
Integration
Sales Voucher
Purchase Order Inventory
Description
Customer Name
I.B.M
Customer Name
IBM
Customer Name
International
Business Machines
Encoding
Sex
1 = Male
2 = Female
Sex
M = Male
F = Female
Sex
X = Male
Y = Female
Units
Cable Length
Centimeters
Cable Length
Yards
Cable Length
Inches
Formats
Key
Character(10)
Key
Integer
Key
pic ‘99999999’
Nonvolatile



Once data is written, it remains unchanged
in the DW
Virtual read-only database system
DB can eliminate background processes
used for recovery (ex : redo log)
Nonvolatile

Figure 1-5
Time-Variant Collection of Data



Adds time dimension to the data warehouse
Creates snapshot of the organization
Can view patterns and trends over time
Supporting Management’s Decision




DW user is the Business strategist
Static reports generated by IT dept. can no
longer satisfy business strategist
Requires appropriate timely performance
Design user interface for business strategist
Decision Support Systems


DSS extends from the extraction of the
data through the DW to the presentation to
the business strategist
Reporting, OLAP, Data Mining
Reporting


The higher the level of the business
strategist, the higher level of summarization
required.
Enterprise-class reporting
–
–
–
–
Rapid development
Easy maintenance
Easy distribution
Internet Enabled
On-Line Analytical Processing







Leverages the time-variant characteristics for
strategist to look both back and ahead in time
MOLAP (Multi-dimensional OLAP)
ROLAP (Relational OLAP)
HOLAP (Hybrid OLAP)
Typical OLAP interface : spreadsheet style
Rotation, roll-up, drill-down
Support “what-if” analysis - manipulate variables
Data Mining


Data mining allows us to see the hidden picture
Find Association, Classification
–
–


Association : relationship among data
Classification : segment into different classes
Use subset of data : size depends on deviation of
data characteristics
Methods
–
Decision Trees, Neural Networks, Genetic Modeling
Data Warehouse Schema



Star Schema
Fact Constellation Schema
Snowflake Schema
Star Schema



A single,large and central fact table and one
table for each dimension.
Every fact points to one tuple in each of the
dimensions and has additional attributes.
Does not capture hierarchies directly
Star Schema
Store Dimension
Fact Table
Time Dimension
Store Key
Store Key
Period Key
Store Name
Product Key
Year
City
Period Key
Quarter
Units
Month
State
Region
Price
Product Key
Product Desc
Product Dimension
•Benefits: Easy to understand, easy to define
hierarchies, reduces no. of physical joins.
SnowFlake Schema



Variant of star schema model.
A single,large and central fact table and one
or more tables for each dimension.
Dimension tables are normalized i.e. split
dimension table data into additional tables
SnowFlake Schema
Store Dimension
Fact Table
Time Dimension
Store Key
Period Key
Product Key
Store Name
Year
Period Key
Quarter
City Key
Units
Month
Store Key
Price
City Dimension
City Key
City
State
Region
Product Key
Product Desc
Product Dimension
•Drawbacks: Time consuming joins,report generation slow
Fact Constellation



Multiple fact tables share dimension tables.
This schema is viewed as collection of stars
hence called galaxy schema or fact
constellation.
Sophisticated application requires such
schema.
Fact Constellation
Sales
Fact Table
Store Key
Product Dimension
Shipping
Fact Table
Shipper Key
Product Key
Product Key
Store Key
Period Key
Product Desc
Product Key
Units
Period Key
Price
Units
Price
Store Dimension
Store Key
Store Name
City
State
Region
The Future of Data Warehousing


Multibillion dollar business!
New SQL commands
–

Performance
–

Efficient query processing via materialized views
Better tools
–

Support Complex Analysis
Easy extraction, easy query formulation, better visualization
of data
Semistructured data
–
XML based data warehouses
Summary


Why build a DW? Business strategist can
make a plan for organization to thrive
DW is a subject oriented, integrated,
nonvolatile, time variant collection of data in
support of management’s decisions.