Transcript DATA IN

Intro. to Data Warehouse
รศ.ดร. วรพจน์ กรี สุระเดช
Worapoj Kreesuradej, Ph.D.
Associate Professor
Data Mining & Data Exploration Laboratory (DME Lab),
Faculty of Information Technology,
King Mongkut's Institute of Technology Ladkrabang,
Web: www.it.kmitl.ac.th/dme
Email: [email protected]
Book

Paulraj Ponniah, Data Warehousing
Fundamentals, John Wiley & Sons, 2001.

Ralph Kimbal and Margy Ross, The Data
Warehouse Toolkit, John Wiley and Sons,
2002.
Definition of DW
“A collection of integrated, subjectoriented databases designed to supply
the information required for decisionmaking.”
- W. Inmon
 A decision support database that is
maintained separately from the
organization’s operational databases.
 A physical repository where relational
data are specially organized to provide
enterprise-wide, cleansed data in a
standardized format –E. Turban and etc.

R. Kimball’s definition of a DW

A data warehouse is a copy of
transactional data specifically
structured for querying and analysis.
Problem: Data Management
in Large Enterprises


Vertical fragmentation of informational
systems
Result of application (user)-driven
development of operational systems
Sales Planning
Suppliers
Num. Control
Stock Mngmt
Debt Mngmt
Inventory
...
...
...
Sales Administration
Finance
Manufacturing
...
Problem: Data Management
in Large Enterprises
 Two Approaches for accessing
data:
 Query-Driven (Lazy)
?
 Warehouse (Eager)
Source
Source
The Need for DW
 Query-driven (lazy, on-demand)
Clients
Integration System
Metadata
...
Wrapper
Source
Wrapper
Source
Wrapper
...
Source
Disadvantages of QueryDriven Approach



Delay in query processing
Inefficient and potentially expensive
for frequent queries
Competes with local processing at
sources
The Warehousing Approach
 Information
integrated in
advance
 Stored in wh
for direct
querying and
analysis
Extractor/
Monitor
Source
Clients
Data
Warehouse
Integration System
Metadata
...
Extractor/
Monitor
Source
Extractor/
Monitor
...
Source
Advantages of Warehousing
Approach



High query performance
Doesn’t interfere with local processing
at sources
Information copied at warehouse



Can modify, annotate, summarize,
restructure, etc.
Can store historical information
Security, no auditing
Characteristics of DW
Subject oriented Data are organized by how
users refer to it
Integrated
Non-volatile
Time variant
Inconsistencies are removed
in both nomenclature and
conflicting information; (i.e.
data are ‘clean’)
Read-only data. Data do not
change over time.
Data are time series, not
current status
Subject Oriented
Data Warehouse is designed around
“subjects” rather than processes
 A company may have
 Retail Sales System
 Outlet Sales System
 Catalog Sales System
 DW will have a Sales Subject Area

Subject Oriented
OLTP Systems
Retail Sales
System
Outlet Sales
System
Catalog Sales
System
Data Warehouse
Sales Subject Area
Subject-Oriented Sales Information
Integrated

Heterogeneous Source Systems

Need to Integrate source data

For Example: Product codes could
be different in different systems

Arrive at common code in DW
Integrated
 Information
integrated in
advance
 Stored in DW
for direct
querying and
analysis
Clients
Data
Warehouse
Integration System
Metadata
...
Extractor/
Monitor
Source
Extractor/
Monitor
Source
Extractor/
Monitor
...
Source
Non-Volatile

Operational update of data does not occur
in the data warehouse environment.

Does not require transaction
processing, recovery, and concurrency
control mechanisms

Requires only two operations in data
accessing:

initial loading of data and access of
data.
Non-Volatile(Read-Mostly)
Write
USER
OLTP
Read
USER
DW
Read
Time Variant

The time horizon for the data warehouse is
significantly longer than that of operational
systems.

Operational database: current value data.

Data warehouse data: provide information
from a historical perspective (e.g., past 510 years)
Time Variant

Most business
analysis has a
time component
Sales
2001

Trend Analysis
(historical data is
required)
2002 2003 2004
Data Warehousing
Process Overview
Data Warehousing
Process Overview

The major components of a data
warehousing process







Data sources
Data extraction
Data loading
Comprehensive Database /Data Store
Data Mart
Metadata
Middleware tools /information delivery
tools
ETL
• Data Extraction
• Data Cleaning and Transformation
•
Convert from legacy/host format to
warehouse format
Load
Sort, summarize, consolidate,
compute views, check integrity,
build indexes, partition
The ETL Process
Source
Systems
Staging
Area
Extract Transform Load
DW
Database
Data Staging Area
• A storage area where extracted data is
•
•
•
•
•
cleaned, transformed and deduplicated.
Initial storage for data
Need not be based on Relational model
Mainly sorting and Sequential
processing
Does not provide data access to users
Analogy – kitchen of a restaurant
ETL Process
Issues & Challenges
• Consumes 70-80% of project time
• Heterogeneous Source Systems
• Little or no control over source systems
• Source systems scattered
• Different currencies, measurement units
• Ensuring data quality
Comprehensive Database
/Data Store

Mostly a relational DB
 Oracle,

DB2, Sybase, SQL Server
New DB design for special purpose of
DW (e.g., scale up, speed up, parallel
processing)
Data Warehouse Design
OLTP Systems are Data Capture Systems
 “DATA IN” systems
 DW are “DATA OUT” systems

OLTP
DW
Dimensional Modeling
Facts are stored in FACT Tables
 Dimensions are stored in
DIMENSION tables
 Dimension tables contains textual
descriptors of business
 Fact and dimension tables form a
Star Schema
 “BIG” fact table in center surrounded
by “SMALL” dimension tables

Star Schema
CUSTOMER
TIME
#
*
*
*
*
*
*
*
*
*
*
TIME_KEY
ORD ERD ATE
D AY_ OF_ WEEK
D AY_ NUMBER_IN _MONTH
D AY_ NUMBER_IN _YEAR
WEEK_N UMBER
MON TH
QUARTER
H OL IDAY_FLAG
FISC AL_YEAR
FISC AL_QUARTER
referenced by
referenced by
SALES
reference
#
#
#
*
*
*
reference
TIME_KEY
PRODUC T_KEY
CUSTOMER_ KEY
PRIC E
QUANTITY
SALES
reference
referenced by
PRODUCT
#
*
*
*
#
*
*
*
*
PRODUC T_KEY
PID
PNAME
PCN AME
CUSTOMER_ KEY
C ID
C NAME
STATE
C ITY
Star Schema
Data mart
Data mart = subset of DW for community
users, e.g. accounting department
 Sometimes exist as Multidimensional
Database
 Info mart = summarized data + report for
community users

Meta Data




Data about data
Needed by both information technology
personnel and users
IT personnel need to know data sources and
targets; database, table and column names;
refresh schedules; data usage measures; etc.
Users need to know entity/attribute
definitions; reports/query tools available;
report distribution information; help desk
contact information, etc.
Information Delivery Tools

Tools
 Query
& reporting
 OLAP
 Data
mining, visualization, segmentation,
clustering
 New developments: text mining, web mining
& personalization
 Mining multimedia data
Information Delivery Tools

Commercial tools
 Crystal

Report, Impromptu, WebFocus
Increasingly common mode of delivery:
Web-enabled
Data Warehouse Architecture


Data Flow Architecture
System Architecture
Data Flow Architecture
Data Flow Architecture
Data Flow Architecture

Operational data stores (ODS)
A type of database often used as an
interim area for a data warehouse,
especially for customer information files

MDB=Multidimensional databases
System Architectures

Three parts of the data warehouse



The data warehouse that contains the data
and associated software
Data acquisition (back-end) software that
extracts data from legacy systems and
external sources, consolidates and
summarizes them, and loads them into the
data warehouse
Client (front-end) software that allows
users to access and analyze data from the
warehouse
System Architectures
System Architectures
System Architecture
System Architecture
Data Warehouse Development

Data warehouse development
approaches



Inmon Model: EDW approach, Enterprisewide warehouse, top down
Kimball Model: Data mart approach, Data
mart, bottom up
Which model is best?


There is no one-size-fits-all strategy to data
warehousing
When properly executed, both result in an
enterprise-wide data warehouse, but with
different architectures
The Data Mart Strategy







The most common approach
Begins with a single mart and architected
marts are added over time for more subject
areas
Relatively inexpensive and easy to implement
Can be used as a proof of concept for data
warehousing
Can perpetuate the “silos of information”
problem
Can postpone difficult decisions and
activities
Requires an overall integration plan
The Enterprise-wide
Strategy





A comprehensive warehouse is built initially
An initial dependent data mart is built using a
subset of the data in the warehouse
Additional data marts are built using subsets
of the data in the warehouse
Like all complex projects, it is expensive, time
consuming, and prone to failure
When successful, it results in an integrated,
scalable warehouse
DW Lifecycle (Ralph Kimball )
Data Warehouse Development

Some best practices for implementing a
data warehouse (Weir, 2002):





Project must fit with corporate strategy and
business objectives
There must be complete buy-in to the
project by executives, managers, and users
It is important to manage user expectations
about the completed project
The data warehouse must be built
incrementally
Build in adaptability
Data Warehouse Development

Some best practices for implementing a
data warehouse (Weir, 2002):





The project must be managed by both IT
and business professionals
Develop a business/supplier relationship
Only load data that have been cleansed and
are of a quality understood by the
organization
Do not overlook training requirements
Be politically aware