Populating Data Warehouse Structures
Download
Report
Transcript Populating Data Warehouse Structures
Data Warehouse
Why Data Warehouse and OLAP?
The Intelligence Stage of Decision Making needs correct
data
Data also should be clean, organized with fast access
Often the data should only be used for subsequent stages
and should not be changed (in other words decision makers
are users of data, not generators of data)
The data for decision making comes from multiple sources
OLTP databases, XML files, Flat files like CSV, PDF, etc.
They need to be combined for one version of the truth
These needs cannot be easily served by the traditional data
sources, hence the concept of data warehouse/data marts
Business Intelligence Process
Increasing potential
to support
business decisions
Making
Decisions
Data Presentation
Visualization Techniques
Data Mining
Information Discovery
End User
Business
Analyst
Data
Analyst
Data Exploration
Statistical Analysis, Querying and Reporting
Data Warehouses / Data Marts
OLAP, MDA
Data Sources
Paper, Files, Information Providers, Database Systems, OLTP
DBA
Data Warehouse: Definitions
Data warehouse
There is no single definition, as it can encompass many aspects
of the data management. Some of the ones found in the
literature are:
A physical repository where relational data are specially organized
to provide enterprise-wide, cleansed data in a standardized format
Defined in many different ways, but not rigorously.
A decision support database that is maintained separately from
the organization’s operational database.
A consistent database source that bring together information from
multiple sources for decision support queries
Support information processing by providing a solid platform of
consolidated, historical data for analysis
Data warehousing
A process by which data from multiple sources are extracted,
transformed and loaded into a data warehouse in a planned
operation
Data Warehouse vs. Operational DBMS
• OLTP
(on-line transaction processing)
Major task of traditional relational DBMS
Day-to-day operations: purchasing, inventory, banking,
manufacturing, payroll, registration, accounting, etc.
Aims at reliable and efficient processing of a large number of
transaction and ensuring data consistency
• OLAP (on-line analytical processing)
Major task of data warehouse system
Data analysis and decision making
Aims at efficient multidimensional processing of large data volumes
Fast, interactive answers to large aggregate queries
• Distinct features (OLTP vs. OLAP):
User and system orientation: customer vs. market
Data contents: current, detailed vs. historical, consolidated
Database design: ER + application vs. star + subject
View: current, local vs. evolutionary, integrated
Access patterns: update vs. read-only but complex queries
OLTP vs. OLAP
User
Clerk, IT Professional
Knowledge worker
Function
Day to day operations
Decision support
DB Design
Application-oriented
Subject-oriented
Data
Current, Isolated
Historical, Consolidated
View
Detailed, Flat relational
Summarized and
Multidimensional
Usage
Structured, Repetitive
Ad hoc
Unit of work
Short, Simple
transaction
Complex query
Access
Read/write
Read Mostly
Operations
Index/hash on prim. Key
Lots of Scans
# Rec. accessed Tens
Millions
#Users
Thousands
Hundreds
Db size
100 MB-GB
100GB-TB
Metric
Trans. throughput
Query throughput, response
Data Warehousing Characteristics
Subject oriented
Integrated
Time variant (time series)
Nonvolatile
Web based
Relational/multidimensional
Client/server
Real-time
Include metadata
Data Warehouse Architecture
Data Warehouse Architecture
Inmon (Bill Inmon) Model: EDW approach
Kimball (Ralph Kimball) Model: Data mart approach
Which model is best?
There is no one-size-fits-all strategy to data
warehousing
One alternative is the hosted warehouse
Alternative DW Architectures
Teradata’s DW Architecture
DW Architecture
Categories/Types of Data Warehouses
Data Mart
Operational Data Stores
A departmental data
A type of database often
warehouse that stores only used as an interim area for
relevant data
a data warehouse,
especially for customer
Dependent Data Mart
information files
A subset that is created
Operational Data Marts
directly from a data
warehouse
An operational data mart.
Independent Data Mart An operational data mart
is a small-scale data mart
A small data warehouse
typically used by a single
designed for a strategic
department or functional
business unit or a
area in an organization
department
Various OLAPs
ROLAP: Relational OLAP. Analysis cube supported
by a Multi-Dimensional Data Warehouse (or Data
Mart) based on Relational database.
MOLAP: Multidimensional OLAP. Analysis cube
supported by a multi-dimensional data warehouse
based on another multi-dimensional data storage.
HOLAP: Hybrid OLAP. Combination of the above
two to optimize performance.
Your task: Search the web to find the relative
advantages and disadvantages of each.
Data Warehousing Architectures
Issues to consider when deciding which architecture
to use:
Which database management system (DBMS) should be used?
Will parallel processing and/or partitioning be used?
Will data migration tools be used to load the data warehouse?
What tools will be used to support data retrieval and analysis?
Data Warehouse
Development and Implementation
Structured data
Data Warehouse and BI Life
Cycle
Business Intelligence
Reporting,
Ad-hoc Query
Analysis
OLTP
Data Warehouse
(Enterprise)
Unstructured data
Phase
I
Data Warehouse and BI Life
Cycle
Extract -Transform - Load
Data Integration
Data Warehouse
(Enterprise)
ODS
Source Data
Reporting Tools
Develop
Discover
Deliver
Phase
II
Data Warehouse and BI Life
Cycle
Data Marts
(Departments)
Data Warehouse
(Enterprise)
Multi-Dimensional Cubes
(Subject wise)
Develop
Discover
Business Intelligence
Deliver
Data Warehouse and BI Life
Cycle
Data Warehouse OLAP Cube
Total Sales for
Customer 3,
Product C, for
July.
From: http://oraclezine.blogspot.com/2009/01/data-warehousingand-olap-cube.html, accessed January 20, 2011
Data Integration and the Extraction,
Transformation, and Load (ETL) Process
Extraction
reading data from a database
Transformation
Converting the extracted data from its previous form
into the form in which it needs to be so that it can be
placed into a data warehouse or simply another
database
Load
Putting the data into the data warehouse
Data Warehouse Schema
Dimensional Modeling
The Star Schema
Dimension Tables that contain the Dimension for Analysis
Example: Time, Region, Salesperson, etc.
Fact Tables that contains the measures and aggregates
Example: Average sales, total commission, total sales, etc.
The Snowflake Schema
Very similar to Star-schema with a central fact table, but the
dimensions are in hierarchical fashion.
Example: Listing agent is a part of the listing company, one city
can have multiple zip code etc.
Reduces the redundant data but can be inefficient for queries
that do not follow patterns.
Steps for DW OLP Design
Decide on your Information Needs
I want to know the Total Sales by month, by region and by salesperson
Decide the sources of data for each information need
Total Sales is not available, must be calculated from unit and quantity sold
Most of the unit and qty sold are in OLTP, but some are also in CSV files for some
of the stores not connected to the database
The Monthly information can be obtained from the date and is in OLTP
The region information is also in OLTP transaction records vi a region code
The Salesperson information is maintained in an Excel file and is coded by region
Declare the grain of the fact table (preferably at the most atomic level)
A grain is the lowest level of information you are interested in. The finer is the
grain (tending towards atomic), the more are the dimensions.
Atomic grain is preferred as it can allow easy roll-ups. However, it can take a lot
of space and processing.
Often the grains in the fact tables are much coarser with ability to drill down
Example of Grains
Sales of Each Customer, for each day for each store.
Monthly total sales for a particular region for all female customers
Add dimensions for "everything you know" about this grain
Add numeric measured facts true to the grain.
Creating a Star* Schema
Identify the dimensions (typically this is the analysis by). In
our real estate listing example, it can be the city name,
bedrooms, listing agent, etc.
Identify the measures. In our case, it can be the average
price/sq ft, total number of houses in a city, the average price,
etc.
Identify the attributes of each dimension. Attributes are the
properties of a dimension. For example, if the listing agent is
a dimension, then the first name, last name, phone number
etc, of a listing agent will be the attributes.
* Snowflake schema design is very similar with hierarchy of the
dimensions separated in another table.
Star Schema continued.
Create the dimension tables with a surrogate Primary
Key (PK). Include all the necessary attributes.
Decide on the measures and calculations. Those will
be in the fact table.
For each PK in the dimension tables, create foreign
keys (FK) in the star table.
Example of Star Schema
Example of Snowflake Schema
Implementing the Star Schema
1. Extract Data From Multiple Sources
2. Integrate, Transform, and Restructure Data
3. Load Data Into Dimension Tables and Fact Tables
The Star Schema Data Load
Heterogeneous
Data Sources
Data Warehouse
Staging Area
Northwind
OLTP
Financial
External
Internal
Files
Files
Sales
Star
DTS
DTS
External
Files
Extracting Data From DTS Transforming DTS
Heterogeneous Sources
Data
Inventory
Star
Loading the
Star Schema
Verifying the Dimension Source Data
Verifying Accuracy of Source
Data
Integrating data from multiple
sources
Applying business rules
Checking structural requirements
Correcting Invalid Data
Managing Invalid Data
Transforming data
Rejecting invalid data
Reassigning data values
Saving invalid data to a log
Dimension Data Load Examples:
buyer_name reg_id
Barr, Adam
2
Chai, Sean
4
O’Melia, Erin
6
...
...
buyer_code buyer_last reg_id
Barr
2
A123
Chai
4
B456
O’Melia
6
...
...
...
buyer_name
Barr, Adam
Chai, Sean
reg_id
II
IV
buyer_name
Smith, Jane
Paper, Anne
reg_id
2
4
DTS
DTS
DTS
buyer_first
Adam
Sean
Erin
...
buyer_code
U999
A123
B456
...
buyer_last reg_id
Barr
2
Chai
4
O’Melia
6
...
...
buyer_last reg_id
Barr
2
Chai
4
O’Melia
6
...
...
buyer_name
Barr, Adam
Chai, Sean
Smith, Jane
Paper, Anne
reg_id
2
4
2
4
Maintaining Integrity of the Dimension
Assigning a Surrogate Key to Each Record
Defines the dimension’s primary key
Relates to the foreign key fields of the fact table
Loading One Record Per Application Key
Maintains uniqueness in the dimension
Depends on how you manage changing dimension
data
Maintains integrity of the fact table
Managing Changing Dimension Data
One of the problem in DW is to deal with the changing
data on the reload and refresh. OLTP is operational
and thus does not face this problem. These change
areas are often called SCD or Slow Changing
Dimensions
Dimensions with Changing Column Values
Inserts of new data
Updates of existing data
Slowly-Changing Dimension Design Solutions
Type 1: Overwrite the dimension record
Type 2: Write another dimension record
Type 3: Add attributes to the dimension record
Examining the Star Schema
Sales Star Schema
Fact Table
Dimension
Tables
Dimension Table
Type 1: Overwriting the Dimension Slide
Product Dimension
product key
product name
product size
product package
product dept
product cat
product subcat
...
Before
001
Rice Puffs
10 oz.
Bag
Grocery
Dry Goods
Snacks
...
After
001
Rice Puffs
12 Oz
oz.
Bag
Grocery
Dry Goods
Snacks
...
Existing record
is changed
Type 2: Writing Another Dimension Record
Product Dimension
product key
product name
product size
product package
product dept
product cat
product subcat
effective_date
…
Before
001
Rice Puffs
10 oz.
Bag
Grocery
Dry Goods
Snacks
05-01-1995
...
001
Rice Puffs
10 Oz
oz.
Bag
Grocery
Dry Goods
Snacks
05-01-1995
...
After
731
Rice Puffs
12 Oz
oz.
Bag
Grocery
Dry Goods
Snacks
10-15-1998
...
Adds a new record
Type 3: Adding Attributes in the Dimension
Record
Product Dimension
product key
product name
product
productsize
size
product package
product dept
product cat
product subcat
current product size date
previous
product
previous
product
size size
previous
product
size size
date date
previous
product
2nd previous product size
2nd previous product size date
...
Before
001
Rice Puffs
10
10 Oz
oz.
Bag
Grocery
Dry Goods
Snacks
05-01-1995
11
11 Oz
oz.
03-20-1994
03-20-1994
(null)
(null)
...
After
001
Rice Puffs
12
12 oz.
oz
Bag
Grocery
Dry Goods
Snacks
10-15-1998
10-15-1998
10 oz.
05-01-1995
05-01-1995
11
11Oz
oz.
03-20-1994
03-20-1994
...
Additional information is stored
in an existing record
Verifying the Fact Table Source Data
Verifying Accuracy of Source Data
Integrating data from multiple sources
Applying business rules
Checking structural requirements
Creating calculated fields
Correcting Invalid Data
Managing Invalid Data
Transforming data
Rejecting invalid data
Reassigning data values
Saving invalid data to a log
Assigning Foreign Keys
Dimension
Tables
customer_dim
201 ALFI Alfreds
Source Data
customer id
ALFI
product id order date quantity_sales amount_sales
123
1/1/2000
400
10,789
product_dim
25 123 Chai
Sales Fact Data
cust_key
time_dim
134 1/1/2000
201
prod_key time_key
123
25
1/1/2000
134
quantity_sales amount_sales
400
10,789
Defining Measures
Loading Measures
from the Source
System
Calculating
Additional
Measures
customer_id
VINET
ALFI
HANAR
...
product_id
9GZ
1KJ
0ZA
...
price
.55
1.10
.98
...
qty
32
48
9
...
Source System Data
customer_key
100
238
437
...
product_key
512
207
338
...
qty
32
48
9
...
Fact Table Data
total_sales
17.60
52.80
8.82
...
Maintaining Data Integrity
Adhering to the Fact Table Grain
A fact table can only have one grain
You must load a fact table with data at the same level of detail
as defined by the grain
Enforcing Column Constraints
NOT NULL constraints
FOREIGN KEY constraints
Implementing Staging Tables
Centralize and Integrate Source Data
Break Up Complex Data Transformations
Facilitate Error Recovery
market_stage
shipments_stage
Staging Area
sales_stage
inventory_stage
DTS Functionality
Accessing Heterogeneous Data Sources
Importing, Exporting, and Transforming Data
Creating Reusable Transformations and Functions
Automating Data Loads
Managing Metadata
Customizing and Extending Functionality
Defining DTS Packages
Identifies Data Sources and Destinations
Defines Tasks or Actions
Implements Transformation Logic
Defines Order of Operations
Identifying Package Components
Connections: Access Data Sources and Destinations
Tasks: Describe Data Transformations or Functions
Steps: Define the Order of Task Operations or
Workflow
Global Variables: Store Data that Can Be Shared
Across Tasks
Creating Packages
Using the DTS Import / Export Wizard
Perform ad-hoc table and data transfers
Develop a prototype package
Using DTS Package Designer
Edit packages created with the DTS Import/Export Wizard
Create packages with a wide range of functionality
Programming DTS Applications
Directly access the functionality of the DTS Object Model
Requires Microsoft Visual Basic or Microsoft Visual C++
Using DTS to Populate the Sales Star
Populating the Sales Star Dimensions
Populating the Sales Star Fact Table
Populating the Sales Star Dimensions
Product
Tab Delimited
Files
product_dim
DTS
customer_dim
Northwind
OLTP
SQL Server
Stored Procedure
DTS
time_dim
DTS
Populating the Sales Star Fact Table
Sales Data
File
product_dim
DTS
sales_stage
DTS
customer_dim
sales_stage
time_dim
sales_fact
Why not Just Excel Pivot Tables?
For small project, Pivot Tables are excellent
However, if you insert/delete records, you have to refresh the
pivot table and may have to change the table reference
(though Table features make it easy)
Calculation cannot be done for one field that will be reflected
for all records. You have to copy it for all the records!
No pre-processing can be done. The pivot table is refreshed
and recalculated every time. Can lead to inefficiencies
No built-in integration mechanism with a data source (can be
done by connecting to cube servers and ODBC compliant
databases, but again an extra step and work)
Selective drill-down is difficult, mainly all or none (again
grouping can help, but you have to be skilled in Pivot Tables)
Data Warehouse Implementation
Implementation issues
Implementing a data warehouse is generally a
massive effort that must be planned and executed
according to established methods
There are many facets to the project lifecycle, and
no single person can be an expert in each area
Major Tasks for Successful Implementation of a DW
1. Establishment of
2.
3.
4.
5.
service-level
agreements and datarefresh requirements
Identification of data
sources and their
governance policies
Data quality planning
Data model design
ETL tool selection
6. Relational database
7.
8.
9.
10.
11.
software and platform
selection
Data transport
Data conversion
Reconciliation process
Purge and archive
planning
End-user support
From Solomon, 2005
Data Warehouse Implementation
Implementation factors that can be categorized into
three criteria
Organizational issues
Project issues
Technical issues
User participation in the development of data and
access modeling is a critical success factor in data
warehouse development
Data Warehouse Management
Data warehouse administrator (DWA)
A
person responsible for the administration
and management of a data warehouse
Security Concerns
“All
eggs in one basket”
Unauthorized access
Usage Concerns
Users
need to understand how to use the data
correctly
Why was data collected, how was it stored, how
do these affect the current use of the data
Data Warehouse Management
Effective security in a data warehouse should focus
on four main areas:
Establishing effective corporate and security policies and
procedures
Implementing logical security procedures and techniques
to restrict access
Limiting physical access to the data center environment
Establishing an effective internal control review process
with an emphasis on security and privacy
Data Warehouse Benefits
Direct benefits of a data warehouse
Allows end users to perform extensive analysis
Allows a consolidated view of corporate data
Better and more timely information
Enhanced system performance
Simplification of data access
Data Warehouse Benefits
Indirect benefits result from end users using these
direct benefits
Enhance business knowledge
Present competitive advantage
Enhance customer service and satisfaction
Facilitate decision making
Help in reforming business processes
Data Warehouse Vendor Selection
Important guidelines for selecting a DW vendor
Financial strength
ERP linkages
Qualified consultants
Market share
Industry experience
Established partnerships
DW Development Best Practices
Project must fit with
The project must be
corporate strategy and
business objectives
There must be complete
buy-in to the project by
executives, managers, and
users
It is important to manage
user expectations about the
completed project
The data warehouse must
be built incrementally
Build in adaptability
managed by both IT and
business professionals
Develop a
business/supplier
relationship
Only load data that have
been cleansed and are of a
quality understood by the
organization
Do not overlook training
requirements
Be politically aware
From Weir, 2002
DW Development: What NOT to do
Cultural issues being ignored
Inappropriate architecture
Unclear business objectives
Missing information
Unrealistic expectations
Low levels of data
summarization
Low data quality
Believing promises of
performance, capacity, and
scalability
Believing that your problems
are over when the data
warehouse is up and running
Believing that DW database
design is the same as
transactional database design
Choosing a DW manager who
is technology oriented rather
than user oriented
Focusing on traditional
internal record-oriented data
and ignoring the value of
external data and of text,
images, and, perhaps, sound
and video
Focusing on ad hoc data
mining and periodic
reporting instead of alerts
Delivering data with
overlapping and confusing
definitions