Data Warehouse - wongsteve.com

Download Report

Transcript Data Warehouse - wongsteve.com

MGS 8020
Business Intelligence
The Data Warehouse
& Relational Database Management System
Feb 5, 2015
Georgia State University - Confidential
MGS8020_05.ppt/Feb 5, 2015/Page 1
Agenda
Data
Warehouse
Georgia State University - Confidential
Relational
Database
Designing &
Building the
Data
Warehouse
MGS8020_05.ppt/Feb 5, 2015/Page 2
The Data Warehouse
The Data Warehouse
•
is physically separated from all other operational systems
•
holds aggregated data and transactional data for management separate
from that data used for online transaction processing
Georgia State University - Confidential
MGS8020_05.ppt/Feb 5, 2015/Page 3
Data Flow
Operational
Data Store
Legacy
Systems
Data
Warehouse
Personal
Data
Warehouse
Data
Mart
Metadata
Georgia State University - Confidential
MGS8020_05.ppt/Feb 5, 2015/Page 4
The Data Warehouse
The Data Warehouse
•
is physically separated from all other operational systems
•
holds aggregated data and transactional data for management separate
from that data used for online transaction processing
Georgia State University - Confidential
MGS8020_05.ppt/Feb 5, 2015/Page 5
Characteristics of a Data Warehouse
•
Subject Orientation
•
Data Integrated
Consistent Naming and Measurement Attributes
•
Time Variant
•
Nonvolatility
Georgia State University - Confidential
MGS8020_05.ppt/Feb 5, 2015/Page 6
Executive Information System
Internal Source
Systems
Extract, Transformation and Load
Business Intelligence & Data Warehouse
External Data
Sources
Georgia State University - Confidential
Dashboard
Balanced Scorecard
KPI
Financial Metrics
Data Warehouse
Data Mart
Business Intelligence
Advanced Analytics
Business Objects
Cognos
Microstrategy
Brios
Microsoft Access
Etc.
SAS
Minitab
SPSS
MGS8020_05.ppt/Feb 5, 2015/Page 7
Data Warehouse Vendors
•
Business Objects
•
Cognos
•
Hyperion
•
IBM
•
Microsoft
•
NCR / Teradata
•
Oracle
•
SAS
Georgia State University - Confidential
MGS8020_05.ppt/Feb 5, 2015/Page 8
Metadata
What is Metadata?
•
•
•
Data about Data
Without metadata, the data is meaningless
Provides consistency of the truth
Components of Metadata
•
•
•
•
•
•
•
Transformation Mapping
Extraction and Relationship History
Algorithms for Summarization (and calculations)
Data Ownership
Patterns of Warehouse Access
Business Friendly naming conventions
Status Information
Georgia State University - Confidential
MGS8020_05.ppt/Feb 5, 2015/Page 9
Agenda
Data
Warehouse
Georgia State University - Confidential
Relational
Database
Designing &
Building the
Data
Warehouse
MGS8020_05.ppt/Feb 5, 2015/Page 10
Relational Database
A relational database is a collection of data items organized as a set of formally-described
tables from which data can be accessed or reassembled in many different ways without having
to reorganize the database tables. The relational database was invented by E. F. Codd at IBM
in 1970.
The standard user and application program interface to a relational database is the structured
query language (SQL). SQL statements are used both for interactive queries for information
from a relational database and for gathering data for reports.
A relational database is a set of tables containing data fitted into predefined categories. Each
table (which is sometimes called a relation) contains one or more data categories in columns.
Each row contains a unique instance of data for the categories defined by the columns. For
example, a typical business order entry database would include a table that described a
customer with columns for name, address, phone number, and so forth. Another table would
describe an order: product, customer, date, sales price, and so forth. A user of the database
could obtain a view of the database that fitted the user's needs. For example, a branch office
manager might like a view or report on all customers that had bought products after a certain
date. A financial services manager in the same company could, from the same tables, obtain a
report on accounts that needed to be paid.
Georgia State University - Confidential
MGS8020_05.ppt/Feb 5, 2015/Page 11
Relational Database
When creating a relational database, you can define the domain of possible values in a data
column and further constraints that may apply to that data value. For example, a domain of
possible customers could allow up to ten possible customer names but be constrained in one
table to allowing only three of these customer names to be specifiable.
The definition of a relational database results in a table of metadata or formal descriptions of
the tables, columns, domains, and constraints. Meta is a prefix that in most information
technology usages means "an underlying definition or description." Thus, metadata is a
definition or description of data and metalanguage is a definition or description of language.
A database is a collection of data that is organized so that its contents can easily be accessed,
managed, and updated. The most prevalent type of database is the relational database, a
tabular database in which data is defined so that it can be reorganized and accessed in a
number of different ways. A distributed database is one that can be dispersed or replicated
among different points in a network. An object-oriented programming database is one that is
congruent with the data defined in object classes and subclasses.
SQL (Structured Query Language) is a standard interactive and programming language for
getting information from and updating a database. Although SQL is both an ANSI and an ISO
standard, many database products support SQL with proprietary extensions to the standard
language. Queries take the form of a command language that lets you select, insert, update,
find out the location of data, and so forth.
Georgia State University - Confidential
MGS8020_05.ppt/Feb 5, 2015/Page 12
Internal Source
Systems
Extract, Transformation and Load
Business Intelligence Environment
External Data
Sources
Data Warehouse
Data Mart
SQL
RDBMS
Business Intelligence
Business Objects
Cognos
Microsoft Access
Etc.
Georgia State University - Confidential
MGS8020_05.ppt/Feb 5, 2015/Page 13
Relational Database
• IBM DB2, DB2/400
• Microsoft SQL/Server
• Teradata
• Oracle
• Sybase
• Informix / Red Brick
• Microsoft Access
• MySQL
Georgia State University - Confidential
MGS8020_05.ppt/Feb 5, 2015/Page 14
Relational Database
SQL
Request
RDBMS
BI Software
Application
Result
Set
Georgia State University - Confidential
MGS8020_05.ppt/Feb 5, 2015/Page 15
SQL
SQL – Structured Query Language
1.
DDL – Data Definition Language
•
•
•
Create
Drop
Alter
2.
DML – Data Manipulation Language
•
•
•
•
Insert
Update
Delete
Select
Georgia State University - Confidential
MGS8020_05.ppt/Feb 5, 2015/Page 16
SQL Select Statement
SELECT
column1, column2, . . .
FROM
table1, table2, . . .
WHERE
AND/OR
.....
criteria1
criteria2
ORDER BY
column1, column1, . . .
Georgia State University - Confidential
MGS8020_05.ppt/Feb 5, 2015/Page 17
SQL Select Statement
Aggregation
SELECT
column1, column2, . . .
FROM
table1, table2, . . .
WHERE
AND/OR
.....
criteria1
criteria2
GROUP BY
column1, column1, . . .
HAVING
AND/OR
.....
criteria1
criteria2
ORDER BY
column1, column1, . . .
Georgia State University - Confidential
MGS8020_05.ppt/Feb 5, 2015/Page 18
SQL – Example 1
SQL
CREATE
TABLE ADDR_BOOK ( NAME char(30),
COMPANY char(20),
E_MAIL char (25)
Output
Name
Company
Email
John Smith
Microsoft
[email protected]
Jeff Jones
Delta
[email protected]
Georgia State University - Confidential
MGS8020_05.ppt/Feb 5, 2015/Page 19
SQL – Example 2
2a)
2b)
SQL
Table - Product
SELECT
ID
Name
Category
NAME,
I
Internet
A
COMPANY,
B
Browsers
A
E_MAIL
A
Application
Null
FROM
G
Graphics
Null
ADDR_BOOK
WHERE COMPANY = ‘Microsoft'
SQL
SELECT
Output
Name
ID,
Company
John Smith Microsoft
Email
NAME
[email protected]
from
PRODUCT
WHERE CATEGORY = NULL
Georgia State University - Confidential
MGS8020_05.ppt/Feb 5, 2015/Page 20
SQL – Example 3
SQL
SELECT
ADDR_BOOK.NAME,
COMPANY.EMAIL
FROM
ADDR_BOOK,
COMPANY
WHERE ADDR_BOOK.EMPLOYEE_ID = COMPANY.EMPLOYEE_ID
Output
Name
Email
John Smith
[email protected]
Jeff Jones
[email protected]
Georgia State University - Confidential
MGS8020_05.ppt/Feb 5, 2015/Page 21
SQL – Example 4
SQL
SQL
CREATE TABLE CUSTOMER (
SELECT
CUST_NO INTEGER,
ORDER.ORDER_NO,
CUSTOMER.NAME,
CUSTOMER.ADDRESS,
CUSTOMER.CITY,
CUSTOMER.ZIP_CIDE,
CUSTOMER.COUNTRY
FIRST_NAME CHAR(30),
LAST_NAME CHAR(30),
ADDRESS CHAR(50),
CITY CHAR(30),
STATE CHAR (2),
ZIP_CODE CHAR(9),
COUNTRY CHAR(20) )
FROM
ORDER, CUSTOMER
WHERE ORDER.CUST_NO =
CUSTOMER.CUST_NO
AND
CREATE TABLE ORDER (
ORDER_NO INTEGER,
ORDER.DATE_ENTERED = '1998-2011'
DATE_ENTERED DATE,
CUST_NO INTEGER )
Georgia State University - Confidential
MGS8020_05.ppt/Feb 5, 2015/Page 22
SQL – Example 5
SQL
CREATE
TABLE ADDR_BOOK ( NAME char(30),
COMPANY char(20),
E_MAIL char (25)
Output
Name
Company
Email
John Smith
Microsoft
[email protected]
Jeff Jones
Delta
[email protected]
Georgia State University - Confidential
MGS8020_05.ppt/Feb 5, 2015/Page 23
SQL – Example 6 – Referential Integrity
SQL
SQL
CREATE TABLE CUSTOMER (
CREATE TABLE ORDER_ITEMS (
CUST_NO INTEGER PRIMARY KEY,
ORDER_NO INTEGER,
FIRST_NAME CHAR(30),
ITEM_NO INTEGER,
LAST_NAME CHAR(30),
PRODUCT CHAR(30),
ADDRESS CHAR(50),
QUANTITY INTEGER,
CITY CHAR(30),
UNIT_PRICE MONEY )
ZIP_CODE CHAR(9),
COUNTRY CHAR(20) )
CREATE TABLE ORDER (
ALTER TABLE ORDER_ITEMS
ADD PRIMARY KEY PK_ORDER_ITEMS
(ORDER_NO, ITEM_NO)
ORDER_NO INTEGER PRIMARY KEY,
DATE_ENTERED DATE,
CUST_NO INTEGER REFERENCES
CUSTOMER (CUST_NO) )
ALTER TABLE ORDER_ITEMS
ADD FOREIGN KEY FK_ORDER_ITEMS_1
(ORDER_NO)
REFERENCES ORDER (ORDER_NO)
Georgia State University - Confidential
MGS8020_05.ppt/Feb 5, 2015/Page 24
SQL – Example 7 – Index
When you have a primary key, you already have an implicitly (or explicitly)
defined unique index on the primary key columns. It's generally a good
idea to define non-unique indexes on the foreign keys.
SQL
CREATE UNIQUE INDEX PK_CUSTOMER ON CUSTOMER (CUST_NO)
CREATE UNIQUE INDEX PK_ORDER ON ORDER (ORDER_NO)
CREATE INDEX FK_ORDER_1 ON ORDER (CUST_NO)
CREATE UNIQUE INDEX PK_ORDER_ITEMS ON ORDER_ITEMS (ORDER_NO,
ITEM_NO)
CREATE INDEX FK_ORDER_ITEMS_1 ON ORDER_ITEMS (ORDER_NO)
Georgia State University - Confidential
MGS8020_05.ppt/Feb 5, 2015/Page 25
Agenda
Data
Warehouse
Georgia State University - Confidential
Relational
Database
Designing &
Building
the Data
Warehouse
MGS8020_05.ppt/Feb 5, 2015/Page 26
Why Business Intelligence
1.
Improve consistency and accuracy of reporting
2.
Reduce stress on operational systems for reporting and analysis
3.
Faster access to information
4.
BI tools provide increased analytical capabilities
5.
Empowering the Business User
6.
Companies are realizing that data is a company’s most underutilized asset
Georgia State University - Confidential
MGS8020_05.ppt/Feb 5, 2015/Page 27
ERM vs. DM
ERM - Entity Relationship Model
•
Remove redundancy
•
Efficiency of transactions
DM - Dimensional Model
•
Intuitive View of the Data
•
Efficiency of access and analysis
Georgia State University - Confidential
MGS8020_05.ppt/Feb 5, 2015/Page 28
Dimensional Model
Dimension Table
Dimension Table
Primary_Key
Descriptive_Attribute_1
Descriptive_Attribute_2
Descriptive_Attribute_3
Descriptive_Attribute_4
Descriptive_Attribute_5
Descriptive_Attribute_6
Descriptive_Attribute_7
....
Dimension Table
Primary_Key
Descriptive_Attribute_1
Descriptive_Attribute_2
Descriptive_Attribute_3
Descriptive_Attribute_4
Descriptive_Attribute_5
Descriptive_Attribute_6
Descriptive_Attribute_7
....
Georgia State University - Confidential
Fact Table
Foreign_Key_1
Foreign_Key_2
Foreign_Key_3
Foreign_Key_4
Metric_1
Metric_2
....
Star Schema
Primary_Key
Descriptive_Attribute_1
Descriptive_Attribute_2
Descriptive_Attribute_3
Descriptive_Attribute_4
Descriptive_Attribute_5
Descriptive_Attribute_6
Descriptive_Attribute_7
....
Dimension Table
Primary_Key
Descriptive_Attribute_1
Descriptive_Attribute_2
Descriptive_Attribute_3
Descriptive_Attribute_4
Descriptive_Attribute_5
Descriptive_Attribute_6
Descriptive_Attribute_7
....
MGS8020_05.ppt/Feb 5, 2015/Page 29
Retail Sales Dimensional Model (Partial)
Product Dimension Table
Time Dimension Table
Time_Key (PK)
Date
Day_of_Week
Week_Number
Month
....
Etc.
Sales Fact Table
Store Dimension Table
Store_Key (PK)
Store_ID
Store_Name
Address
District
Floor_Plan
....
Etc.
Georgia State University - Confidential
Time_Key (FK)
Product_Key (FK)
Store_Key (FK)
Customer_Key
(FK)
Units
Revenue
Cost
...
Product_Key (PK)
SKU_Number
Description
Brand
Product_Category
Size
....
Etc.
Customer Dimension Table
Customer_Key (PK)
Customer_Name
Purchase_Profile
Credit_Profile
Demographic_Category
Address
....
Etc.
MGS8020_05.ppt/Feb 5, 2015/Page 30
Fact Table
Sales Fact Table
Time_Key (FK)
Product_Key (FK)
Store_Key (FK)
Customer_Key
(FK)
Units
Revenue
Cost
...
Georgia State University - Confidential
1.
Contains Foreign Keys that relate
to Dimension Tables
2.
Have a many-to-one relationship
to Dimension Tables
3.
Contains Metrics to be
aggregated
4.
Typically does not contain any
non-foreign key or non-metric
data elements
5.
Level of Granularity defines depth
and flexibility of analysis
MGS8020_05.ppt/Feb 5, 2015/Page 31
Dimension Table
Product Dimension Table
Product_Key (PK)
SKU_Number
Description
Brand
Product_Category
Size
....
Etc.
Georgia State University - Confidential
1.
Contains a Primary Key that
relates to the Fact Table(s)
2.
Has a one-to-many relationship to
the Fact Table(s)
3.
Contains Descriptive data used to
limit and aggregated metrics from
the Fact Table(s)
4.
Can sometimes contain preaggregated data
MGS8020_05.ppt/Feb 5, 2015/Page 32
Warehouse Architecture Specification
•
Common Sources
•
Common Dimensions
•
Common Business Rules
•
Common Semantics
•
Common Metrics
Georgia State University - Confidential
MGS8020_05.ppt/Feb 5, 2015/Page 33
Time Dimension
Date (Primary Key) – a day, 365 per year
Week
Month
Quarter
Year
Fiscal Month
Fiscal Quarter
Fiscal Year
– defined by an end of week day
– January, February, March, . . .
– Q1: 01/01 – 03/31
Q2: 04/01 – 06/30
Q3: 07/01 – 09/30
Q4: 10/01 – 12/31
– 2000, 2001, 2002, 2003
– 4/4/5
Georgia State University - Confidential
MGS8020_05.ppt/Feb 5, 2015/Page 34
Time Dimension
Date (Primary Key) – a day, 365 per year
Weekday/Weekend
Day of Week
– Monday, Tuesday, Wednesday, . . .
Season
– Winter, Spring, Summer, Fall
Holiday
– Labor Day, 4th of July, Memorial Day, . . .
...
Georgia State University - Confidential
MGS8020_05.ppt/Feb 5, 2015/Page 35