Data Warehousing

Download Report

Transcript Data Warehousing

Data Warehousing
ISYS 650
What is a data warehouse?
• A data warehouse is a subject-oriented, integrated,
nonvolatile, time-variant collection of data in support of
management’s decision.
– Subject-oriented: data is organized around major subjects of
the enterprise, such as sales, rather than individual
transactions, and is oriented to decision making.
– Integrated: the same piece of information collected from
various systems is referred to in only one way.
• Example: Gender: M, F; Male, Female; Sex: 0, 1
– Nonvolatile: Data is loaded into a data warehouse on a
scheduled basis.
– Time-variant: Historical data to support time-series and
trend analysis.
What is a Data Warehouse?
•
A physical repository where relational data are
specially organized to provide enterprise-wide,
cleansed data in a standardized format
•
“The data warehouse is a collection of integrated,
subject-oriented databases designed to support
DSS functions, where each unit of data is nonvolatile and relevant to some moment in time”
Need for Data Warehousing
• Separation of operational and informational systems and
data for improved performance.
Types of Data in a DW
• Current detailed data: consistent at the time
the data is extracted from the transaction
system.
• Old detailed data: need to be archived.
• Summarized data
• Metadata:
– A directory of what is in the warehouse.
– A guide to mapping data from transaction
database to data warehouse
Data Mart
A departmental data warehouse that stores
only relevant data
– Dependent data mart
A subset that is created directly from a data
warehouse
– Independent data mart
A small data warehouse designed for a
strategic business unit or a department
DW Framework
No data marts option
Applications
(Visualization)
Data
Sources
Access
ETL
Process
Select
Legacy
Metadata
Extract
POS
Transform
Enterprise
Data warehouse
Integrate
Other
OLTP/wEB
Data mart
(Finance)
Load
Replication
External
data
Data mart
(Engineering)
Data mart
(...)
/ Middleware
Data mart
(Marketing)
API
ERP
Routine
Business
Reporting
Data/text
mining
OLAP,
Dashboard,
Web
Custom built
applications
Data Integration and the Extraction,
Transformation, and Load (ETL) Process
Extraction, transformation, and load (ETL)
Transient
data source
Packaged
application
Data
warehouse
Legacy
system
Extract
Transform
Cleanse
Load
Data mart
Other internal
applications
Representation of Data in DW
• Dimensional Modeling – a retrieval-based system that
supports high-volume query access
• Star schema – the most commonly used and the simplest style
of dimensional modeling
– Contain a fact table surrounded by and connected to several
dimension tables
– Fact table contains the descriptive attributes (numerical values)
needed to perform decision analysis and query reporting
– Dimension tables contain classification and aggregation information
about the values in the fact table
• Snowflakes schema – an extension of star schema where the
diagram resembles a snowflake in shape
Multidimensionality
• Multidimensionality
The ability to organize, present, and analyze data by
several dimensions, such as sales by region, by product, by
salesperson, and by time (four dimensions)
• Multidimensional presentation
– Dimensions: products, salespeople, market segments, business units,
geographical locations, distribution channels, country, or industry
– Measures: money, sales volume, head count, inventory profit, actual
versus forecast
– Time: daily, weekly, monthly, quarterly, or yearly
Example:
Northwind Database
Examples of Sales Analysis
• Total sales by Product
• Sales related to Customer:
– Location: Sales by City, Country
• Sales related to Time:
– Quarterly, monthly, yearly Sales
• Sales related to Employee:
Analyze Sales Data
Detailed Business Data
• Total sales:
• Amount of each detail line:
Quantity*UnitPrice*Discount
• Sum (Quantity*UnitPrice*Discount)
• Total quantity sold:
• Sum(Quantity)
• Detailed business data:
– Quantity*UnitPrice*Discount
– Quantity
Dimensions for Data Analysis:
Factors relevant to the detailed business data
• Analyze sales by:
– Product, product category
– Location: City, State, Country
– Time:
• Quarterly, yearly sales
– Employee:
– And combinations of these dimensions:
• Ex: Product and Location, Product and Time
Data Warehouse Design
- Star Schema • Dimension tables
– contain descriptions about the subjects of the
business such as customers, employees, locations,
products, time periods, etc.
• Fact table
– contain detailed business data with links to
dimension tables.
Define Product Dimension
• Product Table:
– ProductID, ProductName, SupplierID, CategoryID,
QuantityPerUnit, UnitPrice, UnitsInStock,
UnitsOnOrder, ReorderLevel, Discontinued
• Product dimension table:
– ProductID, ProductName, CategoryID
Define Employee Dimenstion
• Employees Table: EmployeeID, LastName,
FirstName, Title, TitleOfCourtesy, BirthDate,
HireDate, Address, City, Region, PostalCode,
Country, HomePhone, etc.
• Employee Dimension:
– EmployeeID, FullName, Title, EmpCity
Define Location Dimension
• Customers table:
– CustomerID, CompanyName, ContactName,
ContactTitle, Address, City, Region, PostalCode, Country,
Phone, Fax
• Location dimension:
– LocationCode, City, Country
– Define Location Code: This is an artificial code created
to link detailed business data with the city and country.
– In the Northwind database, I used the Make Table query
to create a Location table from the Customers table
with City and Country fields. Then I used the Customers
table’s design view to add a LocationCode field with the
Auto Number data type.
Define Period Dimension
• Period:
– In the Orders table: OrderDate
– In the data warehouse we define Period to be: PeriodCode,
Year, Quarter
• OrderDate: 04-Jul-1996 -> 1996, 3, 7
• OrderDate: 20-Dec-1996 -> 1996, 4, 12
– In Access: Create view based on Orders table
• Year:Year(OrderDate); Month:Month(OrderDate)
• Quarter:
– Quarter: IIf([month]<=3,1,IIf([month]<=6,2,IIf([month]<=9,3,4)))
– Define Period Code:
• PeriodCode:Cstr(Year) + Cstr(Quarter)
• 1996, 3, 7 -> 19963
• 1996, 4, 12 -> 19964
Star Schema
Location
Dimension
LocationCode
City
Country
FactTable
LocationCode
PeriodCode
EmployeeID
ProductID
Qty
Amount
Product
Dimension
ProductID
ProductName
CategoryID
Employee
Dimension
EmployeeID
FullName
Title
EmpCity
Period
Dimension
PeriodCode
Year
Quarter
A Query to retrieve data for Fact Table
Transfer Data Between Access
Databases
• Create/Query/Design View
– 1. Create the query with the data to transfer
– 2. Click Make Table button
• Make table in the same database
• Make table in other database
– 3. Click Run