幻灯片 1 - 这是一个测试
Download
Report
Transcript 幻灯片 1 - 这是一个测试
Data Warehousing Fundamentals
noynot@163.com
Course Objectives
After completing this course, you should be able
to do the following:
Describe the role of business intelligence (BI) and
data warehousing in today’s marketplace
Describe data warehousing terminology and the
various technologies that are required to
implement a data warehouse
Explain the implementation and organizational
issues surrounding a data warehouse project
Identify data warehouse modeling concepts
Explain the extraction, transformation, and loading
processes for building a data warehouse
noynot@163.com
Course Objectives
Identify management and maintenance
processes that are associated with a data
warehouse project
Describe methods for refreshing warehouse data
Explain warehouse metadata concepts
Identify tools that can be employed at each stage
of the data warehouse project
Describe user profiles and techniques for
querying the warehouse
Identify methods and tools for accessing and
analyzing warehouse data
noynot@163.com
Lessons
1.
2.
Business Intelligence and Data Warehousing
Defining Data Warehouse Concepts and
Terminology
3. Planning and Managing the Data Warehouse Project
4. Modeling the Data Warehouse
5. Building the Data Warehouse: Extracting Data
6. Building the Data Warehouse: Transforming Data
7. Building the Data Warehouse: Loading Warehouse
Data
8. Refreshing Warehouse Data
9. Leaving a Metadata Trail
10. Managing and Maintaining the Data Warehouse
noynot@163.com
Let’s Get Started
Lesson 1
noynot@163.com
Lesson 1 Objectives
After completing this lesson, you should be
able to do the following:
Describe the role of business intelligence in
today’s marketplace
Describe why an online transaction processing
system (OLTP) is not suitable for analytical
reporting
Describe how extract processing for decision
support querying led to data warehouse
solutions that are employed today
Explain why businesses are driven to employ
data warehouse technology
noynot@163.com
What Is Business Intelligence?
“Business Intelligence is the process of transforming
data into information and through discovery transforming
that information into knowledge.”
Gartner Group
noynot@163.com
Purpose of Business Intelligence
The purpose of business intelligence is to convert
the volume of data into business value through
analytical reporting.
Decision
Knowledge
Information
Data
Value
Volume
noynot@163.com
Early Management
Information Systems
MIS systems provided business data.
Reports were developed on request.
Reports provided little analysis capability.
Decision support tools gave personal ad hoc
access to data.
Ad hoc access
Production
platforms
Operational reports
Decision makers
noynot@163.com
Analyzing Data from
Operational Systems
Data structures are complex.
Systems are designed for high performance and
throughput.
Data is not meaningfully represented.
Data is dispersed.
OLTP systems may be unsuitable for intensive
queries.
Production
platforms
Operational reports
noynot@163.com
Why OLTP Is Not Suitable
for Analytical Reporting
OLTP
Analytical Reporting
Information to support
day-to-day service
Historical information
to analyze
Data stored at transaction
level
Data needs to be integrated
Database design:
Normalized
Database design:
Denormalized, star schema
noynot@163.com
Data Extract Processing
End user computing offloaded from the operational
environment
User’s own data
Operational
systems
Extracts
Decision
makers
noynot@163.com
Management Issues with
Data Extract Programs
Operational
systems
Extracts
Decision
makers
Extract Explosion
noynot@163.com
Productivity Issues with
Extract Processing
Duplicated effort
Multiple technologies
Obsolete reports
No metadata
noynot@163.com
Data Quality Issues with
Extract Processing
No common time basis
Different calculation algorithms
Different levels of extraction
Different levels of granularity
Different data field names
Different data field meanings
Missing information
No data correction rules
No drill-down capability
noynot@163.com
Data Warehousing and
Business Intelligence
Legacy
Data
Enterprise Data
Warehouse
Operations
Data
External
Data
Data Marts
Analytical
Reporting
noynot@163.com
Advantages of Warehouse
Processing Environments
Controlled
Reliable
Quality information
Single source of data
Internal and
external systems
Data
warehouse
Decision
makers
noynot@163.com
Advantages of Warehouse
Processing Environments
No duplication of effort
No need for tools to support many technologies
No disparity in data, meaning, or representation
No time period conflict
No algorithm confusion
No drill-down restrictions
noynot@163.com
Success Factors for a Dynamic
Business Environment
Know the business
Reinvent to face new challenges
Invest in products
Invest in customers
Retain customers
Invest in technology
Improve access to business information
Provide superior services and products
Be profitable
noynot@163.com
Business Drivers for
Data Warehouses
Provide supporting information systems
Get quality information:
• Reduce costs
• Streamline the business
• Improve margins
noynot@163.com
Technological Advances
Enabling Data Warehousing
Hardware
Operating system
Database
Query tools
•
Applications
•
•
•
•
•
Large databases
64-bit architectures
Indexing techniques
Affordable, cost-effective
open systems
Robust warehouse tools
Sophisticated end user tools
noynot@163.com
两种数据的区别
noynot@163.com
Summary
In this lesson, you should have learned how to:
Describe the role of business intelligence in
today’s marketplace
Describe why an online transaction
processing system (OLTP) is not suitable for
analytical reporting
Describe how extract processing for decision
support querying led to data warehouse
solutions employed today
Explain why businesses are driven to employ
data warehouse technology
noynot@163.com
Practice 1-1 Overview
This practice covers the following topics:
Answering questions about data warehousing
Discussing how data warehousing meets business
needs
noynot@163.com
Lesson 2
Defining Data Warehouse
Concepts and Terminology
Objectives
After completing this lesson, you should be able to
do the following:
Identify a common, broadly accepted definition of a
data warehouse
Describe the differences of dependent and
independent data marts
Identify some of the main warehouse development
approaches
Recognize some of the operational properties and
common terminology of a data warehouse
noynot@163.com
Definition of a Data Warehouse
“A data warehouse is a subject oriented,
integrated, non-volatile, and time variant
collection of data in support of management’s
decisions.”
— W.H. Inmon
“数据仓库是一个面向主题的、集成的、随时间变化
的、非易失的、用于战略决策的数据集合”
“Building the Data
Warehouse”(1991)
noynot@163.com
Definition of a Data Warehouse
“…数据仓库无非是所有数据集市的集合...”
— Ralph Kimball
“数据仓库是信息数据库的具体实现,用来存储源
自业务数据库的共享数据。典型的数据仓库应该是
一个主题数据库,支持用户从巨大的运营数据存储
中发现信息,支持对业务趋势进行跟踪和响应,实
现业务的预测和计划。”
— DM Review
noynot@163.com
Definition of a Data Warehouse
“An enterprise structured repository of
subject-oriented, time-variant, historical data
used for information retrieval and decision
support. The data warehouse stores atomic
and summary data.”
— Oracle’s Data Warehouse Definition
“数据仓库是一个过程而不是一个项目”
—另一角度描述数据仓库
noynot@163.com
Data Warehouse Properties
Integrated
Subjectoriented
Data
Warehouse
Nonvolatile
Time-variant
noynot@163.com
Subject-Oriented
Data is categorized and stored by business subject
rather than by application.
OLTP Applications
Data Warehouse
Subject
Equity Plans
Shares
Insurance
Loans
Savings
Customer financial
information
noynot@163.com
Integrated
Data on a given subject is defined and stored once.
Savings
Current
Accounts
Loans
OLTP Applications
Customer
Data Warehouse
noynot@163.com
Time-Variant
Data is stored as a series of snapshots, each
representing a period of time.
Data
Warehouse
noynot@163.com
Nonvolatile
Typically data in the data warehouse is not updated
or deleted.
Operational
Warehouse
Load
Insert, Update,
Delete, or Read
Read
noynot@163.com
Changing Warehouse Data
Operational Databases
Warehouse Database
First time load
Refresh
Refresh
Refresh
Purge or
Archive
noynot@163.com
Data Warehouse Versus OLTP
Property
OLTP
Data Warehouse
Response Time
Sub seconds to
seconds
Seconds to hours
Operations
DML
Primarily Read only
Nature of Data
30 – 60 days
Snapshots over time
Data Organization
Application
Subject, time
Size
Small to large
Large to very large
Data Sources
Operational, Internal
Operational,
Internal, External
Activities
Processes
Analysis
noynot@163.com
Usage Curves
Operational system is predictable
Data warehouse:
• Variable
• Random
noynot@163.com
Enterprise wide Warehouse
Large scale implementation
Scopes the entire business
Data from all subject areas
Developed incrementally
Single source of enterprisewide data
Synchronized enterprisewide data
Single distribution point to dependent data
marts
noynot@163.com
数据仓库设计中心思想
具有一个合适的粒度或细节以满足所有的数据集市
设计不能阻碍在数据集市中使用各种技术,能适应多维集市、
统计、挖掘及探索型仓库
noynot@163.com
Data Marts
数据仓库数据的一个子集。
BI环境中的大部分分析活动均在数据集市中进行。每个数
据集市中的数据通常是为特定的功能所定制,不必对其他
的使用有效。
noynot@163.com
Data Warehouses Versus
Data Marts
Property
Data Warehouse
Data Mart
Scope
Enterprise
Department
Subjects
Multiple
Single-subject, LOB
Data Source
Many
Few
Implementation time
Months to years
Months
noynot@163.com
Dependent Data Mart
Data Marts
Operational
Systems
Legacy
Data
Flat Files
Operations
Data
External
Data
Data
Warehouse
Marketing
Sales
Finance
HR
External
Data
Marketing
Sales
Finance
noynot@163.com
Independent Data Mart
Operational
Systems
Legacy
Data
Flat Files
Sales or
Marketing
Operations
Data
External
Data
External
Data
noynot@163.com
Features of a Data Mart
Not Real-Time Data
Consolidation and Cleansing
noynot@163.com
Warehouse Development Approaches
“Big bang” approach
Incremental approach:
• Top-down incremental approach
• Bottom-up incremental approach
noynot@163.com
“Big Bang” Approach
Analyze enterprise
requirements
Build enterprise
data warehouse
Report in subsets or
store in data marts
noynot@163.com
Top-Down Approach
Analyze requirements at the enterprise level
Develop conceptual information model
Identify and prioritize subject areas
Complete a model of selected subject area
Map to available data
Perform a source system analysis
Implement base technical architecture
Establish metadata, extraction, and load
processes for the initial subject area
Create and populate the initial subject area
data mart within the overall warehouse
framework
noynot@163.com
Bottom-Up Approach
Define the scope and coverage of the
data warehouse and analyze the source
systems within this scope
Define the initial increment based on the
political pressure, assumed business
benefit and data volume
Implement base technical architecture
and establish metadata, extraction, and
load processes as required by increment
Create and populate the initial subject
areas within the overall warehouse
framework
noynot@163.com
Incremental Approach
to Warehouse Development
Multiple iterations
Shorter implementations
Validation of each phase
Increment 1
Strategy
Definition
Analysis
Design
Iterative
Build
Production
noynot@163.com
Data Warehousing
Process Components
Methodology
Architecture
Extraction, Transformation, and Load (ETL)
Implementation
Operation and Support
noynot@163.com
Methodology
Ensures a successful data warehouse
Encourages incremental development
Provides a staged approach to an
enterprisewide warehouse:
•
•
•
•
Safe
Manageable
Proven
Recommended
noynot@163.com
Architecture
“Provides the planning, structure, and
standardization needed to ensure integration of
multiple components, projects, and processes
across time.”
“Establishes the framework, standards, and
procedures for the data warehouse at an
enterprise level.”
– — The Data Warehousing Institute
noynot@163.com
Extraction, Transformation,
and Load (ETL)
“Effective data extract, transform and load (ETL)
processes represent the number one success
factor for your data warehouse project and can
absorb up to 70 percent of the time spent on a
typical data warehousing project.”
— DM Review, March 2001
Source
Staging Area
Target
noynot@163.com
Implementation
Data Warehouse Architecture
Ex., Incremental Implementation
Implementation
Increment 1
Increment 2
.
.
.
Increment n
noynot@163.com
Operation and Support
Data access and reporting
Refreshing warehouse data
Monitoring
Responding to change
noynot@163.com
Phases of the
Incremental Approach
Strategy
Definition
Analysis
Design
Build
Production
Increment 1
Strategy
Definition
Analysis
Design
Build
Production
noynot@163.com
Strategy Phase Deliverables
Business goals and objectives
Data warehouse purpose, objectives, and scope
Enterprise data warehouse logical model
Incremental milestones
Source systems data flows
Subject area gap analysis
noynot@163.com
Strategy Phase Deliverables
Data acquisition strategy
Data quality strategy
Metadata strategy
Data access environment
Training strategy
noynot@163.com
Summary
In this lesson, you should have learned how
to:
Identify a common, broadly accepted definition
of a data warehouse
Describe the differences of dependent and
independent data marts
Identify some of the main warehouse
development approaches
Recognize some of the operational properties
and common terminology of a data warehouse
noynot@163.com
Practice 2-1 Overview
This practice covers the following topics:
Answering questions regarding data warehousing
concept and terminology
Discussing some of the data warehouse concept and
terminology
noynot@163.com
Lesson 3
Modeling the Data Warehouse
Objectives
After completing this lesson, you should be
able to do the following:
Discuss data warehouse environment data
structures
Discuss data warehouse database design
phases:
• Defining the business model
• Defining the dimensional model
• Defining the physical model
noynot@163.com
Data Warehouse Modeling Issues
Among the main issues that data warehouse data
modelers face are:
Different data types
Many ways to use warehouse data
Many ways to structure the data
Multiple modeling techniques
Planned replication
Large volumes of data
noynot@163.com
Data Warehouse Environment
Data Structures
The data modeling structures that are
commonly found in a data warehouse
environment are:
Third normal form (3NF)
Star schema
Snowflake schema
noynot@163.com
Star Schema Model
Product Table
Store Table
Product_id
Product_disc,...
Store_id
District_id,...
Sales Fact Table
Central
fact table
Product_id
Store_id
Item_id
Day_id
Sales_amount
Sales_units, ...
Denormalized
dimensions
Time Table
Day_id
Month_id
Year_id,...
Item Table
Item_id
Item_desc,...
noynot@163.com
Snowflake Schema Model
Product Table
Product_id
Product_desc
Store Table
Store_id
Store_desc
District_id
District Table
District_id
District_desc
Sales Fact Table
Item_id
Store_id
Product_id
Week_id
Sales_amount
Sales_units
Time Table
Item Table
Dept Table
Mgr Table
Week_id
Period_id
Year_id
Item_id
Item_desc
Dept_id
Dept_id
Dept_desc
Mgr_id
Dept_id
Mgr_id
Mgr_name
noynot@163.com
Snowflake Schema Model
Direct use by some tools
More flexible to change
Provides for speedier data loading
Can become large and unmanageable
Degrades query performance
More complex metadata
Country
State
County
City
noynot@163.com
Data Warehouse Database
Design Phases
Phase 1:
Defining the business model
Phase 2:
Defining the dimensional model
Phase 3:
Defining the physical model
noynot@163.com
Phase 1: Defining the Business Model
Performing strategic analysis
Creating the business model
Documenting metadata
noynot@163.com
Performing Strategic Analysis
Identify crucial business processes
Understand business processes
Prioritize and select the business processes to
implement
High
Business
Benefit
Low
Low
Feasibility
High
noynot@163.com
Creating the Business Model
Defining business requirements:
•
•
•
•
Identifying the business measures
Identifying the dimensions
Identifying the grain
Identifying the business definitions and rules
Verifying data sources
noynot@163.com
Business Requirements Drive
the Design Process
Primary input
Business
Requirements
Secondary input
Existing Metadata
Production ERD Model
Research
noynot@163.com
Identifying Measures
and Dimensions
Measures
The attribute varies
continuously:
Balance
Units Sold
Cost
Sales
The attribute is
perceived as constant or
discrete:
Product
Location
Time
Size
Dimensions
noynot@163.com
Using a Business Process Matrix
Business Processes
Business
Dimensions
Sales
Returns
Inventory
Customer
Date
Product
Channel
Promotion
Sample of business process matrix
noynot@163.com
Determining Granularity
YEAR?
QUARTER?
MONTH?
WEEK?
DAY?
noynot@163.com
Identifying Business Rules
Location
Geographic proximity
0 - 1 miles
1 - 5 miles
> 5 miles
Time
Month > Quarter > Year
Product
Type
Monitor
Status
PC
Server
15 inch
17 inch
19 inch
None
New
Rebuilt
Custom
Store
Store > District > Region
noynot@163.com
Documenting Metadata
Documenting metadata should include:
Documenting the design process
Documenting the development process
Providing a record of changes
Recording enhancements over time
noynot@163.com
Metadata Documentation Approaches
Automated
• Data modeling tools
• ETL tools
• End-user tools
Manual
noynot@163.com
Phase 2: Defining the Dimensional Model
Identify fact tables:
• Translate business measures into fact tables
• Analyze source system information for additional
measures
Identify dimension tables
Link fact tables to the dimension tables
Model the time dimension
noynot@163.com
Star Dimensional Modeling
Product Table
Product_id
Product_desc
...
Time Table
Day_id
Month_id
Period_id
Year_id
Sales Fact Table
Product_id
Store_id
Item_id
Day_id
Sales_amount
Sales_units
...
Store Table
Store_id
District_id
...
Item Table
Item_id
Item_desc
...
noynot@163.com
Fact Table Characteristics
Contain numerical metrics of the business
Can hold large volumes of data
Can grow quickly
Can contain base, derived,
and summarized data
Sales Fact Table
Are typically additive
Product_id
Are joined to dimension
Store_id
tables through foreign keys
Item_id
that reference primary
Day_id
keys in the dimension tables
Sales_amount
Sales_units
...
noynot@163.com
Dimension Table Characteristics
Dimension tables have the following characteristics:
Contain textual information that represents the
attributes of the business
Contain relatively static data
Are joined to a fact table through
a foreign key reference
noynot@163.com
Star Dimensional
Model Characteristics
The model is easy for users to understand.
Primary keys represent a dimension.
Nonforeign key columns are values.
Facts are usually highly normalized.
Dimensions are completely denormalized.
Fast response to queries is provided.
Performance is improved by reducing table
joins.
End users can express complex queries.
Support is provided by many front-end tools.
noynot@163.com
Using Time in the Data Warehouse
Defining standards for time is critical.
Aggregation based on time is complex.
noynot@163.com
The Time Dimension
Time is critical to the data warehouse. A consistent
representation of time is required for extensibility.
Sales fact
Time
dimension
Where should the element of time be stored?
noynot@163.com
Using Data Modeling Tools
Tools with a GUI enable definition, modeling,
and reporting.
Avoid a mix of modeling techniques caused by:
• Development pressures
• Developers with lack of knowledge
• No strategy
Determine a strategy.
Write and publish formally.
Make available electronically.
noynot@163.com
Phase 3: Defining the
Physical Model
Translate the dimensional design to a physical
model for implementation.
Define storage strategy for tables and indexes.
Perform database sizing.
Define initial indexing strategy.
Define partitioning strategy.
Update metadata document with physical
information.
noynot@163.com
Physical Model Design Tasks
Define naming and database standards.
Perform database sizing.
Develop initial indexing strategy.
Develop data partition strategy.
Define storage parameters.
Set initialization parameters.
Use parallel processing.
Define summary data.
Determine hardware architecture.
noynot@163.com
Database Object Naming Conventions
Develop a reasonable list of abbreviations.
List all the objects’ names, and work with the
user community to define them.
Resolve name disputes.
Document your naming standards in the
metadata document.
Plan for the naming standards to be a living
document.
noynot@163.com
Architectural Requirements
Scalability
Manageability
Availability
Extensibility
Integration
Flexibility
User
Budget
Business
Technology
noynot@163.com
Strategy for Architecture Definition
Obtain existing architecture plans.
Obtain existing capacity plans.
Document existing interfaces.
Prepare capacity plan.
Prepare technical architecture.
Document operating system requirements.
Develop recovery plans.
Develop security and control plans.
Create architecture.
Create technical risk assessment.
noynot@163.com
Hardware Requirements
SMP
Cluster
MPP
Hybrids (employing both SMP and MPP)
noynot@163.com
Making the Right Choice
Requirements differ from operational systems.
Benchmark
• Available from vendors
• Develop your own
• Use realistic queries
Scalability is important.
noynot@163.com
Storage and Performance Considerations
Database sizing
• Test Load Sampling
Data partitioning
• Horizontal partitioning
• Vertical partitioning
Indexing
• B-Tree indexes
• Bitmap indexes
• Bitmap-join indexes
Star query optimization
• Star transformation
noynot@163.com
Database Sizing
Sizing influences capacity planning and
systems environment management.
Sizing is required for:
• The database
• Other storage areas
Sizing is not an exact science.
Techniques vary.
noynot@163.com
Test Load Sampling
Analyze a representative sample of the data chosen
using proven statistical methods. Ensure that the
sample reflects:
Test loads for different periods
Day-to-day operations
Seasonal data and worst-case scenarios
Indexes and summaries
noynot@163.com
Data Partitioning
Breaking up of data into separate physical
units that can be handled independently
Data partitioning provides ease of:
•
•
•
•
•
•
•
•
Restructuring
Reorganization
Removal
Recovery
Monitoring
Management
Archiving
Indexing
noynot@163.com
Horizontal Partitioning
Table and index data are split by:
•
•
•
•
•
Time
Sales region or person
Geography
Organization
Line of business
Candidate columns appear in
a WHERE clause.
Analysis determines requirements.
noynot@163.com
Vertical Partitioning
You can use vertical partitioning when:
•
•
•
•
Speed of query and update actions are improved by it
Users require access to specific columns
Some data is changed infrequently
Descriptive dimension text may be better moved
away from the dimension itself
noynot@163.com
Partitioning Methods
Range partitioning
List partitioning
Hash partitioning
Composite partitioning
• Composite range-hash partitioning
• Composite range-list partitioning
Index partitioning
noynot@163.com
Indexing
Indexing is used for the following reasons:
It is a huge cost saving, greatly
improving performance and
scalability.
It can replace a full table scan by
a quick read of the index followed
by a read of only those disk
blocks that contain the rows
needed.
noynot@163.com
B-Tree Index
Most common type of indexing
Used for high cardinality columns
Designed for few rows returned
noynot@163.com
Bitmap Indexes
Provide performance benefits and storage savings
Store values as 1s and 0s
Use instead of B-tree indexes when:
• Tables are large
• Columns have relatively low cardinality
noynot@163.com
Bitmap Join Indexes
A bitmap index for the join of two or more tables:
They are new to Oracle9i.
They provide better performance and storage savings.
noynot@163.com
Parallelism
Sales table
Customers
table
P1
P2
P3
P1
P2
P3
Parallel Execution Servers
noynot@163.com
Using Summary Data
Designing summary tables offers the following
benefits:
Provides fast access to precomputed data
Reduces use of I/O, CPU, and memory
noynot@163.com
Summary
In this lesson, you should have learned how to:
Describe Data Warehouse Environment Data
Structures
Define the business model:
• Performing strategic analysis
• Creating the business model
• Identifying business rules
Define the dimensional model:
• Star dimensional model characteristics
Define the physical model:
• Physical model design tasks
• Architectural and hardware requirements
• Storage and performance considerations
noynot@163.com
Practice 3-1 Overview
This practice covers the following topics:
Specifying true or false to a series of statements
Completing a series of sentences accurately
Practicing identifying a simple business model
Identifying indexing method
noynot@163.com
Lesson 4
Building the Data Warehouse:
Extracting Data
Objectives
After completing this lesson, you should be
able to do the following:
Outline the ETL (Extraction, Transformation,
and Loading) processes for building a data
warehouse
Identify ETL tasks, importance, and cost
Explain how to examine data sources
Identify extraction techniques and methods
Identify analysis issues and design options for
extraction processes
List the selection criteria for the ETL tools
noynot@163.com
Extraction, Transformation, Loading
(ETL) Processes
Extract source data
Transform/clean data
Index and summarize
Load data into warehouse
Detect changes
Refresh data
Operational systems
Programs
Gateways
Tools
Data Warehouse
ETL
noynot@163.com
ETL: Tasks, Importance, and Cost
Operational
systems
Extract
Clean up
Consolidate
Restructure
Load
Maintain
Refresh
Data Warehouse
ETL
Relevant
Useful
Quality
Accurate
Accessible
noynot@163.com
Extracting Data
Source systems
• Data from various data sources in various formats
Extraction Routines
• Developed to select data fields from sources
• Consist of business rules, audit trails, error correction
facilities
Data mapping
Transform
Operational
databases
Data staging area
Warehouse
database
noynot@163.com
Examining Data Sources
Production
Archive
Internal
External
noynot@163.com
Production Data
Operating system platforms
File systems
Database systems and vertical applications
IMS
DB2
Oracle
Sybase
Informix
VSAM
SAP
Shared Medical
Systems
Dun and Bradstreet
Financials
Hogan Financials
Oracle Financials
noynot@163.com
Archive Data
Historical data
Useful for analysis over long periods of time
Useful for first-time load
May require unique transformations
Operation
databases
Warehouse
database
noynot@163.com
Internal Data
Planning, sales, and marketing organization data
Maintained in the form of:
• Spreadsheets (structured)
• Documents (unstructured)
Treated like any other source data
Planning
Marketing
Accounting
Warehouse database
noynot@163.com
External Data
Information from outside the organization
Issues of frequency, format, and predictability
Described and tracked using metadata
Purchased
databases
A.C. Nielsen, IRI, IMS,
Walsh America
Dun and
Bradstreet
Competitive
information
Economic
forecasts
Barron's
Warehousing
databases
Wall Street
Journal
noynot@163.com
Mapping Data
Mapping data defines:
Which operational attributes to use
How to transform the attributes for the warehouse
Where the attributes exist in the warehouse
File A
F1
F2
F3
123
Bloggs
10/12/56
Staging File One
Number
USA123
Name
Mr. Bloggs
DOB
10-Dec-56
Metadata
File A
F1
Staging File One
Number
F2
F3
Name
DOB
noynot@163.com
Extraction Techniques
Programs: C, C++, COBOL, PL/SQL, Java
Gateways: transparent database access
Tools:
• In-house developed tools
• Vendor’s data extraction tools
noynot@163.com
Extraction Methods
Logical Extraction methods:
• Full Extraction
• Incremental Extraction
Physical Extraction methods:
• Online Extraction
• Offline Extraction
noynot@163.com
Designing Extraction Processes
Analysis:
• Sources, technologies
• Data types, quality, owners
Design options:
• Manual, custom, gateway, third-party
• Replication, full, or delta refresh
Design issues:
• Volume and consistency of data
• Automation, skills needed, resources
noynot@163.com
Maintaining Extraction Metadata
Source location, type, structure
Access method
Privilege information
Temporary storage
Failure procedures
Validity checks
Handlers for missing data
noynot@163.com
Extraction Tools
noynot@163.com
Selection Criteria
Base functionality
Interface features
Metadata repository
Open API
Metadata access
Repository utilities
Input and output processing
Cleansing, reformatting, and auditing
References
Training requirements
noynot@163.com
Possible ETL Failures
A missing source file
A system failure
Inadequate metadata
Poor mapping information
Inadequate storage planning
A source structural change
No contingency plan
Inadequate data validation
noynot@163.com
Maintaining ETL Quality
ETL must be:
• Tested
• Documented
• Monitored and reviewed
Disparate metadata must be coordinated.
noynot@163.com
Summary
In this lesson, you should have learned how to:
Outline the ETL (Extraction, Transformation,
and Loading) processes for building a data
warehouse
Identify ETL tasks, importance, and cost
Explain how to examine data sources
Identify extraction techniques and methods
Identify analysis issues and design options for
extraction processes
List the selection criteria for the ETL tools
Identify Oracle’s solution for ETL process
noynot@163.com
Practice 4-1 Overview
This practice covers the following topics:
Answering a series of short questions
Answering questions based on the business scenario
for Frontier Airways
noynot@163.com
Lesson 5
Building the Data Warehouse:
Transforming Data
Objectives
After completing this lesson, you should be
able to do the following:
Define transformation
Identify possible staging models
Identify data anomalies and eliminate them
Explain the importance of quality data
Describe techniques for transforming data
Design transformation process
noynot@163.com
Transformation
Transformation eliminates anomalies from
operational data:
Cleans and standardizes
Presents subject-oriented data
Extract
Transform:
Clean up
Consolidate
Restructure
Warehouse
Operational
systems
Load
Data Staging Area
noynot@163.com
Possible Staging Models
Remote staging model
Onsite staging model
noynot@163.com
Remote Staging Model
Data staging area within the warehouse environment
Transform
Extract
Operational
system
Load
Staging area
Warehouse
Data staging area in its own environment
Transform
Operational
system
Extract
Load
Staging area
Warehouse
noynot@163.com
On-site Staging Model
Data staging area within the operational environment,
possibly affecting the operational system
Transform
Extract
Operational
system
Load
Staging area
Warehouse
noynot@163.com
Data Anomalies
No unique key
Data naming and coding anomalies
Data meaning anomalies between groups
Spelling and text inconsistencies
CUSNUM
NAME
ADDRESS
90233479 Oracle Limited
100 N.E. 1st St.
90233489 Oracle Computing
15 Main Road, Ft. Lauderdale
90234889 Oracle Corp. UK
15 Main Road, Ft. Lauderdale,
FLA
90345672 Oracle Corp UK Ltd 181 North Street, Key West, FLA
noynot@163.com
Transformation Routines
Cleaning data
Eliminating inconsistencies
Adding elements
Merging data
Integrating data
Transforming data before load
noynot@163.com
Transforming Data:
Problems and Solutions
Multipart keys
Multiple local standards
Multiple files
Missing values
Duplicate values
Element names
Element meanings
Input formats
Referential Integrity constraints
Name and address
noynot@163.com
Multipart Keys Problem
Multipart keys
Product code = 12 M 654313 45
Salesperson
code
Country Sales
code
territory
Product
number
noynot@163.com
Multiple Local Standards Problem
Multiple local standards
Tools or filters to preprocess
cm
DD/MM/YY
1,000 GBP
inches
MM/DD/YY
FF 9,990
cm
DD-Mon-YY
USD 600
noynot@163.com
Multiple Files Problem
Added complexity of multiple source files
Start simple
Multiple
source files
Logic to detect
correct source
Transformed
data
noynot@163.com
Missing Values Problem
Solution:
Ignore
Wait
Mark rows
Extract when time-stamped
If NULL then
field = ‘A’
A
noynot@163.com
Duplicate Values Problem
Solution:
SQL self-join techniques
RDMBS constraint utilities
ACME Inc
ACME Inc
ACME Inc
SQL>
2
3
4
5
6
7
SELECT ...
FROM table_a, table_b
WHERE table_a.key (+)= table_b.key
UNION
SELECT ...
FROM table_a, table_b
WHERE table_a.key = table_b.key (+);
noynot@163.com
Element Names Problem
Solution:
Common naming
conventions
Customer
Client
Customer
Contact
Name
noynot@163.com
Element Meaning Problem
Avoid misinterpretation
Complex solution
Document meaning in metadata
Customer’s
name
All customer
details
All details
except name
Customer_detail
noynot@163.com
Input Format Problem
EBCDIC
“123-73”
ASCII
12373
ACME Co.
áøåëéí äáàéí
Beer (Pack of 8)
noynot@163.com
Referential Integrity Problem
Solution:
SQL anti-join
Server constraints
Dedicated tools
Department
Emp
Name
Department
10
1099
Smith
10
20
1289
Jones
20
30
1234
Doe
50
40
6786
Harris
60
noynot@163.com
Name and Address Problem
Single-field format
Mr. J. Smith,100 Main St., Bigtown, County Luth, 23565
Multiple-field format
Database 1
Name
Mr. J. Smith
NAME
LOCATION
Street
100 Main St.
DIANNE ZIEFELD
N100
Town
Bigtown
HARRY H. ENFIELD
M300
Country
County Luth
Code
23565
Database 2
NAME
LOCATION
ZIEFELD, DIANNE
100
ENFIELD, HARRY H
300
noynot@163.com
Quality Data: Importance and Benefits
Quality data:
• Key to a successful warehouse implementation
Quality data helps you in:
•
•
•
•
•
Targeting right customers
Determining buying patterns
Identifying householders: private and commercial
Matching customers
Identify historical data
noynot@163.com
Data Quality Guidelines
Operational data:
Should not be used directly in the warehouse
Must be cleaned for each increment
Is not simply fixed by modifying applications
noynot@163.com
Transformation Techniques
Merging data
Adding a Date Stamp
Adding Keys to Data
noynot@163.com
Merging Data
Operational transactions do not usually map
one-to-one with warehouse data.
Data for the warehouse is merged to provide
information for analysis.
Pizza sales/returns by day, hour, seconds
Sale
1/2/02
12:00:01
Ham Pizza
$10.00
Sale
1/2/02
12:00:02
Cheese Pizza
$15.00
Sale
1/2/02
12:00:02
Anchovy Pizza
$12.00
Return 1/2/02
12:00:03
Anchovy Pizza
- $12.00
Sale
12:00:04
Sausage Pizza
$11.00
1/2/02
noynot@163.com
Merging Data
Pizza sales/returns by day, hour, seconds
Sale
1/2/02
12:00:01
Ham Pizza
$10.00
Sale
1/2/02
12:00:02
Cheese Pizza
$15.00
Sale
1/2/02
12:00:02
Anchovy Pizza
$12.00
Return 1/2/02
12:00:03
Anchovy Pizza
- $12.00
Sale
12:00:04
Sausage Pizza
$11.00
1/2/02
Pizza sales
Sale
1/2/02
12:00:01
Ham Pizza
$10.00
Sale
1/2/02
12:00:02
Cheese Pizza
$15.00
Sale
1/2/02
12:00:04
Sausage Pizza
$11.00
noynot@163.com
Adding a Date Stamp
Time element can be represented as a:
• Single point in time
• Time span
Add time element to:
• Fact tables
• Dimension data
noynot@163.com
Adding a Date Stamp:
Fact Tables and Dimensions
Product Table
Product_id
Time_key
Product_desc
Time Table
Week_id
Period_id
Year_id
Time_key
Store Table
Store_id
District_id
Time_key
Sales Fact Table
Item_id
Store_id
Time_key
Sales_dollars
Sales_units
Item Table
Item_id
Dept_id
Time_key
noynot@163.com
Adding Keys to Data
#1
Sale
1/2/98
12:00:01 Ham Pizza
$10.00
#2
Sale
1/2/98
12:00:02 Cheese Pizza
$15.00
#3
Sale
1/2/98
12:00:02 Anchovy Pizza $12.00
#4
Return 1/2/98
12:00:03 Anchovy Pizza - $12.00
#5
Sale
12:00:04 Sausage Pizza $11.00
1/2/98
Data values
or artificial keys
#dw1
Sale
1/2/98
12:00:01 Ham Pizza
$10.00
#dw2
Sale
1/2/98
12:00:02 Cheese Pizza
$15.00
#dw3
Sale
1/2/98
12:00:04 Sausage Pizza $11.00
noynot@163.com
Summarizing Data
1. During extraction on staging area
2. After loading to the warehouse server
Operational
databases
Staging area
Warehouse
database
noynot@163.com
Maintaining Transformation Metadata
Transformation metadata contains:
Transformation rules
Algorithms and routines
Sources
Stage
Rules
Extract
Transform
Publish
Load
Query
noynot@163.com
Maintaining Transformation Metadata
Restructure keys
Identify and resolve coding differences
Validate data from multiple sources
Handle exception rules
Identify and resolve format differences
Fix referential integrity inconsistencies
Identify summary data
noynot@163.com
Data Ownership and Responsibilities
Data ownership and responsibilities should be
shared by the:
• Operational team
• Data warehouse team
Business benefit gained with “work together”
approach
noynot@163.com
Transformation Timing and Location
Transformation is performed:
• Before load
• In parallel
Can be initiated at different points:
• On the operational platform
• In a separate staging area
noynot@163.com
Choosing a Transformation Point
Workload
Impact on environment
CPU usage
Disk space
•
•
•
•
Network bandwidth
Parallel execution
Load window time
User information needs
noynot@163.com
Monitoring and Tracking
Transformations should:
Be self-documenting
Provide summary statistics
Handle process exceptions
noynot@163.com
Designing Transformation Processes
Analysis:
• Sources and target mappings, business rules
• Key users, metadata, grain
Design options:
•
•
•
•
Third-party tools
Custom 3GL programs(FORTRAN、Cobol、C、C++、JAVA )
4GLs like SQL or PL/SQL
Replication
Design issues:
• Performance
• Size of the staging area
• Exception handling, integrity maintenance
noynot@163.com
Transformation Tools
Third-party tools
SQL*Loader
In-house developed programs
noynot@163.com
Summary
In this lesson, you should have learned how
to:
Define transformation
Identify possible staging models
Identify data anomalies and eliminate them
Explain the importance of quality data
Describe techniques for transforming data
Design transformation process
noynot@163.com
Practice 5-1 Overview
This practice covers the following topics:
Answering a series of questions based on the
business scenario for Frontier Airways
Answering a series of short questions
noynot@163.com
Lesson 6
Building the Data Warehouse:
Loading Warehouse Data
Objectives
After completing this lesson, you should be able to
do the following:
Explain key concepts in loading warehouse data
Outline how to build the loading process for the initial
load
Identify loading techniques
Describe the loading techniques provided by Oracle
Identify the tasks that take place after data is loaded
Explain the issues involved in designing the
transportation, loading, and scheduling processes
noynot@163.com
Loading Data into the Warehouse
Loading moves the data into the warehouse
Loading can be time-consuming:
• Consider the load window
• Schedule and automate the loading
Initial load moves large volumes of data
Subsequent refresh moves smaller volumes of
data
Transform
Extract
Operational
databases
Transport,
Load
Staging area
Warehouse
database
noynot@163.com
Initial Load and Refresh
Initial Load:
Single event that populates the database with
historical data
Involves large volumes of data
Employs distinct ETL tasks
Involves large amounts of processing after load
Refresh:
Performed according to a business cycle
Less data to load than first-time load
Less-complex ETL tasks
Smaller amounts of post-load processing
noynot@163.com
Data Refresh Models: Extract
Processing Environment
After each time interval, build a new snapshot of the
database.
Purge old snap shots.
Operational
databases
T1
T2
T3
noynot@163.com
Data Refresh Models: Warehouse
Processing Environment
Build a new database.
After each time interval, add changes to database.
Archive or purge oldest data.
Operational
databases
T1
T2
T3
noynot@163.com
Building the Loading Process
Techniques and tools
File transfer methods
The load window
Time window for other tasks
First-time and refresh volumes
Frequency of the refresh cycle
Connectivity bandwidth
noynot@163.com
Building the Loading Process
Test the proposed technique
Document proposed load
Monitor, review, and revise
noynot@163.com
Data Granularity
Important design and operational issue
Low-level grain:
Expensive, high level of processing,
more disk space, more details
High-level grain:
Cheaper, less processing, less
disk space, little details
noynot@163.com
Loading Techniques
Tools
Utilities and 3GL
Gateways
Customized copy programs
Replication
FTP
Manual
noynot@163.com
Loading Technique Considerations
Tools are comprehensive, but costly.
Data-movement utilities are fast and powerful.
Gateways are suitable for specific instances:
•
•
•
•
Access other databases
Supply dependent data marts
Support a distributed environment
Provide real-time access if needed
Use customized programs as a
last resort.
Replication is limited by
data-transfer rates.
noynot@163.com
Post-Processing of Loaded Data
Transform
Extract
Load
Staging area
Create
indexes
Warehouse
Generate
keys
Post-processing
of loaded data
Summarize
Filter
noynot@163.com
Indexing Data
Before load:
Enable indexes at server
During load:
Adds time to load window, row-by-row approach
After load:
Adds time to load window, but faster than row-by- row
approach
Index
Operational
databases
Staging
area
Warehouse
database
noynot@163.com
Unique Indexes
Disable constraints before load.
Enable constraints after load.
Re-create index if necessary.
Disable
constraints
Enable
constraints
Load data
Create index
Catch
errors
Reprocess
noynot@163.com
Creating Derived Keys
The use of derived or generalized keys is
recommended to maintain the uniqueness of a
row.
Methods:
• Concatenate operational key with a number
• Assign a number sequentially from a list
109908
109908 01
109908
100
noynot@163.com
Summary Management
Summary tables
Materialized views
Summary data
noynot@163.com
Filtering Data
From warehouse to data marts
Summary data
Warehouse
Data marts
noynot@163.com
Verifying Data Integrity
Load data into intermediate file.
Compare target flash totals with totals before load.
Counts &
Amounts
=
Flash
Totals
Load
Intermediate file
Target
Counts &
Amounts
Flash
Totals
=
Preserve, inspect,
fix, then load
noynot@163.com
Steps for Verifying Data Integrity
Source files
Source files
Source files
Control
Target
3
4
1
Extract
SQL*Loader
5
2
6
7
.log
.bad
noynot@163.com
Standard Quality Assurance Checks
Load status
Completion of the process
Completeness of the data
Data reconciliation
Referential integrity violations
Reprocessing
Comparison of counts and amounts
1+1=3
noynot@163.com
Summary
In this lesson, you should have learned how to:
Explain key concepts in loading data into the
warehouse
Outline how to build the loading process for the
initial load
Identify loading techniques
Describe the loading techniques provided by
Oracle
Identify the tasks that take place after data is
loaded
Explain the issues involved in designing the
transportation, loading, and scheduling
processes
noynot@163.com
Practice 6-1 Overview
This practice covers the following topics:
Answering a series of short questions
Answering questions based on the business scenario
for Frontier Airways
noynot@163.com
Lesson 7
Refreshing Warehouse Data
Objectives
After completing this lesson, you should be
able to do the following:
Describe methods for capturing changed data
Explain techniques for applying the changes
Describe Change Data Capture mechanism and
refresh mechanisms supported in Oracle9i
Describe techniques for purging and archiving
data and outline the techniques supported by
Oracle
Outline final tasks, such as publishing the data,
controlling access, and automating processes
List the selection criteria for choosing ETL tools
noynot@163.com
Developing a Refresh Strategy
for Capturing Changed Data
Consider load window.
Identify data volumes.
Identify cycle.
Know the technical infrastructure.
Plan a staging area.
Determine how to detect changes.
Operational
databases
T1
T2
T3
noynot@163.com
User Requirements and Assistance
Users define the refresh cycle.
IT balances requirements against technical issues.
Document all tasks and processes.
Employ user skills.
Operational
databases
T1
T2
T3
noynot@163.com
Load Window Requirements
Time available for entire ETL process
Plan
Test
Prove
Monitor
Load Window
0
3 am
6
User Access Period Load Window
9
12 pm
3
6
9
12
noynot@163.com
Planning the Load Window
Plan and build processes according to a strategy.
Consider volumes of data.
Identify technical infrastructure.
Ensure currency of data.
Consider user access requirements first.
High availability requirements may mean a small load
window.
User Access Period
0
3 am
6
9
12 pm
3
6
9
12
noynot@163.com
Scheduling the Load Window
1
Receive data
FTP
0
2
Requirements
Load cycle
Control File
File names
File types
Number of files
Number of loads
First-time load or refresh
Date of file
Date range
Records in file - counts
Totals – amounts
4
Control process
Open and read
files to verify
and analyze
3
3 a.m.
noynot@163.com
Scheduling the Load Window
5
Load into
warehouse
6
8
Verify,
analyze,
reapply
Create
summaries
7
Index
data
9
Update
metadata
Parallel
load
3 a.m.
6 a.m.
9 a.m.
noynot@163.com
Scheduling the Load Window
11
10
Backup
warehouse
6 a.m.
Create
views for
specialized
tools
12
Users
access
summary
data
13
Publish
9 a.m.
User
access
noynot@163.com
Capturing Changed Data for Refresh
Capture new fact data
Capture changed dimension data
Determine method of capture in each case
Methods:
•
•
•
•
•
Wholesale data replacement
Comparison of database instances
Time stamping
Database triggers
Database log
noynot@163.com
Wholesale Data Replacement
Expensive
Useful for data marts with less data
Limited historical data analysis is possible
Time period often exceeds load window
Mirroring techniques can be used to provide
access to the users
noynot@163.com
Comparison of Database Instances
Delta file:
• Changes to operational data since last refresh
• Used to update the warehouse
Simple to perform, but expensive in terms of time and
processing
Efficient for smaller volumes of data
Yesterday’s
operational
database
Today’s
operational
database
Database
comparison
Delta file holds
changed data
noynot@163.com
Time and Date Stamping
Fast scanning for records changed since last refresh
cycle
Useful for data with updated date field
No detection of deleted data
Operational
data
Delta file holds
changed data
based on time stamp
noynot@163.com
Database Triggers
Changed data intersected at the server level
Extra I/O required
Maintenance overhead
Operational
data
Operational
server
(RDBMS)
Delta file holds
changed data
Triggers on server
noynot@163.com
Using a Database Log
Contains before and after images
Requires system checkpoint
Common technique
Operational
data
Operational
server
(DBMS)
Log
Log analysis
and
data extraction
Delta file holds
changed data
noynot@163.com
Choosing a Method
for Change Data Capture
Consider each method on merit.
Consider a hybrid approach if one approach is not
suitable.
Consider current technical, operational, and
application issues.
noynot@163.com
Applying the Changes to Data
You have a choice of techniques:
Overwrite a record
Add a record
Add a field
Maintain history
Add version numbers
noynot@163.com
Overwriting a Record
Easy to implement
Loses all history
Not recommended
42135
John Doe Married 42135
John Doe Single
noynot@163.com
Adding a New Record
History is preserved; dimensions grow.
Time constraints are not required.
Generalized key is created.
Metadata tracks usage of keys.
42135
42135_01
John Doe
John Doe
Single
Married
noynot@163.com
Adding a Current Field
Maintains some history
Loses intermediate values
Is enhanced by adding an Effective Date field
42135
John Doe Single
42135
John Doe Single
Married
1-Jan-01
noynot@163.com
Limitations of Methods
for Applying Changes
Difficult to maintain History
Dimensions may grow large
Maintenance overhead
noynot@163.com
Maintaining History: Techniques
History tables
One-to-many relationships
Versioning
Preserve complete history
noynot@163.com
Maintaining History: Techniques
History tables:
Normalize dimensions
Hold current and historical data
One-to-many relationships:
One current record and many history records
HIST_CUST
Time
CUSTOMER
Sales
Product
noynot@163.com
Versioning
Avoid double counting
Facts hold version number
Time
Customer.CustId
Version
Customer Name
1234
1
Comer
1234
2
Comer
Sales.CustId
Version
Sales Facts
1234
1
$11,000
1234
2
$12,000
Customer
Sales
Product
noynot@163.com
Preserve Complete History
Complete history:
• Enables realistic historical analysis
• Retains context of data
Model must be able to:
• Reflect business changes
• Maintain context between fact and dimension data
• Retain sufficient data to relate old to new
noynot@163.com
Purging and Archiving Data
As data ages, its value depreciates.
Remove old data from the warehouse:
• Archive for later use (if needed)
• Purge without copy
noynot@163.com
Final Tasks
Update metadata
Publish data
Use database roles to control access to the
warehouse
Sources
Stage
Rules
Publish
Extract
Transform
Load
Query
noynot@163.com
Publishing Data
Control access using database roles
Compromise between load action and user access
Consider:
• Staggering updates
• Using temporary tables
• Using separate tables
noynot@163.com
ETL Tools: Selection Criteria
Overlap with existing tools
Availability of meta model
Supported data sources
Ease of modification and maintenance
Required fine tuning of code
Ease of change control
Power of transformation logic
Level of modularization
Power of error, exception, resubmission features
Intuitive documentation
Performance of code
noynot@163.com
ETL Tool Selection Criteria
Activity scheduling and sophistication
Metadata generation
Learning curve
Flexibility
Supported operating systems
Cost
noynot@163.com
Summary
In this lesson, you should have learned how to:
Describe methods for capturing changed data
Explain techniques for applying the changes
Describe Change Data Capture mechanism and
refresh mechanisms supported in Oracle9i
Describe techniques for purging and archiving
data and outline the techniques supported by
Oracle
Outline final tasks, such as publishing the data,
controlling access, and automating processes
List the selection criteria for choosing ETL tools
noynot@163.com
Practice 7-1 Overview
This practice covers the following topics:
Answering a series of questions based on the
business scenario for Frontier Airways
Answering a series of short questions
noynot@163.com
Lesson 8
Leaving a Metadata Trail
Objectives
After completing this lesson, you should be
able to do the following:
Define warehouse metadata, its types, and its
role in a warehouse environment
Examine each type of warehouse metadata
Develop a metadata strategy
Outline the Common Warehouse Metamodel
(CWM)
noynot@163.com
Defining Warehouse Metadata
Data about warehouse data and processing
Vital to the warehouse
Used by everyone
The key to understanding warehouse information
Metadata
noynot@163.com
Metadata Users
End users
Metadata
repository
Developers
IT Professionals
noynot@163.com
Types of Metadata
End-user metadata:
• Key to a good warehouse
• Navigation aid
• Information provider
ETL metadata:
•
•
•
•
Maps structure
Source and target information
Transformations
Context
Operational metadata:
• Load, management, scheduling processes
• Performance
noynot@163.com
Examining Types of Metadata
ETL metadata
End-user metadata
Metadata
repository
External
sources
Operational
data sources
ETL
End
user
Warehouse
noynot@163.com
Examining Metadata: ETL Metadata
Business rules
Source tables, fields, and key values
Ownership
Field conversions
Encoding and reference table
Name changes
Key value changes
Default values
Logic to handle multiple sources
Algorithms
Time stamp
noynot@163.com
Extraction Metadata
Space and storage requirements
Source location information
Diverse source data
Access information
Security
Contacts
Extraction
Program names
Frequency details
Failure procedures
Validity checking information
Metadata
repository
noynot@163.com
Transformation Metadata
Duplication routines
Exception handling
Key restructuring
Grain conversions
Program names
Frequency
Summarization
Metadata
repository
Transformation
noynot@163.com
Loading Metadata
Method of transfer
Frequency
Validation procedures
Failure procedures
Deployment rules
Contact information
Metadata
repository
Loading
noynot@163.com
Examining Metadata: End-User Metadata
Location of fact and dimensions
Availability
Description of contents and algorithms used for
derived and summary data
Metadata
Data ownership details
repository
End users
noynot@163.com
End-User Metadata: Context
Need to know the context of the table queried
Associate the metadata with its description
Metadata
repository
End users
noynot@163.com
Example of End-User Metadata
Table
Name
Column
Name
Data
Meaning
Product
Prod_ID
739516
Unique identifier for the
product
Product
Valid_date
01/97
Last refresh date
Product
Ware_loc
1816
Warehouse location number
Product
Ware_bin
666
Warehouse bin number
Product
Code
15
The color of the product;
please refer to table
COL_REF for details
Product
Weight
17.62
Packed shipping weight in
kilograms
noynot@163.com
Historic Context of Data
Supports change history
Maintains the context of information
Metadata
repository
End users
1994 1995 1996 1997 1998
noynot@163.com
Types of Context
Simple:
• Data structures
• Naming conventions
• Metrics
Metadata
repository
Complex:
• Product definitions
• Markets
• Pricing
External:
• Economic
• Political
End users
1994 1995 1996 1997 1998
noynot@163.com
Developing a Metadata Strategy
Define a strategy to ensure high-quality
metadata useful to users and developers.
Primary strategy considerations:
•
•
•
•
•
•
•
•
Define goals and intended use
Identify target users
Choose tools and techniques
Choose the metadata location
Manage the metadata
Manage access to the metadata
Integrate metadata from multiple tools
Manage change
noynot@163.com
Defining Metadata Goals
and Intended Usage
Define clear goals.
Identify requirements.
Identify intended usage.
Metadata
noynot@163.com
Identifying Target Metadata Users
Who are the metadata users?
• Developers
• End users
What information do they need?
How will they access the metadata?
noynot@163.com
Choosing Metadata Tools and
Techniques
Tools:
• Data modeling
• ETL
• End user (query and analysis)
Database schema definitions
COBOL copybooks
Middleware tools
noynot@163.com
Choosing the Metadata Location
Usually the warehouse server
Possibly on operational platforms
Desktop tool with metalayer
Metadata
noynot@163.com
Managing the Metadata
Managed by the metadata manager
Maintained by the metadata architect
Standards should be followed
noynot@163.com
Integrating Multiple Sets of Metadata
Multiple tools may generate their own metadata.
These metalayers should be properly integrated.
Metadata exchangeability is desirable.
noynot@163.com
Managing Changes to Metadata
Different types of metadata have different rates of
change.
Consider metadata changes resulting from refresh
cycles.
noynot@163.com
Additional Metadata Content
and Considerations
Summarization algorithms
Relationships
Stewardship
Permissions
Pattern analysis
Reference tables
noynot@163.com
Common Warehouse Metamodel
Design and Administration
Analytic applications
Any source
ERP
Operational
External
Warehouse
Data
integration
Information
delivery
Any access
Reporting
Ad hoc query
& analysis
Data mining
Marts
CWM metadata repository
noynot@163.com
Summary
In this lesson, you should have learned how
to:
Define warehouse metadata, its types, and its
role in a warehouse environment
Examine each type of warehouse metadata
Develop a metadata strategy
Outline the Common Warehouse Metamodel
(CWM)
noynot@163.com
Practice 8-1 Overview
This practice covers the following topics:
Answering a series of short questions
Answering questions based on the business scenario
for Frontier Airways
noynot@163.com
Lesson 9
Managing and Maintaining
the Data Warehouse
Objectives
After completing this lesson, you should be able to
do the following:
Develop a plan for managing the transition from
development to implementation
Identify challenges pertaining to the growth of the data
warehouse
Describe backup and archive mechanisms
Identify data warehouse performance issues
noynot@163.com
Managing the Transition to Production
Promoting support for change
Pilot versus large-scale implementation
Documentation
Testing
Training
Post-implementation support
Maintaining the warehouse
noynot@163.com
Promoting Support for the
Data Warehouse
Awareness
Feedback
Information
Skills
Education
Direction
Control
noynot@163.com
Choosing Between Pilot and
Large-Scale Implementation
Pilot
Large-Scale
Implementation
noynot@163.com
The Warehouse Pilot
Demonstrates benefits to:
• Management
• Users
• IT staff
Relevant to the business
Low technical risk
Small and feasible
Anticipates increased use
Focused on an initial business issue
Remains in context
noynot@163.com
Piloting the Warehouse
Designers:
• Prove model, data, and access tools
Users:
• Prove ease of use of tool
• Check data and query performance
• Identify training requirements
Developers:
• Resolve ETL and metadata issues
• Determine users data and training requirements
• Test security and access levels, monitor performance
noynot@163.com
Documentation
Produces textual deliverables:
Glossary
User and technical documentation
Online help
Metadata reference guide
Warehouse management reference
New features guide
noynot@163.com
Testing the Warehouse
Test every stage.
Use a realistic test database and environment.
noynot@163.com
Training
Users:
•
•
•
•
•
Metadata
DSS tools
Ad hoc queries
Getting help
Registration of enhancement requests
Information systems developers:
•
•
•
•
Analysis techniques
Hardware technicalities
Networking
Implementing, building, and supporting DSS
noynot@163.com
Post-Implementation Support
Evaluate and review the implementation.
Monitor the warehouse:
• Respond to problems
• Conduct performance tuning
• Roll out metadata, queries, reports, filters, and
conditions
• Implement security
• Incorporate new users
• Distribute data marts and catalogs
• Transfer ownership from IT
noynot@163.com
Monitoring the Success
of the Data Warehouse
Number of Users
Initial
3
Months
6
Months
12
Months
24
Months
Period After Implementation
noynot@163.com
Measuring the Success
of the Data Warehouse
Metrics may include:
Availability
Response time
Response to problems
noynot@163.com
Managing Growth
Increasing number of users
Broader usage
Growth of data volumes
Period after Implementation
noynot@163.com
Expansion and Adjustment
Evaluate continually:
•
•
•
•
Changes
New increments
Unnecessary components
Strategies
Ensure open environment
Document development processes
for the future:
•
•
•
•
Planning
Cost analysis
Problem assessment and correction
Performance assessment
noynot@163.com
Controlling Expansion
Ensure the continuity of staff.
Document processes, solutions, and metrics.
Establish working test and production
architecture for further increments.
Create a strategy for maintaining changes to
data.
noynot@163.com
Sizing Storage
Consider different methods.
Determine the best for your needs.
Know the business requirements.
Do not underestimate requirements.
Plan for growth.
Consider space for unwanted data.
noynot@163.com
Estimating Storage
Fact volumes
Fact lifetime
Technology availability
Technology purchase
Storing presummarized data
Mirroring or other techniques
requiring disk storage
noynot@163.com
Objects That Need Space
ODS
Indexes and metadata
Summary data
Redo logs
Rollback information
Sort areas
Temporary space
Workspace for backup
and recovery
noynot@163.com
Other Considerations and Techniques
Queuing models
Rule of thumb
Total database size is three to four times the size of
the base fact tables
Consider:
•
•
•
•
•
Sparseness
Dimensions
Indexes
Summaries
Sort operational space
noynot@163.com
Space Management
Monitor
Avoid fragmentation
Test load data
Plan for growth
Know business patterns
Never let space become an issue
noynot@163.com
Archiving Data
Determine data life expectancy.
Identify archive frequency.
Use read-only tablespaces.
Include in early specifications.
noynot@163.com
Purging Data
Reduce data volumes:
• Create summaries
• Remove unwanted base data
Choose the most effective method.
noynot@163.com
Identifying Data Warehouse
Performance Issues
Improving query efficiency:
•
•
•
•
Use indexes.
Use query governors.
Run large jobs out of hours.
Consider a data mart approach.
Improving network performance:
• Provide sufficient bandwidth and optimize
configuration for access.
• Analyze traffic.
• Deploy data marts at remote locations.
noynot@163.com
Review and Revise
Monitor the warehouse:
Usage
Access
Accurate grain
Detail data
Periodicity
noynot@163.com
Secret of Success
Think big; start small!
noynot@163.com
Course Summary
In this course, you should have learned that
the successful warehouse:
Is driven by the business
Focuses on objectives
Adds value to the business
Can be understood and used
Delivers good data
Performs well
Belongs to the users
noynot@163.com
Sample study
Note : we discuss the creation of data marts, rather than
the perhaps more familiar term, data warehouse.
Data warehouses tend to be large, one-stop-shopping
repositories where all the historical data for the
organization would be stored. Nothing is wrong with this
as a concept; however, attempting to create a data
warehouse often led to huge, multiyear technology
projects that were never quite finished or were outdated
when they finally did get done.
noynot@163.com
Data Mart Structure
Measures
A Measure is a numeric quantity expressing some
aspect of the organization's performance. The
information represented by this quantity is used to
support or evaluate the decision making and
performance of the organization. A measure can also be
called a fact.
the tables that hold measure information are known as
fact tables.
noynot@163.com
Dimensions
A Dimension is a categorization used to spread out an
aggregate measure to reveal its constituent parts.
Dimensions are used to facilitate this slicing and dicing.
noynot@163.com
noynot@163.com
noynot@163.com
noynot@163.com
The Star Schema
A Star Schema is a relational database schema used to
hold measures and dimensions in a data mart. The
measures are stored in a fact table and dimensions are
stored in dimension tables.
noynot@163.com
noynot@163.com
noynot@163.com
noynot@163.com
Attributes
An Attribute is an additional piece of information
pertaining to a dimension member that is not the unique
identifier or the description of the member.
noynot@163.com
noynot@163.com
Hierarchies
A Hierarchy is a structure mode up of two or more levels
of related dimensions. A dimension at an upper level of
the hierarchy completely contains one or more
dimensions from the next lower level of the hierarchy.
noynot@163.com
noynot@163.com
The Snowflake Schema
noynot@163.com