Data Staging

Download Report

Transcript Data Staging

ACCTG 6910
Building Enterprise &
Business Intelligence Systems
(e.bis)
Data Staging
Olivia R. Liu Sheng, Ph.D.
Emma Eccles Jones Presidential Chair of Business
1
The Business Dimensional Lifecycle
Technical
Architecture
Design
Product
Selection &
Installation
Business
Project
Planning
Requirement
Dimensional
Modeling
Physical
Design
Data Staging
Design &
Development
Deployment
Maintenance
and
Growth
Definition
End-User
Application
Specification
End-User
Application
Development
Project Management
2
Data Staging
DB2
Access
Data
Staging
Data
Warehouse
(Oracle)
Excel
Legacy System
3
Data Staging
•
•
•
•
•
•
Extraction
Data Cleansing
Data Integration
Transformation
Transportation (Loading)
Maintenance
4
Extraction
• Extract source data from legacy systems and
place it in a staging area.
• To reduce the impact on the performance of
legacy systems, source data is extracted
without any cleansing, integration and
transformation operations.
5
Extraction
• A variety of file formats exist in legacy
systems
– Relational database: DB2, Oracle, SQL Server,
Informix, Access …
– Flat file: Excel file, text file
• Commercial data extraction tools are very
helpful in data extraction.
– Ex: Oracle Data Mart Builder
6
Data Preparation (Cleansing)
It’s all about data quality!!!
7
Outline
•
•
•
•
•
•
•
Measures for Data Quality
Causes for data errors
Common types of data errors
Common error checks
Correcting missing values
Timing for error checks and corrections
Steps of data preparation
8
Measures for Data Quality
• Correctness/Accuracy - w.r.t. the real data
• Consistency/Uniqueness – data values,
references, measures and interpretations
• Completeness - scope of data & values
• Relevancy – w.r.t. the requirements
• Current data – relevant to the requirements
9
Causes for Data Errors
• Data entry errors
• Correct data not available at the time of data
entries
• By different users same time or same users
overtime
– Inconsistent or incorrect use of “codes”
– Inconsistent or incorrect interpretation of “fields”
• Transaction processing errors
• System and recovery errors
• Data extract/transformation errors
10
Common Data Errors
•
•
•
•
Missing (null) values
Incorrect use of default values (e.g., zero)
Data domain integrity violation (e.g., 0/1)
Data value (dependency) integrity violation
(e.g., if mm=02 then DD<30)
• Data referential integrity violation
(e.g., a customer’s order record cannot exist
unless the customer record already exists)
11
Common Data Errors, Cont’d
• Data retention integrity violation (e.g., old
inventory snapshots should not be stored)
• Data Derivation/Transformation/Aggregation
Integrity Violation (e.g., profit not = sales –
costs)
• Inconsistent data values of the same data (M
versus m for male)
• Inconsistent use of the same data value (DM
for Data Mining and Data Marts)
12
Error Checks
•
•
•
•
•
•
•
•
Domain value validation
Value dependency validation
Referential integrity validation
Identify missing-value or default-value
records
Identify outliers
Cross-footing -Check aggregates and
derivations across different levels and against
common sense
Eyeballs!
13
Process validation
Data Cleaning: Missing Values
1. Exclude the record
2. Exclude the attribute/field
3. Replaced by a global constant
4. Replaced by the attribute mean
5. Replaced by the most probable value
6. Apply 4 – 6 by class/segments of records
7. Manual correction
8. Application specific algorithm
1-6 are less practical for OLAP bound data
14
Timing for Error Checking
• During Data Staging
• During Data Loading
• Others
– Before data extraction (data entries,
transaction processing, recovery, audits,
etc.)
– After data loading
15
Steps of Data Preparation
•
•
•
•
•
•
•
•
•
Identify data sources
Extract and analyze source data
Standardize data
Correct and complete data
Match and consolidate data
Analyze data defect types
Transform and enhance data into target
Calculate derivations and summary data
Audit and control data extract, transformation
and loading
16
Data Integration
• Data from different data sources with
different formats need to be integrated
into one data warehouse
– Ex: 3 customer table in sales department,
marketing department and an acquired
company
Customer (cid, cname, city …)
Customer (customerid, customername,city…)
Customer (custid, custname, cname,…)
17
Data Integration
• Same attribute with different name: cid,
customerid, custid
• Different attribute with same name:
– cname -> customer name
– cname -> city name
• Same attribute with different formats
18
Data Integration
• How to integrate
– Get the schemas of all data sources
– Get the schema of the data warehouse
– Integrate source schemas with the help
from commercial tools and domain experts
19
Transformation
• Prepare data for loading into the data
warehouse
– Change the data format
– Create derived attributes and tables
– Aggregate
– Create warehouse keys
20
Transportation
• Using bulk load tools, such as Oracle
SQL Loader, instead of SQL command
• Create indexes
21
Maintenance
• Maintenance frequency: daily, weekly,
monthly
• Identify change records and new records in
legacy systems
– Create timestamps for changes and new records in
legacy systems
– Compare data between legacy systems and DW
• Load changes and new records into DW
22