Populating the Data Warehouse (ETL)
Download
Report
Transcript Populating the Data Warehouse (ETL)
Extract, Transform, Load
1
Agenda
Review
Analysis
Logical Design
Physical Design
Implementation
(Bus Matrix, Info Package)
(Dimensional Modeling)
(Spreadsheet)
(Data Mart Relational Tables)
ETL Process Overview
ETL Components
Staging Area
Extraction
Transformation
Loading
Documenting High-Level ETL Requirements
Documenting Detailed ETL Flows
Example ETL
2
Review: Dimensional Modeling
3
Review: DM Implementation
DimStudent
CREATE TABLE DimStudent(
student_sk
int identity(1,1),
student_id
varchar(9),
firstname
varchar(30),
lastname
varchar(30),
city
varchar(20),
state
varchar(2),
major
varchar(6),
classification
varchar(25),
gpa
numeric(3, 2),
club_name
varchar(25),
undergrad_school
varchar(25),
gmat
int,
undergrad_or_grad varchar(10),
CONSTRAINT dimstudent_pk PRIMARY
KEY (student_sk));
GO
FactEnrollment
CREATE TABLE FactEnrollment(
student_sk
int,
class_sk
int,
date_sk
int,
professor_sk
int,
course_grade
numeric(2, 1),
CONSTRAINT factenrollment_pk PRIMARY KEY
(student_sk, class_sk, date_sk, professor_sk),
CONSTRAINT factenrollment_student_fk FOREIGN
KEY (student_sk) REFERENCES
dimstudent(student_sk),
CONSTRAINT factenrollment_class_fk FOREIGN
KEY(class_sk) REFERENCES dimclass (class_sk),
CONSTRAINT factenrollment_date_fk FOREIGN
KEY(date_sk) REFERENCES dimtime (date_sk),
CONSTRAINT factenrollment_professor_fk FOREIGN
KEY(professor_sk) REFERENCES dimprofessor
(professor_sk));
GO
4
Review: Physical DW Design
5
ETL Overview
Reshaping relevant data from source systems
into useful information stored in the DW
Extract
Copying and integrating data from OLTP and
other data sources in preparation for cleansing
and loading into the DW
Transform
Cleaning and converting data to prepare it for
loading into the DW
Load
Putting cleansed and converted data into the DW
6
ETL Process
Not Really New, BUT…
Much more data
Includes rearranging, summarizing
Data used for strategic decision-making
Characteristics:
Process AND technology
Detailed, highly-dependent tasks
Consumes average 75% of DW development
An on-going process for life of DW
Requirements:
Well-documented
Automated
Flexible
7
ETL Process
1. Determine target data
2. Determine data sources
3. Prepare data mapping
4. Organize data staging area
5. Establish data extraction rules
6. Establish data transformation rules
7. Plan aggregate tables
8. Establish data load procedures
9. Load dimension tables
10. Load fact tables
8
ETL Process Flow
3, Spreadsheet
1, Dim Model
2, Spreadsheet
6, 7, Map
& SSIS
5, SSIS
8, 9, 10, SSIS
4
9
ETL Staging Area
Information hub, facilitating the enriching
stages that data goes through to populate a DW
Advantages:
Separates source systems and DW
Minimizes ETL impact on source AND DW systems
Can consist of multiple “hubs”
“upload” area
“staging” area
“DW load images”
10
ETL Staging Area, cont…
11
High Level Design of ETL Process
Initial documentation of:
What data do we need and where is it coming
from?
Physical DW Design Spreadsheet shown previously
What are the major transformation/cleansing
needs?
“Extend” Physical DW Design Spreadsheet OR
ETL Map
What’s the sequence of activities for ETL?
ETL Map
12
Common Transformations
Format Revisions
Key Restructuring, Lookup
Handling of Null Values
Decoding fields
Calculated, Derived values
Merging of Data
13
Common Transformations, cont…
Splitting of single fields
Character set conversion
Units of measurement conversion
Date/time conversion
Summarization
Deduplication
14
Common Transformations, cont…
Other Data Quality Issues
Standardize values
Validate values
Identifying mismatches, misspellings
Etc…
Suggestions:
Appoint “Data Stewards”
Ensure ETL programs have control checks
Data Profiling…
15
Comparison of Models
16
Transformations Example
DimTime
DimProfessor
DimClass
DimStudent
FactEnrollment
Create table
Generate SK
Generate SK
Generate SK
Add SKs:
student, section, prof
(join registration to
student, time, and section
dims;
left join them to prof)
Insert row w/SK = -1
Insert row w/SK = -1
Insert row w/SK = -1
Insert row w/SK = -1
Expand rank values
(use SQL case)
Get coursename & cred
hrs from section tbl
(join section to course)
Expand classification values
(use SQL case)
Expand department values
(join prof to departments)
Expand state values
(needs lookup table but
use SQL case instead)
Get gmat, undergrad school
from grad table
(join student to grad)
Get club name from club
(join student to undergrad;
Left join them to club)
Create undergrad_or_grad
values
(if stud_id in undergrad or
stud_id in grad)
17
Data Profiling
Systematic analysis of the content of a data
source
Goals:
Anticipate potential data quality issues upfront
Build quality corrections and controls into ETL
process
Manual and/or Tool-assisted
18
Profiling Example: Manual
Account
CustID Number
Customer
First
Type
Title Name
AW000110
11000 00
I
AW000110
11001 01
I
AW000110
11002 02
Last
Name
Gender Email
Phone
Address Line1
Address
Line2
State
Postal
Code Country
Yang
F
[email protected].
1(11) 500 5550162
3761 N. 14th St
Queensland
4700
AU
Eugene
Huang
F
[email protected].
500-555-0110
2243 W St.
Victoria
3198
AU
I
Ruben
Torres
F
[email protected].
1(11) 500 5550184
5844 Linden Dr
New South
Wales
7001
AU
AW000110
11003 03
I
Christy
Zhu
F
[email protected].
1(11) 500 5550162
1825 Village Pl.
Queensland
2113
AW000110
11004 04
I
F
[email protected].
7553 Harness
(500) 555-0131 Circle
AW000110
11005 05
I
M
[email protected].
1(11) 500 5550151
Mr. Jon
Mrs. Elizabeth Johnson
Julio
Ruiz
7305 Humphrey
Drive
New South
Wales
2500
AU
4169
OZ
19
Profiling Example: SSIS
20
Documenting ETL High Level Design
Add to existing DW Physical Design
Spreadsheet
21
Documenting ETL High Level Design
22
Low Level Design of ETL Process
Detailed documentation of:
What data do we need and where is it coming
from?
What are the major transformation/cleansing
needs?
What’s the sequence of activities for ETL?
Can use tool like SSIS
23
Extracting Source Data
Two forms:
1.
Static Data Capture
Point-in-time snapshot
Initial Loads and periodic refreshes
2. Revised Data Capture
Only data that has been added, updated, deleted
since last load
Ongoing incremental loads
Two timeframes
Immediate
Deferred
24
Static Data Capture
(T)SQL Scripts
e.g., small number of tables/rows
Export/Import Tables
e.g., database or non-database sources
Backup/Restore Database
e.g., copying sqlserver source database for initial
load ETL
Detach/Attach Database
e.g., copying older sqlserver version to newer
sqlserver version for initial load ETL
25
Revised Data Capture
Immediate / Real-time
ETL side:
OLTP side:
OLTP side:
procs get changed data from log real-time
and update ETL staging tables
triggers update ETL staging tables
apps write to OLTP AND ETL staging
tables
Deferred
ETL side:
ETL side:
OLTP side:
procs get changed data from OLTP tables
based on timestamps
procs do file comparison
changed data capture (SS 2008)
26
Documenting ETL Low Level Design:
SSIS
Comes with SQL Server
Helps document and automate ETL process
Based on defining
Packages
Tasks
One approach
A package for each target table
A "master" package
27
SSIS Package Examples: Master
28
SSIS Package Examples: Extract All
29
SSIS Package Examples: Extract Changed
using CDC
Eg, SELECT * from cdccustomer WHERE
cdc_chg_date >
etl_last_capture_date;
30
SSIS Package Examples: Transforms
31
SSIS Package Examples: Load
32
Class Performance DW Example
Create ClassPerformanceDW database
Using ClassPerformanceDW database…
Create ClassPerformanceDW tables using SQL
Script
http://business.baylor.edu/gina_green/teaching/sqlserver/scripts/generate_class_performance_d
w_tables/create_class_performance_dw_tables.sql
33
ETL Example using SQL Scripts
One "Master Script"
Calls five "table" scripts
34
"Master" Script
--be sure to turn on Query, SQLCMD mode in order to run this script
Use ClassPerformanceDW
print 'loading dimclass table'
Go
:r "C:\Documents and Settings\Gina\Desktop\generate_class_performance_dw_tables\load_dimclass.sql"
print 'loading dimprofessor table'
Go
:r "C:\Documents and Settings\Gina\Desktop\generate_class_performance_dw_tables\load_dimprofessor.sql"
print 'loading dimstudent table'
Go
:r "C:\Documents and Settings\Gina\Desktop\generate_class_performance_dw_tables\load_dimstudent.sql"
print 'loading dimtime table'
Go
:r "C:\Documents and Settings\Gina\Desktop\generate_class_performance_dw_tables\load_dimtime.sql"
print 'loading factenrollment table'
Go
:r "C:\Documents and Settings\Gina\Desktop\generate_class_performance_dw_tables\load_factenrollment.sql"
Print 'class performance DW data transformation and loading is complete'
Go
35
Load "DimProfessor" Script (pg. 1 of 3)
set nocount on
print 'remove existing data from dimprofessor'
delete from dimprofessor;
go
print 'reseeding SK identity value back to 1'
dbcc checkident ('dimprofessor', reseed, 0);
go
print 'adding oltp prof data to dimprofessor'
print 'professor_sk will be automatically inserted'
insert into dimprofessor (
professor_id,
firstname,
lastname,
rank,
department)
select
prof_id, firstname, lastname, rank, dept
from
regnOLTP.dbo.prof
;
go
36
Load "DimProfessor" Script (pg. 2 of 3)
print 'decoding rank field'
UPDATE dimprofessor
SET dimprofessor.rank = case dimprofessor.rank
when 'asst' then 'assistant prof'
when 'assc' then 'associate prof'
when 'prof' then 'full prof'
end
;
Go
print 'decoding department field using imported excel spreadsheet'
UPDATE dimprofessor
SET
dimprofessor.department = regnOLTP.dbo.departments.department
FROM dimprofessor, regnOLTP.dbo.departments
WHERE dimprofessor.department = regnOLTP.dbo.departments.prefix
;
Go
37
Load "DimProfessor" Script (pg. 3 of 3)
print 'adding SK -1 row'
set identity_insert dimprofessor on
Go
insert into dimprofessor (
professor_sk,
professor_id,
firstname,
lastname,
rank,
department)
Values (-1, -1, 'unknown', 'unknown', 'unknown', 'unknown');
GO
set identity_insert dimprofessor off
Go
Set nocount off
38
Load "FactEnrollment" Script
print 'adding oltp registration data to fact_enrollment'
INSERT INTO factenrollment (
student_sk,
class_sk,
date_sk,
professor_sk,
course_grade)
SELECT student_sk, class_sk, datekey, professor_sk, final_grade
FROM
((((regnOLTP.dbo.registration INNER JOIN dimstudent ON
registration.stud_id = dimstudent.student_id)
INNER JOIN dimclass ON
regnOLTP.dbo.registration.callno = dimclass.crn)
INNER JOIN dimtime ON
CONVERT(varchar(10),regnOLTP.dbo.registration.regn_date,101) = actualdatekey)
INNER JOIN regnOLTP.dbo.section ON
dimclass.crn = regnOLTP.dbo.section.callno)
LEFT JOIN dimprofessor ON regnOLTP.dbo.section.prof_id =
dimprofessor.professor_id
;
Go
39
Entire Transform/Load "Package"
http://business.baylor.edu/gina_green/teaching/sqlserver/scripts/generate_class_performance_d
w_tables.zip
40