Slides(II) - Zhangxi Lin
Download
Report
Transcript Slides(II) - Zhangxi Lin
ISQS 6339, Data Management & Business Intelligence
Extraction, Transformation, and Loading (II)
Zhangxi Lin
Texas Tech University
1
ISQS 6339, Data Management & Business
Intelligence
Agenda
I. Using SSIS for ETL
Integration Services
Learn by doing
Package items
Problem-oriented package development
II. The Principle of ETL
Extraction
Transformation
Loading
2
ISQS 6339, Data Management & Business
Intelligence
II. The Principle of ETL
3
ISQS 6339, Data Management & Business
Intelligence
Structure and Components of Business
Intelligence
SSMS
SSIS
SSAS
SSRS
SAS
EG
4
ISQS 6339, Data Management & Business
Intelligence
SAS
EM
Automating your routine information
processing tasks
Your routine information processing tasks
Read online news at 8:00a and collect a few most important pieces
Retrieve data from database to draft a short daily report at 10a
View and reply emails and take some notes that are saved in a database
View 10 companies’ webpage to see the updates. Input the summaries into a database
Browse three popular magazines twice a week. Input the summaries into a database
Generate a few one-way frequency and two-way frequency tables and put them on
the web
Merge datasets collected by other people into a main database.
Prepare a weekly report using the database and at 4p every Monday, and publish it to
the internal portal site.
Prepare a monthly report at 11a on the first day of a month, which must be
converted into a pdf file and uploaded to the website.
Seems there are many things are on going. How to handle them properly in the
right time?
Organizer – yes
How about regular data processing tasks?
5
ISQS 6339, Data Management & Business
Intelligence
Information Processing and
Information Flow
Transaction processing
Interactions between a user and a computer application system with
immediate responses from the application
Operational processing
Make use of computer to control a process
Batch processing
Consisting of a series of executions, each of which is applied to a set of data
and turns the result to the next one.
Analytical processing
The interaction between analysts and collections of aggregated data that may
have been reformulated into alternative representational forms for improved
analytical performance.
6
ISQS 6339, Data Management & Business
Intelligence
Extraction, Transformation, Loading (ETL)
Processes
Extract source data
Transform/clean data
Operational systems
Index and summarize
Load data into warehouse
Detect changes
Refresh data
Programs
Gateways
Tools
Data Warehouse
7
ISQS 6339, Data Management & Business
Intelligence
ETL
ETL: Tasks, Importance, and Cost
Operational
systems
Extract
Clean up
Consolidate
Restructure
Load
Maintain
Refresh
Data Warehouse
ETL
Relevant
Useful
Quality
Accurate
Accessible
8
ISQS 6339, Data Management & Business
Intelligence
Extracting Data
Source systems
Data from various data sources in various formats
Extraction Routines
Developed to select data fields from sources
Consist of business rules, audit trails, error correction facilities
Data mapping
Transform
Operational
databases
9
Data staging area
ISQS 6339, Data Management & Business
Intelligence
Warehouse
database
Production Data
Operating system platforms
File systems
Database systems and vertical applications
IMS
DB2
Oracle
Sybase
Informix
VSAM
10
ISQS 6339, Data Management & Business
Intelligence
SAP
Shared Medical
Systems
Dun and Bradstreet
Financials
Hogan Financials
Oracle Financials
Archive Data
Historical data
Useful for analysis over long periods of time
Useful for first-time load
May require unique transformations
Operation
databases
11
ISQS 6339, Data Management & Business
Intelligence
Warehouse
database
Internal Data
Planning, sales, and marketing organization data
Maintained in the form of:
Spreadsheets (structured)
Documents (unstructured)
Treated like any other source data
Planning
Marketing
Accounting
12
ISQS 6339, Data Management & Business
Intelligence
Warehouse database
External Data
Information from outside the organization
Issues of frequency, format, and predictability
Described and tracked using metadata
Purchased
databases
A.C. Nielsen, IRI, IMS,
Walsh America
Dun and
Bradstreet
Barron's
13
Warehousing
databases
ISQS 6339, Data Management & Business
Intelligence
Competitive
information
Economic
forecasts
Wall Street
Journal
Possible ETL Failures
A missing source file
A system failure
Inadequate metadata
Poor mapping information
Inadequate storage planning
A source structural change
No contingency plan
Inadequate data validation
14
ISQS 6339, Data Management & Business
Intelligence
Maintaining ETL Quality
ETL must be:
Tested
Documented
Monitored and reviewed
Disparate metadata must be coordinated.
15
ISQS 6339, Data Management & Business
Intelligence
Transformation
Transformation eliminates anomalies from operational data:
Cleans and standardizes
Presents subject-oriented data
Extract
Transform:
Clean up
Consolidate
Restructure
Operational
systems
Load
Data Staging Area
16
ISQS 6339, Data Management & Business
Intelligence
Warehouse
Remote Staging Model
Data staging area within the warehouse environment
Transform
Extract
Operational
system
Load
Staging area
Warehouse
Data staging area in its own environment
Transform
Operational
system
17
Extract
Load
Staging area
ISQS 6339, Data Management & Business
Intelligence
Warehouse
On-site Staging Model
Data staging area within the operational environment,
possibly affecting the operational system
Transform
Extract
Operational
system
18
Load
Staging area
ISQS 6339, Data Management & Business
Intelligence
Warehouse
Data Anomalies
No unique key
Data naming and coding anomalies
Data meaning anomalies between groups
Spelling and text inconsistencies
19
CUSNUM
NAME
ADDRESS
90233479
Oracle Limited
100 N.E. 1st St.
90233489
Oracle Computing
15 Main Road, Ft. Lauderdale
90234889
Oracle Corp. UK
15 Main Road, Ft. Lauderdale, FLA
90345672
Oracle Corp UK Ltd
181 North Street, Key West, FLA
ISQS 6339, Data Management & Business
Intelligence
Transformation Routines
Cleaning data
Eliminating inconsistencies
Adding elements
Merging data
Integrating data
Transforming data before load
20
ISQS 6339, Data Management & Business
Intelligence
Transforming Data:
Problems and Solutions
Multipart keys
Multiple local standards
Multiple files
Missing values
Duplicate values
Element names
Element meanings
Input formats
Referential Integrity constraints
Name and address
2
1
ISQS 6339, Data Management & Business
Intelligence
Multipart Keys Problem
Multipart keys
Product code = 12 M 654313 45
Salesperson
code
Country Sales
code
territory
22
Product
number
ISQS 6339, Data Management & Business
Intelligence
Multiple Local Standards Problem
Multiple local standards
Tools or filters to preprocess
23
cm
DD/MM/YY
1,000 GBP
inches
MM/DD/YY
FF 9,990
cm
DD-Mon-YY
USD 600
ISQS 6339, Data Management & Business
Intelligence
Multiple Files Problem
Added complexity of multiple source files
Start simple
Multiple
source files
24
Logic to detect
correct source
ISQS 6339, Data Management & Business
Intelligence
Transformed
data
Missing Values Problem
Solution:
Ignore
Wait
Mark rows
Extract when time-stamped
If NULL then
field = ‘A’
25
ISQS 6339, Data Management & Business
Intelligence
A
Duplicate Values Problem
Solution:
SQL self-join techniques
RDMBS constraint utilities
ACME Inc
ACME Inc
ACME Inc
SQL>
2
3
4
5
6
7
26
SELECT ...
FROM table_a, table_b
WHERE table_a.key (+)= table_b.key
UNION
SELECT ...
FROM table_a, table_b
WHERE table_a.key = table_b.key (+);
ISQS 6339, Data Management & Business
Intelligence
Element Names Problem
Solution:
Customer
Common naming
conventions
Client
Contact
Name
27
ISQS 6339, Data Management & Business
Intelligence
Customer
Element Meaning Problem
Avoid misinterpretation
Complex solution
Document meaning in metadata
Customer’s
name
All customer
details
Customer_detail
28
ISQS 6339, Data Management & Business
Intelligence
All details
except name
Input Format Problem
EBCDIC
“123-73”
ASCII
12373
ACME Co.
áøåëéí äáàéí
29
ISQS 6339, Data Management & Business
Intelligence
Beer (Pack of 8)
Referential Integrity Problem
Solution:
SQL anti-join
Server constraints
Dedicated tools
Departme
nt
Emp
Name
Department
10
1099
Smith
10
1289
Jones
20
1234
Doe
50
6786
Harris
60
20
30
40
30
ISQS 6339, Data Management & Business
Intelligence
Name and Address Problem
Single-field format
Mr. J. Smith,100 Main St., Bigtown, County Luth, 23565
Multiple-field format
31
Database 1
Name
Mr. J. Smith
NAME
LOCATION
Street
100 Main St.
DIANNE ZIEFELD
N100
Town
Bigtown
HARRY H. ENFIELD
M300
Country
County Luth
Code
23565
Database 2
NAME
LOCATION
ZIEFELD, DIANNE
100
ENFIELD, HARRY H
300
ISQS 6339, Data Management & Business
Intelligence
Quality Data: Importance and Benefits
Quality data:
◦ Key to a successful warehouse implementation
Quality data helps you in:
◦ Targeting right customers
◦ Determining buying patterns
◦ Identifying householders: private and commercial
◦ Matching customers
◦ Identify historical data
32
ISQS 6339, Data Management & Business
Intelligence
Data Quality Guidelines
Operational data:
Should not be used directly in the warehouse
Must be cleaned for each increment
Is not simply fixed by modifying applications
33
ISQS 6339, Data Management & Business
Intelligence
Transformation Techniques
Merging data
Adding a Date Stamp
Adding Keys to Data
3
4
ISQS 6339, Data Management & Business
Intelligence
Merging Data
Operational transactions do not usually map
one-to-one with warehouse data.
Data for the warehouse is merged to provide information
for analysis.
Pizza sales/returns by day, hour, seconds
35
Sale
1/2/02
12:00:01
Ham Pizza
$10.00
Sale
1/2/02
12:00:02
Cheese Pizza
$15.00
Sale
1/2/02 12:00:02
Anchovy Pizza
$12.00
Return 1/2/02 12:00:03
Anchovy Pizza
- $12.00
Sale
Sausage Pizza
$11.00
1/2/02 12:00:04
ISQS 6339, Data Management & Business
Intelligence
Merging Data
Pizza sales/returns by day, hour, seconds
Sale
1/2/02
12:00:01
Ham Pizza
$10.00
Sale
1/2/02
12:00:02
Cheese Pizza
$15.00
Sale
1/2/02 12:00:02
Anchovy Pizza
$12.00
Return 1/2/02 12:00:03
Anchovy Pizza
- $12.00
Sale
Sausage Pizza
$11.00
1/2/02 12:00:04
Pizza sales
36
Sale
1/2/02
12:00:01
Ham Pizza
$10.00
Sale
1/2/02
12:00:02
Cheese Pizza
$15.00
Sale
1/2/02 12:00:04
Sausage Pizza
ISQS 6339, Data Management & Business
Intelligence
$11.00
Adding a Date Stamp
Time element can be represented as a:
◦ Single point in time
◦ Time span
Add time element to:
◦ Fact tables
◦ Dimension data
37
ISQS 6339, Data Management & Business
Intelligence
Adding a Date Stamp:
Fact Tables and Dimensions
Product Table
Product_id
Time_key
Product_desc
Time Table
Week_id
Period_id
Year_id
Time_key
38
Store Table
Store_id
District_id
Time_key
Sales Fact Table
Item_id
Store_id
Time_key
Sales_dollars
Sales_units
ISQS 6339, Data Management & Business
Intelligence
Item Table
Item_id
Dept_id
Time_key
Adding Keys to Data
#1
Sale
1/2/98
12:00:01 Ham Pizza
$10.00
#2
Sale
1/2/98
12:00:02 Cheese Pizza
$15.00
#3
Sale
1/2/98
12:00:02 Anchovy Pizza $12.00
#4
Return 1/2/98
12:00:03 Anchovy Pizza - $12.00
#5
Sale
12:00:04 Sausage Pizza $11.00
1/2/98
Data values
39
or artificial keys
#dw1
Sale
1/2/98
12:00:01 Ham Pizza
$10.00
#dw2
Sale
1/2/98
12:00:02 Cheese Pizza
$15.00
#dw3
Sale
1/2/98
12:00:04 Sausage Pizza $11.00
ISQS 6339, Data Management & Business
Intelligence
Summarizing Data
1.
2.
During extraction on staging area
After loading to the warehouse server
Operational
databases
40
Staging area
ISQS 6339, Data Management & Business
Intelligence
Warehouse
database
Maintaining Transformation Metadata
Transformation metadata contains:
◦ Transformation rules
◦ Algorithms and routines
Sources
Stage
Rules
Extract
Transform
41
ISQS 6339, Data Management & Business
Intelligence
Publish
Load
Query
Maintaining Transformation Metadata
Restructure keys
Identify and resolve coding differences
Validate data from multiple sources
Handle exception rules
Identify and resolve format differences
Fix referential integrity inconsistencies
Identify summary data
42
ISQS 6339, Data Management & Business
Intelligence
Transformation Timing and Location
◦ Transformation is performed:
Before load
In parallel
◦ Can be initiated at different points:
On the operational platform
In a separate staging area
43
ISQS 6339, Data Management & Business
Intelligence
Monitoring and Tracking
Transformations should:
Be self-documenting
Provide summary statistics
Handle process exceptions
44
ISQS 6339, Data Management & Business
Intelligence
Loading Data into the Warehouse
Loading moves the data into the warehouse
Loading can be time-consuming:
Consider the load window
Schedule and automate the loading
Initial load moves large volumes of data
Subsequent refresh moves smaller volumes of data
Transform
Extract
45
Transport,
Load
Operational
Staging area
Warehouse
database
ISQSdatabases
6339, Data Management & Business
Intelligence
Initial Load and Refresh
Initial Load:
Single event that populates the database with
historical data
Involves large volumes of data
Employs distinct ETL tasks
Involves large amounts of processing after load
Refresh:
Performed according to a business cycle
Less data to load than first-time load
Less-complex ETL tasks
Smaller amounts of post-load processing
46
ISQS 6339, Data Management & Business
Intelligence
Data Refresh Models: Extract
Processing Environment
After each time interval, build a new snapshot of
the database.
Purge old snap shots.
Operational
databases
T1
47
ISQS 6339, Data Management & Business
Intelligence
T2
T3
Data Refresh Models: Warehouse Processing
Environment
Build a new database.
After each time interval, add changes to database.
Archive or purge oldest data.
Operational
databases
T1
48
ISQS 6339, Data Management & Business
Intelligence
T2
T3
Building the Loading Process
Techniques and tools
File transfer methods
The load window
Time window for other tasks
First-time and refresh volumes
Frequency of the refresh cycle
Connectivity bandwidth
49
ISQS 6339, Data Management & Business
Intelligence
Building the Loading Process
Test the proposed technique
Document proposed load
Monitor, review, and revise
50
ISQS 6339, Data Management & Business
Intelligence
Data Granularity
Important design and operational issue
Low-level grain:
Expensive, high level of processing,
more disk space, more details
High-level grain:
Cheaper, less processing, less
disk space, little details
51
ISQS 6339, Data Management & Business
Intelligence
Loading Techniques
Tools
Utilities and 3GL
Gateways
Customized copy programs
Replication
FTP
Manual
52
ISQS 6339, Data Management & Business
Intelligence
Loading Technique Considerations
Tools are comprehensive, but costly.
Data-movement utilities are fast and powerful.
Gateways are suitable for specific instances:
Access other databases
Supply dependent data marts
Support a distributed environment
Provide real-time access if needed
Use customized programs as a
last resort.
Replication is limited by
data-transfer rates.
53
ISQS 6339, Data Management & Business
Intelligence
Post-Processing of Loaded Data
Transform
Extract
Load
Staging area
Create
indexes
Generate
keys
Post-processing
of loaded data
Summarize
54
ISQS 6339, Data Management & Business
Intelligence
Filter
Warehouse
Creating Derived Keys
The use of derived or generalized keys is
recommended to maintain the uniqueness of a row.
Methods:
Concatenate operational key with a number
Assign a number sequentially from a list
55
109908
109908 01
109908
100
ISQS 6339, Data Management & Business
Intelligence
Summary Management
Summary tables
Materialized views
Summary data
56
ISQS 6339, Data Management & Business
Intelligence
Filtering Data
From warehouse to data marts
Summary data
Warehouse
Data marts
57
ISQS 6339, Data Management & Business
Intelligence
Verifying Data Integrity
Load data into intermediate file.
Compare target flash totals with totals before load.
Counts &
Amounts
=
Flash
Totals
Load
58
Intermediate file
Target
Counts &
Amounts
Flash
Totals
=
ISQS 6339, Data Management & Business
Intelligence
Preserve, inspect,
fix, then load
Steps for Verifying Data Integrity
Source files
Source files
Source files
Control
Target
3
4
1
Extract
SQL*Loader
5
2
6
7
59
ISQS 6339, Data Management & Business
Intelligence
.log
.bad
Standard Quality Assurance Checks
Load status
Completion of the process
Completeness of the data
Data reconciliation
Referential integrity violations
Reprocessing
Comparison of counts and amounts
1+1=3
60
ISQS 6339, Data Management & Business
Intelligence