Slides(II) - Zhangxi Lin

Download Report

Transcript Slides(II) - Zhangxi Lin

ISQS 6339, Data Management & Business Intelligence
Extraction, Transformation, and Loading (II)
Zhangxi Lin
Texas Tech University
1
ISQS 6339, Data Management & Business
Intelligence
Agenda
I. Using SSIS for ETL




Integration Services
Learn by doing
Package items
Problem-oriented package development
II. The Principle of ETL
 Extraction
 Transformation
 Loading
2
ISQS 6339, Data Management & Business
Intelligence
II. The Principle of ETL
3
ISQS 6339, Data Management & Business
Intelligence
Structure and Components of Business
Intelligence
SSMS
SSIS
SSAS
SSRS
SAS
EG
4
ISQS 6339, Data Management & Business
Intelligence
SAS
EM
Automating your routine information
processing tasks
 Your routine information processing tasks
Read online news at 8:00a and collect a few most important pieces
Retrieve data from database to draft a short daily report at 10a
View and reply emails and take some notes that are saved in a database
View 10 companies’ webpage to see the updates. Input the summaries into a database
Browse three popular magazines twice a week. Input the summaries into a database
Generate a few one-way frequency and two-way frequency tables and put them on
the web
 Merge datasets collected by other people into a main database.
 Prepare a weekly report using the database and at 4p every Monday, and publish it to
the internal portal site.
 Prepare a monthly report at 11a on the first day of a month, which must be
converted into a pdf file and uploaded to the website.






 Seems there are many things are on going. How to handle them properly in the
right time?
 Organizer – yes
 How about regular data processing tasks?
5
ISQS 6339, Data Management & Business
Intelligence
Information Processing and
Information Flow
 Transaction processing
 Interactions between a user and a computer application system with
immediate responses from the application
 Operational processing
 Make use of computer to control a process
 Batch processing
 Consisting of a series of executions, each of which is applied to a set of data
and turns the result to the next one.
 Analytical processing
 The interaction between analysts and collections of aggregated data that may
have been reformulated into alternative representational forms for improved
analytical performance.
6
ISQS 6339, Data Management & Business
Intelligence
Extraction, Transformation, Loading (ETL)
Processes
 Extract source data
 Transform/clean data
Operational systems
 Index and summarize
 Load data into warehouse
 Detect changes
 Refresh data
Programs
Gateways
Tools
Data Warehouse
7
ISQS 6339, Data Management & Business
Intelligence
ETL
ETL: Tasks, Importance, and Cost
Operational
systems
Extract
Clean up
Consolidate
Restructure
Load
Maintain
Refresh
Data Warehouse
ETL
Relevant
Useful
Quality
Accurate
Accessible
8
ISQS 6339, Data Management & Business
Intelligence
Extracting Data
 Source systems
 Data from various data sources in various formats
 Extraction Routines
 Developed to select data fields from sources
 Consist of business rules, audit trails, error correction facilities
Data mapping
Transform
Operational
databases
9
Data staging area
ISQS 6339, Data Management & Business
Intelligence
Warehouse
database
Production Data
 Operating system platforms
 File systems
 Database systems and vertical applications
IMS
DB2
Oracle
Sybase
Informix
VSAM
10
ISQS 6339, Data Management & Business
Intelligence
SAP
Shared Medical
Systems
Dun and Bradstreet
Financials
Hogan Financials
Oracle Financials
Archive Data
 Historical data
 Useful for analysis over long periods of time
 Useful for first-time load
 May require unique transformations
Operation
databases
11
ISQS 6339, Data Management & Business
Intelligence
Warehouse
database
Internal Data
 Planning, sales, and marketing organization data
 Maintained in the form of:
 Spreadsheets (structured)
 Documents (unstructured)
 Treated like any other source data
Planning
Marketing
Accounting
12
ISQS 6339, Data Management & Business
Intelligence
Warehouse database
External Data
 Information from outside the organization
 Issues of frequency, format, and predictability
 Described and tracked using metadata
Purchased
databases
A.C. Nielsen, IRI, IMS,
Walsh America
Dun and
Bradstreet
Barron's
13
Warehousing
databases
ISQS 6339, Data Management & Business
Intelligence
Competitive
information
Economic
forecasts
Wall Street
Journal
Possible ETL Failures
 A missing source file
 A system failure
 Inadequate metadata
 Poor mapping information
 Inadequate storage planning
 A source structural change
 No contingency plan
 Inadequate data validation
14
ISQS 6339, Data Management & Business
Intelligence
Maintaining ETL Quality
 ETL must be:
 Tested
 Documented
 Monitored and reviewed
 Disparate metadata must be coordinated.
15
ISQS 6339, Data Management & Business
Intelligence
Transformation
 Transformation eliminates anomalies from operational data:
 Cleans and standardizes
 Presents subject-oriented data
Extract
Transform:
Clean up
Consolidate
Restructure
Operational
systems
Load
Data Staging Area
16
ISQS 6339, Data Management & Business
Intelligence
Warehouse
Remote Staging Model
Data staging area within the warehouse environment
Transform
Extract
Operational
system
Load
Staging area
Warehouse
Data staging area in its own environment
Transform
Operational
system
17
Extract
Load
Staging area
ISQS 6339, Data Management & Business
Intelligence
Warehouse
On-site Staging Model
 Data staging area within the operational environment,
possibly affecting the operational system
Transform
Extract
Operational
system
18
Load
Staging area
ISQS 6339, Data Management & Business
Intelligence
Warehouse
Data Anomalies
 No unique key
 Data naming and coding anomalies
 Data meaning anomalies between groups
 Spelling and text inconsistencies
19
CUSNUM
NAME
ADDRESS
90233479
Oracle Limited
100 N.E. 1st St.
90233489
Oracle Computing
15 Main Road, Ft. Lauderdale
90234889
Oracle Corp. UK
15 Main Road, Ft. Lauderdale, FLA
90345672
Oracle Corp UK Ltd
181 North Street, Key West, FLA
ISQS 6339, Data Management & Business
Intelligence
Transformation Routines
 Cleaning data
 Eliminating inconsistencies
 Adding elements
 Merging data
 Integrating data
 Transforming data before load
20
ISQS 6339, Data Management & Business
Intelligence
Transforming Data:
Problems and Solutions
 Multipart keys
 Multiple local standards
 Multiple files
 Missing values
 Duplicate values
 Element names
 Element meanings
 Input formats
 Referential Integrity constraints
 Name and address
2
1
ISQS 6339, Data Management & Business
Intelligence
Multipart Keys Problem
 Multipart keys
Product code = 12 M 654313 45
Salesperson
code
Country Sales
code
territory
22
Product
number
ISQS 6339, Data Management & Business
Intelligence
Multiple Local Standards Problem
 Multiple local standards
 Tools or filters to preprocess
23
cm
DD/MM/YY
1,000 GBP
inches
MM/DD/YY
FF 9,990
cm
DD-Mon-YY
USD 600
ISQS 6339, Data Management & Business
Intelligence
Multiple Files Problem
 Added complexity of multiple source files
 Start simple
Multiple
source files
24
Logic to detect
correct source
ISQS 6339, Data Management & Business
Intelligence
Transformed
data
Missing Values Problem
 Solution:
 Ignore
 Wait
 Mark rows
 Extract when time-stamped
If NULL then
field = ‘A’
25
ISQS 6339, Data Management & Business
Intelligence
A
Duplicate Values Problem
 Solution:
 SQL self-join techniques
 RDMBS constraint utilities
ACME Inc
ACME Inc
ACME Inc
SQL>
2
3
4
5
6
7
26
SELECT ...
FROM table_a, table_b
WHERE table_a.key (+)= table_b.key
UNION
SELECT ...
FROM table_a, table_b
WHERE table_a.key = table_b.key (+);
ISQS 6339, Data Management & Business
Intelligence
Element Names Problem
 Solution:
Customer
Common naming
conventions
Client
Contact
Name
27
ISQS 6339, Data Management & Business
Intelligence
Customer
Element Meaning Problem
 Avoid misinterpretation
 Complex solution
 Document meaning in metadata
Customer’s
name
All customer
details
Customer_detail
28
ISQS 6339, Data Management & Business
Intelligence
All details
except name
Input Format Problem
EBCDIC
“123-73”
ASCII
12373
ACME Co.
áøåëéí äáàéí
29
ISQS 6339, Data Management & Business
Intelligence
Beer (Pack of 8)
Referential Integrity Problem
 Solution:
 SQL anti-join
 Server constraints
 Dedicated tools
Departme
nt
Emp
Name
Department
10
1099
Smith
10
1289
Jones
20
1234
Doe
50
6786
Harris
60
20
30
40
30
ISQS 6339, Data Management & Business
Intelligence
Name and Address Problem
 Single-field format
Mr. J. Smith,100 Main St., Bigtown, County Luth, 23565
 Multiple-field format
31
Database 1
Name
Mr. J. Smith
NAME
LOCATION
Street
100 Main St.
DIANNE ZIEFELD
N100
Town
Bigtown
HARRY H. ENFIELD
M300
Country
County Luth
Code
23565
Database 2
NAME
LOCATION
ZIEFELD, DIANNE
100
ENFIELD, HARRY H
300
ISQS 6339, Data Management & Business
Intelligence
Quality Data: Importance and Benefits
 Quality data:
◦ Key to a successful warehouse implementation
 Quality data helps you in:
◦ Targeting right customers
◦ Determining buying patterns
◦ Identifying householders: private and commercial
◦ Matching customers
◦ Identify historical data
32
ISQS 6339, Data Management & Business
Intelligence
Data Quality Guidelines
 Operational data:
 Should not be used directly in the warehouse
 Must be cleaned for each increment
 Is not simply fixed by modifying applications
33
ISQS 6339, Data Management & Business
Intelligence
Transformation Techniques
 Merging data
 Adding a Date Stamp
 Adding Keys to Data
3
4
ISQS 6339, Data Management & Business
Intelligence
Merging Data
 Operational transactions do not usually map
one-to-one with warehouse data.
 Data for the warehouse is merged to provide information
for analysis.
Pizza sales/returns by day, hour, seconds
35
Sale
1/2/02
12:00:01
Ham Pizza
$10.00
Sale
1/2/02
12:00:02
Cheese Pizza
$15.00
Sale
1/2/02 12:00:02
Anchovy Pizza
$12.00
Return 1/2/02 12:00:03
Anchovy Pizza
- $12.00
Sale
Sausage Pizza
$11.00
1/2/02 12:00:04
ISQS 6339, Data Management & Business
Intelligence
Merging Data
Pizza sales/returns by day, hour, seconds
Sale
1/2/02
12:00:01
Ham Pizza
$10.00
Sale
1/2/02
12:00:02
Cheese Pizza
$15.00
Sale
1/2/02 12:00:02
Anchovy Pizza
$12.00
Return 1/2/02 12:00:03
Anchovy Pizza
- $12.00
Sale
Sausage Pizza
$11.00
1/2/02 12:00:04
Pizza sales
36
Sale
1/2/02
12:00:01
Ham Pizza
$10.00
Sale
1/2/02
12:00:02
Cheese Pizza
$15.00
Sale
1/2/02 12:00:04
Sausage Pizza
ISQS 6339, Data Management & Business
Intelligence
$11.00
Adding a Date Stamp
 Time element can be represented as a:
◦ Single point in time
◦ Time span
 Add time element to:
◦ Fact tables
◦ Dimension data
37
ISQS 6339, Data Management & Business
Intelligence
Adding a Date Stamp:
Fact Tables and Dimensions
Product Table
Product_id
Time_key
Product_desc
Time Table
Week_id
Period_id
Year_id
Time_key
38
Store Table
Store_id
District_id
Time_key
Sales Fact Table
Item_id
Store_id
Time_key
Sales_dollars
Sales_units
ISQS 6339, Data Management & Business
Intelligence
Item Table
Item_id
Dept_id
Time_key
Adding Keys to Data
#1
Sale
1/2/98
12:00:01 Ham Pizza
$10.00
#2
Sale
1/2/98
12:00:02 Cheese Pizza
$15.00
#3
Sale
1/2/98
12:00:02 Anchovy Pizza $12.00
#4
Return 1/2/98
12:00:03 Anchovy Pizza - $12.00
#5
Sale
12:00:04 Sausage Pizza $11.00
1/2/98
Data values
39
or artificial keys
#dw1
Sale
1/2/98
12:00:01 Ham Pizza
$10.00
#dw2
Sale
1/2/98
12:00:02 Cheese Pizza
$15.00
#dw3
Sale
1/2/98
12:00:04 Sausage Pizza $11.00
ISQS 6339, Data Management & Business
Intelligence
Summarizing Data
1.
2.
During extraction on staging area
After loading to the warehouse server
Operational
databases
40
Staging area
ISQS 6339, Data Management & Business
Intelligence
Warehouse
database
Maintaining Transformation Metadata
 Transformation metadata contains:
◦ Transformation rules
◦ Algorithms and routines
Sources
Stage
Rules
Extract
Transform
41
ISQS 6339, Data Management & Business
Intelligence
Publish
Load
Query
Maintaining Transformation Metadata
 Restructure keys
 Identify and resolve coding differences
 Validate data from multiple sources
 Handle exception rules
 Identify and resolve format differences
 Fix referential integrity inconsistencies
 Identify summary data
42
ISQS 6339, Data Management & Business
Intelligence
Transformation Timing and Location
◦ Transformation is performed:
 Before load
 In parallel
◦ Can be initiated at different points:
 On the operational platform
 In a separate staging area
43
ISQS 6339, Data Management & Business
Intelligence
Monitoring and Tracking
 Transformations should:
 Be self-documenting
 Provide summary statistics
 Handle process exceptions
44
ISQS 6339, Data Management & Business
Intelligence
Loading Data into the Warehouse
 Loading moves the data into the warehouse
 Loading can be time-consuming:
 Consider the load window
 Schedule and automate the loading
 Initial load moves large volumes of data
 Subsequent refresh moves smaller volumes of data
Transform
Extract
45
Transport,
Load
Operational
Staging area
Warehouse
database
ISQSdatabases
6339, Data Management & Business
Intelligence
Initial Load and Refresh
 Initial Load:
 Single event that populates the database with
historical data
 Involves large volumes of data
 Employs distinct ETL tasks
 Involves large amounts of processing after load
 Refresh:
 Performed according to a business cycle
 Less data to load than first-time load
 Less-complex ETL tasks
 Smaller amounts of post-load processing
46
ISQS 6339, Data Management & Business
Intelligence
Data Refresh Models: Extract
Processing Environment
 After each time interval, build a new snapshot of
the database.
 Purge old snap shots.
Operational
databases
T1
47
ISQS 6339, Data Management & Business
Intelligence
T2
T3
Data Refresh Models: Warehouse Processing
Environment
 Build a new database.
 After each time interval, add changes to database.
 Archive or purge oldest data.
Operational
databases
T1
48
ISQS 6339, Data Management & Business
Intelligence
T2
T3
Building the Loading Process
 Techniques and tools
 File transfer methods
 The load window
 Time window for other tasks
 First-time and refresh volumes
 Frequency of the refresh cycle
 Connectivity bandwidth
49
ISQS 6339, Data Management & Business
Intelligence
Building the Loading Process
 Test the proposed technique
 Document proposed load
 Monitor, review, and revise
50
ISQS 6339, Data Management & Business
Intelligence
Data Granularity
 Important design and operational issue
 Low-level grain:
Expensive, high level of processing,
more disk space, more details
 High-level grain:
Cheaper, less processing, less
disk space, little details
51
ISQS 6339, Data Management & Business
Intelligence
Loading Techniques
 Tools
 Utilities and 3GL
 Gateways
 Customized copy programs
 Replication
 FTP
 Manual
52
ISQS 6339, Data Management & Business
Intelligence
Loading Technique Considerations
 Tools are comprehensive, but costly.
 Data-movement utilities are fast and powerful.
 Gateways are suitable for specific instances:
 Access other databases
 Supply dependent data marts
 Support a distributed environment
 Provide real-time access if needed
 Use customized programs as a
last resort.
 Replication is limited by
data-transfer rates.
53
ISQS 6339, Data Management & Business
Intelligence
Post-Processing of Loaded Data
Transform
Extract
Load
Staging area
Create
indexes
Generate
keys
Post-processing
of loaded data
Summarize
54
ISQS 6339, Data Management & Business
Intelligence
Filter
Warehouse
Creating Derived Keys
 The use of derived or generalized keys is
recommended to maintain the uniqueness of a row.
 Methods:
 Concatenate operational key with a number
 Assign a number sequentially from a list
55
109908
109908 01
109908
100
ISQS 6339, Data Management & Business
Intelligence
Summary Management
 Summary tables
 Materialized views
Summary data
56
ISQS 6339, Data Management & Business
Intelligence
Filtering Data
 From warehouse to data marts
Summary data
Warehouse
Data marts
57
ISQS 6339, Data Management & Business
Intelligence
Verifying Data Integrity
 Load data into intermediate file.
 Compare target flash totals with totals before load.
Counts &
Amounts
=
Flash
Totals
Load
58
Intermediate file
Target
Counts &
Amounts
Flash
Totals
=
ISQS 6339, Data Management & Business
Intelligence
Preserve, inspect,
fix, then load
Steps for Verifying Data Integrity
Source files
Source files
Source files
Control
Target
3
4
1
Extract
SQL*Loader
5
2
6
7
59
ISQS 6339, Data Management & Business
Intelligence
.log
.bad
Standard Quality Assurance Checks
 Load status
 Completion of the process
 Completeness of the data
 Data reconciliation
 Referential integrity violations
 Reprocessing
 Comparison of counts and amounts
1+1=3
60
ISQS 6339, Data Management & Business
Intelligence