Slides - Zhangxi Lin - Texas Tech University
Download
Report
Transcript Slides - Zhangxi Lin - Texas Tech University
ISQS 3358, Business Intelligence
Extraction, Transformation, and Loading
Zhangxi Lin
Texas Tech University
1
Outline
Integration Services
Learn by doing
Package development tools
2
ETL Topics
Dimension Processing
◦ Extract changed row from the operational database
◦ Handling slowly changing dimensions
◦ De-duplication and fuzzy transforms
Fact Processing
◦
◦
◦
◦
◦
◦
◦
◦
Extract fact data from the operational database
Extract fact updates and deletes
Cleaning fact data
Checking data quality and halting package execution
Transform fact data
Surrogate key pipeline
Loading fact data
Analysis services processing
Integrating all tasks
3
Automating your routine
information processing tasks
Your routine information processing tasks
◦
◦
◦
◦
◦
◦
◦
◦
◦
Read online news at 8:00a and collect a few most important pieces
Retrieve data from database to draft a short daily report at 10a
View and reply emails and take some notes that are saved in a database
View 10 companies’ webpage to see the updates. Input the summaries into a
database
Browse three popular magazines twice a week. Input the summaries into a
database
Generate a few one-way frequency and two-way frequency tables and put
them on the web
Merge datasets collected by other people into a main database.
Prepare a weekly report using the database and at 4p every Monday, and
publish it to the internal portal site.
Prepare a monthly report at 11a on the first day of a month, which must be
converted into a pdf file and uploaded to the website.
Seems there are many things are on going. How to handle them
properly in the right time?
◦ Organizer – yes
◦ How about regular data processing tasks?
4
SQL Server Integration Services (SSIS)
The data in data warehouses and data marts is usually
updated frequently, and the data loads are typically very
large.
Integration Services includes a task that bulk loads data
directly from a flat file into SQL Server tables and views,
and a destination component that bulk loads data into a
SQL Server database as the last step in a data
transformation process.
An SSIS package can be configured to be restartable.
This means you can rerun the package from a
predetermined checkpoint, either a task or container in
the package. The ability to restart a package can save a
lot of time, especially if the package processes data from
a large number of sources.
5
What can you do with SSIS?
To load the dimension and fact tables in the database. If the source
data for a dimension table is stored in multiple data sources, the
package can merge the data into one dataset and load the dimension
table in a single process, instead of using a separate process for each
data source.
To update data in data warehouses and data marts. The Slowly
Changing Dimension Wizard automates support for slowly changing
dimensions by dynamically creating the SQL statements that insert and
update records, update related records, and add new columns to
tables.
To process Analysis Services cubes and dimensions. When the package
updates tables in the database that a cube is built on, you can use
Integration Services tasks and transformations to automatically
process the cube and to process dimensions as well.
To compute functions before the data is loaded into its destination. If
your data warehouses and data marts store aggregated information,
the SSIS package can compute functions such as SUM, AVERAGE, and
COUNT. An SSIS transformation can also pivot relational data and
transform it into a less-normalized format that is more compatible
with the table structure in the data warehouse.
6
SSIS Architecture
7
Control Flow
Bulk Insert task: Perform a fast load of data fro flat files into a
target table. Good for loading clean data.
Execute SQL task: Perform database operations, creating views,
tables, or even databases. Query data or metadata
File Transfer Protocol and File System tasks: transfer files or sets
of files.
Execute Package, Execute DTS2000 Package, and Execute
Process tasks: Break a complex workflow into smaller ones, and
define a parent or master package to execute them.
Send Mail task: sends an email message.
8
Control Flow (cont’d)
Script and ActiveX Script tasks: Perform an endless
array of operations that are beyond the scope of the
standard tasks.
Data Mining and Analysis Service Processing tasks:
Launch processing on SSAS dimensions and databases.
Use SSAS DDL task to create new Analysis Services
partitions, or perform any data definition language
operation.
XML and Web Services tasks
Message Queue, WMI Data Reader, and WMI Event
Watcher tasks: Help to build an automatic ELT system.
ForEach Loop, For Loop, and Sequence containers:
Execute a set of tasks multiple times
Data Flow tasks
9
Data Flow Task
Data Flow task is a pipeline in which data is
picked up, processed and written to a
destination.
Avoids I/O, which provided excellent
performance
Concepts
◦ Data sources
◦ Data destinations
◦ Data transformations
◦ Error flows
10
Some Data transformation Steps
Sort and Aggregate transforms
Conditional Split and Multicast transforms
Union All, Merge Join, and Lookup transforms
Slowly Changing Dimension transform
OLE DB Command transform
Row Count and Audit transforms
Pivot and Unpivot transforms
Data mining Model Training and data Mining Query
transforms
Term extraction and Term Lookup transforms
File Extractor and File Injector transforms
11
Dynamic Packaging
Modifying the actions that a package takes when it’s
executing.
SSIS implements a rich expression language that is used
in control flow and also in data flow transform.
Concepts
◦ Expressions. Uses an expression language, simple.
◦ Variables. Can be defined within a package. Can be
scoped to any object: package-wide, within a
container, a single task, etc.
◦ Configurations. Can overwrite most of the settings
for SSIS objects by supplying a configuration file at
runtime.
12
Using SSIS Import and Export Wizard
Importing or exporting a flat file into SQL
Server 2005
◦ Choose Start | Run. In the Run dialog box’s Open
field, type DTSWizard; click OK button to open
SQL Server Import and Export Wizard’s Welcome
page.
No need to have SQL Server to run the SSIS
Import and Export Wizard.
13
SQL Server Integration Services
The hierarchy of SSIS
◦ Project -> Package -> Control flow -> Data flow
Package structure
◦
◦
◦
◦
◦
Control flow
Data flow
Event handler
Package explorer
Connection tray
Features
◦
◦
◦
◦
Event driven
Layered
Drag-and-drop programming
Data I/O definitions are done using Connection Managers
14
Three Layer Structure of Integration
Services
Control
Flow
Data
Flow
Event
Handler
15
Snowflake Schema of the Data Mart
DimBatch
ManufacturingFact
DimMachine
DimProduct
DimMachineType
DimPlant
DimMaterial
DimCountry
DimProductSubType
DimProductType
Does data loading order matter?
16
Snowflake Schema of the Data Mart
10
DimBatch
9
DimMachine
8
ManufacturingFact
3
2
1
DimProduct
7
DimMachineType
DimPlant
5
6
DimMaterial
DimCountry
4
DimProductSubType
DimProductType
Aggregate
SQL Coding
17
Exercise 4: Populate Maximum Miniatures
Manufacturing Data Mart Dimensions
Preparation: Data sources and destination definition
Loading dimensions
◦
◦
◦
◦
◦
◦
◦
◦
ProductType
ProductSubType
Product
Country
Plant (using SQL Command)
Material (using SQL Command, Aggregate item)
MachineType (copied from the Material loading task)
Machine (copied from the MachineType loading task)
Note: DimBatch and the fact table will be loaded in the next exercise.
Debugging
◦ Step by step
◦ Understand the error messages
◦ Watch database loading status
See more detailed Guidelines of this exercise
18
Codes for data flows
The following codes are used to selectively retrieve data from the
source for the destination database
Code for DimPlant loading
SELECT LocationCode, LocationName, CountryCode
From Locations
WHERE LocationType = 'Plant Site'
Code for DimMaterial loading
SELECT AssetCode, AssetName, AssetClass, LocationCode,
Manufacturer, DateOfPurchase, RawMaterial
FROM CapitalAssets
WHERE AssetType = 'Molding Machine'
19
Package Items
Data flow Task – main task
Control Flow Items
◦
Data Preparation Tasks
◦
Analysis Services Processing Task, Analysis Services Execute DDL Task, Data Mining Query Task
Transfer Tasks
◦
ActiveX Script Task, Script Task
Analysis Services Tasks
◦
Bulk Insert Task, Execute SQL task
Scripting Tasks
◦
Execute Package Task, Execute DTS 2000 Package Task, Execute Process Task, Message Queue Task, Send
Mail Task, WMI Data Reader Task, WMI Event Watcher Task
SQL Server Tasks
◦
File System Task, FTP Task, Web Service Task, XML Task
Work Flow Tasks
◦
For Loop Container, Foreach Loop Container, Sequence Container
Transfer Database Task, Transfer Error Messages Task, Transfer Logins Task Transfer Objects Task, Transfer
Stored Procedures Task
Maintenance Tasks
Custom Tasks
20
Connection managers
Excel Connection Manger
File Connection Manger
Flat File Connection Manager
FTP Connection Manager
HTTP Connection Manager
ODBC Connection Manager
OLE DB Connection Manager ADO Connection Manager – for legacy
applications using earlier versions of programming languages, such as VB 6.0
ADO.NET Connection Manager – Access to Microsoft SQL Server and
data sources exposed through OLE DB and XML by using a .NET provider
Microsoft .NET Data Provider for mySQL Business Suite – access to SAP
server and enables to execute RFC/NAPI commands and select queries
against SAP tables
Design-time data source objects can be created in SSIS, SSAS and SSRS
projects
21
Container Managers
Foreach Loop Container
For Loop Container
Sequence Container
22
Data Flow Components
Data flow sources
Data flow destinations
Data transformations
23
Different Types of ETL Control Flows
With data flows, e.g.
◦ Import data
◦ Database updates
◦ Loading SCD
◦ Database cleansing
◦ Aggregating data
Without data flows, e.g.
◦ Downloading zipped files
◦ Archiving downloaded files
◦ Reading application log
◦ Mailing opportunities
◦ Consolidating workflow package
24
Data Flow for Updating Database
25
Data Flow for Loading Slowly Changing
Dimension
26
Control Flow for Importing Expanded Files
27
De-Duplication
Two common situations: person, and
organization
SSIS provides two general-purpose transforms
helping address data quality and de-duplication
◦ Fuzzy Lookup
◦ Fuzzy Grouping
28
ETL System Debugging
Most frequently encountered errors
◦ Data format error: The database table’s data type
does not match the input data’s format
Reason 1: Flat Text file uses varchar(50), or
string[DT_STR] format; Excel file uses nvarchar format
Reason 2:You defined the database using different
formats, which could be caused by the imported data
set.
Solution: A Data Conversion data transformation node
can be used for changing the format
◦ SQL Server system error: Even though you did
things correctly you could not get through.
Solution: the easiest way to solve this problem is to redo
the ETL flow.
29
ETL How-to Problems
How to use Merge function of Data Transformation
to join datasets from two tables into one.
How to split a dataset to two tables
How to remove duplicated rows in a table.
How to detect the changes of the rows in the data
sources and extract the updated rows into a table in
the data warehouse.
How to load multiple datasets with similar structure
into a table
Reference: SQL Server 2005 Integration Services,
McGraw Hill Osborne, 2007
30
Exploring Features of SQL Server ETL
System
Data Source and Data destination
◦ Flat File
Data flow transformation
◦
◦
◦
◦
Aggregate
Derived Column
Data Conversion
Sort
31
Exercise 5: Exploring Features of SQL
Server ETL System
Data Set:
◦ Source: Commrex_2008, D5.txt (in the shared
directory under \OtherDatasets)
◦ Destination: Flat file, Excel file, OLE DB file
Data flow transformation
◦ Aggregate (Use D5.txt, and aggregate the data with
regard to UserID)
◦ Derived Column (Use Commrex_2008, and create a
new column “NewID”)
◦ Data Conversion (Use Commrex_2008, and convert
data type of some columns, such as UserID, Prop_ID)
◦ Sort (use D5.txt, sort ascending with ID, Date, Time)
32