ETL Basics - dbmanagement.info

Download Report

Transcript ETL Basics - dbmanagement.info

ETL Basics & SSIS Architecture
0
• Extraction Transformation & Load
– Extracts data from source systems
– Enforces data quality and consistency standards
– Conforms data from different sources
– Load data to target systems
• Usually a
– batch process
– Involves large volumes of data
• Scenarios
– Load a data warehouse, data mart for analytical and reporting applications
– Data Integration
– Load packaged applications, or external systems through their APIs or
interface databases
– Data Migration
1
Advantages of Tool-based ETL
– Reusability
– Metadata repository
– Incremental load
– Managed batch loading
– Simpler connectivity
– Parallel operation
– Vendor experience
2
ETL Products from
– Pure-play ETL vendors
– Database vendors
– Business Intelligence vendors
3
• Extract
– Heterogeneous Data Sources: Relational & non-relational databases;
Sequential flat files, complex flat files, COBOL files, VSAM data, XML
data, etc.; Packaged Applications (e.g. SAP, Siebel, etc.)
– Incremental/changed data or complete/snapshot data
– Internal data or third-party data
– Push/Pull
• Transform
– Cleansing & validation
• Simple - Range checks, duplicate checks, NULL value transforms etc.
• Specialized/Complex – Name & address validations, de-duplication, etc.
4
– Transform
• Computations (arithmetic, string, date, etc.)
• Pivot
• Split or Concatenate
• Aggregate
• Filter
• Join, look-up
• Load
– Historical vs. Refresh load
– Incremental vs. Snapshot
– Bulk Loading vs. Record-level Loading
5
ETL Basics
• Usual features provided by ETL tools:
– Graphical data flow definition interfaces for easy development
– Native & ODBC connectivity to standard databases, packages, etc.
– Metadata maintenance components
– Metadata import & export from standard databases, packages, etc.
– Inbuilt standard functions & transformations – e.g. date, aggregate, sort,
etc.
– Options for sharing or reusing developed components
– Facility to call external routines or write custom code for complex
requirement
– Batch definition to handle dependencies between data flows to create the
application
– ETL Engines that handle the data manipulation without depending on the
database engines.
– Run-time support for monitoring the data flow and reading message logs
– Scheduling options
6
ETL Basics
Architecture of a Typical ETL Tool
Source & Target Database
Source & Target Database
Data
ETL Metadata Repository
Metadata
ETL Engine
Data
GUI-Based Development Environment
Run-time Environment
• Metadata Definition/Import/Export
• Data Flow & Transformation Definition
• Batch Definition
• Test & Debug
• Schedule
• Trigger ETL
• Monitor flow
• View logs
7
ARCHITECTURE OF SQL SERVER INTEGRATION SERVICES
8
Microsoft SQL Server Integration Services consists of diverse
components:
• SSIS Designer :
SSIS Designer is a graphical tool that you can use to create and maintain
Integration Services packages. SSIS Designer is available in Business
Intelligence Development Studio as part of an Integration Services project.
• Runtime engine:
The Integration Services runtime saves the layout of packages, runs
packages, and provides support for logging, breakpoints, configuration,
connections, and transactions.
• Tasks and other executables:
The Integration Services run-time executables are the package, containers,
tasks, and event handlers that Integration Services includes. Run-time
executables also include custom tasks that you develop.
9
Microsoft SQL Server Integration Services consists of diverse
components:
• API or object model:
The Integration Services object model includes managed application
programming interfaces (API) for creating custom components for use in
packages, or custom applications that create, load, run, and manage
packages.
10