Do you ELT on z? - Midwest Database User Group

Download Report

Transcript Do you ELT on z? - Midwest Database User Group

Do you ELT on z?
Mar 16. 2009
IBM INFORMATION ON DEMAND 2008
© 2008 IBM Corporation
Information Management Software
Agenda
• Design Studio Overview
• Development SQW Flows using Design Studio
• Physical Data Modeling
• Data Flows
• Control Flows
• Deployment & Runtime Management
© 2008 IBM Corporation
Information Management Software
Inhibitors to data warehousing on System z?
• DB2 functionality?
• DB2 V8 and V9 – great strides
• Costs?
• Specialty processors help – IIPs
• Lower cost licensing for new DB2 workloads – DB2 VUE
• Perception?
• Conventional wisdom over last decade: use distributed sytems for
Data Warehouses
• A lot of shops defied conventional wisdom!
• Lack of skills and/or appropriate data warehouse building tooling
on System z
• Fewer people have skills on System z?
• Recent graduates grew up using graphical tooling!
• No green screen for them
• IBM recognizes this and is porting tooling to System z
© 2008 IBM Corporation
Information Management Software
Data Movement and Transformation
• Moving and transforming data is a key component to
building a data warehouse
• What do you use on System z?
• Programs? COBOL? Rexx?
• Database utilities? Import, Export?
• SQL scripts?
• Stored procedures?
• FTPs?
• ????
© 2008 IBM Corporation
Information Management Software
Alphabet Soup for Tools
E = Extract
• ETL
• ELT
T = Transform
• ETLT
• TELT
• TETLT
L = Load
• What???
• Just MarketingSpeak of stating how and where the data
movement and transformation activities occur
• Leverage an stand-alone transformation engine
• Leverage a database engine for data transformations
© 2008 IBM Corporation
Information Management Software
ETL – Extract Transform Load
Extract
Transform
Load
Usually a stand-alone server separate from any source or target systems
Referred to as an ETL Engine that performs all extracts and transformations
Database agnostic
May push some processing to source databases
May invoke database utilities
Usually a procedural design slant
© 2008 IBM Corporation
Information Management Software
ELT – Extract Load Transform
Extract
Load
Transform
Turns the relational database engine into a data movement and transformation engine
The work is usually done at the target system
Can source from multiple database types but typically supports a specific target
May push some processing to source databases
May invoke database utilities
By definition, has a set-based design slant
© 2008 IBM Corporation
Information Management Software
Variations on a theme
• ETLT, TETLT
• ETL tools are now pushing some processing down into
the source and/or target relational databases by
generating SQL
• TELT, TETLT
• ELT tools can typically push SQL to remote databases
• ELT tools can do some limited non-relational processing
– typically by calling executables or scripts
© 2008 IBM Corporation
Information Management Software
Reality…
• Don’t get caught up in the ETL vs ELT wars
• Advantages to doing some work outside the database
• Advantages go doing some work inside the database
• Don’t build an ETL or an ELT system
• Build an architected Population Subsystem
• Apply the appropriate tools to the appropriate function for
the available (and future) skill level
• ETL tools when appropriate
• ELT tools when appropriate
• Other tools when appropriate
© 2008 IBM Corporation
Information Management Software
A Population Subsystem
• Consider using an
architected model for
building population
subsystems
• Each logical layer performs a
specific kind of function and
processing is encapsulated to
that layer
• Logical layers are grouped
into physical components and
a staging of data occurs
between physical
components
• Each physical component will
be implemented with
technical functions
implemented by one or more
technologies
© 2008 IBM Corporation
Information Management Software
Population Subsystem
• Consider using an
architected model for
building population
subsystems
• Each logical layer
performs a specific kind
of function and
processing is
encapsulated to that
layer
• Logical layers are
grouped into physical
components and a
staging of data occurs
between physical
components
• Each physical
component will be
implemented with
technical functions
implemented by one or
more technologies
From IBM course DW130
© 2008 IBM Corporation
Information Management Software
Tools on System z (z/OS or Linux on System z)
• Data Movement and Transformation
• ETL: InfoSphere DataStage for Linux on System z
• Data Cleansing: InfoSphere Quality Stage for Linux on System z
• ELT: InfoSphere Warehouse on System z – SQL Warehousing Tool
• Expanded Sources
• Heterogeneous data access: InfoSphere Federation Server
• Classic Data Sources: InfoSphere Classic Federation Server for z/OS
• Capturing Changes
• SQL and Q-based data replication: InfoSphere Replication Server for z/OS
• Classic data source replication: InfoSphere Classic Replication Server for z/OS
• DB2 event publishing: InfoSphere Data Event Publisher for z/OS
• Class event publishing: InfoSphere Classic Data Event Publisher for z/OS
© 2008 IBM Corporation
Information Management Software
ELT on System z
InfoSphere Warehouse on System z
SQL Warehousing Tool
(SQW)
© 2008 IBM Corporation
Information Management Software
InfoSphere Warehouse on System z
MQT
Adviso
r
Eclipse
Client Layer
Design Studio
IE/Firefox
Admin Console
Excel
Cognos 8 BI for
System z
Third Parties / BPs
Windows / Linux
MDX
JDBC/DB2 Connect
• Design and admin
client
• BI / Reporting
tools and apps
SQL
JDBC/DB2 Connect
Administration
SQW Runtime
Cubing Services
Engine
Application
Server
WebSphere App Server
Linux on System Z Partition / IFL
JDBC/DB2 Connect
MQT
Cube Metadata
DB2 for z/OS
DB2 for z/OS
IMS
VSAM
Data Warehouse
Server
Control
DB
RDBMS
Source Systems
© 2008 IBM Corporation
Information Management Software
Design Studio
© 2008 IBM Corporation
Information Management Software
InfoSphere Warehouse Design Studio – Key features
• (IDE) Integrated Development Environment for DB2 Warehouse
projects
• Integrated consistent and interoperable tools for:
• Connecting and browsing databases
• Exploring data
• Designing physical database models (reverse/forward engineering)
• Designing OLAP objects
• Designing data movement and transformation flows
• The platform is extensible (Eclipse based) and can be easily
extended with third party or customer developed plug-ins
© 2008 IBM Corporation
Information Management Software
Getting Started – The Workspace
• A workspace is:
• A directory on the local file system where the projects
(i.e. Metadata) created in the Design Studio are stored as
XML files.
• The Design Studio GUI.
Metadata
© 2008 IBM Corporation
Information Management Software
The Business Intelligence Perspective
•
The Business Intelligence Perspective is the default perspective in the Design Studio.
•
It contains the views which are useful during the development of a Data Warehousing project.
•
Views can be moved/stacked by using drag and drop. They can be maximized by double-clicking on their title.
•
View can be closed and reopened. To add a new view to the perspective, use the menu bar “Window -> Show View”
Editor(s)
Object
Palette
Data Project
Explorer View
Outline
View
Data Output
View
Problems
View
Data Source
Explorer View
Properties
View
© 2008 IBM Corporation
Information Management Software
Data Source Explorer
• Define live jdbc connections to relational sources
• Could be nicknames if Federated Server installed
• Browse and work with objects in the live database
• Live connection required for many Design Studio
operations
• Reverse engineering data models
• Test execution of flows
• Sampling data
© 2008 IBM Corporation
Information Management Software
Team component
•
Version and configuration management
•
Share resources with team via a repository
•
The Design Studio includes a CVS repository provider
•
Other repository providers can be used by installing the plugins provided by the repository vendors
•
Rational Clearcase
•
IBM CMVC
•
Merant PVCS; Version Manager
•
… see eclipse.org community page for a list of the available plugins.
Check in/out
Repository
Check in/out
© 2008 IBM Corporation
Information Management Software
Integration with other IBM Tools – Eclipse Shell Sharing
• Share the core Eclipse components so that they are not duplicated between
each Eclipse-based product. Shell sharing eliminates the need to install
several Eclipse platforms for each product, thus saving disk space and
eliminating duplication of components.
• Supported products that shell share with the Design Studio:
• Data Studio Developer (DSD) v2.1
• Data Studio Administrator (DSA) v2.1
• InfoSphere Developer Architect (IDA) v7.5.1
• Data Studio Optimization Expert for z/OS (DSOE) v2.1
• Rational Architect Developer (RAD) v7.5.1
• Rational Software Architect (RSA) v7.5.1
© 2008 IBM Corporation
Information Management Software
Physical Data Modeling
© 2008 IBM Corporation
Information Management Software
Data Models
• Two types of data models
• Logical – The “Business” representation of data without regard to
underlying DBMS
• Physical – The representation of the data as it would appear in the
DBMS
• Design Studio supports development of physical data models
• Metadata representation of actual objects that are present in the
DBMS
• Create from scratch
• Reverse engineer from existing database or DDL
• Physical data model required to provide database metadata to
other SQW components
© 2008 IBM Corporation
Information Management Software
Data modeling overview
• Design and modify database physical models (schema & storage
design, as well as cubes, dimensions, hierarchies)
Key Features:
•
Create a new DB design from scratch
•
Reverse engineer from an existing
Database Explorer connection or from DDL
•
Create overview diagrams
•
Modify the schema graphically or in the
project tree
•
Compare DB objects with each other or
with objects existing in the database
•
Analyze design (best practices, and
dependencies), Validation
•
Generate DDL script & Deploy
•
Impact Analysis
•
DB2 Storage Modeling: Table Space,
Buffer Pool, Partition
© 2008 IBM Corporation
Information Management Software
Design Studio vs InfoSphere Data Architect
• Design Studio includes a subset of functionality provided in InfoSphere Data Architect (IDA).
• Design Studio includes the physical data modeling and corresponding SQL generation
capabilities to help you implement and modify to your physical model.
IDA
Design
Studio
Logical data modeling
Naming model
Glossary model
Other non-LUW advanced physical data modeling
Web publishing and report
Mapping editor
UML - LDM transformation
Data Project Explorer
Database Explorer
Complete Physical Data Modeling for DB2
Basic Physical Data Modeling for others
Impact Analysis
© 2008 IBM Corporation
Information Management Software
Design Studio - Physical Data Model
Data Project
Database – one per
model
Physical data model
Schema
Diagrams logical folder –
one per schema
Diagram
SQL statements logical
folder – one per schema
Table
Primary key column
Column
Primary key constraint
Index
Unique constraint
© 2008 IBM Corporation
Information Management Software
Physical Data Model - Diagram
Hide/show
palette
Geometric
shapes
Drawing
area
Palette –
select the
element to
create on
diagram
© 2008 IBM Corporation
Information Management Software
Embedded Data Movement
SQL Warehouse Tool
(SQW)
© 2008 IBM Corporation
Information Management Software
SQL warehousing tool (SQW)
• Build and execute intra-warehouse (SQL-based) data movement and transformation
services
• Integrated Development Environment and metadata system
• Model logical flows of higher-level operations
• Generate code and create execution plans
• Test and debug flows
• Package generated code and artifacts into a data warehouse application
• Integrate SQW Flows and DataStage jobs
• Generate DB2 z/OS specific optimized SQL code (Data Flows)
• DB2 z/OS specific operators
• DB2 z/OS specific code generation
• Across query optimization
• Predicates pushdown and move around
• Unnecessary column reduction
• Staging table handled automatically by the engine
• Integrate SQL based flows with non-database activities (Control Flows)
• Sequence and manage activity flow
© 2008 IBM Corporation
Information Management Software
Data Flows
© 2008 IBM Corporation
Information Management Software
Data flows
• Data flows are flow models that represent data movement and transformation
requirements
• SQW Codegen translates the models into repeatable, SQL-based warehouse building
processes
• Data from source files and tables moves through a series of transformation steps then
loads or updates a target table or creates a file
© 2008 IBM Corporation
Information Management Software
Would you rather type this ?
……………………….
SELECT
SALES.OU_IP_ID AS STR_IP_ID,
SALES.PD_ID AS PD_ID,
SALES.MSR_PRD_ID AS TIME_ID,
SALES.C_D_MKT_BSKT_TXN_ID AS NMBR_OF_MRKT_BSKTS,
SALES.SUM_NBR_ITM AS NUMBER_OF_ITEMS,
CASE
WHEN SALES.M_BK_PD_SUB_DEPT_NM IN ('BATH AND SHOWER', 'CAMERAS') THEN
SALES.BKP_SUM_NBR_ITMXPRC * DECIMAL(MARTS.RAND1N(5) + 123) / 100
ELSE
SALES.BKP_SUM_NBR_ITMXPRC * DECIMAL(MARTS.RAND1N(5) + 102) / 100
END AS PRDCT_BK_PRC_AMUNT,
CASE
WHEN
SALES.MIN_CG_PD_DEPT_NM IN ('TEEN BOYS', 'TEEN BOYS JEANS', 'DRESS FORMAL','MEN SHOES')
THEN
(DECIMAL(68 - MARTS.RAND1N(5)) / 100) * SALES.SUM_CG_NBR_ITMX_PRC
WHEN
SALES.MIN_CG_PD_DEPT_NM IN ('ELECTRICAL APPLIANCES','ELECTRONICS','COLORED TELEVISIONS','WOMEN
SHOES')
THEN
(DECIMAL(77 - MARTS.RAND1N(5)) / 100) * SALES.SUM_CG_NBR_ITMX_PRC
WHEN
SALES.MIN_CG_PD_DEPT_NM IN ('HEALTH AND BEAUTY')
THEN
(DECIMAL(65 - MARTS.RAND1N(5)) / 100) * SALES.SUM_CG_NBR_ITMX_PRC
ELSE
(DECIMAL(72 - MARTS.RAND1N(5)) / 100) * SALES.SUM_CG_NBR_ITMX_PRC
END AS CST_OF_GDS_SLD_CGS,
SALES.SUM_NBR_ITMXSTM_PRC AS SALES_AMOUNT
FROM
SALES) …………
© 2008 IBM Corporation
Information Management Software
would you rather describe your logic at
a higher-level ?
A simple “Skills”
Star schema
© 2008 IBM Corporation
Information Management Software
… and have
optimized SQL
generated for you?
INSERT INTO OLAPANL.STAR_FACT_TABLE
(ID, COMPANY_ID, TIME_ID, SKILL_DETAILS_ID, NB_SKILLS)
WITH INPUT_04 (COMPANY_NAME, TIME, ID, SKILL_CAT, SKILL_DETAILS, SKILL_ID)
AS (
SELECT
COMPANY_NAME AS COMPANY_NAME,
TIME AS TIME,
ID AS ID,
SKILL_CAT AS SKILL_CAT,
SKILL_DETAILS AS SKILL_DETAILS,
SKILL_ID AS SKILL_ID
FROM
TXTANL.IT_SKILLS_ASKED INPUT_0281),
IN4_07 (ID, SKILLS_PER_OFFER)
AS (
SELECT
INPUT_04.ID AS ID,
COUNT(*) AS SKILLS_PER_OFFER
FROM
INPUT_04
GROUP BY INPUT_04.ID)
SELECT
IN_07.ID AS ID,
IN1_07.COMPANY_ID AS COMPANY_ID,
IN2_07.TIME_ID AS TIME_ID,
IN3_07.SKILL_DETAILS_ID AS SKILL_DETAILS_ID,
DOUBLE(1) / DOUBLE(IN4_07.SKILLS_PER_OFFER) AS NB_SKILLS
FROM
TXTANL.IT_SKILLS_ASKED IN_07,
OLAPANL.STAR_COMPANY IN1_07,
OLAPANL.STAR_TIME IN2_07,
OLAPANL.STAR_SKILL IN3_07,
IN4_07
WHERE (IN_07.SKILL_DETAILS = IN3_07.SKILL_DETAILS AND
IN_07.COMPANY_NAME = IN1_07.COMPANY_NAME AND
IN_07.TIME = IN2_07.TIME_DATE AND
IN_07.ID = IN4_07.ID)
© 2008 IBM Corporation
Information Management Software
Data Flow Operators
Most operators same as in LUW versions but
generate DB2 z/OS specific SQL


Sources & Targets
Table Source (Local and Remote)
Table Target (Local and Remote)
Data Set Import
Data Set Export
SQL Query Source
Data Station
SQL Transformation Operators
Select List
Distinct
Group By
Order By
Table Join
Where (Filter)
Union

Warehouse Operators
Fact Key Replace
Key Lookup
Pivot
Unpivot
Splitter

Custom Tranformations
Custom SQL
DB2 Table Functions
DB2 User Defined Functions
© 2008 IBM Corporation
Information Management Software
Data Flows that call DB2 z/OS utilities
Data Set Import Operator
 Invokes the load utility to load data in a target table from a data set
 Call DB2-Supplied stored procedure SYSPROC.DSNUTILU
Data Set Export Operator
 Invokes the unload utility to unload data from a table to a BSAM sequential
data set.
 Call DB2-Supplied stored procedure SYSPROC.DSNUTILU
Cross Loader Operator
 Invoke load utility to directly load the output of a dynamic SQL SELECT
statement into a table
 Call DB2-Supplied stored procedure SYSPROC.DSNUTILU
© 2008 IBM Corporation
Information Management Software
Execution Database
• The DB2 for z/OS subsystem to which the generated SQL of a
dataflow is submitted
• Sources/Target tables are local when in the execution subsystem,
otherwise remote
Blue = Source Table
Green – Target Table
Linux
Control
DB2A
Data
Admin Console/SQW Runtime
Distributed Database
DB2C
Execution database = DB2C
© 2008 IBM Corporation
Information Management Software
Local Source – Local Target
• Both Source and Target tables are in the same DB2 for z/OS
subsystem
• SQL submitted to DB2C
• No data flows outside of DB2C
Blue = Source Table
Green – Target Table
Linux
Control
DB2A
Data
Admin Console/SQW Runtime
Distributed Database
DB2C
SQL
processin
g
Execution database = DB2C
© 2008 IBM Corporation
Information Management Software
Remote Source Table via jdbc – Local Target
• Source table in remote database accessed as remote Table Source
operator
• Java application runs on Linux connects to DB2A and DB2C
• Data flows throught Linux
Blue = Source Table
Green – Target Table
Linux
Control
DB2A
Data
Admin Console/SQW Runtime
Distributed Database
DB2C
SQL
processin
g
Execution database = DB2C
© 2008 IBM Corporation
Information Management Software
Remote Source DB2 via Cross Loader – Local Target
• Source table in remote DB2 database accessed as remote Table
Source operator but using Cross Loader Target operator
• Cross Load utility invoked at DB2C
• Uses DDF to access remote DB2 table over DRDA
Blue = Source Table
Green – Target Table
Linux
Control
DB2A
Data
Admin Console/SQW Runtime
Distributed Database
DB2C
SQL
processin
g
Execution database = DB2C
© 2008 IBM Corporation
Information Management Software
Local Source Table– Remote Target via jdbc
• Source table in local DB2 z database and target is defined as
remote Table Target operator
• Java application runs on Linux connects to DB2A and DB2C
• Data flows throught Linux
Blue = Source Table
Green – Target Table
Linux
Control
DB2A
Data
Admin Console/SQW Runtime
Distributed Database
DB2C
SQL
processin
g
Execution database = DB2C
© 2008 IBM Corporation
Information Management Software
Remote Source Table– Remote Target Table via jdbc
• Source and Target tables are remote
• Java application runs on Linux connects to DB2A and DB2C
• Data flows through Linux
• All SQL processing is in DB2 z @ DB2C
Blue = Source Table
Green – Target Table
Linux
Control
DB2A
Data
Admin Console/SQW Runtime
Distributed Database
DB2C
SQL
processin
g
Execution database = DB2C
© 2008 IBM Corporation
Information Management Software
Other data flow features
•
Variables
•
Variables can be used in Data Flows
•
Defer the definition of certain properties until a
later phase in the life cycle.
•
•
•
•
•
•
File Names
Table Names
Database Schema Names
Many more
Generalize a Data Flow
Subflows
•
A subflow is a predefined set of operators that
you can place inside a data flow.
•
Useful as a plugin into multiple versions of the
same or similar data flows
•
Containers or building blocks for complex flows
(division of labor)
•
Blue ports represent subflow inputs and
outputs
© 2008 IBM Corporation
Information Management Software
Control Flows
© 2008 IBM Corporation
Information Management Software
Definition and simple example
• A control flow is a flow model that sequences one or more data
flows and integrates other data processing tasks and activities.
• Control flows are the unit of execution.
• This simple example processes two data flows in sequence. If they fail,
e-mail is sent to an administrator:
© 2008 IBM Corporation
Information Management Software
Control Flow Operators
• Task-oriented operators (Do things)
• Data flow
• Subprocess
• JCL Job
• Command (DB2 Shell/FTP)
• Secure Command
• Secure FTP
• Email
• Period row generator
• Load
• Unload
• Reorg
• Runstats
• Table Partition
• Stored procedure
• DataStage job sequence
• DataStage parallel job
• Custom SQL
© 2008 IBM Corporation
Information Management Software
Control Flow Operators
• Flow control operators (Manage things)
• Parallel Container
• Start
• End
• Iterator/End Iterator
• Continue
• Break
• Fail
• File wait
• Variable assignment
• Variable comparison
• File Write
© 2008 IBM Corporation
Information Management Software
CF Operator Introduction(1)
Exchange Operator
 Switch the contents of a base table and its associated clone table.
 Exchange operation is not supported when it runs against a DB2 z/OS version
8 database.
Table Partition Operator

Perform table partition operation
1. Adding a partition
2. Rotating partitions
3. Changing partition boundary
Runstats Operator
 Update the system catalog statistics for DB2 for z/OS database through the
DB2 RUNSTATS utility.
 Call DB2-Supplied stored procedure SYSPROC.DSNUTILU
© 2008 IBM Corporation
Information Management Software
CF Operator Introduction(2)
Reorg Operator
 Reorganize a table space or an index for DB2 for z/OS database through the
DB2 REORG utility.
 Call DB2-Supplied stored procedure SYSPROC.DSNUTILU
Unload Operator
 Unload data from an entire table space or select table, columns to the BSAM
sequential data sets.
 Call DB2-Supplied stored procedure SYSPROC.DSNUTILU
DB2 Online Utility Operator
 Runs any DB2 for z/OS utility that can be invoked by the stored procedure
DSNUTILU
 Call DB2-Supplied stored procedure SYSPROC.DSNUTILU
© 2008 IBM Corporation
Information Management Software
CF Operator Introduction(3)
Load Operator
 Four load approaches
1.
2.
3.
4.
Load from Z/OS data set
Load from cursor
Load from file on client side (Using FTP to transfer file to Z/OS data set)
Load from file on client side (Using FTP to transfer file to Z/OS BATCHPIPES data set)
 Call DB2-Supplied stored procedure SYSPROC.DSNUTILU for 1,2,3
approaches, Call ADMIN_JOB_SUBMIT, ADMIN_JOB_QUERY,
ADMIN_JOB_FETCH, ADMIN_JOB_CANCEL for 4 approach.
Command Operator (FTP)
 Advanced Options Tab has been added ,user could specify following additional
options for Z/OS file transfer:
1. Record length
2. Record format
3. Data transfer type
© 2008 IBM Corporation
Information Management Software
CF Operator Introduction(4)
JCL Operator
 JCL: a control language that is used to identify a job to an operating system
and to describe the job's requirements.
 JCL Operator works to submit a job, query the status of the job, fetch the
output of a job and purge a job.
 Support three scenarios
1. JCL on local machine
2. JCL on Z/OS side
3. New edited JCL
 Call DB2-Supplied stored procedures :ADMIN_JOB_SUBMIT,
ADMIN_JOB_QUERY, ADMIN_JOB_FETCH, ADMIN_JOB_CANCEL,
ADMIN_DS_BROWSE
© 2008 IBM Corporation
Information Management Software
Deployment and Runtime
© 2008 IBM Corporation
Information Management Software
Deployment
The process of promoting a Warehouse Application from a development
environment to Test and Production environments.
Prepare
Design / Test
Application
Development
Test
Deploy / Install
Production
Zip file
Application Profile
DDL
EPG
Manage /
Execute
Deploy / Install
Manage /
Deployment preparation is done in
the Design Studio
Deployment is done via the
Administration Console
Execute
© 2008 IBM Corporation
Information Management Software
InfoSphere Warehouse on System z
Administration Console – manage the
runtime environment
•
•
•
•
•
Deploy data movement applications
Schedule, Execute, Monitor flows
Define and manage cube servers
Manage OLAP Metadata
Assign cubes to cube servers
54
© 2008 IBM Corporation
Information Management Software
Admin Console - SQW
Manage database connections
Deploy and manage SQW apps
Run and schedule Control Flows
Monitor run instances
© 2008 IBM Corporation
Information Management Software
Contact Info – z Warehouse SWAT Team
• Mgr – Beth Hamel – [email protected]
• Andy Perkins – [email protected]
• Jonathan Sloan – [email protected]
• Sundari Voruganti – [email protected]
• Willie Favero – [email protected]
© 2008 IBM Corporation