Transcript Document

Extraction, Transformation,
and Loading (ETL)
Extraction and Transportation
Copyright © 2005, Oracle. All rights reserved.
Objectives
After completing this lesson, you should be able to do
the following:
• Describe the core ETL framework inside the
database and its integration advantage
• Explain data warehousing extraction methods
• Identify transportation methods:
– Flat file
– Distributes operations
– Transportable tablespaces
•
4-2
Describe transformation flow
Copyright © 2005, Oracle. All rights reserved.
Overview
•
•
•
4-3
Lesson 4: Extraction/Transportation
Lesson 5: Loading
Lesson 6: Transformation
Copyright © 2005, Oracle. All rights reserved.
What Is ETL?
•
•
ETL is an acronym for Extraction, Transformation,
and Loading.
The following happen during the ETL process:
– The desired data is identified and extracted from
many different sources.
– Some transformations may take place during this
extraction process.
– After extraction, the data must be transported to a
target system or an intermediate system for further
processing.
– Depending on the method of transportation, some
transformations can be done simultaneously.
•
4-4
ETL refers to a broad process.
Copyright © 2005, Oracle. All rights reserved.
Extraction Methods
•
Extraction can be thought of in two parts:
– Extraction
– Transportation
•
There are two extraction methods:
– Logical
– Physical
•
•
Your logical choice influences the way the data is
physically extracted.
Some criteria for choosing a combination:
–
–
–
–
4-6
Business needs
Location of the source and target systems
Availability of the source system
Time required to extract data
Copyright © 2005, Oracle. All rights reserved.
Logical Extraction Methods
There are two kinds of logical extraction:
• Full extraction
– All data is pulled
– Less information to track
– More time required to pull the data
•
Incremental extraction
– A subset of data is pulled
– Must track what data needs to be pulled
– Less time required to pull the data
4-7
Copyright © 2005, Oracle. All rights reserved.
Physical Extraction Methods
There are two types of physical extraction.
• Online extraction:
– Pulls data from the source system
•
Offline extraction:
– Pulls data from a staging area
– Staging areas include flat files, dump files, and
transportable tablespaces.
4-9
Copyright © 2005, Oracle. All rights reserved.
Offline Extraction
Staging areas:
• Flat files
– Requires data in a predefined, generic format
•
Dump files
– Must be in an Oracle-specific format
•
Redo and archive logs
– Data located in special dump files
•
Transportable tablespaces
– Powerful, fast method for moving large volumes of
data
4-10
Copyright © 2005, Oracle. All rights reserved.
Implementing Methods of Extraction
•
Extracting to a file:
–
–
–
–
•
4-11
Spooling from SQL*Plus
Using OCI or Pro*C to dump to a file
Using Data Pump to export to an Oracle dump file
Using external tables
Extracting through distributed operations
Copyright © 2005, Oracle. All rights reserved.
Incremental Extraction Using CDC
CDC can capture and publish committed change data
in either of the following modes:
• Synchronous
– Triggers on the source database allow change data
to be captured immediately.
– Change data is captured as part of the transaction
modifying the source table.
•
Asynchronous
– Change data is captured after a SQL statement
performing DML is committed using the redo logs.
– Asynchronous Change Data Capture is built on
Oracle Streams.
4-13
Copyright © 2005, Oracle. All rights reserved.
Publish and Subscribe Model
The publisher performs the following tasks:
• Identifies source tables from which the data
warehouse is interested in capturing change data
• Uses the DBMS_CDC_PUBLISH package to:
– Set up the capture of data from the source tables
– Determine and advance the change sets
– Publish the change data
•
4-14
Allows controlled access to subscribers using the
SQL GRANT and REVOKE statements
Copyright © 2005, Oracle. All rights reserved.
Publish and Subscribe Model
The subscriber uses the DBMS_CDC_SUBSCRIBE
package to:
• Subscribe to source tables
•
•
•
•
•
4-16
Extend the window and create change view
Prepare the subscriber views
View data stored in change tables
Purge the subscriber view
Remove the subscriber views
Copyright © 2005, Oracle. All rights reserved.
Synchronous CDC
Source database
SYNC_SOURCE
Change source
Source
database
transactions
Change set
Trigger
execution
Change tables
Source tables
Subscriber
views
4-18
Copyright © 2005, Oracle. All rights reserved.
Asynchronous CDC
Asynchronous CDC:
• Captures change data from redo log files after
changes have been committed to the source
database
• Modes are dependent on the level of supplemental
logging used on the source database
• Uses Oracle Streams to capture change data from
redo log files
• Has three source modes:
– Asynchronous AutoLog mode
– Asynchronous HotLog mode
– Asynchronous Distributed HotLog mode
4-19
Copyright © 2005, Oracle. All rights reserved.
Asynchronous AutoLog Mode
Staging database
Source database
Source
database
transactions
Distributed AutoLog
change set
LOG_ARCHIVE_DEST_2
Change set
RFS
Change
tables
Source tables
LGWR
Online
redo logs
Streams
capture
Distributed AutoLog
change source
4-20
Standby
redo logs
Copyright © 2005, Oracle. All rights reserved.
Subscriber
views
Asynchronous HotLog Configuration
Source database
HOTLOG_SOURCE
Change Source
Source
database
transactions
Change set
LGWR
Change tables
Streams local
capture
Source tables
Online
redo logs
4-22
Subscriber
views
Copyright © 2005, Oracle. All rights reserved.
Asynchronous Distributed HotLog Mode
Staging database
Source database
Distributed HotLog
change set
Source
database
transactions
DBlink
Change set
Source tables
LGWR
Streams
propagation
Change tables
Online
redo logs
Distributed HotLog
change source
4-23
DBlink
Subscriber
views
Copyright © 2005, Oracle. All rights reserved.
Preparing to Publish Change Data
1. Gather requirements from the subscribers.
2. Determine which source database contains the
relevant source tables.
3. Choose the capture mode:
•
•
•
•
Synchronous
Asynchronous HotLog
Asynchronous Distributed HotLog
Asynchronous AutoLog
4. Ensure that the source and staging databases
have appropriate database initialization
parameters set.
5. Set up database links between the source
database and the staging database.
4-24
Copyright © 2005, Oracle. All rights reserved.
Creating a Publisher User
•
The staging database publisher must be granted
the following privileges and roles:
– EXECUTE_CATALOG_ROLE privilege
– SELECT_CATALOG_ROLE privilege
– CREATE TABLE and
CREATE SESSION privileges
– EXECUTE on the
DBMS_CDC_PUBLISH package
•
4-25
Create a default tablespace for
the publisher.
Copyright © 2005, Oracle. All rights reserved.
Synchronous Publishing
1. Create a change set.
BEGIN
DBMS_CDC_PUBLISH.CREATE_CHANGE_SET(
change_set_name => 'CHICAGO_DAILY',
description => 'Change set for sales history info',
change_source_name => 'SYNC_SOURCE');
END;
2. Create a change table.
3. Grant access to subscribers.
GRANT SELECT ON cdcpub.products_ct TO subscriber1;
4-27
Copyright © 2005, Oracle. All rights reserved.
Asynchronous Distributed
HotLog Publishing
Prepare the source and staging databases:
1. Configure Oracle Net so that the source database
can communicate with the staging database.
2. Set initialization parameters on the source
database.
compatible = 10.2.0
global_names = true
job_queue_processes = <current value> + 2
open_links = 4
parallel_max_servers = <current value> + 3
processes = <current value> + 4
sessions = <current value> + 1
streams_pool_size = <current value> + 20 MB
undo_retention = 3600
3. Set initialization parameters on the staging
database.
4-29
Copyright © 2005, Oracle. All rights reserved.
Asynchronous Distributed
HotLog Publishing
Prepare the staging database:
• Set the database initialization parameters on the
staging database.
compatible = 10.2.0
global_names = true
java_pool_size = 50000000
open_links = 4
job_queue_processes = 2
parallel_max_servers = <current_value> + 2
processes = <current_value> + 3
sessions = <current value> + 1
streams_pool_size = <current_value> + 11 MB
undo_retention = 3600
4-30
Copyright © 2005, Oracle. All rights reserved.
Asynchronous Distributed
HotLog Publishing
Alter the source database:
1. Place the database into FORCE LOGGING logging
mode to protect against unlogged direct writes.
ALTER DATABASE FORCE LOGGING;
2. Enable supplemental logging.
ALTER DATABASE ADD SUPPLEMENTAL LOG DATA;
3. Create an unconditional log group on all columns
to be captured in the source table.
ALTER TABLE SH.PRODUCTS
ADD SUPPLEMENTAL LOG GROUP log_group_products
(PROD_ID, PROD_NAME, PROD_LIST_PRICE) ALWAYS;
4-31
Copyright © 2005, Oracle. All rights reserved.
Asynchronous Distributed
HotLog Publishing
Publisher privileges on source and staging databases:
1. Create and grant privileges to the source database
publisher.
CREATE USER source_cdcpub IDENTIFIED BY source_cdcpub
QUOTA UNLIMITED ON SYSTEM
QUOTA UNLIMITED ON SYSAUX;
GRANT CREATE SESSION TO source_cdcpub;
GRANT DBA TO source_cdcpub;
GRANT CREATE DATABASE LINK TO source_cdcpub;
GRANT EXECUTE on DBMS_CDC_PUBLISH TO source_cdcpub;
GRANT EXECUTE_CATALOG_ROLE TO source_cdcpub;
GRANT SELECT_CATALOG_ROLE TO source_cdcpub;
EXECUTE DBMS_STREAMS_AUTH.GRANT_ADMIN_PRIVILEGE(
GRANTEE=> 'source_cdcpub');
2. Create and grant privileges to the staging
database publisher.
4-32
Copyright © 2005, Oracle. All rights reserved.
Asynchronous Distributed
HotLog Publishing
Create source and staging database links:
1. Create the source database link.
CREATE DATABASE LINK staging_db
CONNECT TO staging_cdcpub IDENTIFIED BY
staging_cdcpub
USING 'staging_db';
2. Create the staging database link.
CREATE DATABASE LINK source_db
CONNECT TO source_cdcpub IDENTIFIED BY
source_cdcpub
USING 'source_db';
4-34
Copyright © 2005, Oracle. All rights reserved.
Asynchronous Distributed
HotLog Publishing
Create change sources and change sets:
1. Create the change sources.
BEGIN
DBMS_CDC_PUBLISH.CREATE_HOTLOG_CHANGE_SOURCE(
change_source_name => 'CHICAGO',
description => 'test source',
source_database => 'source_db');
END;
2. Create the change sets.
DBMS_CDC_PUBLISH.CREATE_CHANGE_SET(
change_set_name => 'CHICAGO_DAILY',
description => 'change set for product info',
change_source_name => 'CHICAGO',
stop_on_ddl => 'y');
END;
4-35
Copyright © 2005, Oracle. All rights reserved.
Asynchronous Distributed
HotLog Publishing
Create the change tables on the staging database:
BEGIN
DBMS_CDC_PUBLISH.CREATE_CHANGE_TABLE(
owner => 'staging_cdcpub',
change_table_name => 'products_ct',
change_set_name => 'CHICAGO_DAILY',
source_schema => 'SH',
source_table => 'PRODUCTS',
column_type_list => 'PROD_ID NUMBER(6),
PROD_NAME VARCHAR2(50), PROD_LIST_PRICE NUMBER(8,2),
JOB_ID VARCHAR2(10), DEPARTMENT_ID NUMBER(4)',
capture_values => 'both',
rs_id => 'y',
row_id => 'n',
...
options_string => 'TABLESPACE TS_CHICAGO_DAILY');
END;
4-36
Copyright © 2005, Oracle. All rights reserved.
Asynchronous Distributed
HotLog Publishing
Enable the change source and change set:
1. Enable the change source.
BEGIN
DBMS_CDC_PUBLISH.ALTER_HOTLOG_CHANGE_SOURCE(
change_source_name => 'CHICAGO',
enable_source => 'Y');
END;
2. Enable the change set.
BEGIN
DBMS_CDC_PUBLISH.ALTER_CHANGE_SET(
change_set_name => 'CHICAGO_DAILY',
enable_capture => 'y');
END;
3. Grant access to subscribers.
4-38
Copyright © 2005, Oracle. All rights reserved.
Subscribing to Change Data
1. Find the source tables for which the subscriber
has access privileges.
SQL> SELECT * FROM ALL_SOURCE_TABLES;
SOURCE_SCHEMA_NAME
SOURCE_TABLE_NAME
----------------------------------SH
PRODUCTS
2. Find the change set names and columns for which
the subscriber has access privileges.
SQL> SELECT UNIQUE CHANGE_SET_NAME, COLUMN_NAME, PUB_ID FROM
2 ALL_PUBLISHED_COLUMNS WHERE SOURCE_SCHEMA_NAME ='SH' AND
3 SOURCE_TABLE_NAME = 'PRODUCTS';
CHANGE_SET_NAME COLUMN_NAME
PUB_ID
---------------- ------------------ -----------CHICAGO_DAILY
PROD_ID
41494
CHICAGO_DAILY
PROD_LIST_PRICE
41494
CHICAGO_DAILY
PROD_NAME
41494
4-39
Copyright © 2005, Oracle. All rights reserved.
Subscribing to Change Data
3. Create a subscription.
BEGIN
DBMS_CDC_SUBSCRIBE.CREATE_SUBSCRIPTION(
change_set_name => 'CHICAGO_DAILY',
description => 'Change data for PRODUCTS',
subscription_name => 'SALES_SUB');
END;
4. Subscribe to a source table and columns.
BEGIN
DBMS_CDC_SUBSCRIBE.SUBSCRIBE(
subscription_name => 'SALES_SUB',
source_schema => 'SH',
source_table => 'PRODUCTS',
column_list => 'PROD_ID, PROD_NAME, PROD_LIST_PRICE',
subscriber_view => 'SALES_VIEW');
END;
4-40
Copyright © 2005, Oracle. All rights reserved.
Subscribing to Change Data
5. Activate the subscription.
BEGIN
DBMS_CDC_SUBSCRIBE.ACTIVATE_SUBSCRIPTION(
subscription_name => 'SALES_SUB');
END;
6. Get the next set of change data.
BEGIN
DBMS_CDC_SUBSCRIBE.EXTEND_WINDOW(
subscription_name => 'SALES_SUB');
END;
4-41
Copyright © 2005, Oracle. All rights reserved.
Subscribing to Change Data
7. Query the subscriber views.
SELECT PROD_ID, PROD_NAME, PROD_LIST_PRICE FROM SALES_VIEW;
PROD_ID PROD_NAME
PROD_LIST_PRICE
------- ---------------------------------------------30
And 2 Crosscourt Tee Kids
14.99
30
And 2 Crosscourt Tee Kids
17.66
10
Gurfield& Murks Pleated Trousers
17.99
10
Gurfield& Murks Pleated Trousers
21.99
8. Indicate that the change data is no longer needed.
BEGIN
DBMS_CDC_SUBSCRIBE.PURGE_WINDOW(
subscription_name => 'SALES_SUB');
END;
9. End the subscription.
4-42
Copyright © 2005, Oracle. All rights reserved.
Asynchronous Distributed HotLog
Source Database Initialization Parameters
For all Oracle Database 10g releases:
Parameter
Value
COMPATIBLE
GLOBAL_NAMES
10.2.0 or 10.0.0
TRUE
JOB_QUEUE_PROCESSES
Maximum number of DBMS_JOB jobs that can
run simultaneously plus 2
OPEN_LINKS
Should be equal to the number of Distributed
HotLog change sources planned
The current value + (3 times the number of
change sources planned)
The current value + (4 times the number of
change sources planned)
The current value + (the number of change
sources planned)
3600
PARALLEL_MAX_SERVERS
PROCESSES
SESSIONS
UNDO_RETENTION
4-43
Copyright © 2005, Oracle. All rights reserved.
Asynchronous Distributed HotLog
Source Database Initialization Parameters
For Oracle 9.2 databases:
Parameter
Value
COMPATIBLE
GLOBAL_NAMES
9.2.0
TRUE
JOB_QUEUE_PROCESSES
Maximum number of DBMS_JOB jobs that can
run simultaneously plus 2
LOG_PARALLELISM
1
LOGMNR_MAX_PERSISTENT_
SESSIONS
OPEN_LINKS
The number of change sources planned
PARALLEL_MAX_SERVERS
PROCESSES
4-44
The number of Distributed HotLog change
sources planned
The current value + (3 times the number of
change sources planned)
The current value + (the number of change
sources planned)
Copyright © 2005, Oracle. All rights reserved.
Asynchronous Distributed HotLog
Staging Database Initialization Parameters
For Oracle Database 10g Release 2:
Parameter
Value
COMPATIBLE
GLOBAL_NAMES
10.2.0
TRUE
50000000
JAVA_POOL_SIZE
OPEN_LINKS
PARALLEL_MAX_SERVERS
PROCESSES
SESSIONS
STREAMS_POOL_SIZE
4-45
Equal to the number of Distributed HotLog
change sources planned, but no less than 4
The current value + (2 times the number of
change sources planned)
The current value + (3 times the number of
change sources planned)
The current value + (the number of change
sources planned)
Set to the current value + ((the number of
change sources planned) * (11MB))
Copyright © 2005, Oracle. All rights reserved.
Data Dictionary Views Supporting CDC
•
•
•
•
•
•
•
•
•
4-46
CHANGE_SOURCES lists existing change sources.
CHANGE_SETS lists existing change sets.
CHANGE_PROPAGATIONS describes the streams
propagation associated with a given distributed
HotLog change source on the source database.
CHANGE_TABLES lists existing change tables.
DBA_SOURCE_TABLES lists published source tables.
DBA_PUBLISHED_COLUMNS lists published source
table columns.
DBA_SUBSCRIPTIONS lists all registered
subscriptions.
DBA_SUBSCRIBED_TABLES lists published tables to
which subscribers have subscribed.
DBA_SUBSCRIBED_COLUMNS lists the columns of
tables to which subscribers have subscribed.
Copyright © 2005, Oracle. All rights reserved.
Transportation in a Data Warehouse
Three basic choices in transportation:
• Transportation using flat files
• Transportation through distributed operations
• Transportation using transportable tablespaces
4-47
Copyright © 2005, Oracle. All rights reserved.
Transportable Tablespaces
•
•
•
•
4-48
This is the fastest method for moving large
volumes of data.
Source and target databases can have different
block sizes.
The method is especially useful for transporting
data from OLTP to data warehouse.
Before Oracle Database 10g, source and target
databases needed to use the same operating
system.
Copyright © 2005, Oracle. All rights reserved.
Transportable Tablespaces: Example
1. Place the data into its own tablespace.
CREATE TABLE temp_jan_sales NOLOGGING
TABLESPACE ts_temp_sales AS SELECT * FROM sales
WHERE time_id BETWEEN '31-DEC-1999' AND '01-FEB-2000';
2. Export the metadata.
EXPDP DIRECTORY=DW_DUMP_DIR DUMPFILE=jan.dmp
TRANSPORT_TABLESPACES=ts_temp_sales
3. Copy the data and export file to the target system.
4. Import the metadata.
IMPDP DIRECTORY=DM_DUMP_DIR DUMPFILE=jan.dmp
TRANSPORT_DATAFILES='/db/tempjan.f'
5. Insert the new data into the fact table or employ
the partition exchange feature.
4-50
Copyright © 2005, Oracle. All rights reserved.
Summary
In this lesson, you should have learned how to:
• Describe the core ETL framework inside the
database and its integration advantage
• Explain data warehousing extraction methods
• Identify transportation methods:
– Flat file
– Distributes operations
– Transportable tablespaces
•
4-52
Describe transformation flow
Copyright © 2005, Oracle. All rights reserved.
Practice 4: Overview
This practice covers the following topics:
• Loading data from a flat file by using SQL*Loader
• Configuring synchronous Change Data Capture
• Loading data from a transportable tablespace by
using Data Pump
4-53
Copyright © 2005, Oracle. All rights reserved.