Transcript Extract
Big Data for Dummies using DataStage
By Peter Bjelvert
InfoSphere Architect
Middlecon AB
ETL – Relational DB
Extract
Transform in
DataStage
Load
Your powerful DataStage server will handle all complex
transformation and the database is only used for reading and
writing.
ELT – Relationel DB
Extract
Load with Transform
If you have powerful Database servers you can
push down much of the work to the database,
then DataStage will mostly control the flow
Balanced Optimization
Use DataStage Balanced
Optimization to select how to
push the load:
-To Source
-To Target
-To Both
The DataStage job is re-written
into SQL code.
Bal. Opt. creates a new copy of the
jobb that push the load into Source
and Target
Bal. Opt. create a second copy of the
jobb that push everything into target.
Creates one big SQL statement.
DB
ETL
Balanced Optimization
feature of Datastage
ELT – PushDown
DataStage is doing the main work
Bal. Opt. creates a new copy of
the job with SQL code:
DB server is doing the main job
SELECT * FROM (
SELECT distinct BRANCH_CITY,
BRANCH_STATE, BRANCH_ZIP FROM
JK_BANK2.BANK_BRANCH) AS A, ( Select
distinct BRANCH_CITY,
Hadoop Distributed File System - HDFS
Application
Layer
Workload mgmt
Layer
Data Layer
One file
3 copies
MapReduce example
Hadoop application stack
Application Layer
JACL, AQL….
Workload mgmt Layer
MapReduce
Data Layer
HDFS
IBM’s Hadoop implementation
ETL – HDFS
Extract
Transform in
DataStage
Load
Node
Node
Node
Node
Node
HDFS
HDFS
Node
Node
Node
Node
Node
Node
Node
Your powerful DataStage server can read and
write to the distributed file system
DataStage HDFS example
Read and write to a Hadoop system
using the new BDFS stage
ELT – Hadoop system
Extract
Load with Transform
Node
Node
Node
Node
Node
Hadoop
Hadoop
Node
Node
Node
Node
Node
Node
Node
Use DataStage Balanced Optimization to select how to push
the load:
-To Source
-To Target
-To Both
The DataStage job is re-written into JACL code.
DataStage JACL example
Bal. Opt. create a second copy of the jobb that push
everything into target.
Creates one big JACL statement.
DB
ETL
Balanced Optimization
feature of Datastage
ELT – PushDown
DataStage is doing the main work
Bal. Opt. creates a new copy of
the job with SQL code:
DB server is doing the main job
Bal. Opt. creates a new copy of
the job with JACL code:
Hadoop application server execute the
JACL code on
all nodes.
SELECT * FROM (
SELECT distinct BRANCH_CITY,
BRANCH_STATE, BRANCH_ZIP FROM
JK_BANK2.BANK_BRANCH) AS A, ( Select
distinct BRANCH_CITY,
HDFS
DataStage is doing the main work
SetOptions({conf:{"mapred.job.name":"Da
Node
Node
Node
BDFS
Node
taStage BalOp job BIGDATA:dstage1
ff_read_write_to_hadoop_jaql_balopt_join
CustomerTarget
16_#DSJobInvocationId#"}});
setOptions({conf:{"mapred.reduce.tasks":1
}}));
Node
Node
Node
Hadoop
Node
Node
Node
Node
Node
A good scenario for DS customer
Extract, Transform and filter in DataStage
Load good data
into HDFS
Node
Node
Node
BDFS
Node
Node
Node
DataStage can read from many different sources.
Convert common data (like time/date) to failitate
following queries. Send unwanted data to garbage
Analytic
functions
AQL …
o LIVE DEMO
o Borrowed images from google
Slide 6- https://yoyoclouds.wordpress.com/tag/hadoop/
Slide 7- http://kickstarthadoop.blogspot.se/2011/04/word-count-hadoopmap-reduce-example.html
Slide 8 - http://www.rosebt.com/1/post/2012/07/hadoop-internal-softwarearchitecture.html
Slide 9- http://www.ndm.net/datawarehouse/IBM/ibm-infospherebiginsights
Handling Big Data without angst