Informix Warehouse Accelerator (IWA)

Download Report

Transcript Informix Warehouse Accelerator (IWA)

Informix Warehouse and
Informix Warehouse
Accelerator Overview
Click to add text
Samar
T Desai
For questions about this presentation contact [email protected]
June 28, 2012
© 2012 IBM Corporation
Agenda
 Data Warehousing on Informix
– History & Roadmap
 Informix Data Warehouse
– Informix Warehouse Tooling – ETL
– IDS 11.70 Server Features
 Informix Warehouse Accelerator
 Q&A
2
© 2012 IBM Corporation
Data Warehousing
on Informix
Click to add text
June 28, 2012
© 2012 IBM Corporation
Informix Warehouse History
 Informix has 3 Database Products:
– XPS for MPP Data Warehousing.
– Red Brick for Star Schema data marts/data warehousing.
– Informix Dynamic Server (IDS) for OLTP & (now) Data
Warehousing.
4
© 2012 IBM Corporation
Existing IDS Warehousing Features
 Performance & Scalability
– Inherent SMP Multi-threading
– Parallel Data Query (PDQ)
– Light Scan for fast table scans
– Online Index build
– Efficient Hash Joins
– Auto Fragment Elimination
– Memory Grant Manager (MGM)
– High Performance Loader
– Optimistic Concurrency
 Easy of Management
– Time cyclic data management using Range Partitioning
– Sophisticated Query Optimizer for OLTP and Warehousing
5
© 2012 IBM Corporation
Informix Warehousing Moving Forward
 Goal is to provide a comprehensive warehousing platform
that is highly competitive in the marketplace
– Incorporating the best features of XPS and Red Brick into IDS for
OLTP/Warehousing and Mixed-Workload.
– Using the latest Informix technology in:
 Continuous Availability and Flexible Grid.
 Data Warehouse Accelerator using latest industry technology.
– Integration of IBM’s BI software stack.
6
© 2012 IBM Corporation
Informix Warehouse
11.70 Features
Click to add text
June 28, 2012
© 2012 IBM Corporation
Typical Data Warehouse Architecture
8
© 2012 IBM Corporation
Informix Warehouse Tooling - SQW
DESIGN
Data Flows + Control
Flows
Deployment
preparation
IDS
Databases
Design Studio
Design Center
(Eclipse)
SQW
Execution
DB
SQL Server
Oracle
IDS
Warehouse
DB2
Data Source
DB
DEPLOY
User scripts
Deployment
package
Code Units
Build Profile
Admin Console
IDS
9
SQW
Deploy
Control DB
HTTP service (WAS )
SQW Runtime
RUNTIME
Applications
Other Servers
(DataStage)
© 2012 IBM Corporation
11.70 Warehousing Features
Deep Compression
HPL
LOB
apps
DB utilities
ON utilities
DataStage
External Tables
Databases
Other
transactional
data sources
Interval and List
Fragmentation
Online
attach/detach
Table defragmenter
I/O & data
loading
10
Merge
Online attach/detachFragment level stats
Storage
provisioning
Data Loading
Light Scans
Data & Storage
Management
DBMS & Storage
mgmt
Hierarchical Queries
Multi-Index Scan
Skip Scan
Bitmap Technology
Star and Snowflake
join optimization
Implicit PDQ
Query
Tools
BI
Apps
BPS
Apps
Access performance
Query
Processing
Query
processing
Analytics
Source: Forrester
© 2012 IBM Corporation
11.70 Feature: Warehouse Time-Cyclic Data Management
Dec 2010
field
field
field
field
field
field
field
 Time-cyclic data management (roll-on, roll-off)
 Attach and detach online without requiring exclusive
lock and access to the table
 Automatically kicks off background process to recollect
statistics.
 Interval and List Fragmentation
 Auto Fragment level statistics
Jan
Feb
Mar
Apr
field
field
field
field
field
field
field
field
field
field
field
field
field
field
field
field
field
field
field
field
field
field
field
field
field
field
field
field
enables storing data over time
11
May 2011
field
field
field
field
field
field
field
© 2012 IBM Corporation
Interval Fragmentation
 Fragments data based on an interval value:
– E.g. fragment for every month or every million customer records.
 Tables have an initial set of fragments defined by a range expression.
 When a row is inserted that does not fit in the initial range fragments,
IDS will automatically create fragment to hold the row (no DBA
intervention).
 No Exclusive-lock is required for fragment addition.
 All the benefits of fragment by expression.
12
© 2012 IBM Corporation
Informix 11.70 Feature:
Multi-Index Scan




Make use of all available indices.
Use set operations to apply to all rowids.
Use bitmap operations like union and intersection.
Bitmap can also be used for Skip Scan operations.
13
© 2012 IBM Corporation
Multi-Index Scan – An Example
 Handling common Data Warehouse queries more efficiently
 Large dimension tables, e.g. customer table
 Multiple low-selectivity attributes like gender, age group, zip
code, etc.
 Example
SELECT count (customer_id)
FROM customer_table
WHERE gender = male
AND income_category = HIGH
AND education_level = MASTERS
AND zip_code = 95032;
14
© 2012 IBM Corporation
Multi-Index Scan Example
 Method #1:
– Evaluates the most selective constraint
– Generates a list of rows that qualify, and
– Evaluate the remaining constraints for each of the rows
generated above which will produce the answer to the query
Method retrieves rows based on the most selective constraint
using only the index for that column, followed by a sequential
evaluation of each of other constraints in a post-retrieval manner.
15
© 2012 IBM Corporation
Multi-Index Scan Example
 Method #2
– Evaluate each constraint by using a different B-tree index on
each attribute – results in a list of rows that qualify for each
constraints.
– Merge the lists to form one master list that satisfies all the
constraints
– Retrieve the qualifying rows to produce the answers
Sorted RIDs
Records
AND
Education_level
= “masters”
Gender=‘m’
Zipcode=‘95032’
Sequential Skip Scan
Income_Category=“high”
16
© 2012 IBM Corporation
Informix 11.70 Feature: Push Down Hash Join
First, a standard Hash Join for typical warehousing queries
involving a “large” Fact table with multiple dimension tables
 Build Hash Table on Left Input
 Probe with Right Input
 Typically, build on smaller input
– avoids hash table overflow to disk
Hash Join
Build
Build Scan
17
Probe
Probe Scan
© 2012 IBM Corporation
Typical Star Schema: An Example
Dim (D1)
10K rows
sel : 1/10
Dim (D2)
Fact (F)
Dim (D3)
1M rows
10K rows
sel :
1/1000
sel : 1/10
10K rows
sel: 1/10
 Large Central “Fact” table
 Smaller “Dimension” tables
 Restrictions on Dimension
tables
– assume independence
 Small fraction of Fact table in
result
18
© 2012 IBM Corporation
Prior to 11.70: Standard Left Deep Tree Solution
Hash Join
Problem Join
10K
1K
Scan D3
Hash Join
100K
1K
Scan D2
Hash Join
1K
Scan D1
19
1M
Second Join
Build Too Large
Scan F
© 2012 IBM Corporation
11.70 Feature: Pushdown Hash-Join Solution
Join Keys Pushed Down to
Reduce Probe Size
Hash Join
1K
1K
Scan D3
Hash Join
1K
1K
Scan D2
Hash Join
1K
1K
Scan D1
Scan F
Multi Index Scan
of Fact Table
using Join Keys and
Single-Column Indexes
Join Keys
20
© 2012 IBM Corporation
Informix Warehouse Accelerator
(IWA)
Click to add text
June 28, 2012
© 2012 IBM Corporation
Third Generation of Database Technology
According to IDC’s Article (Carl Olofson) – Feb. 2010
1st Generation:
- Vendor proprietary databases of IMS, IDMS, Datacom
2nd Generation:
- RDBMS for Open Systems, dependent on disk layout, limitations in
scalability and disk I/O
- Database tuning by adding updating stats, creating/dropping
indexes, data partitioning, summary tables & cubes, force query
plans, resource governing
3rd Generation: IDC Predicts that within 5 years:
 Most data warehouses will be stored in a columnar fashion
 Most OLTP database will either be augmented by an in-memory
database (IMDB) or reside entirely in memory
 Most large-scale database servers will achieve horizontal scalability
through clustering
22
© 2012 IBM Corporation
Agenda
• Overview of the Informix Warehouse Accelerator (IWA)
• IWA vs. Row/Column/Hybrid Stores
• Loading IWA
• Query Matching & Execution
• Case Study Results
23
© 2012 IBM Corporation
IWA - Characteristics
• A dedicated SMP system (Linux on Intel x86_64)
• No changes to the applications
–Applications continue to attach to IDS.
–When applicable query needs to be executed, IDS exploits the accelerator
transparently to the applications
–Fencing and protection of IDS against possible accelerator failures
• Order of magnitude performance improvement
• Reducing need for tedious tuning of IDS (partitioning, indexes, etc.)
• Appliance-like form-factor
–Hands free operations
• Significantly improved price/performance and TCO as a combined effect
of:
–Accelerating intensive & complex analytics queries
–Orders of magnitude performance improvement for accelerated queries
–Reduced DBA effort for tuning accelerated queries
24
© 2012 IBM Corporation
Informix Warehouse Accelerator
3rd Generation Database Technology is Here
What is it?
The Informix Warehouse Accelerator (IWA) is a
workload optimized, appliance-like, add-on, that enables
the integration of business insights into operational
processes to drive winning strategies. It accelerates
select queries, with unprecedented response times.
How is it different?
• Performance: Unprecedented response
times to enable 'train of thought' analysis
frequently blocked by poor query
performance.
• Integration: Connects to IDS through deep
integration providing transparency to all
applications.
• Self-managed workloads: queries are
executed in the most efficient way
• Transparency: applications connected to
IDS, are entirely unaware of IWA
• Simplified administration: appliance-like
hands-free operations, eliminating many
database tuning tasks
Breakthrough Technology Enabling New Opportunities
25
© 2012 IBM Corporation
Target Market: Business Intelligence (BI)
• Characterized by:
– “Star” or “snowflake” schema:
Dimensions
Region
Fact Table
City
Produ
ct
Store
Month
Perio
d
Brand
SALES
Categor
y
Quarte
r
Complex, ad hoc queries that typically
26
―
Look for trends, exceptions to make actionable business decisions
―
Touch large subset of the database (unlike OLTP)
―
Involve aggregation functions (e.g., COUNT, SUM, AVG,…)
―
The “Sweet Spot” for the IWA!
© 2012 IBM Corporation
What IWA is Designed For:
• Selective, fast scans over large (fact) tables
• Joins with smaller Dimension tables
• OLAP-style queries over large fact tables in relational star schema with
grouping and aggregations
SELECT PRODUCT_DEPARTMENT, REGION, SUM(REVENUE)
FROM FACT_SALES F
INNER JOIN DIM_PRODUCT P ON F.FKP = P.PK
INNER JOIN DIM_REGION R ON F.FKR = R.PK
LEFT OUTER JOIN DIM_TIME T ON F.FKT = T.PK
WHERE T.YEAR = 2009
AND R.GEOID = 17
AND P.TYPEID = 3
GROUP BY PRODUCT_DEPARTMENT, REGION
27
© 2012 IBM Corporation
Agenda
• Overview of the Informix Warehouse Accelerator (IWA)
• IWA vs. Row/Column/Hybrid Stores
• Loading IWA
• Query Matching & Execution
• Case Study Results
28
© 2012 IBM Corporation
Breakthrough technologies for performance
Extreme Compression
Row & Columnar Database
Required because RAM is the limiting factor.
Row format within IDS for transactional
workloads and columnar data access via
accelerator for OLAP queries.
Multi-core and Vector
Optimized Algorithms
In Memory Database
7
Avoiding locking or synchronization
generation database technology
avoids I/O. Compression allows huge
databases to be completely memory
resident
1
6
2
5
Predicate evaluation on
compressed data
3rd
3
4
Frequency Partitioning
Enabler for the effective parallel access
of the compressed data for scanning.
Horizontal and Vertical Partition
Elimination.
Often scans w/o decompression
during evaluation
Massive Parallelism
All cores are used within used for
queries
29
© 2012 IBM Corporation
Row Oriented Data Store
Each row stored sequentially
• Optimized for record
I/O
• Fetch and decompress
entire row, every time
• Result –
• Very efficient for
transactional workloads
• Not always efficient for
analytical workloads
If only few columns are required the complete
row is still fetched and uncompressed
30
© 2012 IBM Corporation
Columnar Data Store
Data is stored sequentially by column
• Data is compressed
sequentially for
column:
•Aids sequential scan
•Slows random access
If attributes are not required for a specific query
execution,
they are skipped completely.
31
© 2012 IBM Corporation
Compression: Frequency Partitioning
Vol Prod Origin
Column Partitions
Histogram
on Origin
Number of
Occurrences
Trade Info (volume, product,
origin country)
China GER,
USA FRA,
…
Common
Values
Rest
Rare
values
Top 64
traded goods
– 6 bit code
Rest
Product
Origin
Histogram
on Product
•
•
32
Cel
l1
Cell
3
Cell 4
Cell
2
Cell
5
Cell 6
Table partitioned
into Cells
Field lengths vary between cells
• Higher Frequencies  Shorter Codes (Approximate Huffman)
Field lengths fixed within cells
© 2012 IBM Corporation
Data is Processed in Compressed Format
•Within a Register – Store, several
columns are grouped together.
•The sum of the width of the
compressed columns doesn‘t exceed a
register compatible width. This utilizes
the full capabilities of a 64 bit system. It
doesn‘t matter how many columns are
placed within the register – wide data
element.
•It is beneficial to place commonly used
columns within the same register –
wide data element. But this requires
dynamic knowledge about the executed
workload (runtime statistics).
•Having multiple columns within the
same register – wide data element
prevents ANDing of different results.
Predicate evaluation is done against compressed data!
The Register – Store is an optimization of the Column – Store approach where we try to make
the best use of existing hardware. Reshuffeling small data elements at runtime into a register is
time consuming and can be avoided. The Register – Store also delivers good vectorization
capabilities.
33
© 2012 IBM Corporation
Register Stores Facilitate SIMD Parallelism
• Access only the banks referenced in the query (like a column store):
– SELECT
SUM (T.G)
– FROM
T
– WHERE
T.A > 5
– GROUP BY T.D
• Pack multiple rows from the same bank into the 128-bit register
• Enables yet another layer of parallelism: SIMD (Single-Instruction, MultipleData)!
D1
G1
B1
E1
F1
C1
H1
A2
D2
G2
B2
E2
F2
C2
H2
A3
D3
G3
B3
F3
C3
H3
Operand
32 bits
Operand
32 bits
Operand
32 bits
E3
Operand
32 bits
Vector Operation
A4
Result1
D4
G4
Bank β1 (32 bits)
34
B4
E4
F4
Result2 128 bits Result3
Bank β2 (32 bits)
Cell Block
A1
C4 H4
Result4
Bank β3
(16 bits)
© 2012 IBM Corporation
Simultaneous Evaluation of Equality Predicates
• CPU operates on 128-bit units
• Lots of fields fit in 128 bits
State==‘CA’ && Quarter ==
‘Q4’
• These fields are at fixed
offsets
• Apply predicates to all
columns simultaneously!
State
…
State==01001 && Quarter==1110
Quarter
…
…
…
Row
1111
0
Mask
&
11111
0
==
01001
35
0
1110
0
Selection
result
© 2012 IBM Corporation
Agenda
• Overview of the Informix Warehouse Accelerator (IWA)
• IWA vs. Row/Column/Hybrid Stores
• Loading IWA
• Query Matching & Execution
• Case Study Results
36
© 2012 IBM Corporation
Defining, What Data to Accelerate
• A MART is a logical collection of tables which are related to each other.
•
•
•
For example, all tables of a single star schema would belong to the
same MART.
The administrator uses a rich client interface to define the tables which
belong to a MART together with the information about their
relationships.
IDS creates definitions for these MARTs in the own catalog. The related
data is read from the IDS tables and transferred to IWA.
The IWA transforms the data into a highly compressed, scan optimized
format which is kept locally (in memory) on the Accelerator
IDS + IWA
Define
37
Coordinato
r Process
Worker
Processe
s
© 2012 IBM Corporation
Distributing data from IDS (Fact tables)
Fact Table
IDS Stored Procedures
Data Chunk
Coordinato
r Process
UNLOAD
UNLOAD
UNLOAD
UNLOAD
Data Chunk
Data Chunk
Data Chunk
Data Chunk
Data Chunk
Chunk
Data
A copy of the IDS data is
now transferred over to
the Worker process. The
Worker process holds a
subset of the data
(compressed) in main
memory and is able to
execute queries on this
subset. The data is evenly
distributed (no value
based partitioning) across
the cpus.
38
Copy
Worker
Proces
s
Compressed
Data
Compressed
Data
Worker
Proces
s
Compressed
Data
Compressed
Data
Worker
Proces
s
Compressed
Data
Compressed
Data
© 2012 IBM Corporation
Distributing data from IDS
(Dimension tables)
IDS Stored Procedure
IDS
Dimension Table
Dimension Table
Coordinato
r Process
UNLOAD
UNLOAD
UNLOAD
UNLOAD
Dimension Table
Dimension Table
All dimension tables
are transferred to the
worker process.
39
Worker
Proces
s
Worker
Proces
s
Worker
Proces
s
© 2012 IBM Corporation
Mapping Data from IDS to IWA
Inside IDS
Inside IWA
Fact Table
Fact Table
Data Chunk
Data Chunk
Data Chunk
Data Chunk
Data Chunk
Data Chunk
Data Chunk
Data Chunk
Data Chunk
Compressed
Data Chunk
Data Chunk
Data Chunk
Dimension Table
Dimension Table
Dimension Table
Dimension Table
Dimension Table
Dimension Table
Dimension Table
Dimension Table
40
© 2012 IBM Corporation
Agenda
• Overview of the Informix Warehouse Accelerator (IWA)
• IWA vs. Row/Column/Hybrid Stores
• Loading on IWA
• Query Matching & Execution
• Case Study Results
41
© 2012 IBM Corporation
Acceleration with Informix Warehouse
Step 1. Submit SQL
Accelerator
DB protocol: SQLI or DRDA
Applications
Network : TCP/IP,SHM
2. IDS query
Informix
matching and
redirection
technology
BI Tools
Step 5. Return
results/describe/error
Local
Execution
Database protocol: SQLI or DRDA
Network : TCP/IP, SHM
Step 4
Results:
DRDA over tcp/ip
Step 3
offload SQL.
DRDA over tcp/ip
Coordinator process
Worker
Processes
Access plan comparison “Informix only” vs. “with Accelerator”
Query Block
Remote Scan
Fetch data
from accelerator
43
© 2012 IBM Corporation
Agenda
• Overview of the Informix Warehouse Accelerator (IWA)
• IWA vs. Row/Column/Hybrid Stores
• Loading on IWA
• Query Matching & Execution
• Case Study Results
44
© 2012 IBM Corporation
Case Study #1: Major U.S. Shoe Retailer
 Top 7 time-consuming queries in Retail BI and
Warehouse:
(Against 1 Billion rows Fact
Table)
Query
IDS 11.5
IDS 11.7 IWA
1
22 mins
4 secs
2
1 min 3 secs
2 secs
3
3 mins 40 secs
2 secs
4
30 mins & up
4 secs
5
2 mins
2 secs
Our Retail users will be really happy to see such a huge improvement in the queries
processing
6 timings. 30 mins
2 secs
This IWA extension to IDS will really bring value to the Retail BI environment.
45
7
45 mins & up
2 secs
© 2012 IBM Corporation
Case Study #2: Datamart at a Government Agency
 Microstrategy report was run, which generates
– 667 SQL statements of which 537 were Select statements
 Datamart for this report has 250 Tables and 30 GB Data size
 Original report on XPS and Sun Sparc M9000 took 90 mins
 With IDS 11.7 on Linux Intel box, it took 40 mins
 With IWA, it took 67 seconds.
46
© 2012 IBM Corporation
Case Study #3: U.S. Government Agency
Query
1
47
Description
Find Top 100 Entities
Informix
1:28:22
Informix w/ IWA
Notes
0:01:28
Fact Table
Scan
6023.23%
7640.45%
41708.49%
2
Find Top 100 Members
1:22:32
0:01:05
Fact Table
Scan
3
Summarize all transactions
by State and County
1:34:37
0:00:14
Fact Table
Scan
4
Detailed Report on Specific
Programs in a Date
Range
0:00:06
0:00:06
Index Read
5
Summarize all transactions
by State, County, City,
State, Zip, Program,
Program Year,
Commodity and Fiscal
Year
0:00:41
Fact Table
Scan
1:48:58
Improvement
108.41%
15800.89%
© 2012 IBM Corporation
(Samar Desai – [email protected])
48
© 2012 IBM Corporation
49
© 2012 IBM Corporation