Top ten DW features - Northern California Oracle Users

Download Report

Transcript Top ten DW features - Northern California Oracle Users

Oracle 10g for Data
Warehousing
Hermann Baer, Oracle
Product Management Data Warehousing
Server Technologies
NoCOUG Winter Conference, Feb 8th 2005
Agenda
 Oracle10g for data warehousing - short trip
back in the history
–
Continuous innovation over decades
 Adoption trends and drivers
–
What do we see in the market
 Design and build a Data Warehouse
–
–
Ensure a well-balanced system
Optimize Oracle
 Oracle Database 10gR2 – sneak preview
The way to Oracle10g …
 Data Warehousing development started
decades ago with Oracle 7.0
–
–
Primary focus on performance and scalability
Extended with Manageability and the BI platform
vision in the Oracle8i time frame
 Data Warehousing Imperatives
–
–
–
–
–
Efficient Extract, Transform, Load (ETL)
Managing Large Data Volumes
Fast Query Response
Supporting Large User Population
Managing Simply
Oracle10g for Data Warehousing
Oracle 7.3
Continuous Innovation
Oracle 8.0
Oracle8i









Partitioned Tables and Indexes
Partition Pruning

Hash and Composite Partitioning
Parallel Index
Scans

Resource Manager
Parallel Insert,
Update,

ProgressDelete
Monitor
Parallel Bitmap
Star Query

Adaptive
Parallel Query
Parallel ANALYZE

Server-based
Functions
 Analytic
List and
Range-List Partitioning

Materialized
Parallel Constraint
Enabling
Views
Table Compression

Transportable
Tablespaces
Server Managed Backup/Recovery
 Bitmap Join Index

Direct Loader
API
Point-in-Time
Recovery
 Self-Tuning Runtime Memory

Functional Indexes
 Self-tuning SQL Optimization

Partition-wise
Joins
 New Analytic Functions
 SQL Access Advisor

Security Enhancements
 Grouping Sets
Oracle9i
Oracle10g





External Tables Automatic Storage Manager
 Self-tuning Memory
MERGE
 Change Data Capture
Multi-Table Insert
SQL Models
Proactive Query Governing
System Managed
 Undo
SQL Frequent Itemsets
 SQL Partition Outer Joins
 Statistical functions
 and much more ...
Agenda
 Oracle10g for data warehousing - short trip
back in the history
–
Continuous innovation over decades
 Adoption trends and drivers
–
What do we see in the market
 Design and build a Data Warehouse
–
–
Ensure a well-balanced system
Optimize Oracle
 Oracle Database 10gR2 – sneak preview
Main Trends and Drivers
 Oracle VLDWs are growing
–
Less systems, more data
 DW systems are consolidated
–
Global view of the business
 Importance of Data Warehousing increases
dramatically
–
Growing operational/tactical importance
 Cost Effectiveness becomes more important
–
Better decisions, lower cost
Oracle VLDWs are growing
Winter 2003 VLDB Survey
Largest Database Size, Decision Support
1998 Survey
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
Sears
HCIA
Wal-Mart
Tele Danmark
CitiCorp
MCI
NDC Health
Sprint
Ford
Acxiom
Teradata
Informix
Teradata
DB2
DB2
Informix
Oracle
Teradata
Oracle
Oracle
4.63
4.50
4.42
2.84
2.47
1.88
1.85
1.30
1.20
1.13
2001 Survey
2003 Survey
SBC
First Union
Dialog
Telecom Italia
FedEx
Office Depot
AT & T
SK C&C
NetZero
Telecom Italia
France Telecom
AT&T
SBC
Anonymous
Amazon.com
Kmart
Claria
HIRA
FedEx
Vodafone
Teradata 10.50
Informix
4.50
Proprietary 4.25
DB2
3.71
Teradata
3.70
Teradata
3.08
Teradata
2.83
Oracle
2.54
Oracle
2.47
Informix
2.32
Oracle
Daytona
Teradata
DB2
Oracle
Teradata
Oracle
Sybase IQ
Teradata
Teradata
29.23
26.27
24.81
16.19
13.00
12.59
12.10
11.94
9.98
9.91
Oracle VLDWs are growing
 Powerful RDBMS functionality becomes more
important and visible, e.g.
–
–
–
–
Partitioning
Table compression
Automatic Storage Management (ASM)
Parallel processing
Increasing Importance of DW
 Latency between operational and analytical
data must be minimized
–
Intelligence when you need it
 Need for new and enhanced analytical
capabilities
–
More value from your data
 “Classical” strengths of an RDBMS become
more important
–
E.g. Security, B/R, Availability, Concurrency
Cost Effectiveness
 Safe money whenever possible
–
–
–
Commodity servers
Commodity disks
Software manageability
 Example Amazon
–
–
–
16 low cost Intel boxes replaced one SuperDome
Low cost storage arrays replaced high end storage
arrays
2 DBAs
Cost Effectiveness
Pay and Scale Incrementally
W
300%
o
r
k
l
200%
100%
o
d
3
6
9
12 15
Months
18
21
24
Cost Effectiveness
Pay and Scale Incrementally ... with RAC
W
300%
o
r
k
l
200%
100%
o
d
3
6
9
12 15
Months
18
21
24
Cost Effectiveness
 Commodity components make specific
database functionality more important
–
–
–
–
RAC for Scalability and Availability
Resource Manager
Automatic Storage Management (ASM)
RMAN / Oracle Backup (Oracle10gR2)
Oracle Database 10g
DW Major Feature Summary
Reduced Total Cost of Ownership
• Manageability
–
–
–
–
• ULDB support
Workload Repository
Automatic SQL Tuning
Self-Tuning Global Memory
ASM
–
–
–
–
Database size extended to
Exabytes (BIGFILES)
Unlimited size LOBs
Hash Partitioned Global Indexes
ASM removes file system limits
More Value From Your Data
–
–
–
–
Many New OLAP Features
New Data Mining algorithms
Stand-alone Data Mining Tool
Advanced Statistics
–
–
–
SQL Model Clause
Frequent Item Sets
Partition Outer Join
Intelligence When You Need It
–Cross
Platform Transportable Tablespaces
–Data Pump
–
–
Async Change Data Capture
Enhancements to MERGE
Agenda
 Oracle10g for data warehousing - short trip
back in the history
–
Continuous innovation over decades
 Adoption trends and drivers
–
What do we see in the market
 Design and build a Data Warehouse
–
–
Ensure a well-balanced system
Optimize Oracle
 Oracle Database 10gR2 – sneak preview
Build the foundation for Success
 Even after decades of innovation, a computer ‘still’
consists of three main components
–
–
–
CPU provides the computing power
Memory stores the transient data for computational operations
Disks (I/O) store the persistent information
 Getting the best performance is finding the right balance
of all these components and use them optimally
–
–
–
Size your system appropriately
Design your database appropriately
Use the database appropriately
 Data Warehousing is ‘just a special kind of application’
Configuring for your Workload
 CPU requirements depend on user workload:
–
Concurrency of users, ratio of CPU-related tasks
 Memory requirement mostly user-process driven
 IO requirements depend on query-mix:
–
CPU vs. IO
 Relative CPU power for IO related tasks
–
Logically Random IOs (predominant in star schema)
 required for index driven queries, e.g. Index lookups, Index
driven joins, Index scans
–
Logically Sequential IOs (predominant in 3rd NF schema)
 required for table scans, e.g. Hash Joins
 Find the balance between CPU and IO
Configuring for Throughput
Sizing Guidelines
 Oracle can read 300+MB/sec per GHz/CPU power
–
Direct Read, multi-block IO,
 e.g, parallel full table scan ('lab environment')
 An ‘average’ DW system should plan for 75 100MB/sec per GHz/CPU
–
–
Typical mixture of IO and CPU intensive operations
Ball park number, adjust accordingly
 TPC-H plans for appr. 200MB per 3GHz Xeon
“The weakest link” defines
the throughput
Components to consider:
HBA2
HBA1
HBA2
HBA1
HBA2
HBA1
HBA2
HBA1
Configuring for Throughput
●
●
●
FC-Switch1
FC-Switch2
●
●
Disk
Array 1
Disk
Array 2
Disk
Array 3
Disk
Array 4
Disk
Array 5
Disk
Array 6
Disk
Array 7
CPU: Quantity and speed
HBA (Host Bus Adapter):
Quantity and speed
Switch speed
Controller: Quantity and speed
Disk: Quantity and speed
Disk
Array 8
Configuring for Throughput
Bit is not Byte
Throughput Performance
Component
theory (Bit/s)
maximal Byte/s
HBA
1/2Gbit/s
100/200 Mbytes/s
16 Port Switch
8 x 2Gbit/s
1600 Mbytes/s
Fibre Channel
2Gbit/s
200 Mbytes/s
Disk Controller
2Gbit/s
200 Mbytes/s
GigE NIC
1Gbit/s
80 Mbytes/s
Infiniband
10Gbit/s
890 Mbytes/s
CPU
200MB/s
Configuring for Throughput
HBA2
HBA1
HBA2
HBA1
HBA2
HBA1
HBA2
HBA1
Each machine has 2 CPUs; all four servers
drive about 2 * 200MB/s * 4 = 1600 MB/s
Each machine has 2 HBAs = 400MB/s; all 8 HBAs
can sustain 8 * 200MB/s = 1600 MB/s
Each switch needs to support 800MB/s
to guarantee a total system throughput
of 1600 MB/s
FC-Switch1
FC-Switch2
Each disk array has one 2Gbit
controller; all 8 disk arrays can
sustain 8 * 200MB/s = 1600 MB/s
Disk
Array 1
Disk
Array 2
Disk
Array 3
Disk
Array 4
Disk
Array 5
Disk
Array 6
Disk
Array 7
Disk
Array 8
Configuring the Storage
 Design for throughput, not capacity
 Keep it simple
–
Try using RAID 0+1
 Use S.A.M.E. methodology
–
–
–
Stripe And Mirror Everything
At the HW level, if available
Using ASM capabilities
 Leverage ASM whenever possible
–
–
–
Striping and Mirroring capabilities
Automatic rebalancing
Enables low cost storage
Calibrate your System
 You can easily compute the theoretical I/O performance
of your system
–
Typically measured by the minimum of [ I/O channel capacity,
I/O controller capacity, disk I/O capacity]
 Verify the I/O performance limits using OS-level
commands
–
Do this prior to using the database
 Cover basic IO operations and the average future load
pattern
–
–
Random single block IO vs. sequential multi block IO
Concurrency
Calibrate your System
Throughput dd vs. ORCL DIRECT READ
500
dd
Oracle
400
Throughput 300
[MB/s]
200
100
0
1
2
3
4
5
6
7
8
9
Copies of dd/Degree of Paralellism
●
Oracle drives about 90% of what dd can drive with a table scan
●
If you do not get the expected throughput fix the hardware
Agenda
 Oracle10g for data warehousing - short trip
back in the history
–
Continuous innovation over decades
 Adoption trends and drivers
–
What do we see in the market
 Design and build a Data Warehouse
–
–
Ensure a well-balanced system
Optimize Oracle
 Oracle Database 10gR2 – sneak preview
Schema – which way to go?
 Don’t get lost in theory and academia
–
–
Philosophical discussions won’t help (“Star fights 3NF”)
None of the two extremes will work (RedBrick?, Teradata?)
 Design according to your business needs
–
Reality shows that most of the customers are doing a mix and
match
 3NF more in an ODS layer
 ‘Denormalized’ 3NF in DW/Stage for general
purposes
 Dimensional model for subject areas, e.g. sales,
marketing (remember shared dimensions!)
Successful database has to support everything
* OLAP will not be covered in this presentation
Schema – which way to go?
 The chosen schema approach determines used
Oracle functionality
 The chosen schema approach determines IO pattern
–
Logically Random IOs (predominant in star schema)
 required for index driven queries, e.g. Index lookups, Index
driven joins, Index scans
–
Logically Sequential IOs (predominant in 3rd NF schema)
 required for table scans, e.g. Hash Join
 Oracle has both functionality to
–
–
Push the IO to the limit
Optimize the IO requirements
Schema – which way to go?
Star Schema
Leading performance for dimensional schemas
 Innovative usage of bitmap indexes and
bitmap join indexes
–
–
Index access instead of large table access
Bitmap indexes 3 to 20 times smaller than
btree indexes
 Support for complex star schemas
–
–
–
Multiple fact tables
Snowflake schemas
Large number of dimensions
 Fully integrated


Parallel execution
Partition Pruning
I/O – Minimize Requests
Partition Pruning
99-Jan
99-Feb
99-Mar
 Only the relevant partitions will be
accessed
 Optimizer knows or finds the relevant
partitions
–
99-Apr
–
99-May
Static pruning with known values in
advance
Dynamic pruning uses internal recursive
SQL to find the relevant partitions
 Minimizes I/O operations
99-Jun
Sales
–
Also provides order of magnitude
performance gains
I/O – Minimize Requests
Materialized Views
Monthly Sales
by Region
Query
What were
the sales
in the
West and
South
regions
for the
past three
Quarters?
A simple rollup Month -> Quarter
provides unprecedented gain on
performance and minimal I/O
Detail
Query
Rewrite
Schema – which way to go?
3NF example
Jan
...
Apr
Mar
Feb
Jan
Apr
Mar
CUSTOMER_ORDER_PRODUCTS
...
Jan
...
Feb
Jan
CUSTOMER_ORDERS
...
Example of an optimized parallel
partition-wise join of a composite
partitioned table
Jan, Hash 1
Jan, Hash 2
Jan, Hash 3
Jan, Hash 4
Schema – which way to go?
Schema Agnostic - Parallel Execution
 Use parallelism to enable single process
scalability
 Unrestricted parallelism
–
–
No data layout requirement or restriction (as in
shared nothing systems)
All operations can be parallelized
Data on Disk
Query Servers
Coordinator
scan
sort A-K
scan
sort L-S
scan
sort T-Z
Scanners
Sorters (Aggregators)
Dispatch work
Schema – which way to go?
Schema Agnostic - Parallel Execution
DOP 2
DOP 2
 I/O bandwidth requirement
increases with single process
parallelism and multi-user
concurrency
–
Total 200 MB/sec
Plan for your system’s
expected I/O throughput
based on average concurrent
users and parallelism
Schema – which way to go?
Schema Agnostic - Parallel Execution
DOP 4
DOP 4
DOP 4
DOP 4
Total 400 MB/sec
 I/O bandwidth requirement
increases with single process
parallelism and multi-user
concurrency
–
Plan for your system’s
expected I/O throughput
based on average concurrent
users and parallelism
Schema – which way to go?
Schema Agnostic - Parallel Execution
DOP 8
DOP 8
DOP 8
DOP 8
DOP 8
DOP 8
 I/O bandwidth requirement
increases with single process
parallelism and multi-user
concurrency
DOP 8
DOP 8
DOP 8
DOP 8
DOP 8
DOP 8
DOP 8
DOP 8
DOP 8
DOP 8
Total
8001600
MB/sec
Total
MB/sec
–
Plan for your system’s
expected I/O throughput
based on average concurrent
users and parallelism
Schema – which way to go?
Oracle‘s functionality
 Star schema
–
–
–
–
Range-partition fact tables by time
Bitmap indexes on dimension-key columns of fact table
‘Star transformation’ for end-user queries
Materialized views for pre-aggregated cubes
 3NF or normalized schema
–
–
Composite range-hash partitioning on large tables
‘Partition-wise’ joins and parallel execution are key
performance enabler for joining large tables
 Hybrid environments
–
Use both dogmas concurrently in the same system without
affecting each other
Choose what fits your needs best!
Oracle provides optimizations for any kind of setup
Init.ora – less is more
Lessons learned from History
 Do not de-tune Oracle
–
–
Very often, our performance engineers are
getting improvements just by removing
parameters
Results can be poor optimizer plans, wasted
memory, and serialization points
 Trust Oracle
–
–
Don’t try and second guess the software
With the exception of buffer and subject area
related parameters, the system defaults are
usually optimum
Init.ora – less is more
Basic Rules
 Ensure that data warehouse relevant
parameters are set
–
Not all parameters are enabled by default in older
database releases prior to Oracle10g
 Size and set buffer and memory related
parameters
–
Two parameters are enough
 Do not touch other parameters unless
necessary
Init.ora – less is more
Data Warehouse relevant parameters

COMPATIBLE
–

OPTIMIZER_FEATURES_ENABLED
–

Set to TRUE, enabled by default with Oracle10g
QUERY_REWRITE_INTEGRITY
–

Adjust to system capabilities (default to 5 prior to Oracle10g)
QUERY_REWRITE_ENABLED
–

Set to TRUE (Only relevant for older Linux versions)
PARALLEL_MAX_SERVERS
–

Maximize multiblock I/O (use multiple of OS I/O size)
DISK_ASYNCH_IO
–

Database release version to enable new functionality
DB_MULTIBLOCK_READ_COUNT
–

Database release version to enable new functionality
ENFORCED by default, can be potentially lowered
STAR_TRANSFORMATION_ENABLED
–
Set to TRUE
Build the foundation for Success
Summary
 Data Warehousing is ‘just a special kind of
application’
 Ensure a well-tuned I/O subsystem
–
–
Size for I/O throughput, not for disk capacity
Use appropriate hardware / storage
 Find a schema balance
–
Design according your needs using the
appropriate model, not the other way around
 Init.ora settings: less is more
Agenda
 Oracle10g for data warehousing - short trip
back in the history
–
Continuous innovation over decades
 Adoption trends and drivers
–
What do we see in the market
 Design and build a Data Warehouse
–
–
Ensure a well-balanced system
Optimize Oracle
 Oracle Database 10gR2 – sneak preview
ETL Enhancements
 DML error logging
•
•
•
•
•
Column values that are too large
Constraint violations (NOT NULL, unique,
referential, check constraints)
Errors raised during trigger execution
Type conversion errors
Partition mapping errors
• Distributed Change Data Capture
•
Enables 9.2 as source for asynchronous CDC
DML Error Logging (example)
INSERT
INTO sales
SELECT product_id, customer_id
, TRUNC(sales_date), 3, promotion_id
, quantity, amount
FROM sales_activity_direct
LOG ERRORS INTO
sales_activity_errors('load_20050801')
REJECT LIMIT UNLIMITED ;
Performance Enhancements
 Sort
–
–
–
ORDER BY statements
(B-tree) index creation
Up to 5 times performance improvement
 Aggregation
–
–
–
–
GROUP BY statements
Materialized views using aggregations
Implicit use of aggregations, e.g. statistics gathering
Two to three times performance improvement
 Query rewrite using multiple materialized views
Partitioning Enhancements
 Scalability
–
–
–
–
Maximum number of partitions 64K -> 1M
Resource optimization for DROP TABLE of a
partitioned table
Support for partitioning on index-organized tables
Support for hash-partitioned global indexes
 Performance
–
Support for ‘Multi dimensional’ partition pruning
Other Enhancements
 Manageability
–
–
SQL Access Advisor improvements
Materialized view refresh improvements
 Analytics
–
SQL model clause enhancements
Summary
 Oracle10g for data warehousing - short trip back in
the history
–
The most powerful and successful DW platform
 Adoption trends and drivers
–
–
Be visionary, though conservative
Guarantee success and protect investments
 Design and build a Data Warehouse
–
–
Ensure a well-balanced system
Optimize Oracle
 Oracle Database 10gR2 Beta – Interested?
QUESTIONS
ANSWERS