Wait-Time Based Oracle Performance Management
Download
Report
Transcript Wait-Time Based Oracle Performance Management
Wait-Time Based Oracle
Performance Management
Prepared for NCOUG
Presented by Matt Larson
CTO, Confio Software
Who am I?
2
Founder and CTO of database performance
software company
Former DBA consultant specializing in Oracle
performance tuning
Co-author of three Oracle books (Oracle
Development Unleashed, Oracle Unleashed
2nd Edition, Oracle8 Server Unleashed)
Co-author of two other database related
books
Agenda
3
Foundation
Case Study
Case Study
Case Study
Case Study
Q&A
One: PL/SQL Issue
Two: Full Table Scans
Three: Inefficient Indexes
Four: Locking Problems
Working the Wrong Problems
After spending an agonizing week tuning
Oracle buffers to minimize I/O operations,
management typically rewards you with:
•
•
•
•
4
A.
B.
C.
D.
An all expense paid vacation
A free lunch
A stale donut
Reward? Nobody even noticed!
Tuning Success (or lack thereof)
Your role in the rollout of a new customer
facing application results in:
•
•
•
•
5
A.
B.
C.
D.
Keys to drive the CEO’s Porsche
Keys to use the executive restroom
A mop to use in the executive restroom
Your office has been moved to the restroom
Conventional Tools Measure System
Health…
Assumption: If I make the database
healthy, users benefit
Symptoms
• DBA finds “big” problem and fixes it, users
report no impact
• Lots of data to review and things to fix, not
sure which to do first
• Unclear view of performance leads to
Finger-pointing
It’s your Code!
IT staff
6
It’s your
Database!
Developer or
vendor
…RMM Focuses on User Wait-Time
End User
7
1. Identify each bottleneck affecting the user
2. Rank bottlenecks by user impact
3. Implement proven suggestions
4. Set correct expectations on impact of fix
5. Show proof the fix helped users
RMM: Confio’s Underlying Methodology
Resource Mapping Methodology: Industry
best-practice optimizing performance tuning for
maximum business impact
Three Key Principles of RMM
1. SQL View: All statistics at SQL statement level
2. Time View: Measure Time, not number of times a
resource is utilized
3. Full View: Separately measure every resource to
isolate source of problems
8
Illustrating example: SQL View Principle
9
Example: ‘CEO’ measuring ‘employee’ output
Averaging over entire company gives no useful data
Must measure each job separately
DBA must manage database similarly
Measure and identify bottlenecks for each SQL independently
Illustrating example: Time View Principle
10
Example: ‘CEO’ counting ‘tasks’ vs. ‘time to complete’
Counting system statistics not meaningful
Must measure Time to complete
System stats (buffer size, hit ratios, I/O counts) do not identify
where database customers are waiting
Identify and optimize Wait Time for each SQL as best indicator of
performance
Illustrating example: Full View Principle
Example: ‘CEO’ measuring results with blind spot hiding key
11
processes
Without direct visibility, valuable info is lost
Must have visibility to every process step
Distinctly identify and measure each Oracle resource for each
distinct SQL
RMM-compliant Performance Tool Types
Two Primary Types of Tools
Session Specific Tools
• Tools that focus on one session at a time often by tracing the
process
• Examples: tkprof (Oracle), OraSRP Profiler (open source)
Continuous DB Wide Monitoring Tools
• Tools that focus on all sessions by sampling Oracle
• Example: Confio Ignite
12
Both tools have a place in the organization
Tracing
Tracing with wait events complies with RMM
Should be used cautiously in non-batch environments
due to session statistics skew
• 80 out of 100 sessions have no locking contention issues
• 20 out of 100 have spent 99% of time waiting for locked
rows
• If you trace one of the “80” sessions, it appears as if you
have no locking issues (and spend time trying to tune other
items that may not be important)
• If you trace one of the “20” sessions, it appears as if you
could fix the locking problems and reduce your wait time by
99%
13
Tracing (cont)
14
Very precise statistics, may be only way to get
certain statistics
Bind variable information is available
Different types of tracing available providing
detail analysis even deeper than wait events
Ideal if a known problem is going to occur in
the future (and known session)
Difficult to see trends over time
Primary audience is technical user
Continuous DB Wide Monitoring Tools
24/7 sampling provides real-time and
historical perspective
Allows DBA to go back in time and retrieve
information even if problem was not expected
Not the level of detail provided by tracing
Most of these tools have trend reports that
allow communication with others outside of
the group
• What is starting to perform poorly?
• What progress have we made while tuning?
15
Case Study One
PL/SQL Issue
Problem Observed
Critical situation: application performance
unsatisfactory
• Response time between 240 and 900 seconds
• Most times users shutdown application
• Very high network traffic (3x—4x normal),
indicating time-outs and user refreshes
• “CritSit” declared: major effort to resolve problem
17
Wait Events During Problem
library
cache lock
library
cache pin
18
Investigation
19
What does RMM tell us?
20
Which SQL:
Which Resource:
How much time:
CERN_PROFILE
Truncate
library cache pin
library cache lock
up to 16 Hours of
wait time per hour
Results
Found an invalid trigger
• Insert statement was trying to fire trigger
• Truncate was locked behind it
21
Response time improvement from 60,000
seconds (worst case) to 0 seconds
Configured alert to notify DBA when the
problem starts next time
Problem should not occur for 22 hours without
anyone knowing
Case Study Two
DB File Scattered Reads
Problem Observed
Problem: Login taking 4 minutes for each user
everyday they started their day
• High wait accumulation from 6:30 – 8:30 am
• 600 Users X 4 Minutes = 40 Hours Every Day
• 40 Hours lost productivity every day
Applied RMM approach to problem
identification
• Identify Wait Time, offending SQL, offending
Resource
23
Wait Events During Problem
24
Investigation
25
What does RMM tell us?
26
Which SQL:
Which Resource:
How much time:
LoginLookup
UpdateInventory
Scattered Read
Buffer Busy Waits
40+ Hour
Every Day
Hypotheses: Oracle Interpretations
Two Alternative paths for optimization:
I.
Eliminate Full Table Scan
•
II.
Improve response time
•
There isn’t a need to read the whole table, so we need to
find the right shortcut
We need to read most or all of the table anyway, so let’s
just figure out how to do it faster
Key Questions:
1. Is full table scan necessary?
2. What causes a full table scan for this SQL Statement?
27
I. Unnecessary Full Table Scan?
Solutions:
1. Add / Modify index(es) on the table
2. Update table and/or index statistics if proper
index not being used
3. Add hint to use existing index
4. Optimize the application
28
Full Table Scan is Needed
Two alternative paths for optimization:
I. Eliminate Full Table Scan
• There isn’t a need to read the whole table, so we
need to find the right shortcut
II. Improve response time
• We need to read most or all of the table anyway,
so let’s just figure out how to do it faster
29
II. Improve Response Time for Db
File Scattered Reads
Solutions:
1. Use Parallel Reads
2. Set Database Parameters
3. Improve I/O Speed
4. Optimize the application
5. Larger Database Caches (64-bit)
30
1. Use Parallel Reads = Faster FTS
Parallel Reads
• Can be set at the table level (use with caution)
Alter table customer parallel degree 4;
• Normally used by hinting in the SQL Statement
select /*+ FULL(customer) PARALLEL(customer, 4) */ customer_name
from customer;
A delicate tradeoff
• sacrifice the performance of others for the running query.
Not necessarily efficient, just faster
• Parallel Reads may actually do twice the work of a sequential
query but have four workers, thus finishing in half the time
while using 8x resource
31
2. Set database parameters
32
DB_FILE_MULTIBLOCK_READ_COUNT
• specifies the maximum number of blocks read in one I/O operation
during a sequential scan
• Impacts the optimizer
• Reduces number of I/Os required
• For OLTP, typically between 4 to 16
• Optimizer will more likely to FTS if set too high
Ensure that the database read requests are
synced up with the O/S.
This gets tricky if different block sizes are
used in different tablespaces
3. Improve I/O speed
Get your SA involved
Investigate I/O sub-system
• Iostat, vmstat, sar, … for potential problems
• Monitor during high activity
Investigate contention at the disk/controller
level.
• Learn which disks share common resources
• Use more disks to spread I/O and reduce hot spots
33
Investigate caching on disk sub-system and
current memory usage
4. Optimizing the Application
Review application – do you have access to
code for changes?
Understand the code around the problem SQL
Techniques to Optimize a statement:
• Reduce the number of calls for a SQL
– Caching the data in the application
– Creating a summary table (perhaps via a materialized view)
– Eliminating the need for the data
• Retrieve Less Data with each statement
– Add fields to the WHERE clause
• Combine SQLs for fewer calls
– Combine several SQLs with different bind variables into one large statement that
retrieves all the data in one shot
34
5. Larger Database Caches (64-bit)
Larger cache means fewer disk reads
May need large increase to have significant
impact
Performance
Gain
% of database in
memory
35
Results
Added indexes to underlying tables
Added Materialized View
Full Table
Scan Fixed
36
Case Study Three
DB File Sequential Reads
Problem Observed
Data Warehouse loads were taking too long
Noticed high wait times on db file sequential
read wait event
DBAs were confused – why are data loads
“reading” data
Applied RMM approach to problem
identification
• Identify Wait Time, offending SQL, offending
Resource
38
Investigation
SQL
Sequential read time
Sequential read time
by object for SQL
39
What does RMM tell us?
40
Which SQL:
Which Resource:
How much time:
3 Insert Statements
DB File Sequential
Read
5 hour+
90% of wait time
Investigating db file sequential reads
Often considered a “good” read
DB file sequential reads normally occur during
index lookups
Often a single-block read although it may
retrieve more than one block.
Sequential Read may also be seen for reads
from:
• datafile headers
• rebuilding the control file
• dumping datafile headers
41
Hypotheses: Oracle Interpretations of
Sequential Reads
Causes of excessive wait times:
I. Reading too many index leaf blocks
II. Not finding block in buffer cache forces disk
read
III. Slow disk reads
IV. Contention for certain blocks
V. High Read time on INSERT statements
42
I. Reading too many index and table
blocks (cont)
1.
2.
3.
4.
43
Rebuild Fragmented Indexes
•
alter index rebuild [online];
Compress Indexes
•
•
alter index rebuild compress;
Uses more CPU
Multi-column indexes
•
•
Avoid the table lookup
Will create a larger index
Pre-sort Table data
II. Not finding block in buffer cache forces
disk read
44
Db File sequential reads occur because the
block is not in the buffer cache.
How do we make sure more blocks are
already in the cache?
Solutions
1. Increase the size of the buffer cache(s)
2. Put the object in a cache where it is less likely to
get flushed out
III. Slow disk reads
45
With databases, it often comes down to this –
the disk just needs to be faster
Put certain objects on the fastest disk
O/S file caching using special software that
makes normal files perform like raw files
Increase Storage System Caching – such as
an EMC cache
Results
46
Inserts were updating indexes that had low
cardinality leading columns
Reordered columns in the index and got a
50% performance improvement
Log file sync wait event was then the largest
wait event
Data was being committed too often
Tuning is an iterative process
Case Study Four
Enqueue
Problem Observed
Problem: High Wait on CPPFPROD
• Accumulated wait 9.5 hours (34,000 sec) during
3.00-4.00am hour
• End users were complaining loudly
Applied RMM approach to problem
identification:
• Identify Wait Time, offending SQL, offending
Resource
48
Investigation: Drill down to Top SQL &
Identify likely source of Problem
49
enqueue
Causes
TX enqueue
Locks held for the life of a transaction until a COMMIT or
ROLLBACK.
TM enqueue
Locks being held when foreign key constraints are not indexed
properly.
ST enqueue
Locks held during dynamic space allocations.
HW enqueue
50
Serialization for the allocation of space beyond the high water
mark
enqueue
TX
51
Generally due to application or table setup
issues
Is acquired when a transaction initiates an
UPDATE
• Row is locked by the session
• Others may select from the row (read consistency)
• Others wanting to UPDATE same row must wait
Lock is held until a COMMIT or ROLLBACK is
issued
enqueue
TX
Waits caused by “normal” active transactions
Just issue a COMMIT or ROLLBACK
Determine what the true unit of work is
52
enqueue
TX
Waits due to Insufficient 'ITL' slots in a Block
The ITL (interested transaction list) is an area
at the top of each data block where Oracle
keeps track of which rows are locked by which
transaction
Every transaction wanting to change a block
requires a slot in the ITL list of the block
The number of ITL slots is controlled by
• INITRANS, initial number of slots at block creation
• MAXTRANS, total allowable slots over time
• ITL list will expand to allow MAXTRANS only if
space is available
53
enqueue
TX
Waits due to rows being covered by the same
BITMAP index fragment
Bitmap indexes allow one index entry to cover
many rows within a table
If two sessions try to insert or update the
same key value the second session has to
wait
54
enqueue
55
ST
Caused by space management operations
Unnecessary Sorting
• Can happen with small extent sizes, allocation of
temporary segments for sorting
• May get an ORA-01575 indicating a timeout
• Disk sorting requires space management and thus
contention on the ST enqueue
• Eliminate as much disk sorting as possible
• Evaluate SORT_AREA_SIZE or
PGA_AGGREGATE_TARGET parameters
enqueue
ST
General TEMP tablespace advice
• SMON cleanup of temporary space
• Set PCTINCREASE equal to zero to stop
cleanup/coalesce
• Set temporary tablespaces as TEMPORARY
SMON in parallel environment
• SMON Cleanup operations are magnified
Hanging or slow system
• Side effect of many processes on the ST enqueue
56
enqueue
HW
57
Acquired to move the HW mark
High volume of inserts across concurrent
session will cause a wait on this contention
Recreate / modify the object with larger
extents
Pre-allocate extents
• ALTER TABLE … ALLOCATE EXTENT
V$LOCK.ID1 is the tablespace number.
V$LOCK.ID2 is the relative dba of segment
header of the object for which space is being
allocated.
What is blocking session waiting on?
58
Idle Session
DB File Scattered Reads
Another session
Idle Session Scenario
Sally
Update customer 147
Goes to Lunch
Jim
Locked trying to update customer 147
Jim will needlessly wait a long time. DBA can kill
Sally’s session IF they can tell that the session
is idle.
59
Missing Index Scenario
Sally
Update customer 147
Selects from order table
with missing index
Jim
Locked trying to update customer 147
DBA can tell that Jim is really waiting because of
a missing index on the order table – even
though Jim isn’t using the order table.
60
Idle Session Scenario
Sally
Update customer 147
Selects from order table
with missing index
Jim
Updated warehouse 22
Locked trying to update customer 147
Bob
Locked trying to update warehouse
22
A chain of locks occurs even though both locked
users aren’t accessing the table with missing
indexes
61
Wait Events for Development
62
Tuning SQL for optimal performance
Debug/test/integrate/pilot process
Understand impact on existing database
Understand Oracle impact on application
performance
View into production for better development
prioritization and feedback
Reduce finger-pointing
Conclusion
Conventional Tuning focus on “system
health” and lead to finger-pointing and
confusion
Wait event tuning implemented according to
RMM is the new way to tune
Two RMM-compliant tools types
•
•
63
Tracing tools
Continuous DB-wide monitoring tools
Questions & Answers
Who is Confio?
64
Oracle product is “Ignite for Oracle”, fast install, free
trial at www.confio.com
Organizations who trust Confio to monitor their most
critical applications include:
Thank you for coming
Matt Larson
Founder/Chief Technology Officer
Contact Information
• [email protected]
• 303-938-8282 ext. 110
• Company website
www.confio.com
65