Surender_Sara_NCOAUG_Automated_Tuning

Download Report

Transcript Surender_Sara_NCOAUG_Automated_Tuning

Oracle Wait Interface - 911
SURENDER SARA
NCOAUG
Email :
[email protected]
[email protected]
Performance Methodology Goals
•
•
•
•
•
Always Consistent and Repeatable Results
Cost Effective
Corporate Wide Deployment
Customizable
Always Accurate and becomes a Standard
Performance Tuning Categories
• When we don’t know anything about the issue and need to zero in to a
narrow time window and eventually to the set of problem statements.
• We know the timings window when the problem seem to occur but not
the problem.
• We know what specific problem we need to resolve (end user response
time perspective), users know what needs to be tuned – BEST START
!
• Meeting ongoing performance SLA
• Building Performance History ( STATSPACK like approach but more
intelligent and specific )
Performance Tuning Goals (1)
• To be able to answer questions about sudden slowness,
localized or generalized when we receive a help desk call
or someone walks in or we get a page
• Ability to compare hourly performance horizontally, this
will enable us to answer “ WHY ARE WE SLOW
TODAY” type questions compared to ‘YESTERDAY
WAS FINE”
• Need to build historical performance data to compare the
fluctuations vertically ( today) and horizontally ( same
time frame but different days )
Performance Tuning Goals (2)
• Look for signatures and patterns over time and to
know what to expect in response time and when (
PROACTIVE GOAL )
• Baseline for tuning efforts
• Translate the data into meaningful capacity
planning needs
• Quick, few clicks and the problem is exposed
Best Strategy, Always On!
• The ideal way to collect data is “ALWAYS ON” strategy.
This can be accomplished in two ways
• 10046 always on
– Init.ora parameter
– Database logon trigger
• Sampling v$session_wait every second
– Direct SGA attach method where we can sample the data without
using database resources can be the best way to accomplish this
goal ( need c, pro*c expertise)
Issues with Best Strategy - 10046
• 10046
– Not possible to leave it on always, will have
negative effect on your performance
– Will fill your space in no time on a highly
active system ( USER_DUMP_DEST )
Issues with Best Strategy – Sampling v$session_wait
• If sampling is done per second on this table
via direct attached SGA method that method
will work as long as the data store is not in
the same spot as monitored database
What to do - 10046 ?
• 10046
– Use this for tracing programs once they are identified as a
bottleneck.
– Use this to trace all sessions when problematic time window is
defined for ex. Put a database logon trigger at 9 am and stop at 10
am. We will get trace for all the users at that time interval
– Use this to trace all given user connection for already active
sessions ( will start tracing inactive sessions also when they
become active) using PLSQL code
– Have to be prudent to trace specific programs manually by
eliminating the START and END empty wait times
– EMAIL [email protected] for the tracing paper
What to do – v$session_wait ?
• Sample data every X seconds – 30 to 60
seconds recommended
– We are after big bottlenecks during the first
phase
– This way we meet the goal of always on for all
sessions but not creating problems of our own
Fitting It together - GOALS!
• Need something that will capture what each
session did in terms of WAITS and RESOURCES
used before they disconnect
• Need something to sample every session’s all the
time
–
–
–
–
Meets our goal “always on” but is lightweight
Provide historical data
Provide a quick way to catch big issues every time
Cost effective and corporate standard
Program Long Response Time
Data Collection
• Collect Intelligent Data based on IO and Memory Related waits
• Build holding tables, one for IO and couple for memory related waits
• Build ssdba_extents and ssdba_segments tables and build indexes on
them, depending on the segment changes may have to refresh them
before running the diagnosis script
• Sample data every 30 seconds to 1 minute from v$session_wait into
holding tables, this is changeable.
• Create logoff trigger to save v$session_event and sesstat statistics per
user into logoff holding tables
• Memory Related waits need to capture the problematic constructs at
the time they occur
• IO related waits can be taken care of using holding tables ( direct IO
read/write can be an exception)
Logoff trigger
• Creating logoff stats table in the same
database that is being monitored
• Keep 5-7 days data at minimum
• Create one table for events and one for
statistics
Loggoff Event table
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
drop table ssdba.ssdba_logoff_event_t;
create table ssdba.ssdba_logoff_event_t as
select
a.SID
,
a.EVENT
,
a.TOTAL_WAITS
,
a.TOTAL_TIMEOUTS ,
a.TIME_WAITED
,
a.AVERAGE_WAIT ,
a.MAX_WAIT
,
sysdate as logoff_time
,
b.SERIAL#
,
b.PADDR
,
b.USERNAME
,
b.OSUSER
,
b.PROCESS
,
b.MACHINE
,
b.TERMINAL
,
b.TYPE
,
b.logon_time
from
v$session b ,
v$session_event a
where
1=2;
Loggoff Stats table
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
drop table ssdba.ssdba_logoff_stats_t;
create table ssdba.ssdba_logoff_stats_t as
select
c.username ,
c.osuser,
a.sid,c.serial#,
c.paddr,
c.process,
c.logon_time ,
a.statistic# ,
b.name,
a.value,
sysdate as logoff_time
from v$session c, v$sesstat a, v$statname b
where 1=2;
ssdba_io_dump (IO Capture)
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
HOUR
DAY
MONTH
YEAR
SNAP_DT
SID
SEQ#
EVENT
P1TEXT
P1
P1RAW
P2TEXT
P2
P2RAW
P3TEXT
P3
P3RAW
WAIT_TIME
SECONDS_IN_WAIT
STATE
SERIAL#
USERNAME
OSUSER
PADDR
LOGON_TIME
PROCESS
SQL_HASH_VALUE
SADDR
MODULE
ROW_WAIT_OBJ#
ROW_WAIT_FILE#
ROW_WAIT_BLOCK#
ROW_WAIT_ROW#
NUMBER
NUMBER
NUMBER
NUMBER
DATE
NUMBER
NUMBER
VARCHAR2(64)
VARCHAR2(64)
NUMBER
RAW(4)
VARCHAR2(64)
NUMBER
RAW(4)
VARCHAR2(64)
NUMBER
RAW(4)
NUMBER
NUMBER
VARCHAR2(19)
NUMBER
VARCHAR2(30)
VARCHAR2(30)
RAW(4)
DATE
VARCHAR2(9)
NUMBER
RAW(4)
VARCHAR2(48)
NUMBER
NUMBER
NUMBER
NUMBER
ssdba_library_dump
• HOUR
• DAY
• MONTH
• YEAR
• SNAP_DT
• SID
• SERIAL#
• USERNAME
• PADDR
• LOGON_TIME
• SQL_HASH_VALUE
• KGLPNMOD
• KGLNAOBJ
NUMBER
NUMBER
NUMBER
NUMBER
DATE
NUMBER
NUMBER
VARCHAR2(30)
RAW(4)
DATE
NUMBER
NUMBER
VARCHAR2(1000)
ssdba_buffer_dump
•
•
•
•
•
•
•
•
•
•
•
HOUR
DAY
MONTH
YEAR
SNAP_DT
HLADDR
FILE#
DBABLK
TCH
OBJ
OBJECT_NAME
NUMBER
NUMBER
NUMBER
NUMBER
DATE
RAW(4)
NUMBER
NUMBER
NUMBER
NUMBER
VARCHAR2(128)
ssdba_sqltext_dump
•
•
•
•
•
•
•
•
•
HOUR
DAY
MONTH
YEAR
SNAP_DT
HASH_VALUE
ADDRESS
PIECE
SQL_TEXT
NUMBER
NUMBER
NUMBER
NUMBER
DATE
NUMBER
RAW(4)
NUMBER
VARCHAR2(64)
FILE-BLOCK Related Waits
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
buffer busy due to global cache
file#
block#
buffer busy waits
file#
block#
id
buffer read retry
file#
block#
not used
control file parallel write
files
blocks
control file sequential read
file#
block#
Blocks
control file single write
file#
block#
conversion file read
files
block#
db file parallel read
files
blocks
db file parallel write
files
blocks
db file scattered read
file#
block#
db file sequential read
file#
block#
db file single write
file#
block#
direct path write
file number
first dba block cnt
enqueue
name|mode
id1
free buffer waits
file#
block#
set-id#
write complete waits
file#
block#
id
requests
Blocks
Blocks
requests
requests
Blocks
Blocks
Blocks
id2
id
Memory Related Waits, need to capture wait
as well as the details
•
•
•
library cache pin
Shared Pool Latch
Hot Blocks
Q&A regarding this Approach – Distorting the event by
merely looking at it!
•
•
Are we aggravating the issue? No – To prove that lets sample v$session_wait for 10 minutes at
per minute sample see the resource used
Test was done to take snapshot of sesstat and system_event for this activity
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
execute snap_events_start
execute snap_latch_start
set autotrace on explain statistics
declare
pp number;
begin
for i in 1..10
loop
select count(*) into pp from v$session_wait ;
dbms_lock.sleep(60);
end loop;
end;
/
set autotrace off
execute snap_latch.end_snap
execute snap_events.end_snap
Reports
• Will tell you exactly what you waited for and what
caused the wait !
• Fry the big fish in days !
• Resort to 10046 once the TOP Bad code is found !
• Do typical performance fixes for events
discovered.
• Analyzing wait event data will yield a path toward
a solution for almost any problem.