Speaker Qualifications

Download Report

Transcript Speaker Qualifications

RDBMS Forensics:
Troubleshooting Using ASH
Session UGF-9740
Tim Gorman
EvDBT.com
Speaker Qualifications
•
Co-author…
1.
2.
3.
4.
5.
•
29 years in IT…
•
•
•
•
•
•
•
•
“Oracle8 Data Warehousing”, 1998 John Wiley & Sons
“Essential Oracle8i Data Warehousing”, 2000 John Wiley & Sons
“Oracle Insights: Tales of the Oak Table”, 2004 Apress
“Basic Oracle SQL” 2009 Apress
“Expert Oracle Practices: Database Administration with the Oak Table”, 2010 Apress
“C” programmer, UNIX system administrator since 1984
Consultant and technical consulting manager at Oracle (1990-1998)
Independent consultant (www.EvDBT.com) since 1998
Board of Rocky Mountain Oracle Users Group (www.RMOUG.org) since 1995
Board of Oracle Developers Tools Users Group (www.ODTUG.com) since 2013
Oak Table network (www.OakTable.net) since 2002
Oracle ACE since 2007, Oracle ACE Director since 2012
More…
•
Board of directors of Project SafeGuard (www.PSGHelps.org), since 2011
Agenda
• When someone complains of a performance problem or
an error
• The problem is not occurring right this minute
• So, what can we do to find performance problems, even
outright failure conditions, that have occurred in the past?
• Where do we begin?
• Where do we look, what tools?
• How do we decide where to go next?
• Two case studies as illustration…
First case study…
• Informatica jobs reporting Oracle error (something about
“connection lost contact”)
• Obtaining the error messages from the Informatica logs showed
ORA-03135
• Searching for that error message in MyOracleSupport (MOS)
yields…
• Note #730066.1 (entitled “Diagnosis of ORA-3135/ORA-3136
Connection Timeouts when the Fault is in the Database”)
Researching in MOS…
• Note #730066.1 indicates that ORA-03135 (“connection
lost contact”) is returned to the client-side of the database
connection (i.e. Informatica PowerMart, in this case)
• It also mentions that on the database side, error ORA-03136
(“inbound connection timed out”) is recorded to the database
alert.log file as well…
Researching in MOS…
• Note #730066.1 goes on to explain the sequence of SQL
commands executed when establishing a database
connection
• Each of these SQL statements has to be parsed, executed, and
fetched
• Thus presenting the possibility of a “hang” if there is any kind of
contention
• Also, mentions two Shared Pool bugs with symptoms
including ORA-03135/3136 errors
• Almost goes without saying… 
Researching in MOS…
•
Note #730066.1 then goes on to suggest three different
methods for diagnosing “hangs” while establishing
database connections…
1.
2.
3.
From a separate SYSDBA session in SQL*Plus, take three
SYSTEMSTATE dumps each 90 seconds apart while the
problems are happening
Examine Active Session History (ASH) information for the 15
minutes leading up to the occurrence of the error
Continuous monitoring of the contents of V$LATCHHOLDER
while the problems are happening
Researching in MOS…
• Also, note #465043.1 (entitled “Troubleshooting
ORA-3136 WARNING Inbound connection timed out”)
is useful
• Further describing the error as a timeout specified by
the SQL*Net parameter
SQLNET.INBOUND_CONNECT_TIMEOUT (defaults
to 60 secs)
• Introduced in Oracle10g to prevent Denial Of Service (DoS)
attacks on the database
Researching on Google…
• Searching for keywords “oracle ora-3135 ora3136” seems to mostly yield articles that repeat the
following advice posted to an Oracle TechNet (OTN)
discussion forum…
It's a warning. by default there
a connection from client will
prescribed time limit. If the
find the warning messages. In
some parameters in sqlnet.ora
would be some time in seconds, usually
get connected to DB Server with in
connection does not happen then you can
order to handle this you can include
and listener.ora
Example :sqlnet.ora
SQLNET.INBOUND_CONNECT_TIMEOUT = 120
In the listener.ora
INBOUND_CONNECT_TIMEOUT_LISTENER = 110
Research summary
• Connections taking longer to establish than the timeout
specified by SQL*Net timeout parameters
• On both instances of a 2-node RAC cluster, we have already set
all these SQL*Net parameters to 300 secs (5 mins)
• Would it be reasonable to set these higher?
• Consider:
• timeouts tend to turn a failure condition into a performance problem
• longer timeouts show as bigger performance problems…
Research summary
• ORA-03136 errors only showed up in the alert.log file of RAC02
• No error messages found in RAC01
...
Mon Oct
SUCCESS:
Mon Oct
WARNING:
Mon Oct
WARNING:
...
4 22:05:44 2010
diskgroup D3098_ORAARCH was dismounted
4 22:07:05 2010
inbound connection timed out (ORA-3136)
4 22:07:05 2010
inbound connection timed out (ORA-3136)
Active Session History (ASH)
ASH is a slightly different take from other information in the
Automatic Workload Repository (AWR)
• All AWR information is based on real-time statistics maintained in
memory-based V$ performance views
• V$ performance views come in two basic flavors…
• Cumulative instance-level statistics
• Cumulative session-level statistics for currently-active sessions only
• AWR is the long-term repository for the information in the V$
performance views
• Only for the cumulative instance-level statistics…
• …session-level statistics are not retained past the end of the
session…
Active Session History (ASH)
• Why is session-level data valuable?
• Instance-level data is aggregated almost to the point of being
meaningless
• AWR data aggregated to an hour, instance-wide, by default
• Good for finding “Top N Worst” SQL and general trends lasting more
than an hour
• Bad for finding anomalies, incidents happening in seconds or minutes
• Session-level data shows more detailed information from
sampled session data
• Includes information about the identity of the user or program, giving
some idea of the context of a problem
Active Session History (ASH)
• Circular buffer is intended to store about an hour of active session
information
• MMNL process scans V$SESSION info in shared memory every second
for “active” session information, flushes to ASH buffers in shared
memory
• Parameter “_ash_sampling_interval”, defaults to 1000 millisecs (1
sec)
• MMNL process also periodically flushes V$ view info to AWR table
• Parameter “_ash_disk_filter_ratio”,defaults to 10 (seconds)
• If ASH buffer becomes 66% full, then MMNL process flushes to AWR
between snapshots
• Parameter “_ash_eflush_trigger”, defaults to 66 (percent)
• How long data is retained in V$A_S_H is clearly dependent on workload
• Historical ASH data retained in AWR tables under all circumstances
• Intra-snapshot flushes provide a valuable trail of evidence in dire situations
V$ACTIVE_SESSION_HISTORY
• View is too wide to display here, but information
includes…
•
•
•
•
•
•
•
•
Timestamp
User ID, session ID, type, and status
Program, module, action
SQL ID, SQL plan hash value, SQL opcode
PL/SQL module info (entry point, current procedure)
Parallel execution query coordinator (QC) information
Blocking session information, if waiting on enqueue
Wait event information with parameters
DBA_HIST_ACTIVE_SESS_HISTORY
• Contains same columns as
V$ACTIVE_SESSION_HISTORY, plus…
• AWR snapshot ID
• DBID and instance number
• A few more columns, though fewer rows, but data
retained much longer (default: 1 week) than the V$ view
Back to the case study…
• ORA-03136 errors are occurring on RAC02 only
• We focus on two “groupings” of errors a couple hours
apart
(jump to Listing01.rtf)
(then, jump to Listing02.rtf)
Lots of “concurrency” class waits…
• Remembering that the long-term retained ASH information in the
AWR view represents 10-second samples (instead of 1-second
samples)
• Filter out the I/O waits and PX-related waits…
• …notice the concurrency waits…
•
•
•
•
•
•
enq: SQ – contention
row cache lock
gc current block 2-way
gc buffer busy
gc current request
latch: row cache objects
Waits on sequence cache refresh?
•
Drilling down on the event “enq: SQ – contention”…
•
•
•
•
This event means that a sequence is “refreshing” its next set of sequence values
from the data dictionary table SYS.SEQ$
Note the P2 value of “144”?
According to the view V$EVENT_NAME, parameter P2 on the wait event contains
an OBJECT_ID value
Translating OBJECT_ID = 144 using the DBA_OBJECTS view in the RAC
database shows…
SQL> select owner, object_name, object_type from dba_objects where
object_id = 144;
OWNER OBJECT_NAME
OBJECT_TYPE
------ --------------- ------------------SYS
AUDSES$
SEQUENCE
•
This sequence is used to generate values for the column AUDSES in the
V$SESSION view when a database session is created…
Waits on the row cache?
•
Drilling down on the event “row cache lock”…
•
•
•
•
This event indicates problems modifying data values in the data dictionary cache
(a.k.a. “row cache”) within the Shared Pool of the SGA
Note the P1 value of “13”?
According to view V$EVENT_NAME, parameter P1 is the “cache id”
So, looking in V$ROWCACHE where CACHE# = 13, we see…
SQL> select parameter from v$rowcache
2 where cache# = 13;
PARAMETER
-------------------------dc_sequences
•
Hmmm… are we seeing a pattern?
•
..but wait – there’s more!!!
Waits on RAC “cache fusion”?
•
Drilling down on the event “gc current request”…
•
•
•
•
This event is posted when an instance must obtain the most current copy of a
block buffer from another RAC instance…
According to the view V$EVENT_NAME, parameter P1 is file# and P2 is block#
In the ASH query, we saw a *lot* of these events where P1=1 and P2=180…
Looking in the DBA_EXTENTS view, we can find the database object in which this
contended-for block resides…
SQL> select owner, segment_name from dba_extents
2 where file_id = 1 and
3 180 between block_id and (block_id + (blocks - 1));
OWNER
SEGMENT_NAME
-------------- ---------------------SYS
SEQ$
•
…it just keeps piling up, doesn’t it?…
Poorly-cached sequences?
• Let’s find what sequences might need caching…
SQL>
2
3
4
5
select sequence_owner, sequence_name, cache_size, last_number
from dba_sequences
where cache_size < 20
and (nvl(min_value,0) + last_number) / increment_by >= 10000
order by (nvl(min_value,0) + last_number) / increment_by desc;
OWNER
-------SYS
SYS
SYS
PROD
SYS
SYS
MDSYS
PROD
PROD
SYS
SYS
SYS
PROD
SEQUENCE_NAME
CACHE_SIZE LAST_NUMBER
------------------------------ ---------- ----------DBMS_LOCK_ID
10 1073741891
IDGEN1$
10
18987301
WRI$_ADV_SEQ_MSGGROUP
10
6428319
JL_SEQ
0
2344665
ORA_TQ_BASE$
0
1895033
WRI$_ALERT_SEQUENCE
10
1180544
TMP_COORD_OPS
0
1000000
DL_SEQ
0
964612
RUN_ID_SEQ
0
187744
SCHEDULER$_INSTANCE_S
10
166467
WRI$_ADV_SEQ_TASK
10
151284
OBJECT_GRANT
10
139382
DF_SEQ
0
53402
Poorly-cached sequences?
• The sequence SYS.AUDSES$ is not poorly cached…
• CACHE_SIZE = 100
(jump to Listing03.rtf – oops!!!)
• But all of the other poorly-cached sequences affect the
performance of any sequence
• They’re all housed within the same database block in the SEQ$
table
RAC contention on a block?
• Does the AUDSES$ sequence reside within block #180 on file #1?
SQL> set serveroutput on
SQL> declare
2
v_rid
rowid;
3
v_type number;
4
v_obj
number;
5
v_rfno number;
6
v_bno
number;
7
v_rno
number;
8 begin
9
select rowid into v_rid from sys.seq$ where obj# = 144;
10
dbms_output.put_line('v_rid = "'||v_rid||'"');
11
dbms_rowid.rowid_info(v_rid, v_type, v_obj, v_rfno, v_bno, v_rno);
12
dbms_output.put_line('v_type = "'||v_type||'"');
13
dbms_output.put_line('v_obj = "'||v_obj||'"');
14
dbms_output.put_line('v_rfno = "'||v_rfno||'"');
15
dbms_output.put_line('v_bno = "'||v_bno||'"');
16
dbms_output.put_line('v_rno = "'||v_rno||'"');
17 end;
18 /
v_rid = "AAAABEAABAAAAC0AAE"
v_type = "1"
v_obj = "68"
v_rfno = "1“
 file #1
v_bno = "180“  block #180
v_rno = "4"
RAC contention on a block?
• Using the same PL/SQL block to translate a ROWID into file# and
block#...
• it turns out that all of these busy sequences with low or disabled caching
are in file# = 1 and block# = 180…
• Now, do the waits on “gc buffer busy”, “gc current block
2-way”, and “gc current request” make more sense?
• We’re experiencing high demand for slowly-serviced sequences, due to
lack of caching…
• Causing inter-instance contention
• Since RAC02 is the less-busy of the two-node RAC environment…
• …it spends more time waiting because RAC01 has the current image of
these blocks
Taking all the findings together…
1.
Uncached or poorly-cached sequences are resulting in lots of I/O
to one block in the SYS.SEQ$ data dictionary table
•
2.
Thus waits on event “enq: SQ – contention”…
This causes contention within the Row Cache of the Shared Pool
•
3.
Thus waits on “row cache lock” on the section of the Row Cache
storing sequence information…
The less-utilized RAC instance (RAC02) frequently has to wait for
a current image of that block to be shipped over from the heavilyutilized RAC instance (RAC01)
•
Thus waits on “gc” events…
…to reach a conclusion
• The resolution for the ORA-03135/3136 turns out to be
very simple…
• Increase caching on all under-cached sequences
• Increased sequence caching reduces physical I/O to/from the
SYS.SEQ$ table, which reduces inter-instance RAC contention as
well (gc waits), which eliminates starvation by lesser-used instance
Resolution
SQL>
2
3
4
5
6
select
'alter sequence '||sequence_owner||'.'||sequence_name||
' cache 100 /* prev_cache_size='||cache_size||', last#='||last_number||' */ ;'
from
dba_sequences
where
cache_size <= 20
and
(nvl(min_value,0) + last_number) / increment_by >= 10000
order by (nvl(min_value,0) + last_number) / increment_by desc;
CMD
-------------------------------------------------------------------------------------------alter sequence SYS.DBMS_LOCK_ID cache 100
/* prev_cache_size=20, last#=1073741891 */ ;
alter sequence SYS.IDGEN1$ cache 100
/* prev_cache_size=20, last#=18987301 */ ;
alter sequence SYS.WRI$_ADV_SEQ_MSGGROUP cache 100 /* prev_cache_size=10, last#=6428319 */ ;
alter sequence PROD.JL_SEQ cache 100
/* prev_cache_size=0, last#=2344665 */ ;
alter sequence SYS.ORA_TQ_BASE$ cache 100
/* prev_cache_size=0, last#=1895045 */ ;
alter sequence SYS.WRI$_ALERT_SEQUENCE cache 100
/* prev_cache_size=20, last#=1180544 */ ;
alter sequence MDSYS.TMP_COORD_OPS cache 100
/* prev_cache_size=0, last#=1000000 */ ;
alter sequence PROD.DL_SEQ cache 100
/* prev_cache_size=0, last#=964612 */ ;
alter sequence PROD.RUN_ID_SEQ cache 100
/* prev_cache_size=0, last#=187744 */ ;
alter sequence SYS.SCHEDULER$_INSTANCE_S cache 100 /* prev_cache_size=20, last#=166467 */ ;
alter sequence SYS.WRI$_ADV_SEQ_TASK cache 100
/* prev_cache_size=10, last#=151284 */ ;
alter sequence SYS.OBJECT_GRANT cache 100
/* prev_cache_size=20, last#=139402 */ ;
alter sequence PROD.DF_SEQ cache 100
/* prev_cache_size=0, last#=53402 */ ;
13 rows selected.
Another case study…
• Another two-node RAC environment…
• Supporting OBIEE 10.1.3.1.2
• App-server (NQServer.exe) supporting Hyperion users
encountering ORA-03135
• Numerous ORA-03136 found frequently in both database
instance “alert.log” files
…which at first seems similar…
• Similar alert log entries in both instances, at
different times…
Mon Oct 31 05:51:07 2011
WARNING: inbound connection
Mon Oct 31 05:51:07 2011
WARNING: inbound connection
Mon Oct 31 05:51:07 2011
WARNING: inbound connection
Mon Oct 31 05:51:07 2011
WARNING: inbound connection
Mon Oct 31 05:51:07 2011
WARNING: inbound connection
timed out (ORA-3136)
timed out (ORA-3136)
timed out (ORA-3136)
timed out (ORA-3136)
timed out (ORA-3136)
…but ultimately seems quite different…
SQL>
2
3
4
5
6
6
7
select
event,
count(*) cnt,
sum(decode(session_state, 'ON CPU', wait_time, time_waited))/1000000 time_waited
from
dba_hist_active_sess_history
where
sample_time between to_date('31-OCT-2011 05:50:30','DD-MON-YYYY HH24:MI:SS')
and
to_date('31-OCT-2011 05:51:10','DD-MON-YYYY HH24:MI:SS')
group by event
order by time_waited desc;
EVENT
CNT TIME_WAITED
------------------------------ -------- -----------latch free
277
81
latch: shared pool
153
44
latch: library cache
45
13
reliable message
1
1
12
0
DBMS_LDAP: LDAP operation
1
0
Research…
• Bearing in mind that this is the AWR view…
• Not the memory-based V$ view…
• With 10-second samples, not 1-second samples…
• Latch contention…
• “latch free”
• “latch: shared pool”
• “latch: library cache”
• No RAC-related events
• Nothing to do with sequences either…
Research (cont’d)…
SQL> select name, parameter1, parameter2, parameter3
2 from
v$event_name
3 where name in ('latch free','latch: shared pool',
4
'latch: library cache');
NAME
-------------------latch: shared pool
latch: library cache
latch free
PARAMETER1
------------address
address
address
PARAMETER2
------------number
number
number
PARAMETER3
-----------tries
tries
tries
Research (cont’d)…
SQL>
2
3
4
5
6
7
8
9
select
h.event, l.name, count(*) cnt,
sum(decode(h.session_state,'ON CPU',h.wait_time,h.time_waited))/1000000 secs
from
dba_hist_active_sess_history h, v$latchname l
where
h.sample_time between to_date('31-OCT-2011 05:50:30','DD-MON-YYYY HH24:MI:SS')
and
to_date('31-OCT-2011 05:51:10','DD-MON-YYYY HH24:MI:SS')
and
h.event in ('latch free','latch: shared pool','latch: library cache')
and
l.latch# = h.p2
group by h.event, l.name
order by secs desc;
EVENT
---------------------latch free
latch: shared pool
latch free
latch: library cache
latch free
latch free
NAME
CNT
SECS
--------------------------------------- -------- -----------parameter table allocation management
174
51
shared pool
153
44
user lock
90
26
library cache
45
13
active checkpoint queue latch
12
4
kokc descriptor allocation latch
1
0
Research (cont’d)…
• parameter table allocation management
• Automatic SGA management (10.1  10.2)
• kokc descriptor allocation latch
• Related to high concurrency in XML/XDB (10.2.0.4+)
• user lock
• Acquired when cleaning dead user processes or dropping a schema
user (from note on MOS)
• active checkpoint queue latch
• Database full/incremental checkpoint in progress
• shared pool
• Synchronizes changes to Shared Pool in SGA
• library cache
• Synchronizes changes to sub-component of Shared Pool (hard parsing?)
Research summary
• Not RAC-related
• Not sequence-caching related
• Not hard-parsing (i.e. no bind variables)
• Checkpointing? LGWR problems?
• Process cleanup? PMON issues?
• Auto SGA management?
Automatic SGA Management
• SGA_TARGET and SGA_MAX_SIZE
• Both set to 20G
• DB_CACHE_SIZE, SHARED_POOL_SIZE,
LARGE_POOL_SIZE, JAVA_POOL_SIZE,
STREAMS_POOL_SIZE, and LOG_BUFFER all unset
• displays as “0”
• Has auto SGA management been particularly active?
• …bingo!...
(jump to Listing04.rtf and Listing05.rtf)
Dampening the thrashing…
• When SGA_TARGET > 0, then the meaning of the
standard SGA sizing parameters changes
• Instead of specifying a static size…
• …they become a “floor” or minimum value
• So for example, setting SHARED_POOL_SIZE = 6G
when SGA_TARGET > 0 means…
• Shared Pool can still be automatically resized larger than 6G, but
not smaller…
Resolution for the 2nd case study
• Dampen the thrashing between the Buffer Cache and the
Shared Pool
• If Buffer Cache is shrunk or constrained, then more cache misses
and physical read I/O is the result
• If Shared Pool is shrunk or constrained, then more latch
contention is the result
• Based on past history of auto re-sizing, set a floor value for the
Shared Pool
ALTER SYSTEM SET SHARED_POOL_SIZE = 3584M;
Summary
• Although primarily regarded as a performance
tuning tool
• ASH is also a great diagnostic tool
• If you understand its limitations
• Database forensics
• Understand what is recorded and stored within AWR
• Understand how it can be used
COLLABORATE14.IOUG.ORG
April 7 – 11, 2014
The Venetian Resort,
Las Vegas, NV
• Network with 5,000+ database and
application professionals
• 5 days of in-depth education built by
users, for users
• Complimentary Pre-Conference
Workshops for IOUG registrants
Attend for free!
IOUG’s Call for Speakers is now open
collaborate14.ioug.org/call-for-speakers
Training Close to Home
IOUG Master Class Series Returns
Coming soon to a city near you!
One Day Interactive Training Events to meet your
educational needs
Maximum Availability Architecture
with Oracle’s Larry Carpenter and Joe
Meeks
Communities
October 15: Milwaukee, WI
October 16: Rochester, NY
Performance and Internals with Craig Shallahamer
November 6: Chicago, IL
November 8: Atlanta, GA
Get more information at www.ioug.org/masterclass
or visit the IOUG kiosk in Moscone West, 2nd floor
Come See IOUG at the User Group Pavilion
Win a Free Registration to COLLABORATE 14!
Stop by the IOUG kiosk in the User Group Pavilion in
Moscone West, 2nd floor, to pick up a lucky poker chip and
see if you are a winner of a free registration to
COLLABORATE 14 in Las Vegas!
Complimentary eBook
Register Now
www.mhprofessional.com/dbsec
Use Code: db12c
45
Copyright © 2013, Oracle and/or its affiliates. All rights reserved.
Thank You!
[email protected]
RDBMS Forensics: Troubleshooting
Using ASH
Session UGF-9740